Daily Digest — 2026-05-14
340 items · 4 research labs, 329 arxiv papers, 7 industry media
🏛️ Research Labs (4)
Building a safe, effective sandbox to enable Codex on Windows
OpenAI developed a custom sandbox implementation for Codex on Windows to balance safety and productivity, addressing the lack of native OS-level isolation. The solution combines synthetic SIDs and write-restricted tokens to enforce granular filesystem access controls without requiring admin privileges, while advisory environment variables limit network access. Initial prototypes demonstrated effective write restrictions but revealed weaknesses in network suppression, prompting exploration of Windows Firewall integration for stronger isolation.
sandboxsynthetic sidswrite-restricted tokensmandatory integrity controlappcontainer
How finance teams use Codex
OpenAI Codex enables finance teams to automate repetitive tasks and generate review-ready assets for business operations. By leveraging existing workbooks, dashboards, and owner notes, Codex transforms unstructured inputs into structured narratives, variance analyses, and forecast updates. The system integrates with plugins like Google Drive, SharePoint, and Slack to process source-backed data, flag risks, and draft CFO-ready reports. Example workflows include preparing monthly business reviews, cleaning financial models, and updating executive reporting packs. Codex reduces manual effort, ensuring accuracy and consistency while allowing teams to focus on strategic decision-making.
codexforecast updatesvariance analysiscfo-readyplugins
AutoScout24 scales engineering with AI-powered workflows
AutoScout24 Group implemented AI-powered workflows using OpenAI's Codex and ChatGPT to accelerate software development and enhance code quality across its engineering teams. The dual-layer strategy combined broad organizational access to ChatGPT for 2,000 employees with deep integration of Codex into workflows for 1,000 builder roles. Key outcomes included a 10x reduction in development cycles (from weeks to days), improved code consistency through automated pull request reviews, and expanded innovation capacity via AI-enabled prototyping. The company established an AI Champions network to drive organic adoption, focusing on augmenting existing capabilities rather than replacing them.
codexchatgptpull request reviewsworkflow integrationai champions
How NVIDIA engineers and researchers build with Codex
NVIDIA engineers leverage OpenAI's Codex (built on GPT-5.5) to accelerate complex engineering and ML research workflows, achieving 10× speed improvements in end-to-end experimentation. The system autonomously handles long coding sessions, bug detection, and tool selection while maintaining context across compactions. Key results include 40,000 NVIDIA employees adopting Codex, automated research loops (from literature review to experiment execution via SSH), and 20× efficiency gains in Python-to-Rust translation. The model demonstrates superior autonomy and creativity compared to predecessors, enabling rapid prototyping of production systems like an internal podcast app.
codexgpt-5.5kv-cacheautonomous agentsmachine translation
📜 arXiv Papers (329)
AlphaGRPO: Unlocking Self-Reflective Multimodal Generation in UMMs via Decompositional Verifiable Reward
AlphaGRPO introduces Group Relative Policy Optimization (GRPO) for AR-Diffusion Unified Multimodal Models (UMMs), enabling advanced multimodal generation tasks without cold-start initialization. The framework supports Reasoning Text-to-Image Generation by inferring implicit user intents and Self-Reflective Refinement through autonomous error diagnosis and correction. A Decompositional Verifiable Reward (DVReward) mechanism decomposes user requests into atomic, verifiable questions evaluated by a general Multimodal Large Language Model (MLLM) for stable supervision. Experiments on GenEval, TIIF-Bench, DPG-Bench, WISE, and GEdit demonstrate robust improvements in generation and editing tasks, validating the self-reflective reinforcement approach.
group relative policy optimizationar-diffusion unified multimodal modelsdecompositional verifiable rewardreasoning text-to-image generationself-reflective refinement
Learning, Fast and Slow: Towards LLMs That Adapt Continually
We introduce Fast-Slow Training (FST), a framework combining in-context learning (fast weights) and parameter updates (slow weights) for continual adaptation in large language models (LLMs). FST leverages optimized context as fast weights to absorb task-specific information while maintaining slow weights closer to the base model, preserving general reasoning behaviors. FST achieves up to 3x greater sample efficiency and higher performance asymptotes compared to parameter-only reinforcement learning (RL) across reasoning tasks. It reduces KL divergence by up to 70%, mitigating catastrophic forgetting and preserving plasticity for subsequent tasks. In continual learning scenarios, FST consistently acquires new tasks, outperforming parameter-only RL approaches.
fast-slow trainingin-context learningcatastrophic forgettingkl divergencecontinual learning
Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training
The paper introduces a reward-density principle for efficient allocation of labeled training data in language model post-training, arguing that sparse sequence-level rewards should train exploratory models while dense token-level teacher rewards compress behavior into smaller models. It proposes using scarce labeled data upstream on the strongest model to generate reward-shaped behavior, then transferring it downstream as dense supervision. Evaluations on Qwen3 and Llama models for verifiable math tasks show that an RL-improved 8B teacher distilled through dense supervision outperforms direct GRPO on a 1.7B student, improving MATH accuracy from 75.4% to 78.5%.
reward-densitysparse rewarddense supervisionrl-improvedtoken-level
ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents
ToolCUA introduces an end-to-end agent for optimal GUI-Tool path orchestration in Computer Use Agents (CUAs), addressing suboptimal execution paths caused by hybrid action spaces. The method employs an Interleaved GUI-Tool Trajectory Scaling Pipeline to synthesize diverse trajectories, Tool-Bootstrapped GUI RFT combining supervised fine-tuning and single-turn RL for improved switching decisions, and Online Agentic RL guided by a Tool-Efficient Path Reward. Evaluated on OSWorld-MCP, ToolCUA achieves 46.85% accuracy, a 66% relative improvement over the baseline and 3.9% over GUI-only settings, demonstrating effective GUI-Tool orchestration. The approach highlights the potential of hybrid action space training for real-world digital agents.
gui-tool orchestrationhybrid action spacetool-bootstrapped gui rftonline agentic rltool-efficient path reward
OmniNFT: Modality-wise Omni Diffusion Reinforcement for Joint Audio-Video Generation
OmniNFT introduces a modality-aware online diffusion RL framework for joint audio-video generation, addressing three key challenges: multi-objective advantages inconsistency, multi-modal gradients imbalance, and uniform credit assignment. The method employs modality-wise advantage routing, layer-wise gradient surgery, and region-wise loss reweighting to enhance per-modality fidelity and cross-modal alignment. Experiments on JavisBench and VBench with LTX-2 show improvements in audio-video perceptual quality, alignment, and synchronization.
reinforcement learningdiffusion modelsmultimodal generationgradient surgerycredit assignment
Reward Hacking in Rubric-Based Reinforcement Learning
This work investigates reward hacking in rubric-based reinforcement learning, where policies optimized against training verifiers diverge from rubric-free judge evaluations. The authors introduce a framework separating verifier failure (training verifier credits rejected criteria) and rubric-design limitations (rubric-based verifiers favor worse responses). Experiments in medical and science domains show weak verifiers yield proxy-reward gains that fail to transfer, with exploitation growing over training. Stronger verifiers reduce but do not eliminate exploitation. A self-internalization gap metric tracks reference-verifier quality. Results indicate stronger verification reduces reward hacking but does not ensure rubric gains align with broader quality improvements.
reward hackingrubric-based rlverifier failureself-internalization gapproxy-reward
KV-Fold: One-Step KV-Cache Recurrence for Long-Context Inference
KV-Fold introduces a training-free long-context inference protocol that treats the key-value (KV) cache as an accumulator in a left fold over sequence chunks. The method processes each chunk conditioned on the accumulated cache, appends new keys and values, and passes the enlarged cache forward, enabling stable recurrence without model modification. Results demonstrate robustness across chunk sizes, numerical precision, and model families, achieving 100% exact-match retrieval on a needle-in-a-haystack benchmark with contexts up to 128K tokens and chain depths up to 511 on Llama-3.1-8B. KV-Fold maintains long-range retrieval within single GPU memory limits, outperforming streaming methods.
kv-cachelong-context inferencerecurrencetransformerneedle-in-a-haystack
Solve the Loop: Attractor Models for Language and Reasoning
The paper introduces Attractor Models, a novel architecture combining backbone and attractor modules to refine output embeddings via fixed-point solving with implicit differentiation. This approach maintains constant training memory and adaptively selects iteration depth. Empirical results demonstrate Pareto improvements over standard Transformers in language modeling (46.6% lower perplexity, 19.7% higher accuracy) and reasoning tasks (91.4% accuracy on Sudoku-Extreme with 27M parameters). The models exhibit equilibrium internalization, enabling solver removal at inference with minimal performance loss.
attractor modelsfixed-point solvingimplicit differentiationequilibrium internalizationiterative refinement
Towards Affordable Energy: A Gymnasium Environment for Electric Utility Demand-Response Programs
We introduce DR-Gym, an open-source Gymnasium-compatible environment for training and evaluating demand-response programs from an electric utility's perspective. The simulator addresses limitations of offline historical data by modeling the dynamic feedback loop between pricing signals and customer adaptation, featuring a regime-switching wholesale price model calibrated to extreme events and physics-based building demand profiles. A configurable multi-objective reward function enables diverse learning objectives. Baseline strategies and data snapshots demonstrate the simulator's capability to create realistic and learnable environments for optimizing sequential decision-making in demand-response programs.
demand-responsegymnasiumregime-switchingmulti-objectivesequential decision-making
Enabling AI-Native Mobility in 6G: A Real-World Dataset for Handover, Beam Management, and Timing Advance
This work introduces a real-world dataset for AI/ML-driven mobility optimization in 6G networks, addressing limitations of simulated data in high-speed 5G scenarios. The dataset captures user equipment (UE) mobility across pedestrian, bike, car, bus, and train modes, focusing on handover (HO) scenarios to reduce interruption time and maintain throughput. It includes timing advance (TA) measurements at key signaling events (RACH trigger, MAC CE, PDCCH grant), previously absent in existing datasets. The authors detail dataset creation, experimental setup, and exploratory analysis, highlighting its utility for training and evaluating AI/ML models in TA prediction and beam management.
handovertiming advancebeam managementuser equipment6g
The Algorithmic Caricature: Auditing LLM-Generated Political Discourse Across Crisis Events
This study introduces a Computational Social Science framework to audit the population-level realism of LLM-generated political discourse across crisis events. Using a paired corpus of 1,789,406 posts from nine events, the authors compare observed social media discourse with synthetic counterparts across four dimensions: emotional intensity, structural regularity, lexical-ideological framing, and cross-event dependency. Results indicate that synthetic discourse is fluent but less realistic at the population level, exhibiting more negative sentiment, structural regularity, and lexical abstraction compared to observed discourse. Differences vary by event type, quantified via the Caricature Gap measure. The findings highlight reduced population realism as a key limitation of synthetic political discourse.
computational social sciencepopulation realismcaricature gaplexical-ideological framingcross-event dependency
A Causal Language Modeling Detour Improves Encoder Continued Pretraining
The study demonstrates that temporarily switching from Masked Language Modeling (MLM) to Causal Language Modeling (CLM) during encoder adaptation improves downstream performance. Using ModernBERT on biomedical texts, this CLM detour followed by MLM decay outperformed MLM-only baselines by +0.3-2.8pp across 19 French and English tasks. Analysis reveals CLM's dense supervision primarily affects lower transformer layers (0-7), with gains persisting through MLM decay and scaling with model capacity. The authors release ModernCamemBERT-bio and ModernBERT-bio as state-of-the-art biomedical encoders.
masked language modelingcausal language modelingencoder adaptationtransformer layersbiomedical nlp
CAAFC: Chronological Actionable Automated Fact-Checker for misinformation / non-factual hallucination detection and correction
We introduce CAAFC (Chronological Actionable Automated Fact-Checker), a framework addressing limitations in existing Automated Fact-Checking (AFC) systems by aligning with professional fact-checking practices. CAAFC processes claims, conversations, and dialogues to detect factual errors and hallucinations, providing actionable corrections with primary source justifications. It dynamically updates evidence and knowledge bases by incorporating recent contextual information. Evaluations demonstrate that CAAFC outperforms state-of-the-art AFC and hallucination detection systems across multiple benchmark datasets, enhancing fact verification reliability.
automated fact-checkinghallucination detectionknowledge base updateprimary source justificationcontextual information
Formalize, Don't Optimize: The Heuristic Trap in LLM-Generated Combinatorial Solvers
The paper introduces CP-SynC-XL, a benchmark of 100 combinatorial problems (4,577 instances), to evaluate three LLM-generated solver-construction paradigms: native Python, Python + OR-Tools, and MiniZinc + OR-Tools. It finds that Python + OR-Tools achieves highest correctness, while MiniZinc + OR-Tools has lower coverage despite using the same back-end. Prompting for search optimization yields minimal speed-ups (1.03-1.12x median) and often degrades correctness due to heuristic traps like local approximations or redundant constraints. The results advocate formalizing variables and constraints for verified solvers while separately verifying LLM-authored optimizations.
combinatorial solversllm-generated codeconstraint programmingheuristic trapneuro-symbolic systems
Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space
This work proposes that Large Language Models (LLMs) update beliefs through trajectories in a low-dimensional conceptual belief space, analogous to Bayesian inference. The study analyzes belief dynamics using story understanding tasks, combining behavioral and representational analyses. Results show that belief updates follow structured manifolds, reflected consistently in model behavior and internal representations, which can be decoded using linear probes. Interventions on these representations causally steer belief trajectories, predictable from the geometry of the conceptual space. These findings provide a geometric framework for understanding in-context learning in LLMs.
large language modelsbayesian inferenceconceptual belief spacein-context learninglinear probes
Predicting Decisions of AI Agents from Limited Interaction through Text-Tabular Modeling
The study introduces a target-adaptive text-tabular modeling approach to predict decisions of unfamiliar AI agents from limited interactions, leveraging structured game state, offer history, and dialogue. The method employs a tabular foundation model augmented with LLM-as-Observer, where a frozen LLM encodes decision-time state and dialogue into hidden state features, enhancing prediction without direct few-shot prompting. Evaluated on 13 frontier-LLM agents and 91 scaffolded agents, the model outperforms baselines, with Observer features improving response-prediction AUC by ~4 points and reducing bargaining offer-prediction error by 14%. This demonstrates the efficacy of hidden LLM representations in decision prediction.
tabular foundation modelllm-as-observertarget-adaptive predictionhidden state featuresdecision prediction
Semantic Reward Collapse and the Preservation of Epistemic Integrity in Adaptive AI Systems
The paper introduces Semantic Reward Collapse (SRC), a phenomenon where semantically distinct forms of evaluative dissatisfaction are compressed into generalized optimization signals in reinforcement learning from human feedback (RLHF) systems. This leads to epistemic drift, where systems suppress visible uncertainty rather than preserving calibrated uncertainty integrity. Drawing on institutional proxy collapse and human learning theory, the authors propose Constitutional Reward Stratification (CRS), a domain-aware reward framework designed to preserve differentiated epistemic attribution. CRS is presented as a governance-oriented research direction requiring further empirical validation.
semantic reward collapsereinforcement learning from human feedbackepistemic integrityconstitutional reward stratificationoptimization signals
OGLS-SD: On-Policy Self-Distillation with Outcome-Guided Logit Steering for LLM Reasoning
The paper proposes OGLS-SD, an outcome-guided logit-steering framework for on-policy self-distillation (OPSD) in large language models (LLMs). The method addresses token-level supervision mismatch caused by reflection-induced bias and response templates by leveraging verifiable outcome rewards to contrast successful and failed trajectories. Experiments demonstrate improved reasoning performance over standard OPSD and variants across multiple benchmarks through calibrated teacher logits combining outcome-level correctness with dense token-level guidance.
on-policy self-distillationlogit steeringoutcome-guided learningreasoning calibrationtoken-level supervision
Detecting overfitting in Neural Networks during long-horizon grokking using Random Matrix Theory
A novel Random Matrix Theory method detects overfitting in Neural Networks without access to train or test data by identifying Correlation Traps—large outliers in the empirical spectral distribution of randomized weight matrices. The method involves element-wise randomization of weight matrices, fitting with a Marchenko-Pastur distribution, and evaluating JS divergence of output logits on random data. Results reveal an 'anti-grokking' phase characterized by increasing Correlation Traps, high train accuracy, and decreasing test accuracy, distinct from pre-grokking phases. The method also identifies Correlation Traps in some foundation-scale LLMs, indicating potential harmful overfitting.
random matrix theorycorrelation trapsmarchenko-pastur distributionanti-grokkingjs divergence
SEMIR: Semantic Minor-Induced Representation Learning on Graphs for Visual Segmentation
The paper introduces SEMIR, a graph-based representation learning framework for segmenting small, sparse structures in high-resolution images. SEMIR decouples inference from native grids by constructing topology-preserving graph minors through parameterized edge contraction and deletion, optimized via boundary-alignment objectives. The method employs a GNN with relational edge features for efficient region-level inference. Evaluated on BraTS 2021, KiTS23, and LiTS datasets, SEMIR improves Dice scores for minority structures while maintaining practical runtime performance, demonstrating robustness to structural variability and distributional uncertainty.
graph minorboundary-alignmentdice criteriongnnedge contraction
Scalable Token-Level Hallucination Detection in Large Language Models
TokenHD introduces a scalable pipeline for token-level hallucination detection in large language models (LLMs), addressing limitations of step-level analysis. The method combines a data engine for synthesizing hallucination annotations with an importance-weighted training strategy, enabling direct detection on free-form text without predefined segmentation. Experiments demonstrate that a 0.6B detector outperforms larger reasoning models like QwQ-32B, with detection performance scaling consistently from 0.6B to 8B. The detector exhibits strong generalization across diverse scenarios, and strategies for enhancing cross-domain generalization are explored.
hallucination detectiontoken-level analysisimportance-weighted trainingscalable pipelinecross-domain generalization
Trust the Batch, On- or Off-Policy: Adaptive Policy Optimization for RL Post-Training
The paper introduces a batch-adaptive objective for reinforcement learning that dynamically adjusts trust-region and off-policy concerns based on the policy-ratio distribution, eliminating the need for fixed hyper-parameters. The method uses normalized effective sample size to cap score-function weights and set regularization strength, automatically tightening updates when data becomes stale or mismatched. Experiments demonstrate that this approach matches or outperforms tuned baselines across diverse settings without introducing new hyper-parameters. The implementation is available as open-source.
policy optimizationoff-policy learningtrust-region methodseffective sample sizereinforcement learning
Discrete Flow Matching for Offline-to-Online Reinforcement Learning
The paper introduces DRIFT, a method for offline-to-online reinforcement learning (RL) in discrete action spaces, addressing challenges in fine-tuning generative policies. DRIFT updates an offline pretrained continuous-time Markov chain (CTMC) policy using advantage-weighted discrete flow matching, with a path-space penalty to preserve pretrained knowledge and a candidate-set approximation for large action spaces. Theoretical analysis shows controlled error in candidate-set approximation and adaptive CTMC generators. Experiments on Jericho demonstrate stable improvement, achieving the highest average score with a GRU encoder, outperforming pretrained language model methods.
offline-to-online rldiscrete flow matchingcontinuous-time markov chainadvantage-weighted losscandidate-set approximation
ProfiliTable: Profiling-Driven Tabular Data Processing via Agentic Workflows
ProfiliTable introduces an autonomous multi-agent framework for robust tabular data processing, addressing limitations of LLM-based approaches through dynamic profiling. The system combines a Profiler (ReAct-style exploration), Generator (knowledge-augmented code synthesis), and Evaluator-Summarizer (closed-loop refinement via execution feedback). Evaluated on 18 tabular task types, it outperforms baselines in multi-step scenarios, demonstrating improved semantic accuracy and governance compliance through iterative context refinement.
tabular data processingdynamic profilingmulti-agent frameworkreact-style explorationclosed-loop refinement
Agent-Based Post-Hoc Correction of Agricultural Yield Forecasts
A structured LLM agent framework is proposed for post-hoc correction of agricultural yield forecasts, addressing limitations in commercial farm records lacking sensor networks and high-resolution inputs. The framework integrates domain knowledge via phase detection, bias learning, and range validation tools. Evaluations on proprietary strawberry and USDA corn datasets demonstrate significant improvements: agent refinement reduced MAE by 20% and MASE by 56% for strawberry yields across XGBoost, Moirai2, and Random Forest baselines. Llama 3.1 8B outperformed LLaVA 13B, achieving consistent gains across configurations.
llm agentpost-hoc correctionphase detectionbias learningrange validation
Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models
The Granular Alignment Paradigm (GAP) addresses feature-space mismatch in visual latent reasoning for multimodal large language models (MLLMs) by aligning visual latents at three levels. GAP employs feature-level alignment via a PCA-aligned latent head, context-level alignment with auxiliary visual supervision, and capacity-guided alignment targeting challenging examples. Evaluated on Qwen2.5-VL 7B, GAP achieves superior mean aggregate perception and reasoning performance compared to supervised variants. Inference-time probing indicates that generated latents provide task-relevant visual signals beyond token slot expansion.
multimodal large language modelsvisual latent reasoninggranular alignment paradigmfeature-space mismatchpca-aligned latent head
Classifier Context Rot: Monitor Performance Degrades with Context Length
The study demonstrates that frontier language models (Opus 4.6, GPT 5.4, Gemini 3.1) exhibit degraded performance in classifying dangerous actions within long coding transcripts (>500K tokens), with failure rates increasing by 2× to 30× beyond 800K tokens compared to shorter contexts. The authors propose prompting techniques like periodic reminders as partial mitigation and highlight the need for long-context evaluations in monitor benchmarks. Results indicate current evaluations overestimate monitor performance by neglecting context-length effects.
long-context degradationcoding agentsprompting techniquesmonitor performancefrontier models
QAP-Router: Tackling Qubit Routing as Dynamic Quadratic Assignment with Reinforcement Learning
QAP-Router introduces a reinforcement learning approach to qubit routing by framing it as a dynamic Quadratic Assignment Problem (QAP), capturing interaction-distance coupling through flow and distance matrices. The policy network employs a solution-aware Transformer backbone to encode matrix interactions into attention mechanisms, integrating a lookahead mechanism to mitigate myopic decisions. Evaluated on 1,831 quantum circuits from MQTBench, AgentQ, and QUEKO datasets, QAP-Router reduces CNOT gate counts by 15.7%, 30.4%, and 12.1% respectively compared to existing industry compilers.
qubit routingquadratic assignment problemreinforcement learningtransformercnot gate
A Family of Quaternion-Valued Differential Evolution Algorithms for Numerical Function Optimization
The authors introduce Quaternion-Valued Differential Evolution (QDE), a family of novel algorithms extending Differential Evolution (DE) to operate directly in quaternion space. Several mutation strategies are proposed to exploit the algebraic and geometric properties of quaternions. Evaluated on the BBOB benchmark, the QDE variants demonstrate faster convergence and superior performance across multiple function classes compared to traditional real-valued DE, highlighting the potential of quaternion-based optimization in computational intelligence.
quaternion-valued differential evolutionnumerical optimizationmutation strategiesbbob benchmarkquaternion algebra
MedHopQA: A Disease-Centered Multi-Hop Reasoning Benchmark and Evaluation Framework for LLM-Based Biomedical Question Answering
MedHopQA introduces a disease-centered multi-hop reasoning benchmark for evaluating large language models (LLMs) in biomedical question answering, addressing limitations of existing benchmarks. The dataset comprises 1,000 expert-curated question-answer pairs requiring synthesis across two distinct Wikipedia articles, with open-ended free-text answers and ontology-grounded synonym sets for evaluation. Constructed through human annotation, triage, iterative verification, and LLM-as-a-judge validation, MedHopQA embeds scored questions within a larger set of 10,000 questions to mitigate contamination risk. The benchmark prioritizes compositional reasoning, saturation resistance, and contamination resistance, providing a reusable framework for future biomedical QA datasets.
multi-hop reasoningbiomedical question answeringontology-groundedcontamination resistancecompositional reasoning
$δ$-mem: Efficient Online Memory for Large Language Models
The paper introduces $\delta$-mem, a lightweight memory mechanism for large language models that enhances a frozen full-attention backbone with a compact online associative memory state. $\delta$-mem compresses historical information into an $8\times8$ state matrix updated via delta-rule learning and generates low-rank corrections to attention computation during generation. Evaluations show $\delta$-mem improves average scores to $1.10\times$ the frozen backbone and $1.15\times$ the strongest non-$\delta$-mem baseline, with larger gains on memory-heavy benchmarks like MemoryAgentBench ($1.31\times$) and LoCoMo ($1.20\times$), while preserving general capabilities.
delta-rule learningassociative memorylow-rank correctionsattention computationmemory-heavy benchmarks
A New Technique for AI Explainability using Feature Association Map
The paper introduces FAMeX, a novel explainable AI (XAI) algorithm based on Feature Association Maps (FAM) that models feature relationships using graph theory. FAMeX outperforms established XAI methods (Permutation Feature Importance and SHAP) in feature importance estimation across eight benchmark datasets, demonstrating superior classification explainability. Experimental results validate FAMeX's efficacy in enhancing AI system transparency through association-based feature interpretation.
explainable aifeature association mapshappermutation feature importancegraph theory
BSO: Safety Alignment Is Density Ratio Matching
The authors introduce Bregman Safety Optimization (BSO), a principled framework for safety alignment in language models that reduces the task to density ratio matching. By decomposing the likelihood ratio of the optimal safe policy and minimizing Bregman divergences between data and model ratios, BSO yields a family of single-stage loss functions induced by convex generators. This approach eliminates the need for auxiliary models, requires only one additional hyperparameter, and subsumes existing safety-aware methods as special cases. Experiments demonstrate that BSO consistently improves the safety-helpfulness trade-off across benchmarks.
bregman safety optimizationdensity ratio matchingsafety alignmentlanguage modelsbregman divergences
Manifold Sampling via Entropy Maximization
The paper introduces MAnifold Sampling via Entropy Maximization (MASEM), a method for sampling from distributions on manifolds with disconnected components defined by smooth constraints. MASEM employs a resampling scheme to maximize empirical distribution entropy using k-nearest neighbor density estimation, achieving exponential KL-divergence reduction in the mean field. Evaluated with local samplers on synthetic and robotics benchmarks, MASEM outperforms alternatives by an order of magnitude in Sinkhorn distance while maintaining competitive runtime.
constrained samplingentropy maximizationmanifold learningk-nearest neighborsinkhorn distance
EHR-RAGp: Retrieval-Augmented Prototype-Guided Foundation Model for Electronic Health Records
EHR-RAGp introduces a retrieval-augmented foundation model for Electronic Health Records (EHR) that dynamically integrates relevant patient history across diverse clinical event types. The model employs a prototype-guided retrieval module to align and estimate the relevance of historical chunks for a given prediction task, addressing challenges like long trajectories, heterogeneous events, and temporal irregularity. EHR-RAGp consistently outperforms state-of-the-art EHR foundation models and transformer-based baselines across multiple clinical prediction tasks. Integration with existing clinical foundation models yields substantial performance gains, providing a scalable and efficient framework for leveraging long-range clinical context.
electronic health recordsretrieval-augmentedprototype-guidedclinical predictionfoundation model
Reinforcing VLAs in Task-Agnostic World Models
RAW-Dream introduces a task-agnostic paradigm for reinforcing Vision-Language-Action (VLA) models by disentangling world model learning from task dependencies. It employs a pre-trained world model for predicting future rollouts and an off-the-shelf Vision-Language Model (VLM) for reward generation, enabling zero-shot inference. A dual-noise verification mechanism is introduced to mitigate world model hallucinations by filtering unreliable rollouts. Experiments across simulation and real-world settings demonstrate consistent performance gains, showing that generalized physical priors can replace costly task-dependent data, offering a scalable approach for VLA adaptation.
vision-language-actionzero-shot inferenceworld modeldual-noise verificationtask-agnostic
Towards Automated Air Traffic Safety Assessment Around Non-Towered Airports Using Large Language Models
The study proposes a vision-language model (VLM) framework for post-flight safety analysis at non-towered airports, leveraging transcribed Common Traffic Advisory Frequency (CTAF) communications, METAR weather data, ADS-B flight trajectories, and Visual Flight Rules charts. A preliminary evaluation at Half Moon Bay Airport uses Gemini 2.5 Pro for qualitative case studies and benchmarks three open-source (Qwen 2.5-7B, Mistral-7B, Gemma-2-9B) and three closed-source (GPT-4o, GPT-5.4, Claude Sonnet 4.6) LLMs on a synthetic dataset with a 12-category hazard taxonomy. Results show macro F1 scores above 0.85 for binary nominal/danger classification using CTAF and METAR inputs, suggesting VLMs as a promising tool for air traffic safety assessment.
vision-language modelcommon traffic advisory frequencymeteorological aerodrome reportautomatic dependent surveillance-broadcastvisual flight rules
LISA: Cognitive Arbitration for Signal-Free Autonomous Intersection Management
We propose LISA (LLM-Based Intent-Driven Speed Advisory), a signal-free cognitive arbitration framework for autonomous intersection management using large language models (LLMs) to reason over vehicle intents, priority classes, queue pressure, and energy preferences. LISA eliminates dependency on signal infrastructure while addressing LLM inference latency challenges. Evaluated against fixed-cycle control, SCATS, AIM, and GLOSA across varying traffic loads, LISA reduces mean control delay by up to 89.1%, maintains Level of Service C, and decreases mean waiting time by 93% and peak queue length by 60.6% under near-saturated demand. Additionally, it lowers fuel consumption by up to 48.8% and achieves 86.2% intent satisfaction, outperforming non-LLM methods.
autonomous intersection managementlarge language modelscognitive arbitrationsignal-free controlintent-driven reasoning
Transferable Delay-Aware Reinforcement Learning via Implicit Causal Graph Modeling
The paper proposes a transferable delay-aware reinforcement learning method using implicit causal graph modeling to address action-effect propagation challenges in delayed feedback scenarios. The method employs a field-node encoder for latent state representation with node-level semantics and a message-passing mechanism to capture dynamic causal dependencies, enabling transferable structured representations. Imagination-driven behavior learning and latent space planning facilitate cross-task knowledge transfer. Experiments on DMC continuous control tasks with random delays show superior performance over baselines, with cross-task transfer demonstrating accelerated policy adaptation.
reinforcement learningcausal graph modelinglatent state representationmessage-passing mechanismcross-task transfer
KAN-CL: Per-Knot Importance Regularization for Continual Learning with Kolmogorov-Arnold Networks
KAN-CL introduces a continual learning framework leveraging the compact-support spline parameterization of Kolmogorov-Arnold Networks (KANs) to mitigate catastrophic forgetting through per-knot importance regularization. The method combines a KAN classification head with standard EWC regularization on a convolutional backbone (bbEWC), achieving 88% and 93% reductions in forgetting over a KAN-only baseline on Split-CIFAR-10/5T and Split-CIFAR-100/10T benchmarks, respectively, while maintaining or surpassing baseline accuracy. Neural Tangent Kernel (NTK) analysis reveals that KAN's spline locality induces a structural rank deficit in the cross-task NTK, providing a forgetting bound applicable even in the feature-learning regime.
catastrophic forgettingkolmogorov-arnold networksneural tangent kernelper-knot regularizationcontinual learning
Executable Agentic Memory for GUI Agent
The paper introduces Executable Agentic Memory (EAM), a Knowledge Graph (KG)-based framework that replaces fragile step-wise GUI agent planning with robust retrieval-and-execution. EAM employs state-aware DFS and action-group mining for memory construction, coupled with a Q-function-guided Monte Carlo Tree Search (MCTS) for efficient KG traversal. Theoretical analysis proves bias-consistency and sample complexity bounds. Empirical results show EAM outperforms UI-TARS-7B by 19.6% on AndroidWorld while reducing token costs 6× versus GPT-4o, achieving 2.8s average latency for reliable long-horizon automation.
executable agentic memoryknowledge graphmonte carlo tree searchaction-group miningbias-consistency
PriorZero: Bridging Language Priors and World Models for Decision Making
PriorZero introduces a unified framework that integrates Large Language Model (LLM) priors into world-model-based planning for Reinforcement Learning (RL) agents, addressing the prior-dynamics mismatch in long-horizon tasks. The method employs a decoupled rollout-training design: during rollout, LLM priors are injected at the root node of Monte Carlo Tree Search (MCTS) to guide exploration; during training, world-model learning and LLM adaptation are decoupled, enabling stable fine-tuning via alternating optimization. Experiments on benchmarks like Jericho and BabyAI demonstrate improved exploration efficiency and asymptotic performance, validating the framework's effectiveness in LLM-empowered decision-making.
reinforcement learningmonte carlo tree searchworld modellarge language modelfine-tuning
TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching
TokenRatio introduces Token-level Bregman Preference Optimization (TBPO), a method for principled token-level preference optimization via ratio matching, addressing limitations of Direct Preference Optimization (DPO) which operates at the sequence level. TBPO posits a token-level Bradley-Terry preference model over next-token actions conditioned on the prefix, deriving a Bregman-divergence density-ratio matching objective that generalizes the logistic/DPO loss while preserving optimal policy induction. Two instantiations are proposed: TBPO-Q, which learns a lightweight state baseline, and TBPO-A, which removes the baseline through advantage normalization. Experiments across instruction following, helpfulness/harmlessness, and summarization benchmarks demonstrate improved alignment quality, training stability, and output diversity compared to sequence-level and token-level baselines.
token-level preferencebregman-divergenceratio matchingoptimal policyadvantage normalization
Set-Aggregated Genome Embeddings for Microbiome Abundance Prediction
The authors introduce Set-Aggregated Genome Embeddings (SAGE) for predicting community-level microbiome abundance profiles from raw DNA sequences. The method leverages genomic language models (GLMs) and their few-shot learning capabilities, aggregating genome embeddings at the community level. Benchmarking demonstrates improved generalization on novel genomes compared to classical bioinformatics approaches. Model ablation reveals that community-level latent representations directly enhance performance. The study also highlights the benefits of intermediate transformations between latent representations and compares different GLM embedding choices.
set-aggregated genome embeddingsmicrobiome abundance predictiongenomic language modelsfew-shot learninglatent representations
Iterative Audit Convergence in LLM-Managed Multi-Agent Systems: A Case Study in Prompt Engineering Quality Assurance
This paper contributes a case study of iterative, agent-driven auditing for prompt-specification quality assurance in LLM-managed multi-agent systems, focusing on AEGIS, a production seven-lane orchestration pipeline with 7150 lines of prompt specifications. Nine sequential audit rounds, executed by Claude sub-agents using a checklist-driven walkthrough, identified 51 prompt-specification consistency defects, distinct from adversarial code findings. Defect counts per round were 15, 8, 12, 2, 8, 1, 4, 1, and 0, showing non-monotonic convergence. The study proposes a seven-category defect taxonomy, an audit protocol, and a final locked checklist for reproducibility.
prompt-specificationmulti-agent systemsiterative auditingdefect taxonomyclaude sub-agents
NARA: Anchor-Conditioned Relation-Aware Contextualization of Heterogeneous Geoentities
NARA introduces a self-supervised framework for learning context-dependent representations of vector geoentities by jointly modeling semantics, geometry, and spatial relations. Unlike existing methods, NARA captures unified spatial context across heterogeneous geoentities (points, polylines, polygons) by incorporating relational spatial structure beyond proximity alone. The framework leverages neural anchor-conditioned relation-aware representation learning to enable rich contextualized representations. Evaluations on building function classification, traffic speed prediction, and next point-of-interest recommendation tasks demonstrate consistent improvements over prior methods, underscoring the effectiveness of unified relational modeling for vector geospatial data.
vector geoentitiesself-supervised learningspatial relationsneural anchor-conditioningheterogeneous geoentities
How Useful Is Cross-Domain Generalization for Training LLM Monitors?
The study investigates cross-domain generalization in prompted language models for classification tasks, demonstrating partial generalization to adjacent domains and improved performance on unseen tasks. Training on multiple classification tasks with distinct prompts enhances robustness, though edge cases persist where models fail to adapt to entirely new prompts within the same domain. Combining classification training with general instruction following mitigates these failures while retaining classification benefits. Notably, supervised 'no-thinking' classification training generalizes to 'with-thinking' tasks like summarization, suggesting its utility in developing diverse classifiers and monitoring systems.
cross-domain generalizationprompted language modelsinstruction followingclassification traininggeneralization failures
Reconnecting Fragmented Citation Networks with Semantic Augmentation
This work introduces a hybrid framework for reducing fragmentation in citation networks by integrating citation topology with LLM-based text similarity. The method augments the original graph by adding semantic edges between disconnected components and weighting existing citations based on textual similarity, using 662,369 Web of Science publications in Mathematics and Operations Research & Management Science. Semantic augmentation significantly reduces fragmentation while maintaining disciplinary homogeneity, and cluster detection via the Leiden algorithm preserves structural interpretability with multi-scale organization. The approach scales efficiently to large datasets, enhancing citation-based indicators without collapsing disciplinary boundaries.
citation networkssemantic augmentationtext similarityleiden algorithmdisciplinary homogeneity
Missingness-MDPs: Bridging the Theory of Missing Data and POMDPs
The paper introduces missingness-MDPs (miss-MDPs), a POMDP subclass integrating missing data theory, where observations follow missingness functions classifying features as MCAR, MAR, or MNAR. The authors present PAC algorithms leveraging missingness-type structural properties to learn these functions from action-observation trajectories, enabling planning via off-the-shelf methods. Theoretical guarantees show ε-optimal policies in the true miss-MDP with high probability, empirically outperforming model-free POMDP baselines.
missingness-mdpspomdpsmissing datapac learningplanning
Why Conclusions Diverge from the Same Observations: Formalizing World-Model Non-Identifiability via an Inference
The paper formalizes non-identifiability in inference and learning as the root cause of divergent conclusions from shared observations, rather than cognitive defects. It introduces a two-level framework: (i) θ-level non-identifiability, where inference settings (Reference, Exploration, Stabilization, Horizon) vary under the same world model W; and (ii) W-level non-identifiability, where repeated inference biases data exposure and updates, causing W to diverge. The analysis shows how disagreements project onto abstract/concrete, externalizability, and order/freedom bases due to computational, observational, and coordination constraints. The framework connects to deep representation learning, illustrated via AI regulation debates.
non-identifiabilityinference profileworld modelrepresentation learninglatent-state estimation
Mind the Pause: Disfluency-Aware Objective Tuning for Multilingual Speech Correction with LLMs
The authors propose a multilingual speech correction pipeline leveraging large language models (LLMs) to address disfluencies in Automatic Speech Recognition (ASR) transcripts. Their method combines a sequence tagger for disfluent token detection with instruction fine-tuning of an LLM, enhanced by a contrastive learning objective that penalizes disfluent token reproduction while preserving grammatical integrity. Experiments across Hindi, Bengali, and Marathi demonstrate consistent improvements over multilingual sequence-to-sequence baselines, highlighting the insufficiency of detection-only approaches. The approach offers a scalable solution for multilingual disfluency correction in speech-driven NLP systems, with code publicly available.
automatic speech recognitiondisfluency correctioninstruction fine-tuningcontrastive learningmultilingual sequence-to-sequence
No Action Without a NOD: A Heterogeneous Multi-Agent Architecture for Reliable Service Agents
We propose NOD (Navigator-Operator-Director), a heterogeneous multi-agent architecture to enhance the reliability of LLM-based service agents in long-horizon tasks. NOD externalizes a structured Global State for explicit task tracking and introduces selective external oversight via a Director agent to verify critical actions, mitigating error propagation and unsafe behavior. Evaluated on $τ^2$-Bench, NOD achieves higher task success rates and critical action precision compared to baselines, significantly reducing policy violations, tool hallucinations, and user-intent misalignment.
multi-agent architectureglobal stateexternal oversighttask success rateerror propagation
Pretraining Strategies and Scaling for ECG Foundation Models: A Systematic Study
This systematic study evaluates pretraining strategies and scaling effects for ECG foundation models, comparing five self-supervised learning objectives on datasets up to 11M samples. Methods include contrastive and non-contrastive approaches, with architectures spanning transformers, CNNs, and structured state space models. Results show contrastive predictive coding outperforms other objectives, particularly JEPA, in transferability across clinical tasks. Scaling pretraining data improves performance up to 11M samples, while structured state space models demonstrate superior representation learning, attributed to their strong inductive biases over pretraining scale alone.
contrastive predictive codingstructured state space modelsself-supervised learningelectrocardiographyinductive biases
Harness Engineering as Categorical Architecture
The paper formalizes agent harness engineering through categorical architecture, proposing the triple (G, Know, Phi) from ArchAgents as a theoretical foundation. Memory, Skills, Protocols, and Harness Engineering map to coalgebraic state, operad-composed objects, syntactic wiring, and Architecture respectively. Structural guarantees like integrity gates and convergence checks are preserved via compiler functors targeting Swarms, DeerFlow, Ralph, and LangGraph. Validation shows certificate preservation across configurations, with LangGraph enabling native observability. An escalation experiment confirms model-parametric quality control in multi-agent tasks.
categorical architectureagent harnesscoalgebraic stateoperad-composed objectssyntactic wiring
TMRL: Diffusion Timestep-Modulated Pretraining Enables Exploration for Efficient Policy Finetuning
The authors propose TMRL, a unified framework bridging behavioral cloning (BC) pre-training and reinforcement learning (RL) fine-tuning for robot policies. Their Context-Smoothed Pre-training (CSP) method injects forward-diffusion noise into policy inputs, creating a continuum between precise imitation and broad action coverage. During fine-tuning, Timestep-Modulated Reinforcement Learning (TMRL) enables dynamic adjustment of diffusion timestep conditioning, granting explicit control over exploration. The approach integrates with arbitrary policy inputs (states, 3D point clouds, image-based VLA policies) and improves RL fine-tuning sample efficiency. TMRL achieves successful real-world fine-tuning on complex manipulation tasks in under one hour.
behavioral cloningreinforcement learningforward-diffusion noisetimestep modulationsample efficiency
No More, No Less: Task Alignment in Terminal Agents
The paper introduces TAB (Task Alignment Benchmark), a suite of 89 terminal tasks designed to evaluate agents' ability to selectively use relevant environmental cues while ignoring distractors. Derived from Terminal-Bench 2.1, TAB tasks are intentionally underspecified, requiring agents to interpret embedded cues in natural artifacts. Evaluation of ten frontier agents reveals a systematic gap between task capability and task alignment, with the strongest Terminal-Bench agent achieving high task completion but low task alignment. Analysis of six prompt-injection defenses shows that suppressing distractors also suppresses necessary cues, highlighting the need for selective instruction use in task-aligned agents.
task alignment benchmarkterminal agentsenvironmental cuesprompt-injection defensestask capability
TriBand-BEV: Real-Time LiDAR-Only 3D Pedestrian Detection via Height-Aware BEV and High-Resolution Feature Fusion
TriBand-BEV introduces a real-time LiDAR-only 3D pedestrian detection method using a novel bird's eye view (BEV) encoding with three height bands, reformulating 3D detection as 2D. The approach employs area attention, hierarchical bidirectional feature fusion (P1-P4), and distribution focal learning for oriented box prediction, with vertical rebinning and reflectance jitter for robustness. On KITTI, it achieves 58.7/52.6/47.2 BEV AP(%) for pedestrian detection (easy/moderate/hard) at 49 FPS, outperforming Complex-YOLO by +12.6/+7.5/+3.1%. The method includes an IQR filter for outlier removal and demonstrates stable occlusion handling.
bird's eye viewlidar3d detectionfeature fusionreal-time
Heterogeneous SoC Integrating an Open-Source Recurrent SNN Accelerator for Neuromorphic Edge Computing on FPGA
The authors present a heterogeneous System-on-Chip (SoC) integrating ReckOn, an open-source recurrent Spiking Neural Network (SNN) accelerator, with traditional processors (RISC-V-based X-HEEP and ARM) for neuromorphic edge computing on FPGA. The design validates functional equivalence with the taped-out ReckOn version through FPGA implementation, maintaining classification accuracy while offering a cost-effective alternative to custom silicon. Experimental evaluation demonstrates online learning capabilities on a Braille digit dataset subset, benchmarking against existing neuromorphic platforms. The work addresses prohibitive ASIC costs by leveraging FPGA programmability for flexible, open-source neuromorphic hardware development.
spiking neural networksneuromorphic computingfpga acceleratorsystem-on-chiponline learning
Goal-Oriented Reasoning for RAG-based Memory in Conversational Agentic LLM Systems
The paper introduces Goal-Mem, a goal-oriented reasoning framework for RAG-based agentic memory in conversational LLMs that addresses limitations in multi-hop and commonsense reasoning. The method performs backward chaining from user utterances, decomposing goals into atomic subgoals and using targeted memory retrieval with Natural Language Logic for verifiable reasoning. Experiments on two datasets against nine baselines demonstrate consistent performance improvements, particularly in multi-hop reasoning and implicit inference tasks.
rag-based memorymulti-hop reasoningbackward chainingnatural language logicagentic llms
Self-Supervised Laplace Approximation for Bayesian Uncertainty Quantification
The authors propose Self-Supervised Laplace Approximation (SSLA), a method for directly approximating posterior predictive distributions without computing parameter posteriors. Inspired by self-training in self-supervised learning, SSLA quantifies predictive uncertainty by refitting models on self-predicted data, yielding a deterministic, sampling-free approximation. An approximate variant, ASSLA, reduces computational costs by avoiding expensive refitting. Theoretical and empirical evaluations across Bayesian linear models and neural networks demonstrate superior predictive calibration compared to classical Laplace approximations, while maintaining computational efficiency, as validated on simulated and real-world regression tasks.
posterior predictive distributionself-supervised learningbayesian uncertainty quantificationlaplace approximationpredictive calibration
Not How Many, But Which: Parameter Placement in Low-Rank Adaptation
The paper investigates the parameter placement problem in Low-Rank Adaptation (LoRA), focusing on which k trainable entries in the B matrix (with A frozen) impact performance. Under supervised fine-tuning (SFT), random and informed parameter subsets achieve comparable results, while gradient-informed placement is crucial for Generalized Reward-Penalty Optimization (GRPO) to recover standard LoRA accuracy. This divergence stems from gradient structure: SFT gradients are low-rank and stable, enabling coherent updates from any subset, whereas GRPO gradients are high-rank and orthogonal, requiring consistently signed gradients. A scoring procedure identifies critical parameters in under 10 seconds at <0.5% training cost, revealing concentration on residual-stream-writing projections (V, O, Down) across model families and scales (1.5B-8B).
low-rank adaptationparameter placementsupervised fine-tuninggradient-informed placementresidual-stream-writing projections
Uncertainty Quantification for LLM-based Code Generation
We propose RisCoSet, a method for uncertainty quantification in LLM-based code generation that addresses limitations of prior PAC prediction sets. RisCoSet leverages multiple hypothesis testing to construct risk-controlling prediction sets represented by partial programs, guaranteeing correct solutions with high confidence. The approach accommodates multiple valid outputs inherent to code generation, overcoming the single-label classification framework and monotonicity constraints of previous work. Experiments on three LLMs demonstrate RisCoSet's effectiveness, reducing code removal by up to 24.5% at equivalent risk levels compared to state-of-the-art methods.
uncertainty quantificationllm-based code generationrisk-controlling prediction setsmultiple hypothesis testingpartial programs
Overtrained, Not Misaligned
This study provides the most comprehensive analysis to date of emergent misalignment (EM) in fine-tuned language models, demonstrating that EM is not universal but correlates with model size and emerges late in training. The authors evaluate 12 open-source models (8B to 671B parameters) across 4 families (Llama, Qwen, DeepSeek, GPT-OSS), analyzing over one million responses with multiple random seeds. Results show EM replicates in GPT-4o but occurs consistently in only 17% of models, with a strong size-EM correlation (r = 0.90). Practical mitigations include early stopping, which eliminates EM while retaining 93% task performance, and careful learning rate selection. Cross-domain validation confirms these findings generalize, particularly in medical fine-tuning.
emergent misalignmentfine-tuningearly stoppingcross-domain validationtask convergence
Mitigating Context-Memory Conflicts in LLMs through Dynamic Cognitive Reconciliation Decoding
The paper introduces Dynamic Cognitive Reconciliation Decoding (DCRD), a two-stage method for mitigating context-memory conflicts in large language models (LLMs). DCRD first predicts conflicts via attention map analysis, then routes inputs to either greedy decoding or context fidelity-based dynamic decoding. The approach maintains performance in conflict-free scenarios while resolving knowledge conflicts. Evaluated on the new ConflictKG benchmark and six QA datasets across four LLMs, DCRD achieves state-of-the-art results compared to existing baselines.
context-memory conflictsdynamic decodingattention mapparametric knowledgeknowledge conflict
DriftXpress: Faster Drifting Models via Projected RKHS Fields
DriftXpress accelerates drifting models for one-step generative modeling by approximating the drifting kernel in a low-rank feature space via projected RKHS fields. The method preserves the attraction-repulsion structure of the original drifting field while reducing computational costs during training. Evaluated on image-generation benchmarks, DriftXpress maintains comparable FID scores to standard drifting models while significantly decreasing wall-clock training time, demonstrating improved efficiency without sacrificing one-step inference advantages.
drifting modelsone-step generative modelingprojected rkhs fieldslow-rank approximationfid scores
MolDeTox: Evaluating Language Model's Stepwise Fragment Editing for Molecular Detoxification
MolDeTox introduces a novel benchmark for molecular detoxification, addressing limitations in existing toxicity repair benchmarks such as limited data diversity, low structural validity, and reliance on proxy models for toxicity assessment. The benchmark enables fine-grained evaluation of toxicity-aware molecular optimization across stepwise tasks. General-purpose Large Language Models (LLMs) and Vision Language Models (VLMs) are evaluated under diverse settings, demonstrating that fragment-level understanding and generation improves structural validity and molecular quality. Detailed task-level performance analysis provides interpretable insights into the detoxification process. The dataset is publicly available.
molecular detoxificationtoxicity repairstructural validityfragment-level generationbenchmark evaluation
A Deep Learning-based Receiver for Asynchronous Grant-Free Random Access in Control-to-Control Networks
The paper introduces a deep learning-based receiver for asynchronous grant-free control-to-control (C2C) networks, addressing uncoordinated transmissions over shared channels. A convolutional neural network (CNN) detects command unit boundaries (start/tail sequences) directly from the received signal, leveraging LDPC-coded payloads and channel estimates for tail-sequence detection. Successive interference cancellation (SIC) improves decoding post-boundary identification. Simulations demonstrate reliable packet-boundary detection and low packet loss rates under high-traffic conditions.
grant-freeldpccnnsicasynchronous
Do Enterprise Systems Need Learned World Models? The Importance of Context to Infer Dynamics
The paper argues that enterprise systems require runtime discovery of transition dynamics rather than relying solely on offline-trained world models, which degrade under deployment shift. It introduces enterprise discovery agents that read system configurations at inference time, and CascadeBench, a benchmark for evaluating cascade prediction in synthetic environments. Empirical results show discovery-based agents outperform traditional world models under dynamic shifts by grounding predictions in current instance logic.
world modelsenterprise systemsdeployment shiftruntime discoverycascadebench
Premover: Fast Vision-Language-Action Control by Acting Before Instructions Are Complete
Premover introduces a lightweight module for Vision-Language-Action (VLA) policies that enables acting during instruction input rather than waiting for completion. The method freezes the VLA backbone and adds two small projection heads (image patches and language tokens) mapping to a shared space, supervised by target-object segmentation masks. A readiness threshold determines when to act. On LIBERO, Premover reduces wall-clock time by 13.6% (34.0s to 29.4s) while maintaining 95.1% success rate versus the full-prompt baseline, outperforming naive premoving (66.4%).
vision-language-actionprecomputationsegmentation masksreadiness thresholdprojection heads
ALGOGEN: Tool-Generated Verifiable Traces for Reliable Algorithm Visualization
ALGOGEN introduces a decoupled paradigm for reliable Algorithm Visualization (AV) by separating algorithm execution from rendering. The method employs Visualization Trace Algebra (VTA) to model algorithm states and operations, generating VTA-JSON traces via a Python tracker. Rendering is templatized using a Rendering Style Language (RSL), compiled deterministically into Manim, LaTeX/TikZ, or Three.js outputs. Evaluated on a LeetCode AV benchmark of 200 tasks, ALGOGEN achieves a 17.3% improvement in success rate (99.8% vs. 82.5%) over end-to-end methods, effectively mitigating LLM hallucinations and enhancing AV reliability.
visualization trace algebravta-jsonrendering style languagemanimllm hallucinations
MM-OptBench: A Solver-Grounded Benchmark for Multimodal Optimization Modeling
We introduce MM-OptBench, a solver-grounded benchmark for multimodal optimization modeling that evaluates the ability of multimodal large language models (MLLMs) to construct mathematical formulations and solver-executable code from text-and-visual problem specifications. The framework generates 780 solver-verified instances across 6 optimization families, 26 subcategories, and 3 difficulty levels, ensuring structured inputs and reference files are derived from verified sources. Evaluations of 9 MLLMs (6 general-purpose, 3 math-specialized) reveal significant challenges: the top models achieve 52.1% and 51.3% pass@1, while math-specialized models solve 0/780 instances. Errors stem from data extraction and formulation/code generation. MM-OptBench establishes a testbed for solver-grounded multimodal intelligence.
multimodal optimizationsolver-groundedlarge language modelsmathematical formulationpass@1
CIDR: A Large-Scale Industrial Source Code Dataset for Software Engineering Research
The Curated Industrial Developer Repository (CIDR) introduces a large-scale dataset of 2,440 proprietary software repositories totaling 373 million lines of code across 138 programming languages, collected through collaboration with 12 industrial partners. The dataset was constructed via a multi-stage pipeline involving structured partner onboarding, automated metadata filtering, manual code review, and deterministic anonymization of full version control histories. CIDR exclusively contains proprietary production codebases from domains including enterprise web/mobile development, fintech, and custom software consultancy, distinguishing it from open-source code corpora. The dataset supports research in code intelligence, software quality analysis, code language model training, developer behavior studies, and agent evaluation benchmarks, available under a restricted commercial license.
curated industrial developer repositoryproprietary codebasesdeterministic anonymizationversion control historycode intelligence
BoolXLLM: LLM-Assisted Explainability for Boolean Models
BoolXLLM introduces a hybrid framework integrating Large Language Models (LLMs) into Boolean rule-based classifiers to enhance explainability. The method employs LLMs at three stages: feature selection for domain-relevant variables, threshold recommendation for numerical feature discretization, and rule compression for natural language explanations at global and local levels. This approach combines symbolic reasoning with language-based models to bridge formal explanations with human-understandable narratives. Empirical results indicate improved interpretability while maintaining competitive predictive performance, demonstrating the potential of LLM-assisted pipelines in explainable AI systems.
boolean modelslarge language modelsfeature selectionrule compressionexplainable ai
Rollout Cards: A Reproducibility Standard for Agent Research
We propose rollout cards, a reproducibility standard for agent research that preserves rollout records and declares reporting rules, addressing inconsistencies in task-success rates, cost/token accounting, and timing measurements. Through a structured audit of 50 repositories, we identify 37 cases where reporting rules alter outcomes and demonstrate that none report run failures or errors alongside headline scores. Validation in four public releases and re-grading benchmarks shows that reporting rules alone can change scores by 20.9 absolute percentage points and invert model rankings. A reference implementation integrated into Ergon is released, with rollout-card exports for benchmarks in tool use, software engineering, and multi-agent coordination.
rollout cardsreporting rulestask-success ratescost/token accountingbenchmark re-grading
It's Not the Size: Harness Design Determines Operational Stability in Small Language Models
This paper demonstrates that harness engineering significantly impacts the operational stability of small language models (SLMs) independent of model size. Through systematic experimentation with three SLMs (Gemma4 E2B, Qwen3.5:2B, LLaMA 3.2 3B) across 24 tasks, the authors evaluate three harness conditions: model-only, minimal-shell, and a 4-stage pipeline (plan->execute->verify->recover). The pipeline harness achieves optimal performance (TSR=0.952, VTSR=1.000) on Gemma4 E2B, with planning and recovery each contributing ~24.7% to total gains. Notably, scaffold collapse is observed in LLaMA 3.2 3B without harness support, yielding TSR=0.429 due to JSON structure violations.
harness engineeringscaffold collapsetask success ratesmall language modelsverification catch rate
Disentangled Sparse Representations for Concept-Separated Diffusion Unlearning
SAEParate introduces concept-specific clustering in sparse autoencoder (SAE)-based unlearning for text-to-image diffusion models, addressing shared latent features across concepts. The method employs a concept-aware contrastive objective and enhances the encoder with a GeLU-based nonlinear transformation to achieve a more discriminative and disentangled latent space. Evaluated on UnlearnCanvas, SAEParate demonstrates state-of-the-art performance, particularly excelling in joint style-object unlearning by reducing interference between target and non-target concepts.
sparse autoencoderconcept-aware contrastivegelu transformationlatent spacetext-to-image diffusion
To Whom Do Language Models Align? Measuring Principal Hierarchies Under High-Stakes Competing Demands
This study investigates principal hierarchies in language models under high-stakes competing demands, revealing inconsistent alignment across domains and model families. The authors evaluate ten frontier models across 7,136 scenarios in legal and medical domains, testing adherence to professional standards when user instructions conflict with institutional or normative demands. Results show frequent failures to uphold professional standards during task execution, primarily through knowledge omission, even when models demonstrate relevant knowledge internally. Alignment hierarchies prove unstable across contexts and inconsistent across models, suggesting current alignment methods are insufficient for high-stakes professional deployments.
principal hierarchiesknowledge omissionalignment methodstask executionprofessional standards
Adaptive Multi-Round Allocation with Stochastic Arrivals
The paper presents an adaptive multi-round resource allocation framework for stochastic network recruitment, where budget-constrained resources exhibit diminishing returns. The authors derive an exact greedy solution for single-round allocations via marginal survival probabilities, then address intractability in multi-round settings through a population-level surrogate value function. This enables polynomial-time dynamic programming using truncated probability generating functions. Theoretical analysis provides robustness guarantees under model misspecification, decomposing error into frontier and transition components. Empirical validation demonstrates effectiveness in real-world-inspired recruitment scenarios.
stochastic arrivalsdynamic programmingdiminishing returnssurrogate value functionprobability generating functions
Large Language Models as Amortized Pareto-Front Generators for Constrained Bi-Objective Convex Optimization
The paper introduces DIPS, a framework fine-tuning large language models (LLMs) as amortized Pareto-front generators for constrained bi-objective convex optimization. DIPS employs Numerically Grounded Token Initialization, a compact discretization scheme, and Three-Phase Curriculum Optimization to align structural validity, feasibility, and Pareto-front quality. A fine-tuned 7B-parameter model achieves normalized hypervolume ratios of 95.29% to 98.18% across five families of problems, solving instances in as little as 0.16 seconds with vLLM-accelerated inference. Results demonstrate LLMs' effectiveness in continuous Pareto-front approximation.
pareto-frontconstrained optimizationlanguage modelshypervolume ratiocurriculum optimization
Autonomy and Agency in Agentic AI: Architectural Tactics for Regulated Contexts
This work introduces a two-dimensional design space for agentic AI in regulated contexts, coupling agency (what the system can do) and autonomy (how much it acts without human involvement). Both dimensions are organized into five operational levels, ranging from human-commanded operation (L1) to fully autonomous monitoring (L5) for autonomy, and from reasoning over supplied context (L1) to committed writes to authoritative records (L5) for agency. Six architectural tactics—checkpoints, escalation, multi-agent delegation, tool provisioning, tool fencing, and write staging—are proposed to navigate this space, grounded in public-sector examples. Five deployment parameters—model capability, agent architecture, tool fidelity, workflow bottlenecks, and evaluation—are examined to shape achievable configurations independently of agency and autonomy.
agentic aiautonomyagencyarchitectural tacticsdeployment parameters
Intermediate Artifacts as First-Class Citizens: A Data Model for Durable Intermediate Artifacts in Agentic Systems
The paper introduces a systems-level data model for durable intermediate artifacts in agentic AI systems, addressing the ephemeral nature of intermediate work in multi-step, revisable tasks. The model formalizes intermediate artifacts as typed, structured, versioned, and dependency-aware entities, distinct from chat transcripts or hidden chain-of-thought. It specifies additive and superseding update semantics with explicit current-state resolution and emphasizes artifact lineage for durable state maintenance. The approach aims to enhance inspectability, revisability, and maintainability of AI-generated work, shifting evaluation focus from final-output quality to maintained-state quality.
intermediate artifactsagentic systemsupdate semanticsartifact lineagemaintained-state quality
Learning What Matters: Adaptive Information-Theoretic Objectives for Robot Exploration
The paper introduces Quasi-Optimal Experimental Design (QOED), an adaptive information-theoretic objective for robot exploration that addresses challenges in parameter learnability. QOED employs eigenspace analysis of the Fisher information matrix to identify observable subspaces and suppress nuisance parameters, providing a constant-factor approximation to ideal exploration objectives. Evaluated on navigation and manipulation tasks, QOED achieves performance improvements of 35.23% and 21.98% in identifiable-direction selection and nuisance suppression, respectively, and enhances model-based policy optimization over RL baselines.
fisher informationparameter identifiabilityrobot explorationoptimal experimental designnuisance suppression
Property-Level Reconstructability of Agent Decisions: An Anchor-Level Pilot Across Vendor SDK Adapter Regimes
The study evaluates property-level reconstructability of agent decisions across six vendor SDK regimes using an unmodified Decision Trace Reconstructor. Analyzing pinned worked-example anchors, it classifies Decision Event Schema (DES) properties into four reconstructability categories (fully fillable to opaque). Results show strict-governance-completeness tiers ranging from 42.9% to 85.7%, identifying one regime-independent gap (reasoning trace), four regime-dependent gaps, and one Mixed property. The single-annotator pilot study provides checksum-verifiable outputs via a deposited reproducibility package.
decision trace reconstructorvendor sdk regimesproperty-level reconstructabilitydecision event schemastrict-governance-completeness
The Missing GAP: From Solving Square Jigsaw Puzzles to Handling Real World Archaeological Fragments
The paper introduces GAP, a novel dataset of jigsaw puzzles featuring synthetic, eroded fragments with unrestricted shapes, derived from real-world archaeological artifacts. It proposes PuzzleFlow, a Vision Transformer (ViT) and Flow-Matching based framework for solving such puzzles, outperforming existing methods on the GAP dataset. The approach addresses limitations of prior work constrained to square pieces, demonstrating superior performance through advanced architectural and computational techniques.
jigsaw puzzlesarchaeological fragmentsvision transformerflow-matchingdataset
The Deepfakes We Missed: We Built Detectors for a Threat That Didn't Arrive
This position paper identifies a misalignment between deepfake research priorities and observed real-world harms, arguing that the field's focus on public-figure manipulation (2017-2019 threat model) has overlooked dominant emerging threats. Through empirical analysis of 2022-2026 incidents, the authors demonstrate that non-consensual intimate imagery (NCII), voice-clone scams, and emotional-manipulation fraud constitute primary harms, while predicted large-scale misinformation failed to materialize. The paper attributes this gap to structural research inertia, proposes rebalancing efforts toward under-defended harm categories, and outlines three concrete technical research agendas to address current threats.
deepfake detectionnon-consensual intimate imageryvoice-clone scamsthreat modelingmisinformation defense
Clausal Deletion Backdoors for QBF: a Parameterized Complexity Approach
The paper introduces clause covering (CC) backdoors as a new parameterized approach for solving quantified Boolean formulas (QBF), focusing on tractable base classes (Horn, 2-CNF, linear equations). It establishes W[1]-hardness for Horn backdoors but proves fixed-parameter tractability (FPT) for 2-CNF and linear equations via propagation and Gaussian elimination techniques. The work identifies a key missing case for a complete dichotomy, advancing theoretical understanding of QBF solvers in parameterized complexity.
quantified boolean formulasparameterized complexityclause covering backdoorsfixed-parameter tractabilitygaussian elimination
Missing Old Logits in Asynchronous Agentic RL: Semantic Mismatch and Repair Methods for Off-Policy Correction
The paper identifies and addresses the missing-old-logit problem in asynchronous reinforcement learning for large language model agents, where delayed updates and partial rollouts lead to semantic entanglement between training--inference discrepancy and policy-staleness correction. Three exact old-logit acquisition strategies are proposed: snapshot-based version tracking, a dedicated old-logit model, and synchronization via partial rollout interruption, alongside an approximate correction method using revised PPO-EWMA. The revised PPO-EWMA method demonstrates significant improvements in both training speed and optimization performance.
asynchronous reinforcement learningoff-policy correctionppo-ewmamissing-old-logit problemsemantic entanglement
Anomaly-Aware Vision-Language Adapters for Zero-Shot Anomaly Detection
The paper introduces AVA-DINO, an anomaly-aware vision-language adaptation framework for zero-shot anomaly detection that leverages asymmetric distributions of normal and anomalous data. The method employs dual specialized branches for normal and anomalous patterns, trained jointly with text-guided routing and regularization to ensure branch specialization. At inference, it dynamically combines branches using image inputs and predefined language descriptions. Evaluated across nine benchmarks, AVA-DINO achieves 93.5% image-AUROC on MVTec-AD and demonstrates robust cross-domain generalization to medical imaging without domain-specific tuning.
zero-shotanomaly detectionvision-languageasymmetric distributionsdynamic routing
SAGE: A Self-Evolving Agentic Graph-Memory Engine for Structure-Aware Associative Memory
SAGE introduces a Self-evolving Agentic Graph-memory Engine to address long-term memory bottlenecks in language agents by modeling graph memory as a dynamic substrate. The framework combines a memory writer, which incrementally constructs structured graph memory from interaction histories, with a Graph Foundation Model-based memory reader for retrieval and feedback. Evaluations on multi-hop QA, open-domain retrieval, domain-specific review QA, and long-term agent-memory benchmarks demonstrate improved evidence recovery, answer grounding, and retrieval efficiency. SAGE achieves the best average rank on multi-hop QA after two self-evolution rounds and reaches 82.5/91.6 Recall@2/5 on NQ in zero-shot open-domain transfer.
graph memoryself-evolvingmulti-hop qaretrieval efficiencyhallucination-diagnostic
Hölder Policy Optimisation
The paper introduces Hölder Policy Optimisation (HölderPO), a generalized framework for policy optimization in large language models that unifies token-level probability aggregation via the Hölder mean. By modulating the parameter $p$, the method dynamically balances gradient concentration and variance bounds, addressing limitations of fixed aggregation mechanisms in Group Relative Policy Optimisation (GRPO). The approach includes a dynamic annealing algorithm to schedule $p$ across training. Evaluations show HölderPO achieves 54.9% accuracy on mathematical benchmarks, a 7.2% improvement over GRPO, and 93.8% success rate on ALFWorld.
hölderpopolicy optimizationgradient concentrationdynamic annealingalfworld
OmniRefine: Alignment-Aware Cooperative Compression for Efficient Omnimodal Large Language Models
OmniRefine introduces a training-free two-stage framework for efficient audio-visual token compression in omnimodal large language models (Omni-LLMs), addressing inference cost challenges. The method first refines native chunk boundaries into cross-modally aligned compression units via frame-audio similarity and dynamic programming (Correspondence-Preserving Chunk Refinement). Second, it jointly compresses video and audio tokens within each unit to reduce redundancy while preserving critical evidence (Modality-Aware Cooperative Compression). Experiments demonstrate OmniRefine achieves superior efficiency-performance trade-offs, maintaining 46.7% accuracy on WorldSense at a 44% token retention ratio, nearly matching full-token performance.
omnimodal large language modelstoken compressioncross-modal alignmentdynamic programmingmodality-aware cooperative compression
Scaling Laws and Tradeoffs in Recurrent Networks of Expressive Neurons
The paper introduces the ELM Network, a recurrent architecture with Expressive Leaky Memory (ELM) neurons, designed to explore optimal parameter allocation between unit count ($N$), per-unit complexity ($k_e$), and connectivity ($k_c$). ELM neurons emulate cortical functionality, enabling stable training across scales. Evaluated on the SHD-Adding task and Enwik8 character-level language modeling, performance improves monotonically along each axis, with a non-trivial tradeoff optimum under fixed budgets. Larger budgets favor more complex neurons. An information-theoretic model explains diminishing returns via signal-to-noise saturation and redundancy. Scaling laws trace a near-Pareto frontier, challenging the default use of simple units in machine learning.
elm networkexpressive leaky memoryparameter allocationscaling lawssignal-to-noise saturation
Rethink the Role of Neural Decoders in Quantum Error Correction
This work reevaluates neural decoders for quantum error correction (QEC) in surface codes, focusing on accuracy-latency tradeoffs for code distances up to d=9 (161 physical qubits). The authors unify and redesign neural decoders into five architectural paradigms and develop an end-to-end compression pipeline for FPGA deployment. Key findings include: (i) decoding performance is more dependent on data scale than architectural complexity, (ii) appropriate inductive bias is crucial for high accuracy, and (iii) INT4 quantization is necessary to meet microsecond latency requirements. These insights provide actionable guidance for scalable, real-time neural QEC decoding.
quantum error correctionneural decoderssurface codesfpgainductive bias
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation
The paper introduces OmniClean, a visually debiased evaluation subset (8,551 queries from 16,968) for omni-modal language models, addressing benchmark inflation from visual shortcuts. It proposes OmniBoost, a three-stage post-training method for Qwen2.5-Omni-3B: mixed bi-modal SFT, mixed-modality RLVR (reinforcement learning with visual rewards), and SFT on self-distilled data. Results show RLVR drives broad improvements, while self-distillation reshapes performance profiles, enabling the 3B model to match Qwen3-Omni-30B-A3B-Instruct without stronger teacher supervision.
omni-modalvisual debiasingreinforcement learningself-distillationpost-training
Spectral Vision Transformer for Efficient Tokenization with Limited Data
The paper introduces a spectral vision transformer (ViT) architecture optimized for efficient tokenization in data-scarce scenarios, particularly medical imaging. The method leverages spectral projections to achieve spatial invariance and optimal signal-to-noise ratio, reducing computational complexity compared to spatial ViTs. Evaluations demonstrate competitive or superior performance against compact/standard ViTs, CNNs with attention, shifted window transformers, MLPs, and logistic regression, despite fewer parameters. Validation includes simulated, public, and clinical datasets, with code released on GitHub.
spectral vision transformertokenizationspatial invariancesignal-to-noise ratiomedical imaging
Efficient and Adaptive Human Activity Recognition via LLM Backbones
This paper introduces a novel approach for Human Activity Recognition (HAR) by repurposing large pretrained language models (LLMs) as generic temporal backbones, eliminating the need for task-specific Transformer models. A structured convolutional projection bridges the modality gap between inertial sensor data and LLMs, while parameter-efficient Low-Rank Adaptation (LoRA) adapts the frozen pretrained backbone. Experiments on standard HAR benchmarks demonstrate rapid convergence, strong data efficiency, and robust cross-dataset transfer, particularly in low-data and few-shot settings. The results highlight the complementary roles of convolutional frontends and LLMs in handling local invariances and capturing long-range temporal dependencies, respectively.
human activity recognitionlarge language modelslow-rank adaptationconvolutional projectiontemporal dependencies
LLMs and the ZPD
The article proposes that large language models (LLMs) engage in 'primitive thinking' through practices rather than distributed representations, aligning with Vygotsky's concept of Zones of Proximal Development (ZPD). It argues that LLMs do not hallucinate but 'dream,' suggesting a shift from guardrails to investigating cognitive tools enabling common-sense behaviors. The core claim is that interaction is fundamental to human communication, not merely supplementary to understanding. This perspective reinterprets LLM mechanisms by emphasizing the role of practices and interaction in cognitive processes.
large language modelszones of proximal developmentprimitive thinkingdistributed representationscognitive tools
SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces
SkillSafetyBench introduces a benchmark for evaluating safety vulnerabilities in modular skill-based LLM agents, focusing on non-user attack surfaces. The benchmark comprises 155 adversarial cases across 47 tasks, 6 risk domains, and 30 safety categories, each verified by case-specific rule-based verifiers. Experiments with CLI agents and multiple model backends demonstrate that localized attacks can consistently induce unsafe behavior, revealing distinct failure patterns across domains, attack methods, and scaffold-model pairings. Results highlight that agent safety depends on skill interpretation, workflow context trust, and executable environment interactions, beyond model-level alignment.
skill-based agentsadversarial casesrule-based verifiernon-user attacksworkflow context
L2P: Unlocking Latent Potential for Pixel Generation
The paper introduces Latent-to-Pixel (L2P), a framework for efficiently transferring knowledge from pre-trained Latent Diffusion Models (LDMs) to pixel-space generation. L2P replaces the VAE with large-patch tokenization, freezes intermediate LDM layers, and trains only shallow layers to map latent to pixel space, using solely LDM-generated synthetic data. This approach achieves rapid convergence with minimal resources (8 GPUs), enables native 4K resolution, and matches source LDM performance on DPG-Bench while reaching 93% on GenEval.
latent diffusion modelspixel-space generationlarge-patch tokenizationsynthetic training data4k resolution
LegalCheck: Retrieval- and Context-Augmented Generation for Drafting Municipal Legal Advice Letters
LegalCheck introduces a Retrieval- and Context-Augmented Generation (RAG/CAG) system for automating municipal legal advice letter drafting, addressing public-sector legal staff shortages in the Netherlands. The system integrates a large language model (LLM) with curated legal knowledge bases, retrieving relevant laws and precedents while incorporating case-specific details via controlled prompting. Deployed in the Municipality of Amsterdam, LegalCheck generated near-final letters in minutes, achieving 80-100% legal reasoning accuracy and ensuring high legal consistency. Expert-in-the-loop review maintained legal soundness, reducing workload while preserving human judgment. Results demonstrate efficiency gains, improved consistency, and positive user acceptance, showcasing responsible AI deployment in legal domains.
retrieval-augmented generationcontext-augmented generationlarge language modellegal knowledge basescontrolled prompting
CR^2: Cost-Aware Risk-Controlled Routing for Wireless Device-Edge LLM Inference
The paper introduces CR^2, a cost-aware risk-controlled routing framework for wireless device-edge LLM inference. CR^2 employs a two-stage architecture with a lightweight on-device margin gate and an edge-side utility selector, optimizing latency and energy trade-offs under constrained resources. A conformal risk control (CRC) procedure calibrates thresholds for explicit false-acceptance risk management. Experiments demonstrate CR^2 matches full-information routing performance while reducing deployment costs by up to 16.9% at comparable accuracy.
llm inferencedevice-edge routingconformal risk controlcost-aware optimizationwireless edge deployment
The Illusion of Power Capping in LLM Decode: A Phase-Aware Energy Characterisation Across Attention Architectures
The study reveals that power capping in LLM serving is ineffective during autoregressive decode, the dominant production phase. Testing four attention architectures (GQA, MLA, Gated DeltaNet, Mamba2) on NVIDIA H200, decode consumes only 137-300W of 700W GPU capacity, leaving power headroom unused due to memory-bound operations. Firmware clock throttling further distorts measurements. Clock locking emerges as a superior alternative, recovering up to 32% decode energy with minimal throughput loss. Three DVFS behavioral classes are identified, with attention replacements showing a pattern of high prefill cost offset by efficient decode, reducing total request energy by half versus GQA at production batches.
power cappingautoregressive decodeclock lockingattention architecturesdvfs
BadSKP: Backdoor Attacks on Knowledge Graph-Enhanced LLMs with Soft Prompts
The paper introduces BadSKP, a backdoor attack targeting knowledge graph (KG)-enhanced LLMs that use soft prompts. Unlike text-channel attacks, BadSKP exploits the graph-to-prompt interface via multi-stage optimization: constructing adversarial embeddings, optimizing poisoned nodes, and approximating them with fluent attributes. Experiments on two KG-enhanced LLMs across four datasets demonstrate high attack success in frozen and trojaned settings, while text-only attacks remain ineffective due to semantic anchoring by graph-derived prompts.
backdoor attackknowledge graphsoft promptssemantic anchoringmulti-stage optimization
A Transfer Learning Evaluation of Deep Neural Networks for Image Classification
The paper presents a systematic evaluation of transfer learning performance for image classification across multiple pre-trained models. The study compares eleven ImageNet-trained architectures (unspecified) on five target datasets, modifying output layers and network parameters. Metrics include accuracy, accuracy density (unspecified), training time, and model size, evaluated across single-episode and ten-episode training regimes. Results demonstrate trade-offs between model performance and computational resources, though specific numerical outcomes are not provided in the excerpt.
transfer learningimage classificationpre-trained modelsaccuracy densityimagenet
Random-Set Graph Neural Networks
The paper introduces Random-Set Graph Neural Networks (RS-GNNs), a novel framework for modeling node-level epistemic uncertainty in GNNs using belief functions (finite random sets). The approach incorporates a belief-function head that predicts a random set over classes, enabling both precise probability predictions and epistemic uncertainty quantification. Evaluated on 9 graph learning datasets, including Nuscene and ROAD autonomous driving benchmarks, RS-GNNs demonstrate superior uncertainty quantification performance compared to existing methods.
graph neural networksepistemic uncertaintybelief functionsrandom setsautonomous driving
On the Limitations of Large Language Models for Conceptual Database Modeling
The study evaluates the limitations of Large Language Models (LLMs) in conceptual database modeling by generating Entity-Relationship (ER) diagrams from natural language requirements. Three LLMs were tested using Zero-Shot, Chain of Thought, and Chain of Thought + Verifier prompting techniques across scenarios of increasing complexity. Results show that while LLMs perform reasonably in simpler contexts, their reliability declines with complexity, exhibiting inconsistencies, ambiguities, and constraint representation failures. The findings suggest LLMs are not yet mature for complex modeling tasks, and validation costs may outweigh productivity benefits.
large language modelsentity-relationship diagramsconceptual modelingprompt engineeringrelational databases
High-lift Wing Separation Control via Bayesian Optimization and Deep Reinforcement Learning
The study demonstrates active flow control optimization for a 30P30N high-lift wing using Bayesian optimization (BO) and deep reinforcement learning (DRL) at Re_c = 450,000 and α = 23°. Wall-resolved large-eddy simulations (LES) validated the uncontrolled configuration against literature. BO achieved a +10.9% efficiency improvement via -9.7% drag reduction while maintaining lift, whereas DRL yielded minor aerodynamic gains due to constrained exploration from penalty-dominated rewards. Results emphasize the importance of reward design and computational acceleration for DRL-based flow control at high Reynolds numbers.
active flow controlbayesian optimizationdeep reinforcement learninglarge-eddy simulationshigh reynolds numbers
Cooperative Robotics Reinforced by Collective Perception for Traffic Moderation
The paper introduces a cooperative robotics system enhanced by collective perception for traffic moderation at non-line-of-sight (NLOS) intersections. The system employs a humanoid robot that integrates dual-camera infrastructure for vehicle detection and V2X-based cooperative awareness messages (CAM) to assess collision risks. A fusion module combines these data streams to maintain real-time situational awareness, while a Zone of Danger (ZoD) predicts unsafe merges. Upon detecting a hazard, the robot issues a STOP gesture and physically blocks the merging path. Deployed at the Future Mobility Park in Rotterdam, the system demonstrated reliable hazard prediction and prevention of unsafe merges in NLOS conditions.
non-line-of-sightcollective perceptionv2xzone of dangerfusion module
Assessing and Mitigating Miscalibration in LLM-Based Social Science Measurement
The paper investigates miscalibration in LLM-based social science measurements, demonstrating its impact on downstream analyses through a Federal Open Market Committee (FOMC) case study. Auditing 14 constructs across proprietary (GPT-5-mini, DeepSeek-V3.2) and open-source models reveals poor alignment between confidence and correctness. A soft label distillation pipeline is proposed, converting LLM scores into calibrated targets for training smaller classifiers. This method reduces Expected Calibration Error (ECE) by 43.2% and Brier score by 34.0%, emphasizing calibration as essential for measurement validity.
llmcalibrationsocial sciencedistillationece
Counterfactual Trace Auditing of LLM Agent Skills
The paper introduces Counterfactual Trace Auditing (CTA), a framework for evaluating how skills affect LLM agent behavior beyond pass rates. CTA compares agent traces with and without skills, segments them into goal-directed phases, and annotates Skill Influence Patterns (SIPs). Applied to Claude on 49 software engineering tasks, CTA reveals 522 SIP instances despite a mere +0.3pp pass rate change, uncovering behavioral shifts like template copying and excess planning. Key findings include SIP concentration in high-baseline tasks, recoverable gains in moderate tasks (with higher token costs), and baseline-dependent SIP dominance.
counterfactual trace auditingskill influence patternsllm agentsbehavioral auditpass rate
From Noise to Diversity: Random Embedding Injection in LLM Reasoning
This work introduces Random Soft Prompts (RSPs), a training-free method that appends freshly sampled random embedding vectors to LLM inputs, isolating the structural effect of soft prompt injection. RSP vectors are drawn from an isotropic Gaussian fitted to the pretrained embedding table's statistics, inducing early-stage token diversity and branching reasoning trajectories. Empirical results show RSPs achieve accuracy comparable to optimized soft prompts on math reasoning benchmarks, widen Pass@N via temperature sampling, and extend benefits to DAPO training. The mechanism involves attention flattening initial token distributions, followed by natural dilution toward a single completion.
random soft promptstoken diversityisotropic gaussiantemperature samplingdapo training
When Simulation Lies: A Sim-to-Real Benchmark and Domain-Randomized RL Recipe for Tool-Use Agents
The paper introduces RobustBench-TC, a benchmark with 22 perturbation types for tool-use agents, organized by four POMDP components, addressing real-world deployment failures. It evaluates 21 models (1.5B to 32B parameters), revealing uneven robustness: observation perturbations reduce accuracy by <5%, while reward-relevant and transition perturbations reduce it by ~40% and ~30%, respectively. ToolRL-DR, a domain-randomization RL recipe, trains agents on perturbation-augmented trajectories, achieving ~75% clean accuracy and narrowing the gap to baselines. It closes ~27% of the transition gap, demonstrating transfer to unseen runtime failures.
pomdpdomain-randomizationtool-use agentsrobustbench-tctoolrl-dr
Domain Restriction via Multi SAE Layer Transitions
The paper introduces a method for detecting out-of-domain (OOD) inputs in Large Language Models (LLMs) by analyzing internal layer transitions via sparse autoencoders (SAEs). It proposes lightweight techniques to learn domain-specific signatures from these transitions, offering interpretability into the LLM's decision process. Evaluated on Gemma-2B and Gemma-9B, the approach demonstrates strong OOD detection capabilities while revealing fine-grained input processing details.
large language modelsout-of-domain detectionsparse autoencoderlayer transitionsinterpretability
Rethinking Positional Encoding for Neural Vehicle Routing
The paper introduces a hierarchical anisometric positional encoding (PE) tailored for transformer-based neural combinatorial optimization (NCO) of vehicle routing problems (VRPs). The proposed PE addresses three structural properties of routing solutions: anisometric node distances, cyclic and direction-aware topology, and hierarchical depot-anchored global multi-route structure, grounded in geometric principles. It combines distance-indexed, circularly consistent in-route encoding with depot-anchored angular cross-route encoding. Extensive experiments across diverse VRP variants show that geometry-grounded PE consistently outperforms index-based alternatives, with gains generalizing across problem variants, model architectures, and distribution shifts.
positional encodingneural combinatorial optimizationvehicle routing problemsanisometric distancesgeometry-grounded
Rethinking Supervision Granularity: Segment-Level Learning for LLM-Based Theorem Proving
The paper introduces segment-level supervision, a novel training strategy for LLM-based theorem proving that extracts locally coherent proof segments to balance the granularity of supervision. This approach addresses limitations of step-level tactic prediction and whole-proof generation by preserving both local coherence and global structure. Evaluated on STP, LeanWorkbook, and NuminaMath-LEAN, the method achieves proof success rates of 64.84%, 60.90%, and 66.31% on miniF2F, outperforming baselines. Goal-aware rollout further enhances existing step-level provers, increasing BFS-Prover-V2-7B's success rate from 68.77% to 70.74% and InternLM2.5-StepProver's from 59.59% to 60.33%.
segment-level supervisiontheorem provingproof trajectoriesgoal-aware rolloutlean 4
Beyond Point-wise Neural Collapse: A Topology-Aware Hierarchical Classifier for Class-Incremental Learning
We propose Hierarchical-Cluster SOINN (HC-SOINN), a topology-aware hierarchical classifier for Class-Incremental Learning (CIL) that addresses the limitations of Nearest Class Mean (NCM) by capturing complex class manifolds. HC-SOINN employs a 'local-to-global' representation and integrates Structure-Topology Alignment via Residuals (STAR) to adapt to non-linear feature drift through fine-grained pointwise trajectory tracking. Theoretical analysis and Procrustes distance experiments demonstrate resilience to manifold deformations. When integrated into seven state-of-the-art CIL methods, HC-SOINN consistently improves performance, validating its robustness and effectiveness.
class-incremental learningnearest class meanhierarchical-cluster soinnstructure-topology alignmentprocrustes distance
AccLock: Unlocking Identity with Heartbeat Using In-Ear Accelerometers
AccLock introduces a passive earphone-based authentication system leveraging in-ear ballistocardiogram (BCG) signals for secure and unobtrusive user verification. The system employs a two-stage denoising scheme to mitigate inherent and sporadic interference, a disentanglement-based deep learning model (HIDNet) to isolate user-specific features from shared nuisance components, and a scalable Siamese network framework eliminating per-user classifier training. Extensive experiments with 33 participants demonstrate AccLock's efficacy, achieving an average false acceptance rate (FAR) of 3.13% and false rejection rate (FRR) of 2.99%, validating its practical feasibility.
ballistocardiogramsiamese networkdisentanglementdenoisingauthentication
Toward Modeling Player-Specific Chess Behaviors
A novel architecture is proposed to model player-specific chess behaviors by adapting the unified Maia-2 model with champion-specific embeddings and integrating a limited Monte Carlo Tree Search (MCTS) process for tactical exploration. The approach introduces a behavioral metric based on Jensen-Shannon divergence, compressing high-dimensional board representations into a latent space using AutoEncoder and Uniform Manifold Approximation and Projection (UMAP) for move distribution comparison. Evaluation across 16 historical world champions shows that while MCTS decreases standard move accuracy, it improves stylistic alignment, reducing average Jensen-Shannon divergence. The metric effectively discriminates between individual players, advancing behavioral alignment evaluation between players and AI models.
monte carlo tree searchjensen-shannon divergenceautoencoderuniform manifold approximation and projectionmove accuracy
Proteus: A Self-Evolving Red Team for Agent Skill Ecosystems
Proteus introduces a self-evolving red-team framework to measure adaptive leakage risk in agent skill ecosystems, where attackers iteratively revise skills to bypass audits and cause runtime harm. The framework explores a five-axis skill-attack space using an audit-sandbox-oracle pipeline, enabling cross-round mutation, path expansion for alternative attack implementations, and surface expansion for transferring attack patterns to new objectives. Proteus achieves 40-90% Attack Success Rate at 5 rounds (ASR@5) across eight phase-1 cells, with phase-2 expansion producing 438 bypassing and lethal variants. SkillVetter is bypassed ≥93% in every cell, while AI-Infra-Guard admits up to 41.3% joint-success, demonstrating significant underestimation of residual risk in current skill vetting.
adaptive leakageaudit-sandbox-oracle pipelinepath expansionsurface expansionattack success rate
Incentivizing Truthfulness and Collaborative Fairness in Bayesian Learning
The paper introduces a novel mechanism ensuring collaborative fairness (F) and incentivizing data truthfulness (T) in Bayesian collaborative learning. The approach combines semivalues (e.g., Shapley value) for fairness with a truthful data valuation function (DVF) based on an undisclosed validation set. A key condition ensures sources maximize expected data values by submitting truthful datasets. Theoretical analysis explores relaxations of (F) and (T) under budget constraints or absence of validation sets. Empirical validation on synthetic and real-world datasets confirms the mechanism's effectiveness.
bayesian learningsemivaluesdata valuation functioncollaborative fairnesstruthfulness
From Clever Hans to Scientific Discovery: Interpreting EEG Foundational Transformers with LRP
This study evaluates Layer-wise Relevance Propagation (LRP) as a post-hoc attribution method for interpreting Transformer-based EEG foundation models (EEG-FMs). The authors extend LRP from CNNs to Transformer architectures, demonstrating its utility in verifying model decisions and uncovering biologically plausible hypotheses. Key findings include detecting 'Clever Hans' behaviors in motor imagery tasks (where models rely on ocular artifacts) and identifying a central electrode cluster as a potential sensorimotor arousal signature in affect prediction. The work positions LRP as a critical tool for both validation and discovery in EEG-FMs as they scale.
eegtransformerlrpattributionfoundation models
On-Policy Self-Evolution via Failure Trajectories for Agentic Safety Alignment
The paper introduces FATE, an on-policy self-evolution framework for improving LLM agent safety by leveraging failure trajectories as repair supervision. FATE employs verifier-scored failures to generate repair candidates, filtered across security, utility, over-refusal control, and trajectory validity, and uses Pareto-Front Policy Optimization (PFPO) to balance safety-utility trade-offs. Evaluations on AgentDojo, AgentHarm, and ATBench demonstrate FATE's effectiveness: it reduces attack success rate by 33.5%, harmful compliance by 82.6%, and improves trajectory-safety diagnosis by 6.5% compared to baselines.
on-policy learningfailure trajectoriespareto-front optimizationsafety alignmentllm agents
Modulation Consistency-based Contrastive Learning for Self-Supervised Automatic Modulation Classification
The paper introduces Mod-CL, a self-supervised contrastive learning framework for Automatic Modulation Classification (AMC) that leverages intra-instance modulation consistency as a structural prior. By constructing positive pairs from temporal segments of the same signal, Mod-CL learns modulation-invariant representations while suppressing nuisance variations like noise and channel effects. The method includes a tailored contrastive objective combining temporal segmentation and data augmentation. Experiments on RadioML datasets demonstrate Mod-CL's superiority over baselines, particularly in low-label regimes, with significant gains in linear probing accuracy.
modulation classificationcontrastive learningself-supervised learningtemporal segmentationradioml
IPI-proxy: An Intercepting Proxy for Red-Teaming Web-Browsing AI Agents Against Indirect Prompt Injection
IPI-proxy introduces an intercepting proxy for red-teaming web-browsing AI agents against indirect prompt injection (IPI). The tool dynamically rewrites HTTP responses from whitelisted domains, embedding 820 deduplicated attack strings from six benchmarks into HTML via configurable techniques (e.g., hidden CSS, LLM-generated prose). A YAML-driven harness parameterizes payloads, embedding methods, and insertion points (6 locations), enabling systematic evaluation without mock environments. The proxy logs exfiltration attempts, providing a reproducible substrate for hardening agents against real-world IPI threats on live retrieval surfaces.
indirect prompt injectionweb-browsing agentsintercepting proxyred-teamingyaml-driven harness
Very Efficient Listwise Multimodal Reranking for Long Documents
ZipRerank introduces a highly efficient listwise multimodal reranker addressing computational bottlenecks in vision-centric retrieval and multimodal retrieval-augmented generation (M-RAG) for long documents. It reduces input length via a lightweight query-image early interaction mechanism and eliminates autoregressive decoding by scoring all candidates in a single forward pass. The model employs a two-stage training strategy: listwise pretraining on large-scale text data rendered as images, followed by multimodal finetuning with VLM-teacher-distilled soft-ranking supervision. Experiments on the MMDocIR benchmark demonstrate that ZipRerank matches or surpasses state-of-the-art rerankers while reducing LLM inference latency by up to an order of magnitude.
multimodal rerankerautoregressive decodinglistwise pretrainingsoft-ranking supervisionquery-image interaction
EvoNav: Evolutionary Reward Function Design for Robot Navigation with Large Language Models
EvoNav introduces an evolutionary framework for automating robot navigation reward function design using large language models (LLMs), addressing the limitations of hand-crafted rewards in reinforcement learning (RL). The method employs a progressive three-stage warm-up-boost procedure, transitioning from low-cost analytical proxies to lightweight rollouts and full policy training, enabling computationally efficient exploration with effective feedback. Experimental results demonstrate that EvoNav generates navigation policies superior to manually designed RL rewards and state-of-the-art reward design methods.
reinforcement learningrobot navigationreward functionlarge language modelsevolutionary framework
Improving the Performance and Learning Stability of Parallelizable RNNs Designed for Ultra-Low Power Applications
The paper introduces the Cumulative Memory Recurrent Unit (CMRU) and its relaxed variant αCMRU, addressing gradient blocking in the Bistable Memory Recurrent Unit (BMRU) for ultra-low power RNNs. The proposed cumulative update formulation restores gradient flow through skip-connections while preserving persistent memory and quantized states. Experiments demonstrate improved convergence stability and reduced initialization sensitivity, with CMRU/αCMRU matching or outperforming Linear Recurrent Units (LRUs) and minimal Gated Recurrent Units (minGRUs) on diverse benchmarks, particularly in long-range retention tasks, while maintaining analog implementation benefits.
recurrent neural networksgradient blockingquantized statespersistent memoryanalog implementation
GEAR: Granularity-Adaptive Advantage Reweighting for LLM Agents via Self-Distillation
GEAR introduces Granularity-adaptivE Advantage Reweighting, a framework for adaptive-granularity credit assignment in LLM agents via self-distillation. It reshapes trajectory-level GRPO advantage using token- and segment-level signals derived from comparing an on-policy student with a ground-truth-conditioned teacher. Divergence spikes identify semantic deviations, forming adaptive credit regions: aligned tokens preserve token-level resolution, while divergent continuations group into segments with modulated advantage. Experiments on eight benchmarks with Qwen3 4B and 8B models show GEAR outperforms GRPO, self-distillation-only baselines, and token/turn-level methods, especially in challenging long-horizon settings, with gains up to 20% over GRPO.
credit assignmentself-distillationadaptive granularitytrajectory-level advantagesemantic deviation
Martingale-Consistent Self-Supervised Learning
The paper introduces a martingale-consistent self-supervised learning (SSL) framework addressing prediction coherence under partial observation. By formalizing coherence through martingale constraints, the method ensures refined predictions match coarse-view expectations without systematic drift. The approach includes prediction- and latent-space variants with a two-sample Monte Carlo estimator for stochastic refinement. Evaluations on synthetic and real datasets (time-series, tabular, image) demonstrate improved robustness and calibration in partial-observation regimes, outperforming standard SSL in semi-supervised and label-free settings.
martingaleself-supervised learningpartial observationmonte carlo estimatorcoherence
Minimax Rates and Spectral Distillation for Tree Ensembles
The paper establishes minimax-optimal convergence rates for random forest (RF) regression, demonstrating that eigenvalue decay of the induced kernel operator determines statistical rates under standard tree growth conditions. It proposes spectral distillation techniques for tree ensembles: RFs use kernel operator eigenfunctions, while gradient boosting machines (GBMs) employ smoother matrix singular vectors to compress models. These nonlinear spectral representations yield order-of-magnitude smaller distilled models maintaining competitive accuracy, outperforming state-of-the-art pruning and rule extraction methods in resource-constrained settings.
random forestsgradient boostingspectral distillationminimax rateskernel operator
Trade-offs in Decentralized Agentic AI Discovery Across the Compute Continuum
The paper analyzes trade-offs in decentralized discovery mechanisms for agentic AI systems across cloud-edge environments, comparing Chord, Pastry, and Kademlia as structured overlay networks. Using a shared control-plane framework, the study evaluates these overlays through stationary and churn benchmarks on 4096-node networks, measuring discovery reliability, startup behavior, and control-plane overhead. Results characterize the operating points exposed by each overlay for agent discovery in edge-to-cloud deployments, providing insights into their suitability for intermittently connected domains.
decentralized discoverystructured overlayagentic aicontrol-planecloud-edge continuum
Multi-Timescale Conductance Spiking Networks: A Sparse, Gradient-Trainable Framework with Rich Firing Dynamics for Enhanced Temporal Processing
The authors introduce multi-timescale conductance spiking networks, a gradient-trainable SNN framework that combines rich firing dynamics with activity sparsity by parametrizing fast, slow, and ultra-slow conductances to shape current-voltage curves. The method employs a discrete-time formulation enabling direct backpropagation through time without surrogate gradients, supporting diverse regimes (tonic, phasic, bursting) within a single model. Evaluated on Mackey-Glass time-series regression, the networks outperform LIF and AdLIF baselines while achieving 2-3× sparser activity, demonstrating advantages for energy-aware temporal processing and neuromorphic implementation.
spiking neural networksconductance dynamicsgradient-based trainingtemporal processingneuromorphic computing
REFNet++: Multi-Task Efficient Fusion of Camera and Radar Sensor Data in Bird's-Eye Polar View
REFNet++ introduces a computationally efficient multimodal fusion framework for camera and radar data in autonomous driving. The method employs dual encoder-decoder architectures: a variational network transforms front-view camera images into Bird's-Eye View (BEV) polar coordinates, while a radar network converts range-Doppler spectra into range-azimuth features for domain alignment. Evaluated on the RADIal dataset, the approach demonstrates state-of-the-art performance in vehicle detection and free space segmentation tasks by leveraging complementary sensor strengths while maintaining computational efficiency.
sensor fusionbird's-eye viewrange-doppler spectrumvariational encoder-decoderautonomous driving
MedMemoryBench: Benchmarking Agent Memory in Personalized Healthcare
The paper introduces MedMemoryBench, a benchmark for evaluating memory mechanisms in personalized healthcare agents, addressing the gap in existing benchmarks that focus on open-domain conversations. Using a human-agent collaborative pipeline, the authors synthesize clinically grounded, long-horizon medical trajectories, resulting in a dataset of 2,000 sessions and 16,000 interaction turns. The benchmark employs a streaming assessment protocol to mirror dynamic memory accumulation and investigates memory saturation. Results reveal significant bottlenecks in mainstream architectures, particularly in medical reasoning and noise resilience, highlighting the need for robust production-ready agents.
personalized healthcarememory mechanismsstreaming assessmentmemory saturationmedical reasoning
Automated Reformulation of Robust Optimization via Memory-Augmented Large Language Models
The authors introduce Automated Reformulation with Experience Memory (AutoREM), a memory-augmented framework for automating robust optimization (RO) reformulation without domain expertise or parameter updates. AutoREM builds structured textual memory by reflecting on past failed trajectories through offline adaptation, enabling transfer across diverse large language models (LLMs). They also develop AutoRO-Bench, a benchmark for evaluating LLM-based RO reformulation, featuring automated data generation and a curated dataset. Experiments demonstrate AutoREM's consistent improvement in accuracy and efficiency across in-distribution, out-of-distribution datasets, and various base LLMs.
robust optimizationlarge language modelsmemory-augmented frameworkoffline adaptationautomated reformulation
Beyond World-Frame Action Heads: Motion-Centric Action Frames for Vision-Language-Action Models
The paper introduces MCF-Proto, a lightweight action head for Vision-Language-Action (VLA) models that replaces fixed world-frame action prediction with Motion-Centric Action Frames (MCF) and prototype-based parameterization. The method predicts SO(3) rotations to transform actions into local frames, composes them from learned prototypes, and maps back to world coordinates—requiring only standard demonstrations. Results show emergent geometric structure in local frames, compact action representations with dominant directions, and improved robustness to geometric perturbations, demonstrating the benefits of structured action heads for robotic manipulation.
vision-language-action modelsmotion-centric action framesso(3) rotationprototype-based parameterizationrobotic manipulation
Why Users Go There: World Knowledge-Augmented Generative Next POI Recommendation
The paper introduces AWARE, a generative POI recommendation system that augments LLMs with dynamic world knowledge through agent-generated contextual narratives. The method employs an LLM agent to produce location- and time-aware narratives capturing cultural traits, seasonal trends, and real-world events, while grounding them in user-specific spatial-temporal patterns. Evaluations on three real-world datasets show AWARE achieves up to 12.4% relative improvement over baselines by effectively integrating evolving external knowledge.
point-of-interest recommendationlarge language modelsagent-based narrativesspatial-temporal patternsworld knowledge augmentation
OTT-Vid: Optimal Transport Temporal Token Compression for Video Large Language Models
OTT-Vid introduces an optimal transport-based framework for temporal token compression in Video Large Language Models (Video-LLMs), addressing the inference cost bottleneck caused by accumulating visual tokens across frames. The method employs a two-stage process: spatial pruning identifies representative content within frames, followed by optimal transport (OT) with non-uniform token mass and locality-aware cost to estimate temporal compressibility. This approach dynamically allocates compression budgets based on transport difficulty, balancing token importance and matching cost. Evaluations on six benchmarks demonstrate that OTT-Vid retains 95.8% of VQA and 73.9% of VTG performance while preserving only 10% of tokens, outperforming existing training-free compression methods.
optimal transporttemporal token compressionvideo large language modelsspatial pruninglocality-aware cost
Beyond Inefficiency: Systemic Costs of Incivility in Multi-Agent Monte Carlo Simulations
This study quantifies the systemic costs of incivility in multi-agent debates using LLM-based simulations, addressing limitations of human subject research. A Monte Carlo framework generates thousands of 1-on-1 adversarial debates across toxicity conditions, measuring convergence time as an efficiency metric. Experiments extend prior findings across LLM agents of varying parameter sizes, confirming 25% convergence latency and showing increased latency in smaller models. Results reveal a significant first-mover advantage, where initiating agents win above chance regardless of toxicity. The method enables systematic manipulation of communicative behavior at scale.
monte carlo simulationmulti-agent systemsconvergence latencytoxicity conditionsfirst-mover advantage
Crash Assessment via Mesh-Based Graph Neural Networks and Physics-Aware Attention
The work proposes hybrid neural surrogate models (MeshTransolver, MeshGeoTransolver, MeshGeoFLARE) for predicting full-field structural deformations in vehicle crash simulations, addressing computational bottlenecks in design exploration. The architectures combine mesh-based graph neural networks, geometry-aware global attention, and sparse contact-aware correction for autoregressive rollout, capturing both local interactions and long-range deformation patterns. On a 25-sample test set, the best hybrid model achieves 3.20 mm mean RMSE, with qualitative analysis showing superior physical interpretability over pure attention baselines despite comparable quantitative performance.
mesh-based gnnsphysics-aware attentioncrash simulationsurrogate modelingstructural deformation
Is Monotonic Sampling Necessary in Diffusion Models?
This study challenges the necessity of monotonic sampling schedules in diffusion models by testing four nonmonotonic schedule families across DDPM, EDM, and Flow Matching architectures on CIFAR-10. Results from 90 configurations show no performance improvement over monotonic baselines, with penalty magnitudes varying by architecture: significant in DDPM, moderate in Flow Matching, and negligible in EDM. The Schedule Sensitivity Coefficient is introduced as a diagnostic tool for denoiser quality, validating conventional monotonic approaches and offering a new metric complementary to sample-quality benchmarks.
diffusion modelsnonmonotonic schedulesschedule sensitivity coefficientdenoiser qualitycifar-10
Behavioral Integrity Verification for AI Agent Skills
The paper introduces Behavioral Integrity Verification (BIV), a framework for verifying AI agent skills by comparing declared versus actual capabilities using a shared taxonomy. BIV combines deterministic code analysis with LLM-assisted capability extraction to detect deviations, classify root causes, and identify malicious skills. Evaluation on 49,943 OpenClaw skills reveals 80.0% deviate from declared behavior, with 81.1% due to developer oversight and 18.9% to adversarial intent; BIV achieves 0.946 F1 on malicious-skill detection, outperforming baselines.
behavioral integrity verificationai agent skillscapability extractiondeviation taxonomymalicious-skill detection
Focusable Monocular Depth Estimation
Focusable Monocular Depth Estimation (FDE) introduces a region-aware depth estimation task prioritizing user-specified target regions while maintaining global scene geometry. The proposed FocusDepth framework employs Multi-Scale Spatial-Aligned Fusion (MSSA) to spatially align multi-scale features from Segment Anything Model 3 with Depth Anything models, enabling prompt-conditioned depth estimation via box/text cues. FDE-Bench, a benchmark with 252.9K/72.5K train/val image-target-depth triplets across 972 categories, evaluates the approach. FocusDepth outperforms globally fine-tuned DA2/DA3 baselines, particularly in target boundary and foreground regions, with MSSA's spatial alignment reducing AbsRel errors by up to 13.8%.
monocular depth estimationmulti-scale fusionprompt-conditionedspatial alignmenttarget-centric
Towards Visually Grounded Multimodal Summarization via Cross-Modal Transformer and Gated Attention
We propose SPeCTrA-Sum, a unified framework for multimodal summarization that jointly performs text summarization and representative image selection. The system introduces two innovations: a Deep Visual Processor (DVP) enabling hierarchical, layer-wise fusion between visual and language encoders, and a Visual Relevance Predictor (VRP) selecting salient images via Determinantal Point Processes distillation. Training employs a multi-objective loss combining autoregressive summarization, cross-modal alignment, and DPP-based distillation. Experiments demonstrate SPeCTrA-Sum generates more accurate, visually grounded summaries and selects more representative images compared to existing methods, highlighting the benefits of depth-aware fusion and principled image selection.
multimodal summarizationcross-modal transformerdeterminantal point processesvisual relevance predictordepth-aware fusion
DreamAvoid: Critical-Phase Test-Time Dreaming to Avoid Failures in VLA Policies
DreamAvoid introduces a critical-phase test-time dreaming framework to enhance Vision-Language-Action (VLA) models' ability to anticipate and avoid failures in fine-grained manipulation tasks. The method employs a Dream Trigger to identify critical phases, an Action Proposer to sample candidate actions, and a Dream Evaluator trained on mixed success, failure, and boundary cases to predict and select optimal actions. Extensive evaluations on real-world manipulation tasks and simulation benchmarks demonstrate that DreamAvoid significantly improves task success rates by effectively avoiding failures.
vision-language-actioncritical-phasedream evaluatoraction proposerfailure avoidance
When Reasoning Traces Become Performative: Step-Level Evidence that Chain-of-Thought Is an Imperfect Oversight Channel
This study challenges the assumption that chain-of-thought (CoT) reasoning traces reliably reflect model computation timing, introducing a step-level Detect-Classify-Compare framework validated via Patchscopes, tuned-lens probes, and causal direction ablation. Analyzing nine models across seven reasoning benchmarks, latent answer commitment and explicit trace alignment occur in only 61.9% of steps, with 58.0% mismatches attributed to confabulated continuation after answer stabilization. Architecture-matched comparisons reveal CoT utility increases with lower step-level alignment, suggesting CoT's usefulness despite temporal unreliability. Truncation and donor-corruption tests confirm post-commitment text often lacks functional relevance to final answers.
chain-of-thoughtpatchscopesconfabulated continuationtuned-lens probesanswer-commitment proxy
OptArgus: A Multi-Agent System to Detect Hallucinations in LLM-based Optimization Modeling
OptArgus introduces a multi-agent system for detecting hallucinations in LLM-based optimization modeling, addressing structural inconsistencies across problem descriptions, symbolic models, and solver implementations. The method employs a fine-grained hallucination taxonomy spanning objective, variable, constraint, and implementation failures, alongside conductor routing, specialist auditors, and evidence consolidation. Evaluated on a benchmark suite of 484 clean artifacts, 1266 controlled injected artifacts, and 6292 natural LLM-generated artifacts, OptArgus outperforms a single-agent baseline in false alarm reduction, localization accuracy, and detection strength. This work establishes optimization-modeling hallucination detection as a concrete empirical problem and demonstrates the efficacy of modular, taxonomy-grounded auditing.
optimization modelinghallucination detectionmulti-agent systemsymbolic modelevidence consolidation
Allegory of the Cave: Measurement-Grounded Vision-Language Learning
We introduce measurement-grounded vision-language learning to address information loss in RGB rendering, proposing PRISM-VL as an instantiation. PRISM-VL combines RAW-derived Meas.-XYZ inputs, camera-conditioned grounding, and Exposure-Bracketed Supervision Aggregation to transfer supervision from RGB proxies to measurement-domain observations. Evaluated on a 150K instruction-tuning set and a held-out benchmark targeting challenging visual conditions, PRISM-VL-8B achieves 0.6120 BLEU, 0.4571 ROUGE-L, and 82.66% LLM-Judge accuracy, outperforming the RGB-based Qwen3-VL-8B baseline by +0.1074 BLEU, +0.1071 ROUGE-L, and +4.46 percentage points. Results demonstrate that preserving measurement-domain evidence enhances multimodal reasoning.
measurement-groundedprism-vlexposure-bracketed supervisionmeas.-xyzvision-language models
CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating
The paper introduces Concentrate and Concentrate (CaC), a hierarchical spatiotemporal anomaly reward model leveraging Vision-Language Models for video anomaly detection. CaC employs a coarse-to-fine approach: global temporal scanning identifies anomalous time windows, followed by fine-grained spatial grounding within localized intervals, and structured spatiotemporal Chain-of-Thought reasoning for robust judgments. The model is trained on a novel large-scale video anomaly dataset with per-frame annotations, using a three-stage progressive training paradigm involving supervised fine-tuning and Group Relative Policy Optimization (GRPO). CaC achieves a 25.7% accuracy improvement on fine-grained anomaly benchmarks and reduces generated-video anomalies by 11.7% while enhancing video quality.
vision-language modelsspatiotemporal reasoninggroup relative policy optimizationchain-of-thoughtvideo anomaly detection
A Research Agenda on Agents and Software Engineering: Outcomes from the Rio A2SE Seminar
The A2SE seminar in Rio de Janeiro established a research agenda addressing the dual impact of agentic AI on software engineering: agents as tools for software engineering tasks and agents as complex systems requiring novel engineering practices. Eighteen experts from academia and industry participated in structured presentations, collaborative topic clustering, and group discussions to identify six thematic areas: Governance, Software Engineering for Agents, Agents for Software Architecture, Quality and Evaluation, Sustainability, and Code. The seminar prioritized short-term and long-term research directions for each area, providing a structured foundation for coordinated community efforts in this evolving field.
agentic aisoftware engineeringgovernancequality evaluationsustainability
Self-organized MT Direction Maps Emerge from Spatiotemporal Contrastive Optimization
We demonstrate that self-organized direction-selective maps in the middle temporal (MT) area emerge from spatiotemporal contrastive optimization, unifying the computational origins of the ventral and dorsal streams. A 3D ResNet was trained on naturalistic videos using Momentum Contrast (MoCo) self-supervised learning combined with a biologically inspired spatial loss function. The model spontaneously developed brain-like direction maps and topological pinwheel structures, with MT tuning properties quantitatively matching macaque physiological baselines in direction selectivity index, circular variance, and pinwheel density. These results establish a general mechanism for cortical self-organization driven by optimization trade-offs between discriminative pressure and spatial regularization.
middle temporalmomentum contrastspatiotemporal optimizationdirection selectivitytopographic deep artificial neural network
SafeSteer: A Decoding-level Defense Mechanism for Multimodal Large Language Models
The paper introduces SafeSteer, a decoding-level defense mechanism for multimodal large language models (MLLMs) to mitigate jailbreak attacks without costly fine-tuning. It leverages a Decoding-Probe to detect and correct harmful outputs during decoding and employs modal semantic alignment to extend textual safety to vision. Experiments show SafeSteer improves safety by up to 33.40% while maintaining model performance, balancing helpfulness and harmlessness.
multimodaljailbreakdecoding-probesemantic alignmentsafety
Toward Stable Value Alignment: Introducing Independent Modules for Consistent Value Guidance
The Stable Value Guidance Transformer (SVGT) introduces an independent value module for stable alignment of large language models with human values. SVGT employs two key designs: independent value modeling maintains normative representations in a dedicated value space isolated from the backbone, while explicit behavioral guidance transduces these stable signals into learnable latent Bridge Tokens that dynamically steer the generative trajectory. Experiments across multiple backbones and safety benchmarks demonstrate SVGT reduces harmful scores by over 70% while preserving generation fluency, validating its efficacy in architecturally grounded value modeling.
stable value guidance transformerindependent value modelingbridge tokensgenerative trajectoryvalue alignment
Debiased Model-based Representations for Sample-efficient Continuous Control
The paper introduces DR.Q, a debiased model-based representation method for continuous control that addresses biases in existing approaches. The method maximizes mutual information between current state-action pairs and next states while minimizing deviations, using faded prioritized experience replay. Evaluated on continuous control benchmarks with fixed hyperparameters, DR.Q matches or exceeds recent baselines, sometimes by significant margins.
model-based representationscontinuous controlmutual informationprioritized experience replayactor-critic learning
WildRelight: A Real-World Benchmark and Physics-Guided Adaptation for Single-Image Relighting
The paper introduces WildRelight, the first real-world benchmark dataset for single-image relighting, addressing the synthetic-to-real domain gap in current methods. The dataset features high-resolution outdoor scenes with temporally aligned natural illuminations and paired HDR environment maps. A physics-guided inference framework combining Diffusion Posterior Sampling (DPS) and test-time adaptation (TTA) is proposed, demonstrating that synthetic models can adapt to real-world statistics through self-supervised learning on this temporal data.
single-image relightingdomain adaptationdiffusion posterior samplingtest-time adaptationhdr environment maps
Emergent Communication between Heterogeneous Visual Agents through Decentralized Learning
The study demonstrates that heterogeneous visual agents can develop shared symbolic communication through decentralized learning, despite private perceptual representations. Authors introduce the Metropolis-Hastings Captioning Game (MHCG), where agents exchange discrete token sequences and update models based on local visual evidence, without a shared communicative objective. Experiments on MS-COCO reveal that MHCG generates visually informative token sequences outperforming no-communication baselines in cross-agent alignment, visual-feature prediction, and image-text retrieval. Performance declines with increasing encoder mismatch, with moderate heterogeneity reducing sequence count while preserving specificity, and strong heterogeneity yielding fewer, coarser, and asymmetric sequences. Listener-side MH acceptance proves crucial for avoiding degenerate token formation.
emergent communicationdecentralized learningmetropolis-hastingsvisual encoderstoken sequences
Measuring What Matters Beyond Text: Evaluating Multimodal Summaries by Quality, Alignment, and Diversity
We introduce MM-Eval, a unified evaluation framework for Multimodal Summarization with Multimodal Output (MSMO) that integrates textual quality, cross-modal alignment, and visual diversity assessments. The framework employs OpenFActScore for factual consistency, G-Eval for coherence, an MLLM-as-a-judge approach for image-text relevance, and Truncated CLIP Entropy for image-set diversity. A learned aggregation model, calibrated on the mLLM-EVAL news benchmark, aligns component contributions with human preferences. Results indicate a text-dominant hierarchy where factual consistency critically determines overall quality, while visual relevance and diversity provide complementary signals. MM-Eval outperforms heuristic aggregation baselines and offers an interpretable, reference-weak framework for multimodal summary evaluation.
multimodal summarizationfactual consistencycross-modal alignmenttruncated clip entropymllm-as-a-judge
Shaping Zero-Shot Coordination via State Blocking
The paper introduces State-Blocked Coordination (SBC), a framework enhancing zero-shot coordination (ZSC) by generating diverse virtual environments via state blocking. This method exposes agents to varied suboptimal partner policies without direct environment modification, improving generalization to unseen partners. Evaluations across benchmarks show SBC outperforms existing approaches in ZSC, including robust performance with human partners.
zero-shot coordinationstate blockingmulti-agent systemsgeneralizationhuman-ai collaboration
Persistent and Conversational Multi-Method Explainability for Trustworthy Financial AI
The paper introduces a human-centered explainable AI architecture for financial sentiment analysis, combining persistent XAI artifacts, multi-method explanation triangulation, and faithfulness evaluation. XAI artifacts, including LIME feature attributions, occlusion-based word importance scores, and saliency heatmaps, are stored persistently in S3-compatible storage with structured metadata and natural-language summaries, enabling semantic retrieval and index reconstruction. A retrieval-augmented generation (RAG) assistant synthesizes explanations from multiple XAI methods, allowing conversational robustness assessment. Automated checks evaluate explanation faithfulness, focusing on grounding completeness, hallucinated claims, and method-attribution behavior. Evaluations on an EXTRA-BRAIN pipeline with FinBERT show constrained prompting reduces hallucination by 36% and increases method-attribution citations by 73% compared to naive prompting.
explainable airetrieval-augmented generationsentiment analysisfeature attributionhallucination rate
Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion
We propose MORA (Multi-Objective Reward Assimilation), a novel method addressing the zero-sum conflict in multi-objective alignment of large language models by expanding reward diversity through prompt rewriting. MORA isolates single-reward prompts via pre-sampling and rewrites them to incorporate multi-dimensional intents, breaking the Pareto frontier limitation. Experiments show MORA achieves single-preference improvements of 5%-12.4% in sequential alignment, with significant gains in harmlessness, and an average overall reward improvement of 4.6% in simultaneous alignment across helpfulness, harmlessness, and truthfulness dimensions.
multi-objective alignmentpareto frontierprompt rewritingreward diversitysequential alignment
OOM-Free Alpamayo via CPU-GPU Memory Swapping for Vision-Language-Action Models
A framework enabling memory-efficient inference for Vision-Language-Action (VLA) models on VRAM-constrained GPUs is introduced, achieving up to 3.55x speedup over Accelerate offloading while maintaining full BF16 precision. The method employs Sequential Demand Layering to reduce VRAM usage to layer-level granularity, Pipelined Demand Layering to overlap parameter transfer with computation, and a GPU-Resident Layer Decision Policy informed by per-module residency benefit analysis to eliminate residual transfer overhead. A performance prediction model determines optimal configurations with less than 1.3% error. Evaluated on NVIDIA's Alpamayo-R1-10B (21.52GB) on an RTX 5070Ti (16GB), the framework avoids out-of-memory errors without model modification.
vision-language-actionvram-constrainedsequential demand layeringpipelined demand layeringgpu-resident layer
A CAP-like Trilemma for Large Language Models: Correctness, Non-bias, and Utility under Semantic Underdetermination
This paper proposes a CAP-like trilemma for Large Language Models (LLMs), asserting that under semantic underdetermination, LLMs cannot simultaneously guarantee strong correctness, strict non-bias, and high utility. Semantic underdetermination occurs when premises do not determine a unique answer, requiring the model to introduce selection criteria or preferences. The authors formalize this trilemma, develop illustrative examples, and argue that certain LLM failures stem from the inherent structure of underdetermined decision requests rather than model limitations alone.
semantic underdeterminationcap theoremlarge language modelsselection criteriondecision requests
Cochise: A Reference Harness for Autonomous Penetration Testing
The authors present Cochise, a minimal 597-line Python reference harness for evaluating LLM-driven autonomous penetration testing systems. The framework implements a Planner-Executor architecture with ReAct-style execution over SSH, maintaining long-term state externally while adapting prompts to target environments. Evaluated on the Game of Active Directory (GOAD) testbed, Cochise includes replay tools (cochise-replay), analysis utilities (cochise-analyze-logs/graphs), and a corpus of JSON trajectory logs to enable reproducible research without provisioning the resource-intensive 48-64GB RAM testbed.
autonomous penetration testingplanner-executor architecturereact-style executionssh command executionreference harness
Evolutionary Task Discovery: Advancing Reasoning Frontiers via Skill Composition and Complexity Scaling
The paper introduces Evolutionary Task Discovery (EvoTD), a framework for advancing LLM reasoning through structured evolutionary operators: Crossover for skill composition and Parametric Mutation for complexity scaling. EvoTD employs a Zone of Proximal Development filter to maintain task learnability. Empirical results show consistent reasoning improvements across diverse model architectures, pretraining regimes, and scales, validating the efficacy of evolutionary curricula. The method addresses limitations of unstructured data synthesis by navigating a dual-axis manifold of Algorithmic Skills and Complexity Attributes.
evolutionary task discoveryskill compositioncomplexity scalingzone of proximal developmentreasoning frontier
Reviving In-domain Fine-tuning Methods for Source-Free Cross-domain Few-shot Learning
The paper revives in-domain fine-tuning methods for source-free cross-domain few-shot learning (CDFSL) by analyzing and rectifying attention collapse in CLIP models. Through establishing baselines, the authors find adapter-based methods (e.g., LoRA) outperform prompt-based ones (e.g., MaPLe) in CDFSL, attributing this to LoRA's ability to correct visual CLS token attention and enhance modality alignment. They propose Semantic Probe, a plug-and-play framework that rectifies attention by leveraging textual EOS tokens and improves both adapter- and prompt-based methods. Experiments on four CDFSL benchmarks demonstrate state-of-the-art performance, validating the approach.
cross-domain few-shot learningcliploramodality alignmentattention rectification
Weather-Robust Cross-View Geo-Localization via Prototype-Based Semantic Part Discovery
SkyPart introduces a lightweight swappable head for patch-based vision transformers (ViTs) to address limitations in cross-view geo-localization (CVGL). The method employs learnable prototypes for patch token assignment, altitude-conditioned linear modulation during training, graph-attention readout over prototypes, and a Kendall uncertainty-weighted multi-objective loss. With 26.95M parameters and 22.14 GFLOPs, SkyPart achieves state-of-the-art performance on SUES-200, University-1652, and DenseUAV benchmarks under a single-pass, no-re-ranking, no-TTA protocol. It demonstrates superior robustness under the WeatherPrompt corruption benchmark compared to existing baselines.
cross-view geo-localizationpatch-based vision transformerslearnable prototypesgraph-attention readoutkendall uncertainty-weighted loss
Every Bit, Everywhere, All at Once: A Binomial Multibit LLM Watermark
The paper introduces a binomial multibit watermarking scheme for LLMs that encodes every bit of a payload at every token position, coupled with a stateful encoder to dynamically balance encoding pressure. This approach outperforms 8 baselines on 64-bit payloads, showing superior message accuracy and robustness, particularly for large payloads and low-distortion regimes. The authors also critique prior evaluation metrics and propose per-bit confidence scoring as a more practical alternative.
multibit watermarkingbinomial encodingstateful encoderper-bit confidencelow-distortion regimes
Hide to See: Reasoning-prefix Masking for Visual-anchored Thinking in VLM Distillation
We propose a novel think-answer distillation framework for visual-language models (VLMs) that enhances visual-anchored reasoning by masking salient reasoning prefixes during training. Our method employs token-wise salient reasoning-prefix masking and self-paced masking budget scheduling to encourage reliance on visual evidence, replacing standard causal masks with reasoning-prefix masks that block both future tokens and reasoning cues. Experiments demonstrate superior performance over existing VLM distillation and self-distillation methods on multimodal reasoning benchmarks, with analysis confirming improved visual utilization throughout the reasoning process.
visual-language modelsreasoning-prefix maskingself-paced maskingmultimodal reasoningdistillation framework
Seirênes: Adversarial Self-Play with Evolving Distractions for LLM Reasoning
Seirênes introduces a self-play reinforcement learning framework that transforms contextual interference into a training signal for enhancing large language model (LLM) reasoning robustness. The method employs a parameter-shared adversarial loop where a single model both generates plausible distracting contexts to expose its reasoning blind spots and solves problems by discerning essential tasks from these perturbations. This co-evolutionary curriculum drives the model beyond superficial pattern matching. Evaluated across seven mathematical reasoning benchmarks with model scales from 4B to 30B parameters, Seirênes achieves average accuracy gains of +10.2, +9.1, and +7.2 points. Additionally, its distracting contexts reduce top-tier closed-source model accuracy by 4--5 points.
self-playcontextual interferenceco-evolutionary curriculumreasoning robustnessadversarial loop
Unlocking UML Class Diagram Understanding in Vision Language Models
The work introduces a benchmark for visual question answering (VQA) on UML class diagrams, addressing a gap in Vision Language Model (VLM) capabilities for computer science diagrams. Using a dataset of 16,000 image-question-answer triples, the authors demonstrate that LoRA-based fine-tuning surpasses the performance of Qwen 3.5 27B, a state-of-the-art VLM, on this specialized task. The benchmark is designed to be both challenging and tractable, filling a niche in diagram understanding research.
vision language modelsuml class diagramsvisual question answeringlora-based fine-tuningbenchmark
Can LLM Agents Respond to Disasters? Benchmarking Heterogeneous Geospatial Reasoning in Emergency Operations
The Disaster Operational Response Agent benchmark (DORA) introduces the first agentic benchmark for end-to-end disaster response, comprising 515 expert-authored tasks across 45 real-world events and 10 disaster types, with 3,500 tool-call steps in gold trajectories. Tasks span five dimensions: disaster perception, spatial relational analysis, rescue planning, temporal reasoning, and multi-modal report synthesis, utilizing a 108-tool MCP library over heterogeneous geospatial data. Evaluation of 13 frontier LLMs reveals persistent challenges in disaster-domain grounding, tool selection, argument grounding, and compositional fragility, with agent-to-gold performance gaps widening from 7% to 56% on long pipelines.
disaster responsegeospatial datatool-call stepsmulti-modal synthesiscompositional fragility
Enhancing Multilingual Counterfactual Generation through Alignment-as-Preference Optimization
The paper introduces Macro, a preference alignment framework for multilingual counterfactual explanation (SCE) generation using Direct Preference Optimization (DPO). It addresses the trade-off between validity and minimality in non-dominant languages by constructing preference pairs via a composite scoring function. Experiments across four LLMs and seven languages demonstrate Macro's 12.55% average validity improvement over chain-of-thought baselines while preserving minimality, outperforming supervised fine-tuning. Analyses show enhanced cross-lingual perturbation alignment and reduced generation errors.
counterfactual explanationspreference optimizationmultilingual generationvalidity-minimality trade-offdirect preference optimization
Nice Fold or Hero Call: Learning Budget-Efficient Thinking for Adaptive Reasoning
The paper introduces Budget-Efficient Thinking (BET), a two-stage framework for adaptive reasoning that optimizes compute allocation by aligning solve-or-fold decisions with solvability expectations. BET combines behavioral cold-start with GRPO under an investment-cost-aware reward, learning three key behaviors: short solve for easy queries, nice fold for unsolvable cases, and hero call for hard-but-solvable problems. Evaluated across seven benchmarks and three base models, BET reduces reasoning tokens by ~55% while maintaining or improving performance, demonstrating zero-shot transferability from mathematical to scientific QA and logical reasoning tasks.
adaptive reasoningbudget-efficient thinkingsolvabilitygrpozero-shot transfer
From Generic Correlation to Input-Specific Credit in On-Policy Self Distillation
The paper introduces CREDIT (Contrastive REward from DIsTillation), a method for isolating input-specific reasoning in on-policy self-distillation for language models. Under a posterior-compatibility interpretation, the authors demonstrate that self-distillation token rewards correspond to Bayesian filtering increments, whose sum equals the pointwise mutual information (pMI) between response and feedback given the input. CREDIT employs a batch-contrastive baseline to decompose teacher log-probability along the input axis, penalizing responses likely under unrelated inputs. Evaluated across coding, scientific reasoning, and tool-use benchmarks on two model families, CREDIT achieves superior aggregate performance with minimal computational overhead.
self-distillationbayesian filteringpointwise mutual informationcontrastive rewardinput-specific reasoning
When Emotion Becomes Trigger: Emotion-style dynamic Backdoor Attack Parasitising Large Language Models
We propose Paraesthesia, a dynamic backdoor attack leveraging emotion as a semantic trigger in fine-tuned large language models (LLMs). Unlike token-level attacks, Paraesthesia manipulates emotional style as a decoupled factor in LLM representation space, enabling stealthy parasitic behavior. The method combines emotional style quantification and rewriting, injecting poisoned samples during fine-tuning to induce predefined harmful outputs upon emotional inputs. Evaluated on instruction-following generation and classification tasks across four LLMs, Paraesthesia achieves 99% attack success rate while preserving model utility on clean inputs.
backdoor attackemotional stylelarge language modelsfine-tuningsemantic manipulation
CuSearch: Curriculum Rollout Sampling via Search Depth for Agentic RAG
The paper introduces CuSearch, a curriculum rollout sampling framework for optimizing agentic retrieval-augmented generation (RAG) systems trained via reinforcement learning with verifiable rewards (RLVR). CuSearch employs Search-Depth Greedy Allocation (SDGA) to prioritize deeper-search trajectories during training, which provide denser supervision for retrieval sub-policies. Two variants, SDGA-Auto and SDGA-Phase, adaptively allocate update budgets based on trajectory depth. Experiments demonstrate consistent improvements, including an 11.8-point exact-match gain over GRPO on ZeroSearch, validating search depth as an effective proxy for supervision density.
retrieval-augmented generationreinforcement learningcurriculum learningrollout samplingverifiable rewards
Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information
The paper introduces Anti-Self-Distillation (AntiSD), a method that improves reasoning in reinforcement learning by ascending the divergence between student and teacher models rather than descending it, addressing inconsistent gains in math reasoning from on-policy self-distillation. AntiSD reverses the per-token sign of the distillation signal and uses an entropy-triggered gate to disable the term once teacher entropy collapses. Evaluated across five models (4B to 30B parameters) on math reasoning benchmarks, AntiSD achieves the GRPO baseline's accuracy in 2 to 10x fewer steps and improves final accuracy by up to 11.5 points.
anti-self-distillationpointwise mutual informationreasoning rlon-policy self-distillationentropy-triggered gate
PRISM: A Geometric Risk Bound that Decomposes Drift into Scale, Shape, and Head
The paper introduces PRISM (Proxy Risk Inference via Structural Mapping), a geometric risk bound that decomposes LLM representation drift into scale, shape, and head divergence components. By leveraging the linear output head and near-isometric structure of LLM backbones, PRISM provides a closed-form upper bound on cross-entropy risk gaps between target models and post-training variants (e.g., quantized, LoRA-adapted). The method enables variant ranking and identifies specific failure modes, guiding remediation. Evaluated across two model families and five benchmarks, PRISM achieves mean Spearman correlations of 0.820 (quantization) and 0.831 (LoRA forgetting), while its shape regularizer outperforms experience replay in mitigating catastrophic forgetting.
prismrepresentation driftcross-entropy risklora-adaptedquantization
Exact Stiefel Optimization for Probabilistic PLS: Closed-Form Updates, Error Bounds, and Calibrated Uncertainty
The paper introduces an end-to-end framework for probabilistic partial least squares (PPLS) that addresses noise-signal coupling and orthogonality constraints. The method combines noise pre-estimation, constrained likelihood optimization, and prediction calibration, replacing full-spectrum noise averaging with noise-subspace estimation and interior-point penalty handling with exact Stiefel-manifold optimization. The noise-subspace estimator achieves a signal-strength-independent finite-sample rate and matches a minimax lower bound. The framework extends to sub-Gaussian settings via optional Gaussianization and provides closed-form standard errors through block-structured Fisher analysis. Evaluated on synthetic high-noise settings and multi-omics benchmarks (TCGA-BRCA and PBMC CITE-seq), the method achieves near-nominal coverage, Ridge-level point accuracy at rank r=3, and improved parameter recovery stability.
probabilistic partial least squaresstiefel-manifold optimizationnoise-subspace estimationblock-structured fisher analysismulti-omics benchmarks
Keep What Audio Cannot Say: Context-Preserving Token Pruning for Omni-LLMs
ContextGuard, an inference-time token pruning framework for Omni-LLMs, preserves broad audio-visual context while reducing cross-modal redundancy by predicting coarse visual semantics from audio and pruning recoverable video tokens. It retains localized visual details unspecified by audio and merges temporally similar video tokens, requiring no downstream LLM fine-tuning and using only a lightweight predictor. Evaluated on Qwen2.5-Omni and Video-SALMONN2+ at 3B and 7B scales across six benchmarks, ContextGuard outperforms prior pruning methods, achieving full-token-level performance on five benchmarks while pruning 55% of input tokens on Qwen2.5-Omni 7B.
contextguardtoken pruningomni-llmscross-modal redundancyaudio-visual context
GAR: Carbon-Aware Routing for LLM Inference via Constrained Optimization
The paper introduces Green-Aware Routing (GAR), a constrained multi-objective optimization framework for carbon-efficient LLM inference routing. GAR minimizes CO2 emissions per request while enforcing accuracy floors and p95-latency SLOs, using adaptive constraint optimization and lightweight estimators for correctness, latency, and emissions. The proposed GAR-PD algorithm and heuristic variants achieve 15-30% carbon reduction across heterogeneous LLM pools (7B-70B parameters) on NLP benchmarks, maintaining competitive accuracy and latency guarantees.
carbon-aware routingllm inferenceconstrained optimizationservice-level objectivesprimal-dual algorithm
DiffScore: Text Evaluation Beyond Autoregressive Likelihood
DiffScore introduces masked reconstruction as an alternative to autoregressive text evaluation, addressing positional bias inherent in left-to-right factorization. Leveraging Masked Large Diffusion Language Models, it scores tokens using full bidirectional context across continuous masking rates, establishing a hierarchy from local fluency to global coherence. The framework provides diagnostic tools like multi-timestep quality profiles and bidirectional PMI decomposition, disentangling fluency from faithfulness. Experiments across ten benchmarks demonstrate DiffScore's consistent superiority over autoregressive baselines in both zero-shot and fine-tuned settings.
masked reconstructionpositional biasbidirectional contextmasking ratesfluency-faithfulness disentanglement
EpiCastBench: Datasets and Benchmarks for Multivariate Epidemic Forecasting
EpiCastBench introduces a large-scale benchmarking framework for multivariate epidemic forecasting, addressing the lack of diverse, high-quality datasets. The framework comprises 40 curated multivariate datasets spanning various infectious diseases and geographical regions, characterized by diverse temporal granularity, series length, and sparsity. Standardized evaluation settings, including unified forecasting horizons, preprocessing pipelines, and performance metrics, ensure reproducibility and fair comparison. The framework evaluates 15 multivariate forecasting models, ranging from statistical baselines to state-of-the-art deep learning and foundation models. All datasets and code are publicly available on Kaggle and GitHub.
multivariate forecastingepidemic forecastingtemporal granularitydeep learningfoundation models
Native Explainability for Bayesian Confidence Propagation Neural Networks: A Framework for Trusted Brain-Like AI
The paper introduces a native explainability framework for Bayesian Confidence Propagation Neural Networks (BCPNNs), addressing the EU AI Act's transparency requirements for high-risk systems. It proposes a taxonomy mapping BCPNN architectural primitives to explainable-AI modalities, introduces 16 explanation primitives with closed-form algorithms, and 5 configuration-as-explanation primitives for hyperparameter auditing. The method leverages BCPNN's inherent transparency properties to provide attribution, prototype, and mechanistic explanations without computational overhead. Results demonstrate feasibility for edge deployment and alignment with Industry 5.0 standards through FPGA implementations and neuromorphic sparsity.
bcpnnexplainable-aieu ai actbayesian inferenceedge computing
SoK: Unlearnability and Unlearning for Model Dememorization
This paper presents the first integrated analysis of model dememorization techniques, focusing on unlearnability and machine unlearning. The authors develop a unified taxonomy of these methods and conduct empirical evaluations to assess their robustness, interplay, and limitations regarding shallow dememorization. They identify vulnerabilities such as falsely claimed data learnability reduction, weight perturbation effects, and domain knowledge recovery during unlearning. The study also establishes the first theoretical guarantee on dememorization depth for models processed through certified unlearning. These contributions provide foundational insights for achieving deeper immemor states of sensitive knowledge across the machine learning lifecycle.
dememorizationunlearnabilitymachine unlearningcertified unlearningimemor state
NexOP: Joint Optimization of NEX-Aware k-space Sampling and Image Reconstruction for Low-Field MRI
NexOP introduces a deep-learning framework for joint optimization of k-space sampling and image reconstruction in multi-NEX acquisitions for low-field MRI. The method optimizes sampling density probabilities across the k-space-NEX domain under fixed sampling-budget constraints and employs a novel architecture to reconstruct high-SNR images from multiple low-SNR measurements. Experiments on 0.3T brain data show NexOP outperforms existing methods across various acceleration factors and tissue contrasts, yielding non-uniform sampling strategies that decrease across repetitions. Theoretical analysis supports these findings, demonstrating NexOP's potential for faster, higher-quality imaging in low-cost MRI systems.
k-space samplingsignal-to-noise ratiolow-field mrideep-learning architecturemulti-nex acquisitions
Three Regimes of Context-Parametric Conflict: A Predictive Framework and Empirical Validation
This paper resolves empirical contradictions in how large language models handle conflicts between training knowledge and contradicting documents by proposing a three-regime framework: Regime 1 (single-source updating, dominant predictor: evidence coherence), Regime 2 (competitive integration, dominant predictor: parametric certainty), and Regime 3 (task-appropriate selection, dominant predictor: task knowledge requirement). The authors formalize parametric strength (exposure frequency) and parametric uniqueness (encoding consistency) as orthogonal dimensions, with strength being the operative predictor in stable factual domains. The framework is validated across Claude Sonnet 4.6, GPT-5.5, Gemini 2.5 Flash, Llama 4 Maverick, and DeepSeek V3 using 9,970 API calls. GEE logistic regression confirms the predicted Regime 2 certainty gradient (beta = -0.38 to -0.50, all p <= .013), and a Regime 3 ablation shows task framing flips context-following from near-100% to 6-71% (p < .001).
parametric strengthparametric uniquenessgee logistic regressioncertainty gradienttask framing
Dual-Temporal LSTM with Hybrid Attention for Airline Passenger Load Factor Forecasting: Integrating Intra-Flight and Inter-Flight Booking Dynamics
A dual-stream Long Short-Term Memory (LSTM) with hybrid attention is proposed for airline passenger load factor forecasting, addressing the limitations of unidimensional temporal modeling. The model processes two complementary sequences: intra-flight booking accumulation and inter-flight booking patterns at fixed days-before-departure offsets. Multiple architectural variants combining self-attention, cross-attention, and hybrid attention with concatenation, residual, and gated fusion strategies are evaluated. Experiments on Biman Bangladesh Airlines data show the hybrid model achieves a Mean Absolute Error of 2.8167 and an R² of 0.9495, outperforming baselines and prior dual-LSTM architectures. The model generalizes across diverse route types and has been integrated into airline operations.
long short-term memoryattention mechanismbooking dynamicsmean absolute errordual-stream architecture
TCP-SSM: Efficient Vision State Space Models with Token-Conditioned Poles
TCP-SSM introduces token-conditioned poles to improve State Space Models (SSMs) for vision tasks, addressing implicit recurrence dynamics and memory control limitations. The method employs real poles for decay patterns and complex-conjugate poles for oscillatory responses, with token-dependent pole adaptation via bounded radius/angle modulation. Grouped pole sharing and low-rank pathways enable efficient linear-time scans. Evaluations on image classification, segmentation, and detection show 44% computation reduction in Vision Mamba models while maintaining accuracy.
state space modelstoken-conditioned polesscan operatorcomplex-conjugate poleslinear-time complexity
When Looking Is Not Enough: Visual Attention Structure Reveals Hallucination in MLLMs
The paper introduces LaSCD (Laplacian-Spectral Contrastive Decoding), a training-free method to mitigate visual hallucinations in multimodal large language models (MLLMs). By analyzing high-frequency visual attention structure via layer-wise Laplacian energy, LaSCD identifies hallucination-prone layers and remaps next-token logits in closed form. Evaluations on hallucination and general multimodal benchmarks demonstrate consistent hallucination reduction while maintaining model capabilities, achieving this without additional training. The approach reveals that hallucination correlates with specific attention patterns rather than simple attention mass distribution.
multimodal large language modelsvisual hallucinationlaplacian energycontrastive decodingattention structure
Hindsight Hint Distillation: Scaffolded Reasoning for SWE Agents from CoT-free Answers
Hindsight Hint Distillation (HHD) introduces a method to enhance reasoning in long-horizon tasks without requiring costly chain-of-thought (CoT) annotations. HHD synthesizes hindsight hints from failed self-rollouts, scaffolds on-policy rollouts to complete tasks, and self-distills these trajectories for generalization. Experiments demonstrate HHD's superiority, achieving an 8% absolute improvement on SWE-bench Verified compared to iterative RFT and trajectory-synthesis baselines, which improve by only 2%. Notably, HHD-induced reasoning strategies generalize effectively to out-of-distribution tasks, yielding significant gains on SWE-bench Multilingual without multilingual training.
hindsight hint distillationchain-of-thoughtself-rolloutson-policy rolloutsself-distillation
Sharpen Your Flow: Sharpness-Aware Sampling for Flow Matching
SharpEuler introduces a training-free sampler for flow matching models that optimizes sample quality under fixed evaluation budgets. The method profiles pretrained models offline by estimating velocity field sharpness along calibration trajectories, converting this profile into a non-uniform timestep grid via quantile transform. SharpEuler is theoretically justified through numerical, variational, and statistical principles, demonstrating stability at the terminal distribution level. Empirical results show improved sample quality, reducing inter-mode leakage and increasing mode coverage compared to uniform sampling schedules.
flow matchingsharpness profileeuler integrationtimestep gridmode coverage
Optimal LTLf Synthesis
The paper introduces optimal LTLf synthesis, addressing the limitation of traditional strategy synthesis by maximizing the realization of objectives when not all are jointly achievable. Three approaches are proposed: max-guarantee synthesis, which identifies a maximal set of a priori guaranteed objectives; max-observation synthesis, which maximizes a posteriori realized objectives across executions; and incremental max-observation synthesis, enhancing strategies by leveraging stronger guarantees during execution. Experimental results demonstrate that these variations scale effectively, solving a significant fraction of benchmark instances within timeout constraints, confirming their practical feasibility.
ltlf synthesismax-guarantee synthesismax-observation synthesisincremental synthesisstrategy synthesis
Taming Extreme Tokens: Covariance-Aware GRPO with Gaussian-Kernel Advantage Reweighting
We introduce Covariance-Aware GRPO with Gaussian-Kernel Advantage Reweighting, a hyperparameter-free optimization method that improves Group Relative Policy Optimization (GRPO) for large language models. The method dynamically down-weights extreme token-level updates using a Gaussian kernel, leveraging the covariance between token probabilities and their advantages to stabilize entropy changes during training. Empirical evaluations demonstrate that this approach enhances downstream reasoning performance across benchmarks compared to standard GRPO, while effectively maintaining training stability and preserving informative learning signals.
group relative policy optimizationcovariance-awaregaussian-kerneltoken probabilitiesadvantage reweighting
Read, Grep, and Synthesize: Diagnosing Cross-Domain Seed Exposure for LLM Research Ideation
The paper investigates whether LLM-based research ideation benefits from cross-domain retrieval or mere exposure to diverse mechanisms. It introduces PaperGym, a three-stage pipeline: (1) tool-augmented seed extraction (read, grep, bash) from isolated papers, (2) cross-domain seed retrieval via paraphrasing across seven ML domains, and (3) rubric-scored method synthesis. Results show cross-domain retrieval outperforms no-retrieval and same-domain baselines in novelty but matches random diverse-seed controls, suggesting LLMs benefit from diversity but lack semantic retrieval exploitation. The authors release seed libraries and evaluation scripts.
llm ideationcross-domain retrievaltool-augmented extractionmethod synthesisrubric-based evaluation
Efficient and provably convergent end-to-end training of deep neural networks with linear constraints
The paper introduces an efficient, provably convergent method for end-to-end training of deep neural networks with linear constraints via projection layers. Key innovation is the HS-Jacobian, a conservative mapping for polyhedral projection operators that enables nonsmooth automatic differentiation and integration with optimizers like Adam. Theoretical convergence guarantees are established for the HS-Jacobian-based Adam algorithm. Experiments across finance, computer vision, and architecture design demonstrate superior performance over existing methods.
hs-jacobianpolyhedral projectionnonsmooth automatic differentiationlinear constraintsend-to-end training
PointGS: Semantic-Consistent Unsupervised 3D Point Cloud Segmentation with 3D Gaussian Splatting
PointGS introduces an unsupervised 3D point cloud segmentation pipeline leveraging 3D Gaussian Splatting to bridge the discrete-continuous domain gap between sparse point clouds and dense 2D images. The method reconstructs sparse point clouds into dense 3D Gaussian spaces via multi-view observations, renders multi-view dense images, extracts 2D semantic masks using the Segment Anything Model (SAM), and distills semantics to 3D Gaussian primitives through contrastive learning. Point semantics are assigned via nearest-neighbor search on labeled Gaussians after two-step registration. PointGS achieves +0.9% mIoU on ScanNet-V2 and +2.8% mIoU on S3DIS, outperforming state-of-the-art unsupervised methods.
3d gaussian splattingpoint cloud segmentationcontrastive learningsegment anything modelsemantic consistency
Controllable User Simulation
The paper formalizes controllable user simulation as a causal inference problem, demonstrating that standard supervised fine-tuning introduces structural bias via trajectory labels coupled to behavior policies. This look-ahead bias causes controllability collapse under policy shift, where evaluation metric variance grows geometrically. The authors propose causally consistent training mitigations: a priori controls, dynamic step-wise controls, and policy-conditioned learning. Experiments show their method eliminates bias, preserves conversational variance, and generalizes zero-shot to unseen agent behaviors, unlike standard approaches that distort distributions and collapse diversity.
causal inferencecontrollability collapselook-ahead biaspolicy-conditioned learningzero-shot generalization
AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration -- Learning from Cheap, Optimizing Expensive
AutoLLMResearch introduces an agentic framework for automating high-cost LLM experiment configurations, addressing the inefficiency of manual expert-driven approaches. The method leverages LLMConfig-Gym, a multi-fidelity environment with over one million GPU hours of verifiable outcomes across four critical LLM experiment tasks, and a structured training pipeline formulated as a long-horizon Markov Decision Process to incentivize cross-fidelity extrapolation reasoning. Evaluations against diverse baselines demonstrate the framework's effectiveness, generalization, and interpretability, supporting its practical utility for scalable LLM experiment automation.
multi-fidelity environmentmarkov decision processcross-fidelity extrapolationllm experiment configurationagentic framework
A Study on Hidden Layer Distillation for Large Language Model Pre-Training
The study evaluates Hidden Layer Distillation (HLD) for decoder-only LLM pre-training, comparing it to logit-based Knowledge Distillation (KD) and self-supervised baselines. Using Gemma3 3.4B as teacher and 123M/735M student models trained on up to 168B C4 tokens, HLD shows systematic perplexity gains over KD but no consistent downstream task improvement. Results suggest HLD extracts latent signals, though further breakthroughs may be needed for broader pre-training utility.
hidden layer distillationknowledge distillationlarge language modelsdecoder-only architectureperplexity
Pion: A Spectrum-Preserving Optimizer via Orthogonal Equivalence Transformation
The authors propose Pion, a spectrum-preserving optimizer for LLM training that uses orthogonal equivalence transformations to update weight matrices while preserving their singular values. Unlike additive optimizers like Adam, Pion applies left and right orthogonal transformations to modulate weight matrix geometry without altering spectral norms. Theoretical analysis covers update rule derivation, design choices, and convergence properties. Experiments demonstrate Pion's stability and competitiveness in both LLM pretraining and finetuning compared to standard optimizers.
spectrum-preserving optimizerorthogonal equivalence transformationsingular value preservationllm trainingspectral norm
Elastic Attention Cores for Scalable Vision Transformers
We propose Visual Elastic Core Attention (VECA), a Vision Transformer architecture that replaces quadratic-complexity all-to-all self-attention with linear-time core-periphery structured attention. VECA introduces a small set of learned core tokens that mediate information exchange between image patches, reducing complexity to O(N) for N patches interacting with C resolution-invariant cores. The model maintains and updates all N input tokens while avoiding a C-way bottleneck through nested training along the core axis. Evaluated on classification and dense tasks, VECA achieves competitive performance with state-of-the-art vision foundation models while significantly reducing computational costs, establishing elastic core-periphery attention as a scalable alternative for Vision Transformers.
vision transformerscore-periphery attentionlinear complexityelastic attentionnested training
Task-Adaptive Embedding Refinement via Test-time LLM Guidance
We propose a task-adaptive embedding refinement method using test-time LLM guidance to enhance zero-shot search and classification performance. The approach refines query embeddings in real-time by leveraging generative LLM feedback on a small document set, enabling embeddings to adapt to task-specific constraints. Extensive experiments on diverse benchmarks demonstrate consistent improvements, with up to +25% gains in literature search, intent detection, key-point matching, and query-instruction following. The refined embeddings improve ranking quality and binary separation, expanding the practical deployment scope of embedding models as a cost-effective alternative to LLM pipelines. Code is released for reproducibility.
embedding refinementzero-shot classificationllm guidancetask adaptationquery embedding
MEME: Multi-entity & Evolving Memory Evaluation
MEME introduces a benchmark evaluating LLM-based agents in persistent environments, focusing on multi-entity and evolving memory tasks. It defines six tasks, including Cascade, Absence, and Deletion, which assess dependency reasoning and post-removal state. The study evaluates six memory systems across three paradigms on 100 episodes, revealing severe performance collapse on dependency reasoning tasks (Cascade: 3%, Absence: 1% average accuracy). Prompt optimization, deeper retrieval, and stronger LLMs fail to bridge this gap; only a file-based agent with Claude Opus 4.7 partially improves performance at ~70x baseline cost, highlighting scalability challenges.
llm-based agentsdependency reasoningmemory systemsprompt optimizationscalability challenges
Routers Learn the Geometry of Their Experts: Geometric Coupling in Sparse Mixture-of-Experts
The paper demonstrates geometric coupling between routers and experts in Sparse Mixture-of-Experts (SMoE) models, where router-expert weight gradients align along input directions. Analyzing a 1B-parameter SMoE, the authors show router scores predict expert activations, revealing shared routing-expert dynamics. Auxiliary load-balancing losses disrupt this coupling by homogenizing router directions. A parameter-free online K-Means router, leveraging geometric coupling, achieves low load imbalance with minimal perplexity increase, outperforming loss-based balancing methods. Results indicate routers learn assignment geometries that facilitate effective expert specialization.
sparse mixture-of-expertsgeometric couplingrouter-expert alignmentload-balancing lossesonline k-means router
High-arity Sample Compression
The work introduces a high-arity variant of sample compression schemes, extending concepts from learning theory to product spaces. By analyzing the properties of these schemes, the authors demonstrate that the existence of a high-arity sample compression scheme with non-trivial quality implies high-arity PAC learnability. This result bridges high-arity learning theory with classical PAC learning frameworks, providing a theoretical foundation for understanding learnability in complex, multi-dimensional spaces.
sample compression schemeshigh-arity learning theorypac learnabilityproduct spaceslearning theory
Search Your Block Floating Point Scales!
ScaleSearch introduces a fine-grained search strategy for selecting optimal scale factors in Block Floating Point (BFP) quantization, minimizing quantization errors by leveraging mantissa bits in microscaling formats. The method integrates with existing techniques like Post Training Quantization (PTQ) and low-precision attention, enhancing their performance. ScaleSearchAttention, an NVFP4-based attention algorithm, ensures near-zero performance loss in causal language modeling. Experiments demonstrate a 27% reduction in quantization error for NVFP4, a 15-point improvement in PTQ for Qwen3-8B on MATH500, and a 0.77-point improvement in Wikitext-2 PPL for Llama 3.1 70B.
block floating pointquantizationmicroscalingnvfp4post training quantization
A proximal gradient algorithm for composite log-concave sampling
The authors propose a proximal gradient algorithm for sampling from composite log-concave distributions of the form π∝e^(-f-g), where f is smooth and g admits a restricted Gaussian oracle (RGO). The method leverages gradient evaluations of f and RGO calls for g, achieving ε error in total variation distance in O~(κ√d log^4(1/ε)) iterations when f+g is α-strongly convex and f is β-smooth (κ=β/α). The results extend to non-log-concave distributions satisfying Poincaré/log-Sobolev inequalities and non-smooth Lipschitz f.
log-concave samplingproximal gradientrestricted gaussian oracletotal variation distancepoincaré inequality
Multi-Stream LLMs: Unblocking Language Models with Parallel Streams of Thoughts, Inputs and Outputs
The paper proposes Multi-Stream LLMs, a paradigm shift from sequential message processing to parallel streams of computation in language models. By instruction-tuning models to simultaneously read from multiple input streams and generate tokens across multiple output streams in each forward pass, the method addresses limitations of sequential processing (e.g., inability to act while reading/react while writing). The approach improves efficiency through parallelization, enhances security via separation of concerns, and increases monitorability while maintaining causal dependencies across timesteps.
multi-stream llmsparallel computationinstruction-tuningcausal dependenciesautonomous agents
TextSeal: A Localized LLM Watermark for Provenance & Distillation Protection
TextSeal introduces a localized watermark for large language models that combines dual-key generation, entropy-weighted scoring, and multi-region localization to enhance detection robustness without inference overhead. The method builds on Gumbel-max sampling and supports optimizations like speculative decoding and multi-token prediction. Evaluations demonstrate TextSeal strictly outperforms baselines like SynthID-text in detection strength, maintains downstream performance on reasoning benchmarks, and shows no perceptible quality degradation in multilingual human evaluations (6000 A/B comparisons, 5 languages). Additionally, its 'radioactive' property enables detection of unauthorized use through model distillation.
gumbel-max samplingdual-key generationspeculative decodingmulti-region localizationmodel distillation
ORCE: Order-Aware Alignment of Verbalized Confidence in Large Language Models
The paper introduces ORCE, an order-aware framework for improving verbalized confidence alignment in LLMs by decoupling confidence estimation from answer generation. The method first generates answers, then estimates confidence conditioned on fixed question-answer pairs using a sampling-based surrogate and rank-based RL objectives to align confidence with correctness likelihood. Experiments on reasoning and knowledge benchmarks demonstrate improved calibration and failure prediction while maintaining answer accuracy, showing the benefits of decoupled confidence optimization.
verbalized confidenceconfidence calibrationreinforcement learninglarge language modelsfailure prediction
Environment-Adaptive Preference Optimization for Wildfire Prediction
We propose Environment-Adaptive Preference Optimization (EAPO), a framework for wildfire prediction that addresses long-tailed distributions and environmental shifts. EAPO constructs distribution-aligned datasets via $k$-nearest neighbor retrieval and performs hybrid fine-tuning combining supervised learning with preference optimization, emphasizing rare extreme events. Evaluated on a real-world wildfire prediction task with environmental shifts, EAPO achieves robust performance (ROC-AUC 0.7310) and improves detection in extreme regimes, demonstrating effectiveness in dynamic wildfire prediction systems.
environment-adaptive preference optimizationlong-tailed distributionwildfire predictionk-nearest neighbor retrievalpreference optimization
Learning Minimally Rigid Graphs with High Realization Counts
The paper introduces a reinforcement learning method to construct minimally rigid graphs with high realization counts, addressing an extremal problem in rigidity theory. The approach uses the Deep Cross-Entropy Method with a policy combining a Graph Isomorphism Network encoder and a permutation-equivariant action head to perform 0- and 1-extensions (Henneberg moves). Empirical results show the method matches known optima for planar realization counts and improves bounds for spherical realization counts, producing new record graphs.
minimally rigid graphsrealization countshenneberg movesgraph isomorphism networkdeep cross-entropy method
ORBIT: Preserving Foundational Language Capabilities in GenRetrieval via Origin-Regulated Merging
ORBIT preserves foundational language capabilities during Generative Retrieval (GenRetrieval) fine-tuning by regulating weight drift. The method monitors distance between fine-tuned and original model parameters, applying weight averaging when divergence exceeds a threshold. Experiments demonstrate ORBIT's superiority over continual learning baselines and regularization methods, maintaining both text generation and retrieval performance.
generative retrievalcatastrophic forgettingweight averagingmodel driftfine-tuning
Aligning Flow Map Policies with Optimal Q-Guidance
The paper introduces flow map policies, a novel class of generative policies enabling fast action generation via arbitrary-size jumps in flow-based dynamics, addressing inference latency in sequential decision-making. The method combines FLOW MAP Q-GUIDANCE (FMQ), a trust-region optimization for offline-to-online RL adaptation, with Q-GUIDED BEAM SEARCH (QGBS) for iterative inference-time refinement. Evaluated on 12 robotic tasks from OGBench and RoboMimic, FMQ achieves a 21.3% relative improvement in average success rate over prior one-step policies.
flow map policiesoffline-to-online rltrust-region optimizationq-guided beam searchgenerative policies
Model-based Bootstrap of Controlled Markov Chains
The authors introduce a model-based bootstrap method for estimating transition kernels in finite controlled Markov chains (CMCs) under nonstationary or history-dependent control policies, addressing offline reinforcement learning scenarios with unknown behavior policies. The method leverages a novel bootstrap law of large numbers for visitation counts and applies the martingale central limit theorem to bootstrap transition increments. Distributional consistency is established for both single long-chain and episodic offline RL regimes, extending to offline policy evaluation and optimal policy recovery via the delta method. Empirical results on the RiverSwim problem demonstrate superior coverage of percentile bootstrap confidence intervals compared to baselines.
controlled markov chainsoffline reinforcement learningbootstrap law of large numbersmartingale central limit theoremdelta method
Trajectory-Agnostic Asteroid Detection in TESS with Deep Learning
The paper introduces a deep learning method for asteroid detection in TESS data using a novel W-Net architecture comprising two stacked 3D U-Nets with skip connections. The approach eliminates the need for speed/direction assumptions by employing data augmentation through image cube rotation. A key innovation is Adaptive Normalization, a learned data scaling technique optimizing input processing. The publicly released tess-asteroid-ml toolkit generates training data with asteroid masks. The method's generalizability makes it suitable for future missions like the Nancy Grace Roman Space Telescope and NEOSurveyor.
w-netadaptive normalizationtess3d u-netshift-and-stack
Events as Triggers for Behavioral Diversity in Multi-Agent Reinforcement Learning
The authors propose a Multi-Agent Reinforcement Learning (MARL) framework that decouples agent identity from behavior using event-triggered behavioral transitions. The framework introduces Neural Manifold Diversity (NMD), a formal distance metric for transient, agent-agnostic behaviors, and employs an event-based hypernetwork generating Low-Rank Adaptation (LoRA) modules over a shared team policy for dynamic agent-policy reconfiguration. Theoretical analysis ensures diversity does not interfere with reward maximization. Empirical results show the framework outperforms baselines across benchmarks, achieves zero-shot generalization, and uniquely solves tasks requiring sequential behavior reassignment.
multi-agent reinforcement learningneural manifold diversitylow-rank adaptationevent-triggered transitionsbehavioral diversity
A Semi-Supervised Framework for Speech Confidence Detection using Whisper
A semi-supervised hybrid framework is proposed for speech confidence detection, addressing data scarcity and annotation subjectivity. The method combines deep semantic embeddings from the Whisper encoder with interpretable acoustic features (eGeMAPS descriptors) and auxiliary probability estimates of vocal stress and disfluency. An Uncertainty-Aware Pseudo-Labelling strategy is introduced to generate high-quality pseudo-labels for unlabelled data, prioritizing data quality over quantity. The framework achieves a Macro-F1 score of 0.751, outperforming self-supervised baselines (WavLM, HuBERT, Wav2Vec 2.0) and the unimodal Whisper baseline, with a 3% improvement in the minority class. Ablation studies confirm the superiority of curated pseudo-labels over indiscriminate augmentation.
whisper encoderegemaps descriptorspseudo-labellingmacro-f1 scoreacoustic features
MetaColloc: Optimization-Free PDE Solving via Meta-Learned Basis Functions
MetaColloc introduces an optimization-free, data-free framework for solving partial differential equations (PDEs) by decoupling basis discovery from the solving process. The method meta-trains a dual-branch neural network on Gaussian Random Fields offline to create a universal dictionary of neural basis functions. At test time, the frozen network assembles a collocation matrix, solving PDEs via a single linear least squares step or Newton-Raphson for non-linear cases. Experiments across six 2D and 3D PDEs demonstrate state-of-the-art accuracy and test-time computation reductions by orders of magnitude. Frequency sweep analysis reveals a critical mismatch between function approximation and operator stability at high frequencies, guiding future operator-aware meta-learning.
partial differential equationsmeta-learningcollocation matrixgaussian random fieldsnewton-raphson
Attacks and Mitigations for Distributed Governance of Agentic AI under Byzantine Adversaries
This work analyzes vulnerabilities in SAGA, a state-of-the-art agentic AI governance system, focusing on attacks from a compromised Provider. The authors identify concrete attacks, including undermining agent attributability, extracting private data, and bypassing access control. They propose three mitigation strategies: SAGA-BFT, a Byzantine-resilient architecture with strong security but high overhead; SAGA-MON and SAGA-AUD, leveraging lightweight monitoring and auditing for minimal overhead; and SAGA-HYB, a hybrid approach balancing security and performance. Evaluations compare these architectures against SAGA, discussing optimal solutions under varying conditions.
agentic ai governancebyzantine resilienceaccess controlmonitoringauditing
From Message-Passing to Linearized Graph Sequence Models
The authors introduce Linearized Graph Sequence Models, a framework that reformulates message-passing graph computation through the lens of sequence modeling to simplify architectural decisions. This approach decouples computational processing depth from information propagation depth, enabling graph architectural choices to be treated as sequence modeling problems. The study empirically and theoretically analyzes sequence properties that effectively preserve graph inductive bias, particularly demonstrating improved performance on long-range information tasks in graphs. The findings provide a principled method for integrating modern sequence modeling advances into message-passing based graph learning while recasting architectural questions as input modeling choices.
message-passingsequence modelinggraph inductive biasinformation propagationlinearized graph
Neural-Schwarz Tiling for Geometry-Universal PDE Solving at Scale
The paper introduces Neural-Schwarz Tiling (NEST), a local-to-global framework for scalable PDE solving that shifts learning from full-domain solution operators to reusable local physical solvers. NEST trains a neural operator on minimal voxel patches (3×3×3) with diverse local geometries and boundary/interface data, then composes global solutions through domain decomposition and iterative Schwarz coupling with partition-of-unity assembly. Evaluated on nonlinear static equilibrium in compressible neo-Hookean solids, NEST demonstrates generalization across domain size, shape, and boundary-condition configurations, offering a reusable path for scalable learned PDE solvers.
neural operatordomain decompositionschwarz couplingvoxel patchespartition-of-unity
Multi-Variable Conformal Prediction: Optimizing Prediction Sets without Data Splitting
The paper introduces multi-variable conformal prediction (MCP), a framework extending conformal prediction to vector-valued score functions with multiple calibration variables, eliminating data splitting while preserving coverage guarantees. MCP unifies prediction set design and calibration via scenario theory, proposing two variants: RemMCP (constrained optimization with constraint removal) and RelMCP (iterative optimization with constraint relaxation). Experiments on ellipsoidal and multi-modal prediction sets show both variants achieve target coverage with smaller or comparable set sizes than split conformal baselines, while reducing calibration variance by using all data simultaneously.
conformal predictionscenario theoryprediction setscoverage guaranteesconstrained optimization
Online Learning-to-Defer with Varying Experts
(No summary returned.)
Grid Games: The Power of Multiple Grids for Quantizing Large Language Models
The paper introduces multiple-grid quantization for large language models, formalizing the power-of-two-grids (PO2) problem and demonstrating its efficacy for small-group formats like MXFP and NVFP. Four grid families are instantiated: PO2(NF4), MPO2, PO2(Split87), and SFP4, each leveraging adaptive grids to enhance quantization accuracy. Empirical results show consistent improvements in post-training quantization of open models and pre-training of Llama-like models, outperforming single-grid FP4 in both weight-only and weight+activation scenarios. Theoretical analysis indicates diminishing returns for very large groups. Source code is provided for reproducibility.
quantizationpower-of-two-gridsmxfpnf4tensorcore
Autoregressive Learning in Joint KL: Sharp Oracle Bounds and Lower Bounds
(No summary returned.)
In-context learning to predict critical transitions in dynamical systems
We introduce TipPFN, an in-context learning framework for predicting critical transitions in dynamical systems using a prior-data fitted network. The method leverages a novel synthetic data generator based on canonical bifurcation scenarios with randomized stochastic dynamics, enabling flexible adaptation to contexts of varying size, complexity, and dimensionality. TipPFN achieves robust, state-of-the-art performance in early detection of critical transitions across unseen tipping regimes, sim-to-real scenarios, and real-world observations, demonstrating effectiveness in both in-context learning and zero-shot settings.
in-context learningcritical transitionsprior-data fitted networkbifurcation scenarioszero-shot learning
From Model Uncertainty to Human Attention: Localization-Aware Visual Cues for Scalable Annotation Review
The study introduces a novel interface for visualizing spatial uncertainty in AI-assisted annotation workflows, addressing the challenge of mislocalized predictions in tasks requiring both class labels and spatial boundaries. Through a controlled experiment with 120 participants, the interface demonstrates that annotators receiving spatial uncertainty cues achieve higher label quality and increased efficiency. Box-level analysis reveals that these cues effectively redirect annotator attention toward high-uncertainty predictions and away from well-localized boxes. The findings establish localization uncertainty as a critical factor in improving human-in-the-loop annotation processes.
spatial uncertaintyai-assisted annotationlocalization-awarehuman-in-the-looplabel quality
Approximation of Maximally Monotone Operators : A Graph Convergence Perspective
The paper introduces a graph convergence framework for approximating maximally monotone operators, addressing limitations of classical uniform and $L^p$ approximation methods. By leveraging Painlevé-Kuratowski convergence, the authors demonstrate that continuous encoder-decoder architectures can approximate such operators locally in the graph sense. Additionally, they propose resolvent-based parameterizations to construct structure-preserving approximations that maintain maximal monotonicity. This approach extends operator learning to discontinuous and set-valued operators, which are prevalent in differential operator contexts.
graph convergencemaximally monotone operatorspainlevé-kuratowski convergenceresolvent-based parameterizationsencoder-decoder architectures
STRABLE: Benchmarking Tabular Machine Learning with Strings
STRABLE introduces a benchmarking corpus of 108 real-world tabular datasets containing both string and numerical entries, addressing the understudied area of tabular learning with strings. The study evaluates 445 pipelines, comparing end-to-end architectures against modular pipelines where strings are encoded, post-processed, and fed to tabular learners. Results show that advanced tabular learners paired with simple string embeddings perform well on categorical-dominant tables, while large LLM encoders excel on free-text-dominant tables, with performance sensitive to post-processing. STRABLE provides generalizable pipeline rankings, establishing it as a foundational resource for research on string tabular learning.
tabular learningstring embeddingsllm encoderspost-processingbenchmarking corpus
Targeted Neuron Modulation via Contrastive Pair Search
We introduce contrastive neuron attribution (CNA), a method for identifying the 0.1% of MLP neurons whose activations most distinguish harmful from benign prompts using only forward passes. Applying CNA to Llama and Qwen models (1B-72B parameters), we find that ablating these neurons reduces refusal rates by over 50% on a jailbreak benchmark while preserving fluency. Base models exhibit similar late-layer discrimination structures, but steering these neurons produces content shifts rather than behavioral change. Results suggest alignment fine-tuning transforms pre-existing discrimination structures into sparse, targetable refusal gates.
contrastive neuron attributionmlp neuronsalignment fine-tuningrefusal gatejailbreak benchmark
What makes a word hard to learn? Modeling L1 influence on English vocabulary difficulty
The study computationally models English vocabulary difficulty for learners with Spanish, German, or Chinese as their first language (L1), using gradient-boosted models trained on word familiarity, meaning, surface form, and cross-linguistic transfer features. Shapley value analysis reveals word familiarity as the dominant feature across all L1s, with orthographic transfer additionally influencing Spanish and German learners. Chinese learners' difficulty is determined solely by familiarity and surface features, lacking orthographic transfer. The models yield interpretable, L1-specific difficulty estimates for curriculum design.
gradient-boosted modelsshapley valuesorthographic transfercross-linguistic transfervocabulary difficulty
Hypernetworks for Dynamic Feature Selection
We propose Hyper-DFS, a hypernetwork-based dynamic feature selection (DFS) approach that generates feature subset-specific classifier parameters on demand. The method employs a Set Transformer encoding to create a smooth conditioning space, ensuring geometric proximity for functionally similar tasks. Structural analysis shows Hyper-DFS achieves a smaller complexity bound than mask-embedding methods. Experiments demonstrate state-of-the-art performance on synthetic and real-life tabular data, competitive results on image datasets, and superior zero-shot generalization to unseen feature subsets compared to existing DFS approaches.
hypernetworkdynamic feature selectionset transformerzero-shot generalizationcomplexity bound
Reconstruction of Personally Identifiable Information from Supervised Finetuned Models
We introduce COVA, a novel decoding algorithm for reconstructing personally identifiable information (PII) from supervised finetuned (SFT) language models under prefix-based attacks. Using multi-turn, user-centric Q&A datasets in medical and legal domains containing PII, we evaluate PII leakage across varying levels of attacker knowledge about the fine-tuning dataset. Results demonstrate that partial attacker knowledge significantly improves reconstruction success, with leakage varying substantially across PII types. COVA consistently outperforms existing extraction methods in reconstructing sensitive information from SFT models.
supervised finetuningpersonally identifiable informationprefix-based attacksdecoding algorithmmulti-turn q&a
Delay-Empowered Causal Hierarchical Reinforcement Learning
The paper proposes Delay-Empowered Causal Hierarchical Reinforcement Learning (DECHRL), a novel method for handling stochastic delayed effects in reinforcement learning. DECHRL explicitly models causal state transitions and their delay distributions, integrating them into a delay-aware empowerment objective to guide exploration toward controllable states. Evaluated on modified 2D-Minecraft and MiniGrid environments with stochastic delays, DECHRL outperforms baselines in decision-making under temporal uncertainty by effectively modeling and adapting to variable delays.
hierarchical reinforcement learningstochastic delayscausal modelingempowerment objectivetemporal uncertainty
Instruction Lens Score: Your Instruction Contributes a Powerful Object Hallucination Detector for Multimodal Large Language Models
The Instruction Lens Score (InsLen) is introduced as a novel plug-and-play object hallucination detector for multimodal large language models (MLLMs), addressing a critical challenge in their reliable deployment. InsLen leverages instruction token embeddings, which implicitly encode visual information while filtering misleading visual embeddings, combining a Calibrated Local Score with a Context Consistency Score to measure object token consistency. Extensive experiments across multiple benchmarks and diverse MLLM architectures demonstrate InsLen's consistent outperformance of existing hallucination detection methods, highlighting its effectiveness and robustness without requiring auxiliary models or additional training.
instruction lens scoreobject hallucinationmultimodal large language modelscontext consistency scorecalibrated local score
SOAR: Scale Optimization for Accurate Reconstruction in NVFP4 Quantization
The paper proposes SOAR, a post-training quantization framework that improves NVFP4 (4-bit microscaling format) accuracy for large language models. It introduces Closed-form Joint Scale Optimization (CJSO) to analytically optimize global and block-wise scales via reconstruction error minimization, and Decoupled Scale Search (DSS) to separate quantization/dequantization scales with discrete search. Experiments demonstrate SOAR outperforms existing NVFP4 methods across multiple LLMs, achieving higher accuracy at identical memory footprints without hardware overhead.
nvfp4post-training quantizationreconstruction errorscale optimization4-bit quantization
Optimal Policy Learning under Budget and Coverage Constraints
The paper presents a framework for optimal policy learning under joint budget and minimum coverage constraints, demonstrating that the problem exhibits a knapsack-type structure. The optimal policy is characterized by an affine threshold rule incorporating budget and coverage shadow prices. The linear programming relaxation of the combinatorial solution is shown to have an O(1) integrality gap, ensuring asymptotic equivalence with optimal discrete allocation. Two algorithms, Greedy-Lagrangian (GLC) and rank-and-cut (RC), are analyzed: GLC achieves near-optimal performance in finite samples, while RC performs optimally under slack coverage constraints or homogeneous costs but misallocates when cost heterogeneity interacts with binding coverage constraints. Monte Carlo simulations validate these findings.
knapsack-type structureaffine threshold ruleintegrality gapgreedy-lagrangianrank-and-cut
Intrinsic Vicarious Conditioning for Deep Reinforcement Learning
The paper introduces vicarious conditioning as an intrinsic reward mechanism for deep reinforcement learning, enabling agents to learn from demonstrators without accessing their policies or reward functions. The method implements four cognitive steps (attention, retention, reproduction, reinforcement) using memory-based approaches, supporting low-shot learning. Evaluations in MiniWorld Sidewalk and Box2D CarRacing show improved episode lengths by avoiding non-descriptive terminal states and guiding toward desirable states, demonstrating applicability to single-life and continual learning scenarios.
intrinsic rewardvicarious conditioninglow-shot learningmemory-based methodsnon-descriptive terminal
On the Importance of Multistability for Horizon Generalization in Reinforcement Learning
We formalize temporal horizon generalization in reinforcement learning (RL) for partially observable Markov decision processes (POMDPs), deriving necessary and sufficient conditions for policies to remain optimal across arbitrary horizons. Through empirical evaluation of nonlinear and parallelizable recurrent neural network (RNN) variants, we demonstrate that multistability is necessary for horizon generalization and sufficient in simple tasks, while complex tasks additionally require transient dynamics. Modern parallelizable architectures, including state space models and gated linear RNNs, fail to generalize due to inherent monostability. These findings establish multistability and transient dynamics as complementary dynamical regimes essential for scalable long-horizon RL, motivating the design of parallelizable architectures combining both properties.
temporal horizon generalizationmultistabilitytransient dynamicspartially observable markov decision processesparallelizable architectures
Investigating simple target-covariate relationships for Chronos-2 and TabPFN-TS
The paper evaluates the ability of Time Series Foundation Models (TSFMs) to integrate covariates by conducting controlled experiments on simple target-covariate relationships. Specifically, it compares Chronos-2 and TabPFN-TS, two recent TSFM architectures, in their capacity to model these dependencies. Results indicate that TabPFN-TS outperforms Chronos-2 in capturing such relationships, particularly for short prediction horizons. This suggests that Chronos-2's strong benchmark performance does not necessarily correlate with optimal modeling of simple covariate-target dependencies.
time series foundation modelscovariateschronos-2tabpfn-tszero-shot
A Unified Graph Language Model for Multi-Domain Multi-Task Graph Alignment Instruction Tuning
UniGraphLM introduces a unified graph language model addressing multi-domain, multi-task graph alignment instruction tuning. It integrates a multi-domain, multi-task GNN encoder to learn generalizable graph representations aligned with textual semantics, overcoming challenges of domain-task variability and LLM token space compatibility. The model adaptively aligns these representations with LLMs, enhancing generalization across diverse graph data. This approach bridges the gap between GNN-encoded representations and LLM token spaces, enabling effective instruction tuning for graph-language alignment.
graph neural networkslarge language modelsinstruction tuninggraph alignmenttoken space
ECTO: Exogenous-Conditioned Temporal Operator for Ultra-Short-Term Wind Power Forecasting
We propose ECTO (Exogenous-Conditioned Temporal Operator), a unified framework for ultra-short-term wind power forecasting that decomposes exogenous variable modeling into Physically-Grounded Variable Selection (PGVS) and Exogenous-Conditioned Regime Refinement (ECRR). PGVS performs hierarchical, group-aware sparse selection using domain-informed physical priors and sparsemax activations, while ECRR routes forecasts through learned regime experts with gain-bias calibration and horizon-specific corrections. Experiments on three wind farms (66-200 MW, 11-13 exogenous variables) show ECTO achieves the lowest MSE across all sites, with relative improvements of 2.2%-5.2% over baselines, widening to 8.6% at H=32. Ablations confirm positive contributions from PGVS (+1.84%) and ECRR (+2.86%), with interpretability analysis revealing physically meaningful variable selection and consistent calibration strategies.
ultra-short-term forecastingexogenous variablessparsemax activationgain-bias calibrationmixture-of-experts
Fair Conformal Classification via Learning Representation-Based Groups
The paper introduces a fair conformal inference framework for classification that guarantees conditional coverage on adaptively identified subgroups, addressing algorithmic biases in traditional conformal prediction methods. The method constructs compact prediction sets by learning representation-based groups through nonlinear feature combinations, balancing effectiveness and efficiency while ensuring adaptive equalized coverage across subgroups. Experiments on synthetic and real-world datasets demonstrate the framework's effectiveness in achieving fair and trustworthy machine learning.
conformal predictionconditional coveragealgorithmic biasrepresentation learningfair classification
Probing Non-Equilibrium Grain Boundary Dynamics with XPCS and Domain-Adaptive Machine Learning
The authors introduce a domain-adaptive machine learning framework combined with X-ray photon correlation spectroscopy (XPCS) to quantitatively probe non-equilibrium grain boundary (GB) dynamics in nanocrystalline materials. They employ a semi-supervised learning approach that transfers physical parameter labels from continuum simulations to unlabeled experimental XPCS maps via domain-adaptive representation alignment. This method enables direct extraction of kinetic parameters, including bulk diffusivity, GB stiffness, and effective GB concentration, from noisy XPCS fluctuation maps. Results demonstrate that GB relaxation in nanocrystalline silicon deviates from time-translation invariance, remaining far from equilibrium over experimental timescales.
grain boundary dynamicsx-ray photon correlation spectroscopydomain-adaptive machine learningnon-equilibrium relaxationsemi-supervised learning
Information-Theoretic Generalization Bounds for Sequential Decision Making
The paper introduces a sequential supersample framework to extend information-theoretic generalization bounds to sequential decision-making problems, addressing limitations in existing supersample conditional mutual information (CMI) bounds. The method separates learner filtration from proof-side enlargement, leveraging row-wise exchangeability to control the sequential generalization gap via sequential CMI, a sum of roundwise selector-loss information terms. A Bernstein-type refinement is also established for faster rates under variance conditions. The framework applies to online learning, streaming active learning with importance weighting, and stochastic multi-armed bandits.
sequential supersampleconditional mutual informationrow-wise exchangeabilityselector-loss informationbernstein-type refinement
Multi-Task Representation Learning for Conservative Linear Bandits
The paper introduces Constrained Multi-Task Representation Learning (CMTRL), a framework for conservative linear bandits that leverages shared low-dimensional representations across tasks. It proposes Safe-AltGDmin, an algorithm combining alternating projected gradient descent and minimization to recover a low-rank feature matrix while adhering to safety or performance constraints. Theoretical guarantees for regret and sample complexity bounds are established. Empirical evaluations demonstrate the algorithm's performance against benchmark methods in multi-task linear bandit settings.
linear banditsmulti-task learninglow-rank representationregret boundssample complexity
Expected Batch Optimal Transport Plans and Consequences for Flow Matching
(No summary returned.)
Lower bounds for one-layer transformers that compute parity
The work establishes a lower bound for one-layer transformers, proving that no self-attention layer post-processed by a rational function can sign-represent the parity function unless the product of heads and post-processing degree grows linearly with input length. The method combines this bound with ReLU network rational approximations to extend the result to ReLU-post-processed self-attention layers. Results show fundamental limitations in transformer expressivity for parity computation under these architectural constraints.
self-attentionparity functionrational functionlower boundrelu networks
On What We Can Learn from Low-Resolution Data
The paper theoretically analyzes the informativeness of low-resolution data when models are evaluated on high-resolution inputs, using Kullback-Leibler divergence to characterize how datapoint influence varies with resolution. It derives bounds relating the contributions of high- and low-resolution observations to information loss under downsampling. Empirical validation with a vision transformer and convolutional neural network shows that incorporating low-resolution data improves performance when high-resolution samples are scarce.
low-resolution datakullback-leibler divergencevision transformerconvolutional neural networkdownsampling
Machine Learning for neutron source distributions
A novel machine learning approach for neutron source distribution estimation is proposed, leveraging probabilistic generative models trained on Monte Carlo particle lists. The method eliminates dependency on particle lists post-training, enabling efficient, rapid, and memory-efficient sampling. Four generative models—variational autoencoder, normalizing flow, generative adversarial network, and denoising diffusion model—are evaluated and compared against existing estimation techniques. Results demonstrate the feasibility of modeling neutron source distributions using probabilistic generative models, highlighting their potential for advancing this field.
probabilistic generative modelsmonte carloneutron source distributionvariational autoencoderdenoising diffusion model
Fused Gromov-Wasserstein Distance with Feature Selection
The paper introduces Fused Gromov-Wasserstein (FGW) distances with feature selection, enhancing interpretability and robustness in high-dimensional settings by adaptively suppressing irrelevant features. Two methods are proposed: (1) regularized FGW with Lasso/Ridge penalties and (2) simplex-constrained weights, including groupwise extensions. Theoretical analysis establishes bounds relative to classical FGW and Gromov-Wasserstein distances, alongside metric properties. An alternating minimization algorithm is developed. Experiments demonstrate improved interpretability and task-relevant structure identification, particularly in computational redistricting applications.
fused gromov-wassersteinfeature selectionalternating minimizationcomputational redistrictingmetric learning
PrivacySIM: Evaluating LLM Simulation of User Privacy Behavior
The paper introduces PrivacySIM, an evaluation suite for assessing large language models' (LLMs) ability to simulate individual privacy decisions. The method benchmarks nine frontier LLMs against ground-truth responses from 1,000 users across five privacy studies, conditioning models on three persona facets (demographics, previous experiences, stated privacy attitudes). Results show persona conditioning improves simulation accuracy (best model: 40.4%), but LLMs still fail to faithfully replicate individual decisions, particularly for users with high AI experience but low privacy concerns.
privacy simulationllm evaluationpersona conditioningbehavioral modelingdata-sharing scenarios
STRUM: A Spectral Transcription and Rhythm Understanding Model for End-to-End Generation of Playable Rhythm-Game Charts
STRUM introduces an end-to-end audio-to-chart pipeline for generating playable rhythm-game charts (Clone Hero/YARG) across five instruments without oracle metadata. The hybrid system combines specialized modules: a two-stage CRNN onset detector and six-model ensemble for drums, neural onset detectors with monophonic pitch tracking for guitar/bass, word-aligned ASR for vocals, and spectral keyboard detection. Evaluation on a 30-song benchmark (selected via drum-stem RMS criteria) shows F1 scores of 0.838 (drums), 0.694 (bass), 0.651 (guitar), and 0.539 (vocals) at ±100ms tolerance. The work includes ablation studies, timing distribution analysis, and releases code/models/benchmark data.
audio-to-chartonset detectionmonophonic pitch trackingsource separationrhythm-game
MULTI: Disentangling Camera Lens, Sensor, View, and Domain for Novel Image Generation
The paper introduces MULTI, a method for disentangling imaging factors (camera lens, sensor, viewpoint, domain) in text-to-image generation to address limitations of current content-focused approaches. The two-stage approach first learns general factors via textual inversion, then extracts dataset-specific factors, enabling novel factor combinations and distribution gap reduction. Evaluation on the DF-RICO benchmark demonstrates MULTI's effectiveness, establishing factor disentanglement as a new research direction for precise image generation control.
factor disentanglementtextual inversionimage generationcontrolnetsdf-rico benchmark
Keeping Score: Efficiency Improvements in Neural Likelihood Surrogate Training via Score-Augmented Loss Functions
The authors propose a score-augmented loss function to improve the efficiency of neural likelihood surrogate training for structured stochastic process models. By augmenting binary cross-entropy loss with exact score information ∇θlog p(x∣θ) and adaptive weighting based on loss gradients, they bypass the black-box assumption of simulation-based inference. Evaluations on network dynamics and spatial processes demonstrate that the method achieves inference performance equivalent to a 10x increase in training data with less than a 1.1x increase in training time, drastically reducing computational costs.
simulation-based inferencelikelihood surrogatescore augmentationadaptive weightingstochastic process
Elicitation-Augmented Bayesian Optimization
The authors propose elicitation-augmented Bayesian optimization (BO), a method that integrates pairwise comparison queries from domain experts to improve sample efficiency in human-in-the-loop BO. Unlike prior approaches requiring explicit quantification of expert knowledge, this method interprets pairwise judgments as noisy evidence about the objective function, combining them with direct observations via a cost-aware value-of-information acquisition function. The method adapts to query cost and noise: it outperforms observation-only BO when queries are cheap and reverts to standard BO when queries are costly or noisy, achieving performance near the convex hull of individual information sources.
bayesian optimizationpairwise comparisonsvalue-of-informationsample efficiencyelicitation
Learning plug-in surrogate endpoints for randomized experiments
The paper introduces plug-in composite surrogates as functions of post-treatment variables that can substitute for primary outcomes in randomized experiments. Two methods are proposed for learning these surrogates by maximizing effect predictiveness, with theoretical analysis of unbiased effect estimation in representative scenarios. Empirical evaluation on synthetic and real-world experimental data demonstrates that the proposed method outperforms established approaches in predicting primary effects.
surrogate endpointsrandomized experimentseffect predictivenessplug-in estimatorscausal inference
Resilient Vision-Tabular Multimodal Learning under Modality Missingness
A multimodal transformer framework is proposed for joint vision-tabular learning under pervasive modality missingness, eliminating the need for imputation or heuristic model switching. The architecture integrates vision, tabular, and multimodal fusion encoders, utilizing learnable modality tokens and masked self-attention to exclude missing tokens and modalities during information aggregation and gradient propagation. A modality-dropout regularization strategy stochastically removes available modalities during training to enhance resilience. Evaluated on the MIMIC-CXR dataset paired with MIMIC-IV clinical data for multilabel classification of 14 diagnostic findings, the method consistently outperforms baselines across all missingness regimes, demonstrating smoother performance degradation and improved robustness. Attention-level masking and intermediate fusion with joint fine-tuning are identified as critical for resilient multimodal inference.
multimodal transformermodality missingnessmasked self-attentionmodality-dropoutintermediate fusion
Approximation Theory of Laplacian-Based Neural Operators for Reaction-Diffusion System
The paper establishes explicit approximation error bounds for Laplacian-based neural operators applied to the generalized Gierer-Meinhardt reaction-diffusion system, a nonlinear PDE model of pattern formation. By leveraging the Laplacian spectral representation of the Green's function, the authors derive bounds in terms of network depth, width, and spectral rank, demonstrating polynomial growth in parameter complexity relative to target accuracy. This alleviates the curse of parametric complexity in generic operator learning. Numerical experiments on the Gierer-Meinhardt system empirically validate the theoretical findings.
neural operatorsreaction-diffusion systemlaplacian spectral representationapproximation error boundsparametric complexity
Limits of Learning Linear Dynamics from Experiments
The work establishes fundamental limits on learning linear time-invariant (LTI) dynamics from experimental data, showing that identifiability depends on the experimental setup (initial state and control input) rather than just classical controllability conditions. Using geometric analysis, the authors derive a closed-form characterization of all systems consistent with observed trajectories and prove that dynamics remain uniquely identifiable on the reachable subspace, even when full system identification fails. This provides a theoretical framework for partial identifiability in data-driven system identification.
linear time-invariant systemssystem identificationidentifiabilityreachabilitygeometric control theory
Estimating Subgraph Importance with Structural Prior Domain Knowledge
The authors propose a subgraph importance estimation method for pretrained Graph Neural Networks (GNNs) on graph-level tasks, formulated as a linear Group Lasso regression problem in the embedding space. The method leverages prior domain knowledge of graph substructures while remaining architecture-agnostic regarding output layers and readout functions, and operates without ground-truth labels. Experiments on real-world graph datasets demonstrate consistent outperformance over existing baselines in subgraph importance estimation. The method is further extended to identify important nodes within graphs.
graph neural networkssubgraph importancegroup lassoembedding spacegraph-level tasks
Split the Differences, Pool the Rest: Provably Efficient Multi-Objective Imitation
We propose Multi-Output Augmented Behavioral Cloning (MA-BC), a provably efficient algorithm for multi-objective imitation learning in Multi-Objective Markov Decision Processes (MOMDPs). MA-BC systematically partitions divergent expert demonstrations while pooling non-conflicting state-action pairs, addressing limitations of standard imitation approaches that aggregate conflicting trajectories. Theoretical analysis shows MA-BC converges to Pareto-optimal policies faster than independent expert dataset learners and achieves minimax optimality. Empirical validation across discrete environments and a continuous Linear Quadratic Regulator task demonstrates MA-BC's effectiveness.
multi-objective imitation learningpareto-optimal policiesbehavioral cloningmulti-objective markov decision processminimax optimality
QDSB: Quantized Diffusion Schrödinger Bridges
Quantized Diffusion Schrödinger Bridges (QDSB) accelerate training of generative models between unpaired source and target distributions by avoiding costly global coupling computations. QDSB computes endpoint coupling on anchor-quantized distributions and lifts the plan back to original data points via cell-wise sampling, ensuring stability with quantization error controlled by anchor approximation quality. Experiments demonstrate that QDSB achieves comparable sample quality to existing baselines while significantly reducing training time. The method addresses the computational inefficiency and geometric distortion of iterative minibatch-based entropic optimal transport solutions.
schrödinger bridgesquantized diffusionoptimal transportgenerative modelsanchor quantization
Stochastic Minimum-Cost Reach-Avoid Reinforcement Learning
The authors introduce reach-avoid probability certificates (RAPCs) to address stochastic minimum-cost reach-avoid reinforcement learning, enabling agents to satisfy probabilistic reach-avoid constraints while minimizing expected cumulative costs. They develop a contraction-based Bellman formulation that integrates reach-avoid considerations into reinforcement learning, ensuring cost optimization under probabilistic constraints. The proposed algorithms achieve almost sure convergence to locally optimal policies. Experimental results in the MuJoCo simulator demonstrate improved cost performance and higher reach-avoid satisfaction rates compared to existing methods.
reach-avoid probability certificatesbellman formulationstochastic environmentscost optimizationmujoco simulator
Towards Order Fairness: Mitigating LLMs Order Sensitivity through Dual Group Advantage Optimization
We introduce Dual Group Advantage Optimization (DGAO), a reinforcement learning method to mitigate order sensitivity in Large Language Models (LLMs) while improving accuracy. DGAO balances intra-group relative accuracy advantage and inter-group relative stability advantage, rewarding order-stable and correct outputs while penalizing order-sensitive or incorrect responses. We also propose Consistency Rate and Overconfidence Rate metrics to evaluate pseudo-stability. Experiments show DGAO enhances order fairness and performance on Retrieval-Augmented Generation (RAG), mathematical reasoning, and classification tasks.
order sensitivitydual group advantage optimizationlarge language modelsretrieval-augmented generationreinforcement learning
NOFE -- Neural Operator Function Embedding
Neural Operator Function Embedding (NOFE) introduces a domain-aware framework for continuous dimensionality reduction, addressing limitations of discrete point cloud methods. NOFE learns function-to-function mappings via a Graph Kernel Operator, enabling mesh-free evaluation and generalizing Sheaf Neural Networks to continuous domains. Evaluated against PCA, t-SNE, and UMAP, NOFE significantly outperforms baselines in local structure preservation (local Stress: 0.111 vs. 0.398, 0.773, 0.791) and sampling independence (Patch Stitching Error reduced by 20.0× relative to UMAP). While maintaining competitive global structure preservation (Stress-1: 0.379 vs. PCA's 0.268), NOFE resolves fine-grained structures and ensures consistency across varying sample densities.
dimensionality reductiongraph kernel operatorsheaf neural networksmesh-free evaluationlocal structure preservation
Assessment of cloud and associated radiation fields from a GAN stochastic cloud subcolumn generator
A novel two-stage machine learning subcolumn generator for the GEOS atmospheric model is introduced, combining a Conditional Variational Autoencoder with a Generative Adversarial Network (CVAE-GAN) and a U-Net architecture. Trained on CloudSat-CALIPSO height-resolved cloud optical depth data, the generator produces 56 stochastic subcolumns representing cloud occurrence and optical depth profiles. Compared to the Räisänen method, it accurately reproduces bimodal cloud overlap distributions, reduces biases in grid-mean statistics, and halves the root-mean-square error in ISCCP-style cloud-top pressure and optical thickness joint histograms. The approach improves offline radiative transfer calculations, reducing the global-mean shortwave top-of-atmosphere cloud radiative effect bias by a factor of three.
generative adversarial networkcloud optical depthradiative transferearth system modelsvariational autoencoder
STAGE: Tackling Semantic Drift in Multimodal Federated Graph Learning
STAGE introduces a protocol-first framework for multimodal federated graph learning (MM-FGL) to address semantic drift across heterogeneous client modalities. The method constructs a shared semantic space by translating multimodal features into comparable representations before graph propagation, mitigating false agreement and inconsistency amplification. Evaluations on 8 multimodal-attributed graphs demonstrate state-of-the-art performance across 5 tasks while reducing communication overhead.
federated graph learningmultimodal learningsemantic driftrepresentation translationgraph propagation
Understanding Sample Efficiency in Predictive Coding
This work provides a mechanistic understanding of the higher sample efficiency in Predictive Coding (PC) compared to Backpropagation (BP) through the introduction of 'target alignment', a metric quantifying the alignment between network output changes and prediction error. The authors derive and empirically validate analytical expressions for target alignment in Deep Linear Networks, demonstrating that PC outperforms BP in efficiency, particularly in deep, narrow, and pre-trained networks. They establish exact conditions for optimal target alignment in PC and validate findings through experiments on linear and non-linear models, showing PC's benefits persist even when theoretical assumptions are violated.
predictive codingbackpropagationtarget alignmentdeep linear networkssample efficiency
Delightful Gradients Accelerate Corner Escape
The paper introduces Delightful Policy Gradient (DG), a modified policy gradient method that accelerates escape from sub-optimal simplex corners in reinforcement learning. DG gates each gradient term by the product of advantage and action surprisal, eliminating self-trapping behavior near corners. Theoretical analysis for $K$-armed bandits shows DG achieves logarithmic escape time and maintains $O(1/t)$ global convergence in both bandits and tabular MDPs. Experiments on MNIST contextual bandits demonstrate faster recovery from bad initializations compared to standard policy gradient, though a counterexample reveals limitations under shared function approximation.
policy gradientsimplex cornersself-trappingadvantage gatingtabular mdps
Procedural-skill SFT across capacity tiers: A W-Shaped pre-SFT Trajectory and Regime-Asymmetric Mechanism on 0.8B-4B Qwen3.5 Models
This work analyzes procedural-skill supervised fine-tuning (SFT) contributions across three Qwen3.5 model scales (0.8B, 2B, 4B) using a 200-task/40-skill holdout set, with Claude Haiku 4.5 as a frontier reference. The study employs a corpus of 353 demonstration rows and identifies a W-shaped pre-SFT trajectory, where SFT-attributable procedural-skill improvements are roughly uniform across model sizes (+0.070, +0.040, +0.075). Results reveal a regime-asymmetric pattern, with SFT providing the most significant absolute gains where the base model struggles with procedures. Cross-family validation via GPT-5.4 confirms findings with Cohen's κ ≥ 0.754 and agreement ≥ 93.25%. Earlier framings of format-only and shrinking SFT are identified as path-mismatch artifacts.
supervised fine-tuningprocedural-skillw-shaped trajectoryregime-asymmetriccross-family validation
Qwen-Scope: Turning Sparse Features into Development Tools for Large Language Models
Qwen-Scope introduces an open-source suite of sparse autoencoders (SAEs) built on the Qwen model family, comprising 14 SAE groups across 7 variants from Qwen3 and Qwen3.5 series, including dense and mixture-of-expert architectures. The SAEs serve as practical interfaces for model development, enabling inference-time steering, evaluation analysis, data-centric workflows, and post-training optimization. Results demonstrate SAEs' utility in controlling language and concepts, analyzing benchmark redundancy, supporting multilingual toxicity classification, and mitigating undesirable behaviors like code-switching and repetition. Qwen-Scope aims to advance mechanistic interpretability research and connect model internals to downstream behavior.
sparse autoencodersmechanistic interpretabilitymixture-of-expertinference-time steeringpost-training optimization
Sobolev Regularized MMD Gradient Flow
We introduce Sobolev-regularized Maximum Mean Discrepancy (SrMMD) gradient flow, a novel regularization of MMD gradient flow that imposes a gradient penalty on the witness function. This regularization addresses the non-convexity of the MMD objective, enabling provable global convergence guarantees in both continuous and discrete time without requiring isoperimetric assumptions on the target distribution. The method leverages regularity conditions on kernel mean embeddings and is applicable to both sampling from unnormalized distributions (using Stein kernels) and generative modeling, unlike prior gradient flows limited to one setting. Empirical validation demonstrates its effectiveness across diverse generative modeling and sampling tasks.
sobolev regularizationmaximum mean discrepancygradient flowstein kernelskernel mean embeddings
Adaptive TD-Lambda for Cooperative Multi-agent Reinforcement Learning
The paper introduces Adaptive TD($λ$) (ATD($λ$)) for Multi-agent Reinforcement Learning (MARL), addressing the challenge of policy distribution estimation in large joint action spaces. The method employs a parametric likelihood-free density ratio estimator with two replay buffers to approximate policy distributions without statistical calculation. ATD($λ$) dynamically assigns values to state-action pairs based on their likelihood under the current policy's stationary distribution. Evaluated on QMIX and MAPPO baselines across SMAC benchmarks and Gfootball academy scenarios, ATD($λ$) consistently outperforms or matches static $λ$ approaches.
adaptive td($λ$)multi-agent reinforcement learningdensity ratio estimatorreplay bufferspolicy distribution
LOFT: Low-Rank Orthogonal Fine-Tuning via Task-Aware Support Selection
LOFT introduces a low-rank orthogonal fine-tuning framework that decouples subspace adaptation from transformation, unifying various orthogonal PEFT methods. By framing adaptation as multiplicative subspace rotation, LOFT emphasizes support selection as a key design axis, informed by task-specific signals. Experiments across language understanding, visual transfer, and multilingual adaptation demonstrate that LOFT recovers principal-subspace orthogonal adaptation while gradient-informed supports enhance efficiency-performance trade-offs under constrained budgets.
orthogonal fine-tuninglow-rank adaptationsubspace rotationtask-aware supportparameter-efficient
Information theoretic underpinning of self-supervised learning by clustering
This paper contributes to the theoretical foundation of self-supervised learning (SSL) by formulating SSL as Kullback-Leibler (K-L) divergence optimization, specifically focusing on deep clustering approaches. The authors prevent mode collapse by imposing optimization constraints on the teacher distribution, leading to normalization using inverse cluster priors. Through the application of Jensen's inequality, this normalization simplifies to the batch centering procedure, a common heuristic in SSL. The theoretical model not only validates existing SSL methods but also provides a framework for future research directions.
self-supervised learningk-l divergencedeep clusteringbatch centeringmode collapse
FIS-DiT: Breaking the Few-Step Video Inference Barrier via Training-Free Frame Interleaved Sparsity
FIS-DiT introduces a training-free framework for accelerating Video Diffusion Transformers (DiTs) by exploiting frame interleaved sparsity in latent frames, overcoming limitations of step-wise optimization in few-step regimes. The method strategically processes frame subsets across model layers while maintaining structural consistency, enabling reduced computation without full block evaluations. Evaluations on Wan 2.2 and HunyuanVideo 1.5 show 2.11--2.41× speedup with minimal quality degradation on VBench-Q and CLIP metrics, advancing real-time HD video generation.
video diffusion transformersframe interleaved sparsityfew-step inferencelatent frame dualitytraining-free acceleration
Variance-aware Reward Modeling with Anchor Guidance
The paper introduces Anchor-guided Variance-aware Reward Modeling (AVRM) to address limitations in standard Bradley-Terry reward models and Gaussian reward models for handling pluralistic human preferences. AVRM resolves non-identifiability in Gaussian models by augmenting pairwise preference data with two coarse response-level anchor labels, proving two anchors suffice for identification. The method includes a joint training objective and establishes non-asymptotic convergence rates for reward mean and variance estimation. Empirical results across simulations and four real-world datasets demonstrate consistent improvements in reward modeling and downstream RLHF tasks, including PPO training and best-of-N selection.
bradley-terrygaussian reward modelsnon-identifiabilityanchor guidanceppo training
Beyond Parameter Aggregation: Semantic Consensus for Federated Fine-Tuning of LLMs
The paper proposes semantic consensus as an alternative to parameter aggregation for federated fine-tuning of LLMs, addressing scalability and heterogeneity challenges. Clients locally fine-tune models on private data and exchange generated outputs on public prompts; the server maps these to a semantic space, forms consensus pseudo-labels, and returns them for further local tuning. This approach reduces communication by orders of magnitude (e.g., 1006× for Llama3.1-405B), supports heterogeneous architectures, and matches federated fine-tuning performance while lowering runtime and energy costs.
federated learningsemantic consensuslarge language modelsparameter-efficient fine-tuningheterogeneous architectures
Constrained Stochastic Spectral Preconditioning Converges for Nonconvex Objectives
The authors propose proximal preconditioned gradient methods, extending Muon and Scion optimizers to handle convex and nonconvex constraints. They introduce stochastic algorithms with convergence guarantees under heavy-tailed noise, supported by a novel geometric analysis, and a variance-reduced variant for faster convergence under standard noise. The work demonstrates that polynomial iterations in Muon are better modeled by nonlinear preconditioners than ideal matrix signs, yielding more accurate convergence analysis for practical implementations.
proximal methodsspectral gradientnonconvex optimizationvariance reductionpreconditioning
A Fast and Energy-Efficient Latch-Based Memristive Analog Content-Addressable Memory
The authors propose a strong-arm latched memristor (SALM) analog content-addressable memory (aCAM) cell that addresses limitations of conventional 6T2M designs, including static search power, limited voltage gain, and match-line crosstalk. SALM replaces static voltage division with a dynamic current-race comparator, enabling high regenerative gain, intrinsic result latching, and near-zero static search power. Compared to 6T2M, SALM reduces read energy by 33% at identical latency and eliminates scalability constraints. A dataset-aware optimization framework achieves up to 50% energy reduction at 3x latency across workloads. Integrated into the X-TIME decision-tree compiler, SALM maintains near-software accuracy for high-dimensional datasets, outperforming baseline designs.
memristorcontent-addressable memorymatch-line crosstalkcurrent-race comparatordecision-tree inference
Gradient Clipping Beyond Vector Norms: A Spectral Approach for Matrix-Valued Parameters
(No summary returned.)
More Edits, More Stable: Understanding the Lifelong Normalization in Sequential Model Editing
This work provides the first theoretical analysis of Lifelong Normalization (LN), a core strategy enabling stable lifelong model editing in Large Language Models. LN normalizes value gradients using running statistics, creating a self-reinforcing stability loop that yields asymptotically orthogonal parameter updates with bounded norms when combined with ridge-regularized regression. The authors propose StableEdit, which enhances LN via explicit warm-up and full whitening, improving long-horizon stability with minimal overhead. Experiments validate the theoretical insights, demonstrating competitive performance in mitigating catastrophic forgetting and model collapse during sequential editing.
lifelong normalizationsequential model editingridge-regularized regressionasymptotic orthogonalityvalue gradients
Bin Latent Transformer (BiLT): A shift-invariant autoencoder for calibration-free spectral unmixing of turbid media
(No summary returned.)
Fed-BAC: Federated Bandit-Guided Additive Clustering in Hierarchical Federated Learning
Fed-BAC introduces a federated bandit-guided additive clustering framework for hierarchical federated learning, addressing joint optimization of cluster assignment and client selection under data heterogeneity. The method employs a two-level bandit mechanism: contextual bandits at the cloud layer for server-to-cluster assignments and Thompson Sampling at edge servers for client selection. Additive decomposition enables knowledge sharing via a global network while capturing distribution variations through cluster-specific networks. Evaluated on CIFAR-10, SVHN, and Fashion-MNIST under non-IID settings, Fed-BAC achieves accuracy gains of up to +35.5pp over HierFAVG and +8.4pp over IFCA, converges 1.5 to 4.8× faster, and improves cross-server fairness, with scalability validated at 5× deployment scale.
hierarchical federated learningadditive clusteringcontextual banditsthompson samplingnon-iid
Stop Marginalizing My Dreams: Model Inversion via Laplace Kernel for Continual Learning
REMIX introduces a structured covariance modeling framework for data-free continual learning (DFCIL), addressing limitations of diagonal covariance assumptions in model inversion. By leveraging a Laplace kernel parameterization, REMIX enables scalable full-covariance modeling without dense matrix inversion or log-determinant computation, capturing feature dependencies with linear memory scaling and logarithmic computational overhead. This approach produces more coherent synthetic samples, improving performance on standard DFCIL benchmarks. Results demonstrate the necessity of modeling feature correlations for effective and scalable DFCIL. Code is available at https://github.com/pkrukowski1/REMIX-Model-Inversion-via-Laplace-Kernel.
data-free continual learningmodel inversionlaplace kernelstructured covariancefeature dependencies
ROMER: Expert Replacement and Router Calibration for Robust MoE LLMs on Analog Compute-in-Memory Systems
ROMER introduces a post-training calibration framework for robust MoE-based LLMs on analog compute-in-memory (CIM) systems, addressing hardware noise-induced expert load imbalance and suboptimal routing. The method combines expert replacement (swapping underactivated experts with high-frequency ones) and router logit recalibration via percentile-based normalization. Evaluations on DeepSeek-MoE, Qwen-MoE, and OLMoE show perplexity reductions of 58.6%, 58.8%, and 59.8% respectively under real-chip noise conditions, demonstrating cross-architecture generalizability.
mixture-of-expertscompute-in-memoryrouter calibrationload balancehardware noise
Entropy Polarity in Reinforcement Fine-Tuning: Direction, Asymmetry, and Control
The paper introduces entropy polarity, a token-level quantity predicting how reinforcement learning updates affect policy entropy in LLMs. Through theoretical analysis, the authors identify structural asymmetry: high-probability tokens induce entropy contraction while low-probability samples promote expansion. They propose Polarity-Aware Policy Optimization (PAPO), which dynamically balances entropy-expanding and contracting updates via advantage reweighting. Experiments on mathematical reasoning and agentic tasks demonstrate PAPO's superior performance over baselines, with improved training efficiency and reward gains.
entropy polaritypolicy optimizationreinforcement learningtoken-level mechanismadvantage reweighting
From Token to Token Pair: Efficient Prompt Compression for Large Language Models in Clinical Prediction
The paper introduces Medical Token-Pair Encoding (MedTPE), a lossless prompt compression method for LLMs processing electronic health records (EHRs). MedTPE merges frequently co-occurring medical token pairs into composite tokens via dependency-aware replacement, requiring fine-tuning only 0.5-1.0% of the LLM's parameters. Experiments show MedTPE reduces token length by 31% and inference latency by 34-63% while maintaining or improving predictive performance across four clinical tasks, with demonstrated generalizability to other domains and languages.
prompt compressionelectronic health recordstoken-pair encodingdependency-aware replacementself-supervised learning
Decomposing the Generalization Gap in PROTAC Activity Prediction: Variance Attribution and the Inter-Laboratory Ceiling
The study decomposes the generalization gap in PROTAC activity prediction, identifying inter-laboratory measurement variance as the dominant factor (0.124 AUROC contribution) over binarization-threshold choice (0.05). Using PROTAC-Bench (10,748 measurements, 173 targets), the authors evaluate eight architectures and ESM-2 models up to 3B parameters, finding a LOTO AUROC plateau near 0.67. Few-shot k=5 stratified retraining with ADMET features improves LOTO AUROC from 0.668 to 0.705, while Platt scaling maintains calibration. The work releases a variance-decomposition framework, per-target calibration protocol, and evaluation code.
protacgeneralization gapaurocesm-2platt scaling
A nonlinear extension of parametric model embedding for dimensionality reduction in parametric shape design
The paper introduces NLPME, a nonlinear extension of Parametric Model Embedding (PME) for dimensionality reduction in parametric shape design. NLPME replaces PME's linear subspace with a nonlinear latent representation while maintaining geometry-driven latent variables and parameter-mediated reconstruction. Evaluated on a 32D bio-inspired underwater glider design, NLPME achieves 5% reconstruction error with 5 latent variables (vs PME's 8) and 1% error with 9 (vs PME's 15). The method retains most nonlinear compression benefits of deep autoencoders while preserving explicit backmapping to original design parameters.
dimensionality reductionparametric shape designnonlinear embeddinglatent representationgeometry reconstruction
One-Step Generative Modeling via Wasserstein Gradient Flows
We introduce W-Flow, a framework for one-step generative modeling via Wasserstein gradient flows, addressing the computational inefficiency of iterative sampling in diffusion and flow-based models. The method defines an evolution from a reference to a target distribution by minimizing an energy functional instantiated with Sinkhorn divergence, then trains a static neural generator to compress this evolution into a single step. Theoretical analysis shows convergence of finite-sample training dynamics to continuous-time distributional dynamics. W-Flow achieves state-of-the-art performance on ImageNet 256×256 generation with 1.29 FID, 100× faster sampling than comparable diffusion models, and improved mode coverage and domain transfer.
wasserstein gradient flowssinkhorn divergenceone-step generationenergy functionalfinite-sample dynamics
Federated Client Selection under Partial Visibility: A POMDP Approach with Spatio-Temporal Attention
We propose a spatial-temporal attention-based reinforcement learning framework for federated client selection under partial visibility, formulated as a Partially Observable Markov Decision Process (POMDP). The method integrates historical global models and client identity embeddings to capture both temporal training contexts and persistent client characteristics. Experiments across multiple datasets demonstrate superior performance compared to existing baselines in heterogeneous and partially visible settings, effectively addressing incomplete observations in practical federated learning systems.
federated learningclient selectionpomdpspatial-temporal attentionreinforcement learning
Learning Feature Encoder with Synthetic Anomalies for Weakly Supervised Graph Anomaly Detection
The authors propose a weakly supervised graph anomaly detection method that learns domain-specific feature representations through synthetic anomalies. The approach employs a multi-task learning scheme where synthetic anomalies are generated by perturbing normal graphs, with each anomaly type assigned a dedicated detection head to ensure sensitivity to deviations. A two-phase training strategy is used: initial warm-up with synthetic samples only, followed by full training integrating both synthetic and real data. Experiments on public datasets demonstrate superior performance over existing methods. Code is available on GitHub.
graph anomaly detectionweakly supervised learningsynthetic anomaliesmulti-task learningfeature representation
Training-Inference Consistent Segmented Execution for Long-Context LLMs
We introduce a training-inference consistent segment-level generation framework for Transformer-based large language models, addressing the computational and memory challenges of long-context generation. The method enforces consistency by restricting gradient propagation to KV states from the immediately preceding segment during training, while allowing head-specific access to past KV states in the forward pass. Evaluated on long-context benchmarks, the approach achieves performance comparable to full-context attention, with competitive latency-memory trade-offs and significantly improved scalability, reducing peak prefill memory by approximately 6x at 128K context length compared to full-context attention with FlashAttention.
transformerkv statesgradient propagationlong-context generationscalability
WorldComp2D: Spatio-semantic Representations of Object Identity and Location from Local Views
WorldComp2D introduces a lightweight representation learning framework for spatio-semantic reasoning by explicitly structuring latent space geometry based on object identity and spatial proximity. The framework comprises a proximity-dependent encoder mapping observations into spatio-semantic latent space and a localizer inferring object coordinates from this representation. Evaluated on facial landmark localization, WorldComp2D reduces parameters and FLOPs by up to 4.0X and 2.2X, respectively, compared to state-of-the-art lightweight models, while maintaining real-time CPU performance. This demonstrates the efficiency and generality of explicitly structured latent spaces for spatio-semantic reasoning.
spatio-semantic reasoninglatent space geometryproximity-dependent encoderlocalizerfacial landmark localization
Online Continual Learning with Dynamic Label Hierarchies
The paper introduces DHOCL (Online Continual Learning from Dynamic Hierarchies), a novel problem setting addressing evolving hierarchical label structures in online continual learning. To tackle partial supervision and granularity-dependent interference, the authors propose HALO (Hierarchical Adaptive Learning with Organized Prototypes), which combines adaptive classification heads with regularized hierarchical prototypes for rapid adaptation and semantic consistency. HALO outperforms existing methods on multiple benchmarks, achieving improvements in hierarchical accuracy, mistake severity, and continual performance metrics.
online continual learningdynamic hierarchiespartial supervisiongranularity-dependent interferencehierarchical prototypes
U-STS-LLM A Unified Spatio-Temporal Steered Large Language Model for Traffic Prediction and Imputation
U-STS-LLM introduces a unified spatio-temporal steered LLM framework for traffic prediction and imputation, addressing limitations of specialized STGNNs and weakly guided LLM adaptations. The method features a Dynamic Spatio-Temporal Attention Bias Generator for explicit structural guidance, LoRA-based parameter-efficient tuning, and Gated Adaptive Fusion for multi-task learning. Evaluations on cellular datasets show state-of-the-art performance in long-horizon forecasting (e.g., 12-step) and high-missing-rate imputation (e.g., 50% missing), with improved training stability and efficiency over baselines.
spatio-temporal attentionlow-rank adaptationtraffic imputationdynamic graphmulti-task learning
Persona-Conditioned Adversarial Prompting: Multi-Identity Red-Teaming for Adversarial Discovery and Mitigation
The paper introduces Persona-Conditioned Adversarial Prompting (PCAP), a method for multi-identity red-teaming that conditions adversarial search on diverse attacker personas (e.g., doctors, students) to discover transferable jailbreaks. PCAP generates rich defense datasets with automatic metadata tracking, increasing attack success from 57% to 97% on GPT-OSS 120B while producing 2-6× more diverse prompts. Fine-tuning lightweight adapters on PCAP-generated data improves model robustness (recall: 0.36→0.99, F1: 0.53→0.96) with minimal false positives, demonstrating a closed-loop approach from vulnerability discovery to alignment.
adversarial promptingred-teamingjailbreakspersona-conditionedrobustness
Block-R1: Rethinking the Role of Block Size in Multi-domain Reinforcement Learning for Diffusion Large Language Models
The paper introduces Block-R1, a framework addressing domain block size conflicts in multi-domain reinforcement learning (RL) for diffusion large language models (dLLMs). It formulates domain block size conflict, proposes a novel dataset (Block-R1-41K) with sample-level optimal block sizes, and establishes a benchmark for flexible RL post-training. The method includes a cross-domain post-training approach using sample-specific block sizes. Evaluations span 13 datasets, 7 RL algorithms, and multiple dLLM backbones, with resources open-sourced.
reinforcement learningdiffusion modelsblock size conflictmulti-domain learningpost-training
EPIC: Efficient Predicate-Guided Inference-Time Control for Compositional Text-to-Image Generation
EPIC introduces a training-free inference-time refinement framework for compositional text-to-image generation, addressing challenges with multi-object, count, attribute, and relation prompts. The method parses prompts into visual programs of object variables and typed predicates, verifying generated images against these programs to guide targeted editing or resampling. EPIC improves prompt-level accuracy on GenEval2 from 34.16% to 71.46%, outperforming prior baselines by 19.23 points while reducing image-model executions by 31%, MLLM calls by 72%, and MLLM tokens by 81%.
text-to-imageinference-time controlvisual programpredicate-guided searchcompositional generation
Unlocking Compositional Generalization in Continual Few-Shot Learning
The paper introduces a novel paradigm for compositional generalization in continual few-shot learning by decoupling representation learning from compositional inference. The method leverages self-supervised Vision Transformers (ViTs) to preserve object-level geometries during training and dynamically composes slot representations at inference. This dual-phase strategy prevents representation drift and enables novel-concept transfer. Experiments show state-of-the-art performance in unseen-concept generalization and minimal forgetting across continual learning benchmarks.
compositional generalizationcontinual few-shot learningvision transformersrepresentation learningobject-centric representations
GRAFT: Graph-Tokenized LLMs for Tool Planning
The paper introduces GRAFT, a graph-tokenized LLM framework for tool planning that internalizes tool graphs by mapping nodes to special tokens and learning dependencies in representation space. It employs on-policy tool context distillation to train on sampled trajectories while distilling stepwise planning signals. Experiments demonstrate GRAFT's state-of-the-art performance in exact sequence matching and dependency legality, enhancing reliability in complex workflow planning.
graph-tokenizedtool planningdependency-awareon-policy distillationworkflow reliability
Augmented Lagrangian Method for Last-Iterate Convergence for Constrained MDPs
The paper proposes an augmented Lagrangian (AL) method for last-iterate convergence in constrained Markov decision processes (CMDPs), addressing the impracticality of mixture policies. It introduces projected Q-ascent (PQA) to solve AL sub-problems, proving global last-iterate convergence in tabular settings. The framework extends to log-linear and non-linear policies, validated on continuous control tasks. Theoretical guarantees match prior work while enabling practical deployment of single policies.
augmented lagrangianconstrained mdpslast-iterate convergenceprojected q-ascentlog-linear policies
Compositional Neural Operators for Multi-Dimensional Fluid Dynamics
The paper introduces Compositional Neural Operators (CompNO), a framework for solving 2D PDEs by decomposing complex systems into modular Foundation Blocks of specialized Neural Operators. Each block (convection, diffusion, nonlinear convection, Poisson Solver) is pretrained on elementary physics and assembled via an Adaptation Block with an Aggregator that learns nonlinear interactions through physics-informed loss minimization. Evaluated on Convection-Diffusion, Burgers', and Incompressible Navier-Stokes equations, CompNO demonstrates improved adaptability, interpretability, and pretrained block reuse compared to traditional encoding-decoding approaches.
compositional neural operatorsfoundation blocksphysics-informed learningneural operatorspde surrogates
Slicing and Dicing: Configuring Optimal Mixtures of Experts
This work presents the first systematic study of Mixture-of-Experts (MoE) architecture design choices, analyzing over 2,000 pretraining runs across models up to 6.6B parameters. The authors exhaustively vary expert count, dimension, heterogeneous sizing, shared experts, and load-balancing mechanisms. Results show that performance consistently improves with total MoE parameters, even at extreme active-to-total parameter ratios (e.g., 128:1). Optimal expert size depends primarily on active parameter count rather than total parameters, while other design choices have minimal impact relative to expert count and granularity. Dropless routing emerges as the only secondary factor with consistent performance gains.
mixture-of-expertspretrainingload-balancingheterogeneous expertsdropless routing
Partial Model Sharing Improves Byzantine Resilience in Federated Conformal Prediction
The paper introduces a Byzantine-resilient federated conformal prediction (FCP) method using partial model sharing, where only subsets of parameters are exchanged per round. This approach safeguards both training and calibration phases by limiting attack surfaces and compressing non-conformity scores into histogram vectors for Byzantine detection. Experiments demonstrate improved coverage and tighter prediction intervals under diverse Byzantine attacks compared to standard FCP, offering robust uncertainty quantification with reduced communication overhead.
federated conformal predictionbyzantine resiliencepartial model sharingnon-conformity scoreshistogram-based characterization
Posterior Contraction Rates for Sparse Kolmogorov-Arnold Networks in Anisotropic Besov Spaces
The paper establishes posterior contraction rates for sparse Bayesian Kolmogorov-Arnold networks (KANs) in anisotropic Besov spaces, providing a statistical foundation for KANs. Using spike-and-slab priors and a hyperprior on model size, the method achieves near-minimax contraction rates that adapt to unknown anisotropic smoothness. Key results show fixed-depth KANs control approximation complexity through width, spline-grid parameters, and sparsity, with rates depending on layerwise smoothness in compositional settings. Theoretical tools for spline-edge architectures are developed, avoiding the curse of dimensionality.
kolmogorov-arnold networksposterior contractionanisotropic besov spacesspike-and-slab priorscompositional smoothness
GeomHerd: A Forward-looking Herding Quantification via Ricci Flow Geometry on Agent Interactive Simulations
GeomHerd introduces a forward-looking geometric framework for quantifying herding behavior in financial markets by analyzing agent-interaction graphs rather than lagging price correlations. The method employs discrete Ollivier--Ricci curvature on graphs generated from LLM-driven multi-agent simulations, linking graph topology to macroscopic herding statistics (CSAD). Results show early detection: 272-step median lead time before order-parameter onset, 65% recall of critical trajectories 318 steps early, and 40-step precedence over price-correlation baselines. The approach generalizes to the Vicsek model and improves cascade-window forecasting (reduced MAE).
ollivier-ricci curvaturemulti-agent simulationherding quantificationprice-correlation laggeomherd
Finite Sentence-Interface Control for Learning Bounded-Fan-Out Linear MCFGs under Fixed Monoid Typing
(No summary returned.)
Learning U-Statistics with Active Inference
The paper proposes an active inference framework for efficient estimation of U-statistics under label acquisition constraints. The method employs augmented inverse probability weighting to incorporate sampling rules and machine learning predictions, characterizing the optimal sampling rule for variance minimization. Experimental results on real datasets show significant improvements in estimation efficiency over baselines while maintaining target coverage.
u-statisticsactive inferenceaugmented inverse probability weightingsampling ruleestimation efficiency
MIST: Reliable Streaming Decision Trees for Online Class-Incremental Learning via McDiarmid Bound
MIST introduces a reliable streaming decision tree for online class-incremental learning, addressing two miscalibrations in traditional approaches: unreliable split criteria and lack of knowledge transfer. The method combines (i) a K-independent McDiarmid confidence radius for Gini splitting, (ii) a Bayesian inheritance protocol for variance reduction, and (iii) per-leaf KLL quantile sketches for adaptive prediction. Evaluated on tabular streams, MIST matches parametric methods on near-Gaussian data and outperforms state-of-the-art benchmarks on non-Gaussian geometry.
streaming decision treesonline class-incremental learningmcdiarmid boundgini splittingquantile sketches
Targeted Tests for LLM Reasoning: An Audit-Constrained Protocol
The paper introduces an audit-constrained protocol for targeted evaluation of LLM reasoning, addressing limitations of fixed benchmarks by systematically testing prompt variations. Methodologically, it employs Component-Adaptive Prompt Sampling (CAPS) within a deterministic grammar-based framework, with strict semantic and extraction audits to distinguish genuine model errors from artifacts. Results show the protocol effectively identifies confirmed errors while filtering invalid cases, but CAPS does not outperform uniform sampling in audited yield or unique prompt discovery, emphasizing the need for audited metrics over proxy-guided policies.
llm reasoningprompt variationaudit protocolcomponent-adaptive samplingsemantic validation
Learning Weakly Communicating Average-Reward CMDPs: Strong Duality and Improved Regret
(No summary returned.)
A Mixture Autoregressive Image Generative Model on Quadtree Regions for Gaussian Noise Removal via Variational Bayes and Gradient Methods
The paper introduces a probabilistic generative model for grayscale image denoising, combining quadtree region-partitioning with a mixture autoregressive model. The framework reformulates MAP-estimation-based denoising as variational lower bound maximization, solved via alternating variational Bayes and gradient methods. Analytical computation of gradient updates eliminates numerical approximation. Experimental validation confirms noise reduction efficacy, with identified improvement pathways.
quadtreeautoregressivevariational bayesmap-estimationgradient methods
FedOUI: OUI-Guided Client Weighting for Federated Aggregation
FedOUI proposes a novel federated aggregation method using the Overfitting-Underfitting Indicator (OUI), an activation-based metric that captures input-space organization without requiring labels. Clients transmit local updates with OUI values computed on a fixed probe batch, enabling the server to reweight atypical clients via smooth distribution-based weighting. Evaluations on non-IID CIFAR-10 show FedOUI outperforms FedAvg, FedProx, and gradient-alignment baselines under strong heterogeneity, demonstrating activation structure's utility beyond traditional size/gradient criteria.
federated learningclient weightingactivation metricnon-iid dataaggregation rule
OUI as a Structural Observable: Towards an Activation-Centric View of Neural Network Training
The paper introduces the Overfitting--Underfitting Indicator (OUI) as a structural observable for analyzing neural network training dynamics from an activation-centric perspective. OUI serves as an early, label-free signal derived from activation patterns, enabling the identification of poor or promising training regimes prior to convergence. Empirical results demonstrate its utility across domains: in supervised learning, it anticipates weight decay regimes; in reinforcement learning, it discriminates learning-rate regimes in PPO actor--critic; and in online control, it facilitates layer-wise weight decay adaptation. These findings, combined with evidence of early activation pattern stabilization, suggest OUI as a foundational tool for developing an activation-centric theory of training dynamics.
overfitting-underfitting indicatoractivation patternstraining dynamicsppo actor-criticweight decay
A Composite Activation Function for Learning Stable Binary Representations
We propose Heavy Tailed Activation Function (HTAF), a smooth composite sigmoid-tanh approximation to the Heaviside function that enables stable gradient-based optimization for networks with binary activations. HTAF maintains large gradient mass near zero inputs while exhibiting slower gradient decay in tail regions, theoretically supporting stable training. Experiments demonstrate that HTAF enables stable training of Spiking Neural Networks, Binary Neural Networks, and Deep Heaviside Networks. Additionally, we introduce Implicit Concept Bottleneck Models (ICBMs), leveraging HTAF for discrete feature representations in image models, achieving comparable or superior prediction performance to standard models across architectures and datasets.
heavy tailed activation functionheaviside functiongradient-based optimizationimplicit concept bottleneck modelsbinary neural networks
A Controlled Counterexample to Strong Proxy-Based Explanations of OOD Performance: in a Fixed Pretraining-and-Probing Setup
The paper demonstrates a controlled counterexample where task-agnostic structure proxies fail to align with out-of-distribution (OOD) probe accuracy rankings, challenging strong proxy-based explanations of OOD performance. Using a fixed pretraining-and-probing setup motivated by computationally bounded notions like epiplexity, the authors construct a scenario where a formal structure quantity, its operational proxy, and task-relevant structure separate. In a synthetic sequence-model experiment, OOD accuracy rankings reversed proxy rankings in two of three seeds, supported by auxiliary diagnostics and ablations. This identifies a boundary on proxy-based explanations, showing proxies for total learned structure can fail to track task-relevant structure driving OOD performance.
out-of-distributionpretrainingprobe accuracyepiplexitytask-agnostic
VNDUQE: Information-Theoretic Novelty Detection using Deep Variational Information Bottleneck
The paper proposes VNDUQE, a novelty detection method using the Deep Variational Information Bottleneck (VIB) to constrain information flow in learned representations. The approach evaluates out-of-distribution (OOD) detection via KL divergence and prediction entropy, showing complementary strengths: KL divergence achieves 100% AUROC on far-OOD samples (e.g., noise), while prediction entropy attains 94.7% AUROC on near-OOD cases (novel digit classes). Combined, they yield 95.3% average AUROC, a 32 percentage point improvement over maximum softmax probability. VIB compression (β=10⁻³) reduces Expected Calibration Error by 38%, demonstrating improved uncertainty calibration for active learning applications.
novelty detectionvariational information bottleneckout-of-distributionkl divergenceuncertainty quantification
Fast MoE Inference via Predictive Prefetching and Expert Replication
A dynamic expert replication strategy is proposed to accelerate Mixture of Experts (MoE) inference by predicting overloaded experts and replicating them for concurrent batch processing. This approach addresses GPU underutilization, load imbalance, and latency issues caused by sparse expert activation in large-scale MoE models. The method enables near-complete GPU utilization (~100%) and achieves up to 3x inference speed improvement while maintaining 90-95% of baseline performance, as demonstrated on Switch-base-128 and Switch-base-256 architectures.
mixture of expertsgpu utilizationexpert replicationinference accelerationload balancing
Primal-Dual Policy Optimization for Linear CMDPs with Adversarial Losses
(No summary returned.)
Generative climate downscaling enables high-resolution compound risk assessment by preserving multivariate dependencies
A diffusion-based multivariate generative framework with bias correction significantly improves high-resolution climate downscaling by preserving inter-variable dependencies critical for compound risk assessment. The method addresses resolution gaps up to 50× while maintaining correlations among five meteorological variables, reducing correlation errors by over 4× compared to existing baselines. Applied to Japan, it enhances both univariate and spatial accuracy, enabling more reliable detection of severe drought and other compound hazards.
generative downscalingdiffusion modelscompound hazardsbias correctionmultivariate dependencies
Multi-Narrow Transformation as a Single-Model Ensemble: Boundary Conditions, Mechanisms, and Failure Modes
The paper investigates the trade-offs between single-wide and multi-narrow (MN) architectures in single-model ensembles (SMEs) under matched parameter budgets. Through systematic experiments with CNNs across varied data regimes, architectures, and datasets, it demonstrates that MN transformation excels in low-data settings by learning diverse, non-redundant path-wise features, while single-wide configurations dominate in data-rich scenarios due to imbalanced training. The findings provide empirical guidelines for capacity allocation between width and member multiplicity in resource-constrained settings.
single-model ensemblesmulti-narrow transformationparameter budgetfeature diversitylow-data regimes
FERMI: Exploiting Relations for Membership Inference Against Tabular Diffusion Models
FERMI introduces a novel membership inference attack tailored for tabular diffusion models in multi-relational settings, addressing the limitation of existing single-table approaches. The method leverages auxiliary relational information during training to enrich single-table features with relational membership signals, while requiring only target table attributes at inference time. Evaluated across three tabular diffusion architectures and three real-world relational datasets, FERMI demonstrates significant improvements in attack performance, achieving up to 53% higher true positive rate at 0.1 false positive rate (TPR@0.1FPR) in white-box settings and 22% in black-box settings compared to single-table baselines.
membership inferencetabular diffusion modelsmulti-relational datafeature-mappingprivacy risk
OverNaN: NaN-Aware Oversampling for Imbalanced Learning with Meaningful Missingness
OverNaN introduces a NaN-aware oversampling framework for imbalanced learning that preserves meaningful missingness in datasets. The method extends synthetic oversampling techniques to operate directly on incomplete feature vectors, allowing missing values to be preserved, propagated, or selectively interpolated based on defined strategies. By treating missingness as part of the feature space, OverNaN avoids introducing artificial certainty while addressing class imbalance. The framework is demonstrated to retain meaningful missingness during oversampling, making it suitable for small, incomplete, and imbalanced datasets in scientific and engineering domains where missingness is informative and unavoidable.
nan-aware oversamplingimbalanced learningmissing-data handlingsynthetic oversamplingfeature space
EqOD: Symmetry-Informed Stability Selection for PDE Identification
Equivariant Operator Discovery (EqOD) introduces a fully automatic method for partial differential equation (PDE) identification by combining symmetry-informed library reduction and randomized LASSO stability selection. When Galilean invariance is detected via a weak-form structural test, EqOD reduces the candidate library using a proven Galilean exclusion result; otherwise, it applies stability selection guided by false-positive bounds. EqOD achieves F1 = 1.000 ± 0.000 on the Heat equation at 20% noise, outperforming WF-LASSO (0.475 ± 0.181), PySINDy 2.0 (0.000), and WSINDy (0.789). It wins 7 of 32 test cases under strict criteria and outperforms PySINDy 2.0.0 in 23 of 32 cases. External validation yields F1 = 1.000 on all 5 clean benchmarks.
equivariant operator discoverygalilean invariancestability selectionweak-form structural testpartial differential equations
📰 Industry Media (7)
AI chatbots are giving out people’s real phone numbers
Generative AI chatbots, including Google Gemini, OpenAI ChatGPT, and Anthropic Claude, are increasingly exposing personal phone numbers and other PII due to training on web-scraped datasets containing sensitive information. Instances include incorrect customer service numbers and personal contacts surfaced during casual queries. Experts attribute this to LLMs memorizing and reproducing PII from training data, exacerbated by diminishing public datasets and reliance on data brokers. Guardrails like content filters and privacy instructions often fail, as evidenced by cases where chatbots bypassed safeguards to reveal addresses and phone numbers. Current privacy laws inadequately address this issue, and AI companies lack clear mechanisms for PII removal.
piillmsguardrailsdata brokersmemorization
Fastino Labs Open-Sources GLiGuard: A 300M Parameter Safety Moderation Model That Matches or Exceeds Accuracy of Models 23–90x Its Size
Fastino Labs introduces GLiGuard, a 300M parameter encoder-based safety moderation model that reframes moderation as a text classification task rather than autoregressive generation. GLiGuard evaluates four safety tasks concurrently in a single forward pass: safety classification, jailbreak strategy detection, harm category detection, and refusal detection. Trained on 87,000 human-annotated examples and synthetic data, GLiGuard achieves 87.7 F1 on prompt classification and 82.7 F1 on response classification, matching or exceeding models 23–90× its size while reducing latency by up to 16.6× (26ms vs. 426ms) on an NVIDIA A100 GPU.
encoder-basedautoregressive generationtext classificationharm category detectionmacro-averaged f1
Mira Murati’s Thinking Machines Lab Introduces Interaction Models: A Native Multimodal Architecture for Real-Time Human-AI Collaboration
Thinking Machines Lab introduces Interaction Models, a native multimodal architecture for real-time human-AI collaboration, addressing limitations of turn-based systems. The 276B parameter Mixture-of-Experts model employs micro-turn design (200ms chunks) with encoder-free early fusion, enabling simultaneous audio/video/text processing via co-trained lightweight embeddings. Benchmarks show superior performance on interaction metrics (77.8 FD-bench v1.5, 0.40s latency) and novel tasks like TimeSpeak (64.7 accuracy) where existing models score near-zero.
interaction modelsmicro-turn designencoder-free early fusionmixture-of-expertsreal-time multimodal
Google DeepMind Introduces an AI-Enabled Mouse Pointer Powered by Gemini That Captures Visual and Semantic Context Around the Cursor
Google DeepMind introduces an AI-enabled mouse pointer powered by Gemini, capturing visual and semantic context around the cursor to streamline user interactions. The system operates on four principles: maintaining workflow continuity, leveraging visual-semantic context, interpreting deictic language, and converting pixels into actionable entities. It dynamically processes cursor hover state and UI content as structured inputs, enabling intuitive interactions without manual prompting. Experimental demos for image editing and map search are available in Google AI Studio, with integrations rolling out in Chrome and planned for Googlebook laptops. The approach shifts AI assistance from isolated windows to cursor-level functionality across applications.
geminideictic languagestructured inputsmultimodal modelsentity extraction
Build a Hybrid-Memory Autonomous Agent with Modular Architecture and Tool Dispatch Using OpenAI
The article presents a modular architecture for building hybrid-memory autonomous agents using OpenAI's API, combining semantic vector search (text-embedding-3-small) and keyword retrieval (BM25) via Reciprocal Rank Fusion. The system implements abstract interfaces for MemoryBackend, LLMProvider, and Tool, with concrete implementations including a HybridMemory class and OpenAIProvider (gpt-4o-mini). Four tools (memory_store, memory_search, calculator, web_search) demonstrate tool dispatch, while an AgentPersona class enforces consistent behavior through compiled system prompts. The agent achieves multi-turn tool-augmented reasoning with an 8-round maximum dispatch loop.
hybrid-memoryreciprocal rank fusiontool dispatchmodular architectureautonomous agent
Meet AntAngelMed: A 103B-Parameter Open-Source Medical Language Model Built on a 1/32 Activation-Ratio MoE Architecture
Researchers introduce AntAngelMed, a 103B-parameter open-source medical LLM employing a 1/32 activation-ratio Mixture-of-Experts (MoE) architecture, activating only 6.1B parameters per inference. The model builds on Ling-flash-2.0 with architectural optimizations including sigmoid routing, QK-Norm, and Partial-RoPE, achieving 7× efficiency over dense models. Three-stage training combines medical pre-training, mixed-domain SFT, and GRPO-based RL. Benchmarks show state-of-the-art performance on HealthBench (surpassing proprietary models), MedAIBench, and MedBench, with 128K context via YaRN and >200 tokens/s throughput on H20 hardware.
mixture-of-expertsqk-normpartial-ropegrpoyarn
Physical AI Conference Comes to San Jose as Robotics & Autonomous AI Go Mainstream
The Physical AI Conference 2026 in San Jose focuses on scaling AI from digital to physical domains, emphasizing robotics, autonomous systems, and industrial automation. The event highlights enterprise-scale deployment strategies, infrastructure requirements, and real-world AI reliability. Sessions cover AI strategy, robotics, autonomous operations, and developer workflows, featuring insights from NVIDIA, Airbus, Qualcomm, and Hyundai. The conference aims to bridge the gap between AI experimentation and production, addressing challenges in scalability, infrastructure, and safety. Attendees will explore advancements in Physical AI, including sensing, reasoning, and acting in dynamic environments, marking a shift from software-based AI to embedded intelligent systems.
roboticsautonomous systemsindustrial automationai infrastructurephysical ai
Generated automatically at 2026-05-13 21:19 UTC. Summaries and keywords are produced by an LLM and may contain inaccuracies — always consult the original article.
