Daily Digest — 2026-05-14

Wednesday, May 13, 2026 · 340 items · model: deepseek/deepseek-chat

340 items · 4 research labs, 329 arxiv papers, 7 industry media

🏛️ Research Labs (4)

Building a safe, effective sandbox to enable Codex on Windows

OpenAI News · 2026-05-13

OpenAI developed a custom sandbox implementation for Codex on Windows to balance safety and productivity, addressing the lack of native OS-level isolation. The solution combines synthetic SIDs and write-restricted tokens to enforce granular filesystem access controls without requiring admin privileges, while advisory environment variables limit network access. Initial prototypes demonstrated effective write restrictions but revealed weaknesses in network suppression, prompting exploration of Windows Firewall integration for stronger isolation.

sandboxsynthetic sidswrite-restricted tokensmandatory integrity controlappcontainer

How finance teams use Codex

OpenAI News · 2026-05-12

OpenAI Codex enables finance teams to automate repetitive tasks and generate review-ready assets for business operations. By leveraging existing workbooks, dashboards, and owner notes, Codex transforms unstructured inputs into structured narratives, variance analyses, and forecast updates. The system integrates with plugins like Google Drive, SharePoint, and Slack to process source-backed data, flag risks, and draft CFO-ready reports. Example workflows include preparing monthly business reviews, cleaning financial models, and updating executive reporting packs. Codex reduces manual effort, ensuring accuracy and consistency while allowing teams to focus on strategic decision-making.

codexforecast updatesvariance analysiscfo-readyplugins

AutoScout24 scales engineering with AI-powered workflows

OpenAI News · 2026-05-12

AutoScout24 Group implemented AI-powered workflows using OpenAI's Codex and ChatGPT to accelerate software development and enhance code quality across its engineering teams. The dual-layer strategy combined broad organizational access to ChatGPT for 2,000 employees with deep integration of Codex into workflows for 1,000 builder roles. Key outcomes included a 10x reduction in development cycles (from weeks to days), improved code consistency through automated pull request reviews, and expanded innovation capacity via AI-enabled prototyping. The company established an AI Champions network to drive organic adoption, focusing on augmenting existing capabilities rather than replacing them.

codexchatgptpull request reviewsworkflow integrationai champions

How NVIDIA engineers and researchers build with Codex

OpenAI News · 2026-05-12

NVIDIA engineers leverage OpenAI's Codex (built on GPT-5.5) to accelerate complex engineering and ML research workflows, achieving 10× speed improvements in end-to-end experimentation. The system autonomously handles long coding sessions, bug detection, and tool selection while maintaining context across compactions. Key results include 40,000 NVIDIA employees adopting Codex, automated research loops (from literature review to experiment execution via SSH), and 20× efficiency gains in Python-to-Rust translation. The model demonstrates superior autonomy and creativity compared to predecessors, enabling rapid prototyping of production systems like an internal podcast app.

codexgpt-5.5kv-cacheautonomous agentsmachine translation

📜 arXiv Papers (329)

AlphaGRPO: Unlocking Self-Reflective Multimodal Generation in UMMs via Decompositional Verifiable Reward

arXiv cs.AI · Runhui Huang, Jie Wu, Rui Yang, Zhe Liu · 2026-05-12

AlphaGRPO introduces Group Relative Policy Optimization (GRPO) for AR-Diffusion Unified Multimodal Models (UMMs), enabling advanced multimodal generation tasks without cold-start initialization. The framework supports Reasoning Text-to-Image Generation by inferring implicit user intents and Self-Reflective Refinement through autonomous error diagnosis and correction. A Decompositional Verifiable Reward (DVReward) mechanism decomposes user requests into atomic, verifiable questions evaluated by a general Multimodal Large Language Model (MLLM) for stable supervision. Experiments on GenEval, TIIF-Bench, DPG-Bench, WISE, and GEdit demonstrate robust improvements in generation and editing tasks, validating the self-reflective reinforcement approach.

group relative policy optimizationar-diffusion unified multimodal modelsdecompositional verifiable rewardreasoning text-to-image generationself-reflective refinement

Learning, Fast and Slow: Towards LLMs That Adapt Continually

arXiv cs.AI · Rishabh Tiwari, Kusha Sareen, Lakshya A Agrawal, Joseph E. Gonzalez · 2026-05-12

We introduce Fast-Slow Training (FST), a framework combining in-context learning (fast weights) and parameter updates (slow weights) for continual adaptation in large language models (LLMs). FST leverages optimized context as fast weights to absorb task-specific information while maintaining slow weights closer to the base model, preserving general reasoning behaviors. FST achieves up to 3x greater sample efficiency and higher performance asymptotes compared to parameter-only reinforcement learning (RL) across reasoning tasks. It reduces KL divergence by up to 70%, mitigating catastrophic forgetting and preserving plasticity for subsequent tasks. In continual learning scenarios, FST consistently acquires new tasks, outperforming parameter-only RL approaches.

fast-slow trainingin-context learningcatastrophic forgettingkl divergencecontinual learning

Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training

arXiv cs.AI · Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He · 2026-05-12

The paper introduces a reward-density principle for efficient allocation of labeled training data in language model post-training, arguing that sparse sequence-level rewards should train exploratory models while dense token-level teacher rewards compress behavior into smaller models. It proposes using scarce labeled data upstream on the strongest model to generate reward-shaped behavior, then transferring it downstream as dense supervision. Evaluations on Qwen3 and Llama models for verifiable math tasks show that an RL-improved 8B teacher distilled through dense supervision outperforms direct GRPO on a 1.7B student, improving MATH accuracy from 75.4% to 78.5%.

reward-densitysparse rewarddense supervisionrl-improvedtoken-level

ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents

arXiv cs.AI · Xuhao Hu, Xi Zhang, Haiyang Xu, Kyle Qiao · 2026-05-12

ToolCUA introduces an end-to-end agent for optimal GUI-Tool path orchestration in Computer Use Agents (CUAs), addressing suboptimal execution paths caused by hybrid action spaces. The method employs an Interleaved GUI-Tool Trajectory Scaling Pipeline to synthesize diverse trajectories, Tool-Bootstrapped GUI RFT combining supervised fine-tuning and single-turn RL for improved switching decisions, and Online Agentic RL guided by a Tool-Efficient Path Reward. Evaluated on OSWorld-MCP, ToolCUA achieves 46.85% accuracy, a 66% relative improvement over the baseline and 3.9% over GUI-only settings, demonstrating effective GUI-Tool orchestration. The approach highlights the potential of hybrid action space training for real-world digital agents.

gui-tool orchestrationhybrid action spacetool-bootstrapped gui rftonline agentic rltool-efficient path reward

OmniNFT: Modality-wise Omni Diffusion Reinforcement for Joint Audio-Video Generation

arXiv cs.AI · Guohui Zhang, XiaoXiao Ma, Jie Huang, Hang Xu · 2026-05-12

OmniNFT introduces a modality-aware online diffusion RL framework for joint audio-video generation, addressing three key challenges: multi-objective advantages inconsistency, multi-modal gradients imbalance, and uniform credit assignment. The method employs modality-wise advantage routing, layer-wise gradient surgery, and region-wise loss reweighting to enhance per-modality fidelity and cross-modal alignment. Experiments on JavisBench and VBench with LTX-2 show improvements in audio-video perceptual quality, alignment, and synchronization.

reinforcement learningdiffusion modelsmultimodal generationgradient surgerycredit assignment

Reward Hacking in Rubric-Based Reinforcement Learning

arXiv cs.AI · Anas Mahmoud, MohammadHossein Rezaei, Zihao Wang, Anisha Gunjal · 2026-05-12

This work investigates reward hacking in rubric-based reinforcement learning, where policies optimized against training verifiers diverge from rubric-free judge evaluations. The authors introduce a framework separating verifier failure (training verifier credits rejected criteria) and rubric-design limitations (rubric-based verifiers favor worse responses). Experiments in medical and science domains show weak verifiers yield proxy-reward gains that fail to transfer, with exploitation growing over training. Stronger verifiers reduce but do not eliminate exploitation. A self-internalization gap metric tracks reference-verifier quality. Results indicate stronger verification reduces reward hacking but does not ensure rubric gains align with broader quality improvements.

reward hackingrubric-based rlverifier failureself-internalization gapproxy-reward

KV-Fold: One-Step KV-Cache Recurrence for Long-Context Inference

arXiv cs.AI · Alireza Nadali, Patrick Cooper, Ashutosh Trivedi, Alvaro Velasquez · 2026-05-12

KV-Fold introduces a training-free long-context inference protocol that treats the key-value (KV) cache as an accumulator in a left fold over sequence chunks. The method processes each chunk conditioned on the accumulated cache, appends new keys and values, and passes the enlarged cache forward, enabling stable recurrence without model modification. Results demonstrate robustness across chunk sizes, numerical precision, and model families, achieving 100% exact-match retrieval on a needle-in-a-haystack benchmark with contexts up to 128K tokens and chain depths up to 511 on Llama-3.1-8B. KV-Fold maintains long-range retrieval within single GPU memory limits, outperforming streaming methods.

kv-cachelong-context inferencerecurrencetransformerneedle-in-a-haystack

Solve the Loop: Attractor Models for Language and Reasoning

arXiv cs.AI · Jacob Fein-Ashley, Paria Rashidinejad · 2026-05-12

The paper introduces Attractor Models, a novel architecture combining backbone and attractor modules to refine output embeddings via fixed-point solving with implicit differentiation. This approach maintains constant training memory and adaptively selects iteration depth. Empirical results demonstrate Pareto improvements over standard Transformers in language modeling (46.6% lower perplexity, 19.7% higher accuracy) and reasoning tasks (91.4% accuracy on Sudoku-Extreme with 27M parameters). The models exhibit equilibrium internalization, enabling solver removal at inference with minimal performance loss.

attractor modelsfixed-point solvingimplicit differentiationequilibrium internalizationiterative refinement

Towards Affordable Energy: A Gymnasium Environment for Electric Utility Demand-Response Programs

arXiv cs.AI · Jose E. Aguilar Escamilla, Lingdong Zhou, Xiangqi Zhu, Huazheng Wang · 2026-05-12

We introduce DR-Gym, an open-source Gymnasium-compatible environment for training and evaluating demand-response programs from an electric utility's perspective. The simulator addresses limitations of offline historical data by modeling the dynamic feedback loop between pricing signals and customer adaptation, featuring a regime-switching wholesale price model calibrated to extreme events and physics-based building demand profiles. A configurable multi-objective reward function enables diverse learning objectives. Baseline strategies and data snapshots demonstrate the simulator's capability to create realistic and learnable environments for optimizing sequential decision-making in demand-response programs.

demand-responsegymnasiumregime-switchingmulti-objectivesequential decision-making

Enabling AI-Native Mobility in 6G: A Real-World Dataset for Handover, Beam Management, and Timing Advance

arXiv cs.AI · Mannam Veera Narayana, Rohit Singh, Deepa M. R, Radha Krishna Ganti · 2026-05-12

This work introduces a real-world dataset for AI/ML-driven mobility optimization in 6G networks, addressing limitations of simulated data in high-speed 5G scenarios. The dataset captures user equipment (UE) mobility across pedestrian, bike, car, bus, and train modes, focusing on handover (HO) scenarios to reduce interruption time and maintain throughput. It includes timing advance (TA) measurements at key signaling events (RACH trigger, MAC CE, PDCCH grant), previously absent in existing datasets. The authors detail dataset creation, experimental setup, and exploratory analysis, highlighting its utility for training and evaluating AI/ML models in TA prediction and beam management.

handovertiming advancebeam managementuser equipment6g

The Algorithmic Caricature: Auditing LLM-Generated Political Discourse Across Crisis Events

arXiv cs.AI · Gunjan, Sidahmed Benabderrahmane, Talal Rahwan · 2026-05-12

This study introduces a Computational Social Science framework to audit the population-level realism of LLM-generated political discourse across crisis events. Using a paired corpus of 1,789,406 posts from nine events, the authors compare observed social media discourse with synthetic counterparts across four dimensions: emotional intensity, structural regularity, lexical-ideological framing, and cross-event dependency. Results indicate that synthetic discourse is fluent but less realistic at the population level, exhibiting more negative sentiment, structural regularity, and lexical abstraction compared to observed discourse. Differences vary by event type, quantified via the Caricature Gap measure. The findings highlight reduced population realism as a key limitation of synthetic political discourse.

computational social sciencepopulation realismcaricature gaplexical-ideological framingcross-event dependency

A Causal Language Modeling Detour Improves Encoder Continued Pretraining

arXiv cs.AI · Rian Touchent, Eric de la Clergerie · 2026-05-12

The study demonstrates that temporarily switching from Masked Language Modeling (MLM) to Causal Language Modeling (CLM) during encoder adaptation improves downstream performance. Using ModernBERT on biomedical texts, this CLM detour followed by MLM decay outperformed MLM-only baselines by +0.3-2.8pp across 19 French and English tasks. Analysis reveals CLM's dense supervision primarily affects lower transformer layers (0-7), with gains persisting through MLM decay and scaling with model capacity. The authors release ModernCamemBERT-bio and ModernBERT-bio as state-of-the-art biomedical encoders.

masked language modelingcausal language modelingencoder adaptationtransformer layersbiomedical nlp

CAAFC: Chronological Actionable Automated Fact-Checker for misinformation / non-factual hallucination detection and correction

arXiv cs.AI · Islam Eldifrawi, Shengrui Wang, Amine Trabelsi · 2026-05-12

We introduce CAAFC (Chronological Actionable Automated Fact-Checker), a framework addressing limitations in existing Automated Fact-Checking (AFC) systems by aligning with professional fact-checking practices. CAAFC processes claims, conversations, and dialogues to detect factual errors and hallucinations, providing actionable corrections with primary source justifications. It dynamically updates evidence and knowledge bases by incorporating recent contextual information. Evaluations demonstrate that CAAFC outperforms state-of-the-art AFC and hallucination detection systems across multiple benchmark datasets, enhancing fact verification reliability.

automated fact-checkinghallucination detectionknowledge base updateprimary source justificationcontextual information

Formalize, Don't Optimize: The Heuristic Trap in LLM-Generated Combinatorial Solvers

arXiv cs.AI · Haoyu Wang, Yuliang Song, Tao Li, Zhiwei Deng · 2026-05-12

The paper introduces CP-SynC-XL, a benchmark of 100 combinatorial problems (4,577 instances), to evaluate three LLM-generated solver-construction paradigms: native Python, Python + OR-Tools, and MiniZinc + OR-Tools. It finds that Python + OR-Tools achieves highest correctness, while MiniZinc + OR-Tools has lower coverage despite using the same back-end. Prompting for search optimization yields minimal speed-ups (1.03-1.12x median) and often degrades correctness due to heuristic traps like local approximations or redundant constraints. The results advocate formalizing variables and constraints for verified solvers while separately verifying LLM-authored optimizations.

combinatorial solversllm-generated codeconstraint programmingheuristic trapneuro-symbolic systems

Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space

arXiv cs.AI · Eric Bigelow, Raphaël Sarfati, Daniel Wurgaft, Owen Lewis · 2026-05-12

This work proposes that Large Language Models (LLMs) update beliefs through trajectories in a low-dimensional conceptual belief space, analogous to Bayesian inference. The study analyzes belief dynamics using story understanding tasks, combining behavioral and representational analyses. Results show that belief updates follow structured manifolds, reflected consistently in model behavior and internal representations, which can be decoded using linear probes. Interventions on these representations causally steer belief trajectories, predictable from the geometry of the conceptual space. These findings provide a geometric framework for understanding in-context learning in LLMs.

large language modelsbayesian inferenceconceptual belief spacein-context learninglinear probes

Predicting Decisions of AI Agents from Limited Interaction through Text-Tabular Modeling

arXiv cs.AI · Eilam Shapira, Moshe Tennenholtz, Roi Reichart · 2026-05-12

The study introduces a target-adaptive text-tabular modeling approach to predict decisions of unfamiliar AI agents from limited interactions, leveraging structured game state, offer history, and dialogue. The method employs a tabular foundation model augmented with LLM-as-Observer, where a frozen LLM encodes decision-time state and dialogue into hidden state features, enhancing prediction without direct few-shot prompting. Evaluated on 13 frontier-LLM agents and 91 scaffolded agents, the model outperforms baselines, with Observer features improving response-prediction AUC by ~4 points and reducing bargaining offer-prediction error by 14%. This demonstrates the efficacy of hidden LLM representations in decision prediction.

tabular foundation modelllm-as-observertarget-adaptive predictionhidden state featuresdecision prediction

Semantic Reward Collapse and the Preservation of Epistemic Integrity in Adaptive AI Systems

arXiv cs.AI · William Parris · 2026-05-12

The paper introduces Semantic Reward Collapse (SRC), a phenomenon where semantically distinct forms of evaluative dissatisfaction are compressed into generalized optimization signals in reinforcement learning from human feedback (RLHF) systems. This leads to epistemic drift, where systems suppress visible uncertainty rather than preserving calibrated uncertainty integrity. Drawing on institutional proxy collapse and human learning theory, the authors propose Constitutional Reward Stratification (CRS), a domain-aware reward framework designed to preserve differentiated epistemic attribution. CRS is presented as a governance-oriented research direction requiring further empirical validation.

semantic reward collapsereinforcement learning from human feedbackepistemic integrityconstitutional reward stratificationoptimization signals

OGLS-SD: On-Policy Self-Distillation with Outcome-Guided Logit Steering for LLM Reasoning

arXiv cs.AI · Yuxiao Yang, Xiaoyun Wang, Weitong Zhang · 2026-05-12

The paper proposes OGLS-SD, an outcome-guided logit-steering framework for on-policy self-distillation (OPSD) in large language models (LLMs). The method addresses token-level supervision mismatch caused by reflection-induced bias and response templates by leveraging verifiable outcome rewards to contrast successful and failed trajectories. Experiments demonstrate improved reasoning performance over standard OPSD and variants across multiple benchmarks through calibrated teacher logits combining outcome-level correctness with dense token-level guidance.

on-policy self-distillationlogit steeringoutcome-guided learningreasoning calibrationtoken-level supervision

Detecting overfitting in Neural Networks during long-horizon grokking using Random Matrix Theory

arXiv cs.AI · Hari K. Prakash, Charles H Martin · 2026-05-12

A novel Random Matrix Theory method detects overfitting in Neural Networks without access to train or test data by identifying Correlation Traps—large outliers in the empirical spectral distribution of randomized weight matrices. The method involves element-wise randomization of weight matrices, fitting with a Marchenko-Pastur distribution, and evaluating JS divergence of output logits on random data. Results reveal an 'anti-grokking' phase characterized by increasing Correlation Traps, high train accuracy, and decreasing test accuracy, distinct from pre-grokking phases. The method also identifies Correlation Traps in some foundation-scale LLMs, indicating potential harmful overfitting.

random matrix theorycorrelation trapsmarchenko-pastur distributionanti-grokkingjs divergence

SEMIR: Semantic Minor-Induced Representation Learning on Graphs for Visual Segmentation

arXiv cs.AI · Luke James Miller, Yugyung Lee · 2026-05-12

The paper introduces SEMIR, a graph-based representation learning framework for segmenting small, sparse structures in high-resolution images. SEMIR decouples inference from native grids by constructing topology-preserving graph minors through parameterized edge contraction and deletion, optimized via boundary-alignment objectives. The method employs a GNN with relational edge features for efficient region-level inference. Evaluated on BraTS 2021, KiTS23, and LiTS datasets, SEMIR improves Dice scores for minority structures while maintaining practical runtime performance, demonstrating robustness to structural variability and distributional uncertainty.

graph minorboundary-alignmentdice criteriongnnedge contraction

Scalable Token-Level Hallucination Detection in Large Language Models

arXiv cs.AI · Rui Min, Tianyu Pang, Chao Du, Minhao Cheng · 2026-05-12

TokenHD introduces a scalable pipeline for token-level hallucination detection in large language models (LLMs), addressing limitations of step-level analysis. The method combines a data engine for synthesizing hallucination annotations with an importance-weighted training strategy, enabling direct detection on free-form text without predefined segmentation. Experiments demonstrate that a 0.6B detector outperforms larger reasoning models like QwQ-32B, with detection performance scaling consistently from 0.6B to 8B. The detector exhibits strong generalization across diverse scenarios, and strategies for enhancing cross-domain generalization are explored.

hallucination detectiontoken-level analysisimportance-weighted trainingscalable pipelinecross-domain generalization

Trust the Batch, On- or Off-Policy: Adaptive Policy Optimization for RL Post-Training

arXiv cs.AI · Rasool Fakoor, Murdock Aubry, Nicholas Stranges, Alexander J. Smola · 2026-05-12

The paper introduces a batch-adaptive objective for reinforcement learning that dynamically adjusts trust-region and off-policy concerns based on the policy-ratio distribution, eliminating the need for fixed hyper-parameters. The method uses normalized effective sample size to cap score-function weights and set regularization strength, automatically tightening updates when data becomes stale or mismatched. Experiments demonstrate that this approach matches or outperforms tuned baselines across diverse settings without introducing new hyper-parameters. The implementation is available as open-source.

policy optimizationoff-policy learningtrust-region methodseffective sample sizereinforcement learning

Discrete Flow Matching for Offline-to-Online Reinforcement Learning

arXiv cs.AI · Fairoz Nower Khan, Nabuat Zaman Nahim, Peizhong Ju · 2026-05-12

The paper introduces DRIFT, a method for offline-to-online reinforcement learning (RL) in discrete action spaces, addressing challenges in fine-tuning generative policies. DRIFT updates an offline pretrained continuous-time Markov chain (CTMC) policy using advantage-weighted discrete flow matching, with a path-space penalty to preserve pretrained knowledge and a candidate-set approximation for large action spaces. Theoretical analysis shows controlled error in candidate-set approximation and adaptive CTMC generators. Experiments on Jericho demonstrate stable improvement, achieving the highest average score with a GRU encoder, outperforming pretrained language model methods.

offline-to-online rldiscrete flow matchingcontinuous-time markov chainadvantage-weighted losscandidate-set approximation

ProfiliTable: Profiling-Driven Tabular Data Processing via Agentic Workflows

arXiv cs.AI · Wei Liu, Yang Gu, Xi Yan, Zihan Nan · 2026-05-12

ProfiliTable introduces an autonomous multi-agent framework for robust tabular data processing, addressing limitations of LLM-based approaches through dynamic profiling. The system combines a Profiler (ReAct-style exploration), Generator (knowledge-augmented code synthesis), and Evaluator-Summarizer (closed-loop refinement via execution feedback). Evaluated on 18 tabular task types, it outperforms baselines in multi-step scenarios, demonstrating improved semantic accuracy and governance compliance through iterative context refinement.

tabular data processingdynamic profilingmulti-agent frameworkreact-style explorationclosed-loop refinement

Agent-Based Post-Hoc Correction of Agricultural Yield Forecasts

arXiv cs.AI · Matthew Beddows, Aiden Durrant, Georgios Leontidis · 2026-05-12

A structured LLM agent framework is proposed for post-hoc correction of agricultural yield forecasts, addressing limitations in commercial farm records lacking sensor networks and high-resolution inputs. The framework integrates domain knowledge via phase detection, bias learning, and range validation tools. Evaluations on proprietary strawberry and USDA corn datasets demonstrate significant improvements: agent refinement reduced MAE by 20% and MASE by 56% for strawberry yields across XGBoost, Moirai2, and Random Forest baselines. Llama 3.1 8B outperformed LLaVA 13B, achieving consistent gains across configurations.

llm agentpost-hoc correctionphase detectionbias learningrange validation

Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models

arXiv cs.AI · Yanting Miao, Yutao Sun, Dexin Wang, Mengyu Zhou · 2026-05-12

The Granular Alignment Paradigm (GAP) addresses feature-space mismatch in visual latent reasoning for multimodal large language models (MLLMs) by aligning visual latents at three levels. GAP employs feature-level alignment via a PCA-aligned latent head, context-level alignment with auxiliary visual supervision, and capacity-guided alignment targeting challenging examples. Evaluated on Qwen2.5-VL 7B, GAP achieves superior mean aggregate perception and reasoning performance compared to supervised variants. Inference-time probing indicates that generated latents provide task-relevant visual signals beyond token slot expansion.

multimodal large language modelsvisual latent reasoninggranular alignment paradigmfeature-space mismatchpca-aligned latent head

Classifier Context Rot: Monitor Performance Degrades with Context Length

arXiv cs.AI · Sam Martin, Fabien Roger · 2026-05-12

The study demonstrates that frontier language models (Opus 4.6, GPT 5.4, Gemini 3.1) exhibit degraded performance in classifying dangerous actions within long coding transcripts (>500K tokens), with failure rates increasing by 2× to 30× beyond 800K tokens compared to shorter contexts. The authors propose prompting techniques like periodic reminders as partial mitigation and highlight the need for long-context evaluations in monitor benchmarks. Results indicate current evaluations overestimate monitor performance by neglecting context-length effects.

long-context degradationcoding agentsprompting techniquesmonitor performancefrontier models

QAP-Router: Tackling Qubit Routing as Dynamic Quadratic Assignment with Reinforcement Learning

arXiv cs.AI · Kien X. Nguyen, Ankit Kulshrestha, Ilya Safro, Xiaoyuan Liu · 2026-05-12

QAP-Router introduces a reinforcement learning approach to qubit routing by framing it as a dynamic Quadratic Assignment Problem (QAP), capturing interaction-distance coupling through flow and distance matrices. The policy network employs a solution-aware Transformer backbone to encode matrix interactions into attention mechanisms, integrating a lookahead mechanism to mitigate myopic decisions. Evaluated on 1,831 quantum circuits from MQTBench, AgentQ, and QUEKO datasets, QAP-Router reduces CNOT gate counts by 15.7%, 30.4%, and 12.1% respectively compared to existing industry compilers.

qubit routingquadratic assignment problemreinforcement learningtransformercnot gate

A Family of Quaternion-Valued Differential Evolution Algorithms for Numerical Function Optimization

arXiv cs.AI · Gerardo Altamirano-Gomez, Álvaro Gallardo, Carlos Ignacio Hernández Castellanos · 2026-05-12

The authors introduce Quaternion-Valued Differential Evolution (QDE), a family of novel algorithms extending Differential Evolution (DE) to operate directly in quaternion space. Several mutation strategies are proposed to exploit the algebraic and geometric properties of quaternions. Evaluated on the BBOB benchmark, the QDE variants demonstrate faster convergence and superior performance across multiple function classes compared to traditional real-valued DE, highlighting the potential of quaternion-based optimization in computational intelligence.

quaternion-valued differential evolutionnumerical optimizationmutation strategiesbbob benchmarkquaternion algebra

MedHopQA: A Disease-Centered Multi-Hop Reasoning Benchmark and Evaluation Framework for LLM-Based Biomedical Question Answering

arXiv cs.AI · Rezarta Islamaj, Robert Leaman, Joey Chan, Nicholas Wan · 2026-05-12

MedHopQA introduces a disease-centered multi-hop reasoning benchmark for evaluating large language models (LLMs) in biomedical question answering, addressing limitations of existing benchmarks. The dataset comprises 1,000 expert-curated question-answer pairs requiring synthesis across two distinct Wikipedia articles, with open-ended free-text answers and ontology-grounded synonym sets for evaluation. Constructed through human annotation, triage, iterative verification, and LLM-as-a-judge validation, MedHopQA embeds scored questions within a larger set of 10,000 questions to mitigate contamination risk. The benchmark prioritizes compositional reasoning, saturation resistance, and contamination resistance, providing a reusable framework for future biomedical QA datasets.

multi-hop reasoningbiomedical question answeringontology-groundedcontamination resistancecompositional reasoning

$δ$-mem: Efficient Online Memory for Large Language Models

arXiv cs.AI · Jingdi Lei, Di Zhang, Junxian Li, Weida Wang · 2026-05-12

The paper introduces $\delta$-mem, a lightweight memory mechanism for large language models that enhances a frozen full-attention backbone with a compact online associative memory state. $\delta$-mem compresses historical information into an $8\times8$ state matrix updated via delta-rule learning and generates low-rank corrections to attention computation during generation. Evaluations show $\delta$-mem improves average scores to $1.10\times$ the frozen backbone and $1.15\times$ the strongest non-$\delta$-mem baseline, with larger gains on memory-heavy benchmarks like MemoryAgentBench ($1.31\times$) and LoCoMo ($1.20\times$), while preserving general capabilities.

delta-rule learningassociative memorylow-rank correctionsattention computationmemory-heavy benchmarks

A New Technique for AI Explainability using Feature Association Map

arXiv cs.AI · Sayantani Ghosh, Amit Kumar Das, Amlan Chakrabarti · 2026-05-12

The paper introduces FAMeX, a novel explainable AI (XAI) algorithm based on Feature Association Maps (FAM) that models feature relationships using graph theory. FAMeX outperforms established XAI methods (Permutation Feature Importance and SHAP) in feature importance estimation across eight benchmark datasets, demonstrating superior classification explainability. Experimental results validate FAMeX's efficacy in enhancing AI system transparency through association-based feature interpretation.

explainable aifeature association mapshappermutation feature importancegraph theory

BSO: Safety Alignment Is Density Ratio Matching

arXiv cs.AI · Tien-Phat Nguyen, Truong Nguyen, Thin Nguyen, Duy Minh Ho Nguyen · 2026-05-12

The authors introduce Bregman Safety Optimization (BSO), a principled framework for safety alignment in language models that reduces the task to density ratio matching. By decomposing the likelihood ratio of the optimal safe policy and minimizing Bregman divergences between data and model ratios, BSO yields a family of single-stage loss functions induced by convex generators. This approach eliminates the need for auxiliary models, requires only one additional hyperparameter, and subsumes existing safety-aware methods as special cases. Experiments demonstrate that BSO consistently improves the safety-helpfulness trade-off across benchmarks.

bregman safety optimizationdensity ratio matchingsafety alignmentlanguage modelsbregman divergences

Manifold Sampling via Entropy Maximization

arXiv cs.AI · Cornelius V. Braun, Tilman Burghoff, Marc Toussaint · 2026-05-12

The paper introduces MAnifold Sampling via Entropy Maximization (MASEM), a method for sampling from distributions on manifolds with disconnected components defined by smooth constraints. MASEM employs a resampling scheme to maximize empirical distribution entropy using k-nearest neighbor density estimation, achieving exponential KL-divergence reduction in the mean field. Evaluated with local samplers on synthetic and robotics benchmarks, MASEM outperforms alternatives by an order of magnitude in Sinkhorn distance while maintaining competitive runtime.

constrained samplingentropy maximizationmanifold learningk-nearest neighborsinkhorn distance

EHR-RAGp: Retrieval-Augmented Prototype-Guided Foundation Model for Electronic Health Records

arXiv cs.AI · Saeed Shurrab, Mariam Al-Omari, Dana El Samad, Farah E. Shamout · 2026-05-12

EHR-RAGp introduces a retrieval-augmented foundation model for Electronic Health Records (EHR) that dynamically integrates relevant patient history across diverse clinical event types. The model employs a prototype-guided retrieval module to align and estimate the relevance of historical chunks for a given prediction task, addressing challenges like long trajectories, heterogeneous events, and temporal irregularity. EHR-RAGp consistently outperforms state-of-the-art EHR foundation models and transformer-based baselines across multiple clinical prediction tasks. Integration with existing clinical foundation models yields substantial performance gains, providing a scalable and efficient framework for leveraging long-range clinical context.

electronic health recordsretrieval-augmentedprototype-guidedclinical predictionfoundation model

Reinforcing VLAs in Task-Agnostic World Models

arXiv cs.AI · Yucen Wang, Rui Yu, Fengming Zhang, Junjie Lu · 2026-05-12

RAW-Dream introduces a task-agnostic paradigm for reinforcing Vision-Language-Action (VLA) models by disentangling world model learning from task dependencies. It employs a pre-trained world model for predicting future rollouts and an off-the-shelf Vision-Language Model (VLM) for reward generation, enabling zero-shot inference. A dual-noise verification mechanism is introduced to mitigate world model hallucinations by filtering unreliable rollouts. Experiments across simulation and real-world settings demonstrate consistent performance gains, showing that generalized physical priors can replace costly task-dependent data, offering a scalable approach for VLA adaptation.

vision-language-actionzero-shot inferenceworld modeldual-noise verificationtask-agnostic

Towards Automated Air Traffic Safety Assessment Around Non-Towered Airports Using Large Language Models

arXiv cs.AI · Torsten Darrell, Mahyar Ghazanfari, Jordan Kam, Alexandre Bayen · 2026-05-12

The study proposes a vision-language model (VLM) framework for post-flight safety analysis at non-towered airports, leveraging transcribed Common Traffic Advisory Frequency (CTAF) communications, METAR weather data, ADS-B flight trajectories, and Visual Flight Rules charts. A preliminary evaluation at Half Moon Bay Airport uses Gemini 2.5 Pro for qualitative case studies and benchmarks three open-source (Qwen 2.5-7B, Mistral-7B, Gemma-2-9B) and three closed-source (GPT-4o, GPT-5.4, Claude Sonnet 4.6) LLMs on a synthetic dataset with a 12-category hazard taxonomy. Results show macro F1 scores above 0.85 for binary nominal/danger classification using CTAF and METAR inputs, suggesting VLMs as a promising tool for air traffic safety assessment.

vision-language modelcommon traffic advisory frequencymeteorological aerodrome reportautomatic dependent surveillance-broadcastvisual flight rules

LISA: Cognitive Arbitration for Signal-Free Autonomous Intersection Management

arXiv cs.AI · Abderrahmane Lakas, Mohamed Amine Ferrag, Merouane Debbah · 2026-05-12

We propose LISA (LLM-Based Intent-Driven Speed Advisory), a signal-free cognitive arbitration framework for autonomous intersection management using large language models (LLMs) to reason over vehicle intents, priority classes, queue pressure, and energy preferences. LISA eliminates dependency on signal infrastructure while addressing LLM inference latency challenges. Evaluated against fixed-cycle control, SCATS, AIM, and GLOSA across varying traffic loads, LISA reduces mean control delay by up to 89.1%, maintains Level of Service C, and decreases mean waiting time by 93% and peak queue length by 60.6% under near-saturated demand. Additionally, it lowers fuel consumption by up to 48.8% and achieves 86.2% intent satisfaction, outperforming non-LLM methods.

autonomous intersection managementlarge language modelscognitive arbitrationsignal-free controlintent-driven reasoning

Transferable Delay-Aware Reinforcement Learning via Implicit Causal Graph Modeling

arXiv cs.AI · Chenran Zhao, Dianxi Shi, Yaowen Zhang, Chunping Qiu · 2026-05-12

The paper proposes a transferable delay-aware reinforcement learning method using implicit causal graph modeling to address action-effect propagation challenges in delayed feedback scenarios. The method employs a field-node encoder for latent state representation with node-level semantics and a message-passing mechanism to capture dynamic causal dependencies, enabling transferable structured representations. Imagination-driven behavior learning and latent space planning facilitate cross-task knowledge transfer. Experiments on DMC continuous control tasks with random delays show superior performance over baselines, with cross-task transfer demonstrating accelerated policy adaptation.

reinforcement learningcausal graph modelinglatent state representationmessage-passing mechanismcross-task transfer

KAN-CL: Per-Knot Importance Regularization for Continual Learning with Kolmogorov-Arnold Networks

arXiv cs.AI · Minjong Cheon · 2026-05-12

KAN-CL introduces a continual learning framework leveraging the compact-support spline parameterization of Kolmogorov-Arnold Networks (KANs) to mitigate catastrophic forgetting through per-knot importance regularization. The method combines a KAN classification head with standard EWC regularization on a convolutional backbone (bbEWC), achieving 88% and 93% reductions in forgetting over a KAN-only baseline on Split-CIFAR-10/5T and Split-CIFAR-100/10T benchmarks, respectively, while maintaining or surpassing baseline accuracy. Neural Tangent Kernel (NTK) analysis reveals that KAN's spline locality induces a structural rank deficit in the cross-task NTK, providing a forgetting bound applicable even in the feature-learning regime.

catastrophic forgettingkolmogorov-arnold networksneural tangent kernelper-knot regularizationcontinual learning

Executable Agentic Memory for GUI Agent

arXiv cs.AI · Zerui Qin, Sheng Yue, Xingyuan Hua, Yongjian Fu · 2026-05-12

The paper introduces Executable Agentic Memory (EAM), a Knowledge Graph (KG)-based framework that replaces fragile step-wise GUI agent planning with robust retrieval-and-execution. EAM employs state-aware DFS and action-group mining for memory construction, coupled with a Q-function-guided Monte Carlo Tree Search (MCTS) for efficient KG traversal. Theoretical analysis proves bias-consistency and sample complexity bounds. Empirical results show EAM outperforms UI-TARS-7B by 19.6% on AndroidWorld while reducing token costs 6× versus GPT-4o, achieving 2.8s average latency for reliable long-horizon automation.

executable agentic memoryknowledge graphmonte carlo tree searchaction-group miningbias-consistency

PriorZero: Bridging Language Priors and World Models for Decision Making

arXiv cs.AI · Junyu Xiong, Yuan Pu, Jia Tang, Yazhe Niu · 2026-05-12

PriorZero introduces a unified framework that integrates Large Language Model (LLM) priors into world-model-based planning for Reinforcement Learning (RL) agents, addressing the prior-dynamics mismatch in long-horizon tasks. The method employs a decoupled rollout-training design: during rollout, LLM priors are injected at the root node of Monte Carlo Tree Search (MCTS) to guide exploration; during training, world-model learning and LLM adaptation are decoupled, enabling stable fine-tuning via alternating optimization. Experiments on benchmarks like Jericho and BabyAI demonstrate improved exploration efficiency and asymptotic performance, validating the framework's effectiveness in LLM-empowered decision-making.

reinforcement learningmonte carlo tree searchworld modellarge language modelfine-tuning

TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching

arXiv cs.AI · Truong Nguyen, Tien-Phat Nguyen, Linh Ngo Van, Duy Minh Ho Nguyen · 2026-05-12

TokenRatio introduces Token-level Bregman Preference Optimization (TBPO), a method for principled token-level preference optimization via ratio matching, addressing limitations of Direct Preference Optimization (DPO) which operates at the sequence level. TBPO posits a token-level Bradley-Terry preference model over next-token actions conditioned on the prefix, deriving a Bregman-divergence density-ratio matching objective that generalizes the logistic/DPO loss while preserving optimal policy induction. Two instantiations are proposed: TBPO-Q, which learns a lightweight state baseline, and TBPO-A, which removes the baseline through advantage normalization. Experiments across instruction following, helpfulness/harmlessness, and summarization benchmarks demonstrate improved alignment quality, training stability, and output diversity compared to sequence-level and token-level baselines.

token-level preferencebregman-divergenceratio matchingoptimal policyadvantage normalization

Set-Aggregated Genome Embeddings for Microbiome Abundance Prediction

arXiv cs.AI · Younhun Kim, Georg K. Gerber, Travis E. Gibson · 2026-05-12

The authors introduce Set-Aggregated Genome Embeddings (SAGE) for predicting community-level microbiome abundance profiles from raw DNA sequences. The method leverages genomic language models (GLMs) and their few-shot learning capabilities, aggregating genome embeddings at the community level. Benchmarking demonstrates improved generalization on novel genomes compared to classical bioinformatics approaches. Model ablation reveals that community-level latent representations directly enhance performance. The study also highlights the benefits of intermediate transformations between latent representations and compares different GLM embedding choices.

set-aggregated genome embeddingsmicrobiome abundance predictiongenomic language modelsfew-shot learninglatent representations

Iterative Audit Convergence in LLM-Managed Multi-Agent Systems: A Case Study in Prompt Engineering Quality Assurance

arXiv cs.AI · Elias Calboreanu · 2026-05-12

This paper contributes a case study of iterative, agent-driven auditing for prompt-specification quality assurance in LLM-managed multi-agent systems, focusing on AEGIS, a production seven-lane orchestration pipeline with 7150 lines of prompt specifications. Nine sequential audit rounds, executed by Claude sub-agents using a checklist-driven walkthrough, identified 51 prompt-specification consistency defects, distinct from adversarial code findings. Defect counts per round were 15, 8, 12, 2, 8, 1, 4, 1, and 0, showing non-monotonic convergence. The study proposes a seven-category defect taxonomy, an audit protocol, and a final locked checklist for reproducibility.

prompt-specificationmulti-agent systemsiterative auditingdefect taxonomyclaude sub-agents

NARA: Anchor-Conditioned Relation-Aware Contextualization of Heterogeneous Geoentities

arXiv cs.AI · Jina Kim, Gengchen Mai, Lingyi Zhao, Khurram Shafique · 2026-05-12

NARA introduces a self-supervised framework for learning context-dependent representations of vector geoentities by jointly modeling semantics, geometry, and spatial relations. Unlike existing methods, NARA captures unified spatial context across heterogeneous geoentities (points, polylines, polygons) by incorporating relational spatial structure beyond proximity alone. The framework leverages neural anchor-conditioned relation-aware representation learning to enable rich contextualized representations. Evaluations on building function classification, traffic speed prediction, and next point-of-interest recommendation tasks demonstrate consistent improvements over prior methods, underscoring the effectiveness of unified relational modeling for vector geospatial data.

vector geoentitiesself-supervised learningspatial relationsneural anchor-conditioningheterogeneous geoentities

How Useful Is Cross-Domain Generalization for Training LLM Monitors?

arXiv cs.AI · Sam Martin, Fabien Roger · 2026-05-12

The study investigates cross-domain generalization in prompted language models for classification tasks, demonstrating partial generalization to adjacent domains and improved performance on unseen tasks. Training on multiple classification tasks with distinct prompts enhances robustness, though edge cases persist where models fail to adapt to entirely new prompts within the same domain. Combining classification training with general instruction following mitigates these failures while retaining classification benefits. Notably, supervised 'no-thinking' classification training generalizes to 'with-thinking' tasks like summarization, suggesting its utility in developing diverse classifiers and monitoring systems.

cross-domain generalizationprompted language modelsinstruction followingclassification traininggeneralization failures

Reconnecting Fragmented Citation Networks with Semantic Augmentation

arXiv cs.AI · Vu Thi Huong, Annika Buchholz, Imene Khebouri, Thorsten Koch · 2026-05-12

This work introduces a hybrid framework for reducing fragmentation in citation networks by integrating citation topology with LLM-based text similarity. The method augments the original graph by adding semantic edges between disconnected components and weighting existing citations based on textual similarity, using 662,369 Web of Science publications in Mathematics and Operations Research & Management Science. Semantic augmentation significantly reduces fragmentation while maintaining disciplinary homogeneity, and cluster detection via the Leiden algorithm preserves structural interpretability with multi-scale organization. The approach scales efficiently to large datasets, enhancing citation-based indicators without collapsing disciplinary boundaries.

citation networkssemantic augmentationtext similarityleiden algorithmdisciplinary homogeneity

Missingness-MDPs: Bridging the Theory of Missing Data and POMDPs

arXiv cs.AI · Joshua Wendland, Markel Zubia, Roman Andriushchenko, Maris F. L. Galesloot · 2026-05-12

The paper introduces missingness-MDPs (miss-MDPs), a POMDP subclass integrating missing data theory, where observations follow missingness functions classifying features as MCAR, MAR, or MNAR. The authors present PAC algorithms leveraging missingness-type structural properties to learn these functions from action-observation trajectories, enabling planning via off-the-shelf methods. Theoretical guarantees show ε-optimal policies in the true miss-MDP with high probability, empirically outperforming model-free POMDP baselines.

missingness-mdpspomdpsmissing datapac learningplanning

Why Conclusions Diverge from the Same Observations: Formalizing World-Model Non-Identifiability via an Inference

arXiv cs.AI · Toru Takahashi · 2026-05-12

The paper formalizes non-identifiability in inference and learning as the root cause of divergent conclusions from shared observations, rather than cognitive defects. It introduces a two-level framework: (i) θ-level non-identifiability, where inference settings (Reference, Exploration, Stabilization, Horizon) vary under the same world model W; and (ii) W-level non-identifiability, where repeated inference biases data exposure and updates, causing W to diverge. The analysis shows how disagreements project onto abstract/concrete, externalizability, and order/freedom bases due to computational, observational, and coordination constraints. The framework connects to deep representation learning, illustrated via AI regulation debates.

non-identifiabilityinference profileworld modelrepresentation learninglatent-state estimation

Mind the Pause: Disfluency-Aware Objective Tuning for Multilingual Speech Correction with LLMs

arXiv cs.AI · Deepak Kumar, Baban Gain, Asif Ekbal · 2026-05-12

The authors propose a multilingual speech correction pipeline leveraging large language models (LLMs) to address disfluencies in Automatic Speech Recognition (ASR) transcripts. Their method combines a sequence tagger for disfluent token detection with instruction fine-tuning of an LLM, enhanced by a contrastive learning objective that penalizes disfluent token reproduction while preserving grammatical integrity. Experiments across Hindi, Bengali, and Marathi demonstrate consistent improvements over multilingual sequence-to-sequence baselines, highlighting the insufficiency of detection-only approaches. The approach offers a scalable solution for multilingual disfluency correction in speech-driven NLP systems, with code publicly available.

automatic speech recognitiondisfluency correctioninstruction fine-tuningcontrastive learningmultilingual sequence-to-sequence

No Action Without a NOD: A Heterogeneous Multi-Agent Architecture for Reliable Service Agents

arXiv cs.AI · Zixu Yang, Hang Zheng, Nan Jiang, Zhiyang Tang · 2026-05-12

We propose NOD (Navigator-Operator-Director), a heterogeneous multi-agent architecture to enhance the reliability of LLM-based service agents in long-horizon tasks. NOD externalizes a structured Global State for explicit task tracking and introduces selective external oversight via a Director agent to verify critical actions, mitigating error propagation and unsafe behavior. Evaluated on $τ^2$-Bench, NOD achieves higher task success rates and critical action precision compared to baselines, significantly reducing policy violations, tool hallucinations, and user-intent misalignment.

multi-agent architectureglobal stateexternal oversighttask success rateerror propagation

Pretraining Strategies and Scaling for ECG Foundation Models: A Systematic Study

arXiv cs.AI · M A Al-Masud, Nils Strodthoff · 2026-05-12

This systematic study evaluates pretraining strategies and scaling effects for ECG foundation models, comparing five self-supervised learning objectives on datasets up to 11M samples. Methods include contrastive and non-contrastive approaches, with architectures spanning transformers, CNNs, and structured state space models. Results show contrastive predictive coding outperforms other objectives, particularly JEPA, in transferability across clinical tasks. Scaling pretraining data improves performance up to 11M samples, while structured state space models demonstrate superior representation learning, attributed to their strong inductive biases over pretraining scale alone.

contrastive predictive codingstructured state space modelsself-supervised learningelectrocardiographyinductive biases

Harness Engineering as Categorical Architecture

arXiv cs.AI · Bogdan Banu · 2026-05-12

The paper formalizes agent harness engineering through categorical architecture, proposing the triple (G, Know, Phi) from ArchAgents as a theoretical foundation. Memory, Skills, Protocols, and Harness Engineering map to coalgebraic state, operad-composed objects, syntactic wiring, and Architecture respectively. Structural guarantees like integrity gates and convergence checks are preserved via compiler functors targeting Swarms, DeerFlow, Ralph, and LangGraph. Validation shows certificate preservation across configurations, with LangGraph enabling native observability. An escalation experiment confirms model-parametric quality control in multi-agent tasks.

categorical architectureagent harnesscoalgebraic stateoperad-composed objectssyntactic wiring

TMRL: Diffusion Timestep-Modulated Pretraining Enables Exploration for Efficient Policy Finetuning

arXiv cs.AI · Matthew M. Hong, Jesse Zhang, Anusha Nagabandi, Abhishek Gupta · 2026-05-12

The authors propose TMRL, a unified framework bridging behavioral cloning (BC) pre-training and reinforcement learning (RL) fine-tuning for robot policies. Their Context-Smoothed Pre-training (CSP) method injects forward-diffusion noise into policy inputs, creating a continuum between precise imitation and broad action coverage. During fine-tuning, Timestep-Modulated Reinforcement Learning (TMRL) enables dynamic adjustment of diffusion timestep conditioning, granting explicit control over exploration. The approach integrates with arbitrary policy inputs (states, 3D point clouds, image-based VLA policies) and improves RL fine-tuning sample efficiency. TMRL achieves successful real-world fine-tuning on complex manipulation tasks in under one hour.

behavioral cloningreinforcement learningforward-diffusion noisetimestep modulationsample efficiency

No More, No Less: Task Alignment in Terminal Agents

arXiv cs.AI · Sina Mavali, David Pape, Jonathan Evertz, Samira Abedini · 2026-05-12

The paper introduces TAB (Task Alignment Benchmark), a suite of 89 terminal tasks designed to evaluate agents' ability to selectively use relevant environmental cues while ignoring distractors. Derived from Terminal-Bench 2.1, TAB tasks are intentionally underspecified, requiring agents to interpret embedded cues in natural artifacts. Evaluation of ten frontier agents reveals a systematic gap between task capability and task alignment, with the strongest Terminal-Bench agent achieving high task completion but low task alignment. Analysis of six prompt-injection defenses shows that suppressing distractors also suppresses necessary cues, highlighting the need for selective instruction use in task-aligned agents.

task alignment benchmarkterminal agentsenvironmental cuesprompt-injection defensestask capability

TriBand-BEV: Real-Time LiDAR-Only 3D Pedestrian Detection via Height-Aware BEV and High-Resolution Feature Fusion

arXiv cs.AI · Mohammad Khoshkdahan, Alexey Vinel · 2026-05-12

TriBand-BEV introduces a real-time LiDAR-only 3D pedestrian detection method using a novel bird's eye view (BEV) encoding with three height bands, reformulating 3D detection as 2D. The approach employs area attention, hierarchical bidirectional feature fusion (P1-P4), and distribution focal learning for oriented box prediction, with vertical rebinning and reflectance jitter for robustness. On KITTI, it achieves 58.7/52.6/47.2 BEV AP(%) for pedestrian detection (easy/moderate/hard) at 49 FPS, outperforming Complex-YOLO by +12.6/+7.5/+3.1%. The method includes an IQR filter for outlier removal and demonstrates stable occlusion handling.

bird's eye viewlidar3d detectionfeature fusionreal-time

Heterogeneous SoC Integrating an Open-Source Recurrent SNN Accelerator for Neuromorphic Edge Computing on FPGA

arXiv cs.AI · Michelangelo Barocci, Vittorio Fra, Enrico Macii, Gianvito Urgese · 2026-05-12

The authors present a heterogeneous System-on-Chip (SoC) integrating ReckOn, an open-source recurrent Spiking Neural Network (SNN) accelerator, with traditional processors (RISC-V-based X-HEEP and ARM) for neuromorphic edge computing on FPGA. The design validates functional equivalence with the taped-out ReckOn version through FPGA implementation, maintaining classification accuracy while offering a cost-effective alternative to custom silicon. Experimental evaluation demonstrates online learning capabilities on a Braille digit dataset subset, benchmarking against existing neuromorphic platforms. The work addresses prohibitive ASIC costs by leveraging FPGA programmability for flexible, open-source neuromorphic hardware development.

spiking neural networksneuromorphic computingfpga acceleratorsystem-on-chiponline learning

Goal-Oriented Reasoning for RAG-based Memory in Conversational Agentic LLM Systems

arXiv cs.AI · Jiazhou Liang, Armin Toroghi, Yifan Simon Liu, Faeze Moradi Kalarde · 2026-05-12

The paper introduces Goal-Mem, a goal-oriented reasoning framework for RAG-based agentic memory in conversational LLMs that addresses limitations in multi-hop and commonsense reasoning. The method performs backward chaining from user utterances, decomposing goals into atomic subgoals and using targeted memory retrieval with Natural Language Logic for verifiable reasoning. Experiments on two datasets against nine baselines demonstrate consistent performance improvements, particularly in multi-hop reasoning and implicit inference tasks.

rag-based memorymulti-hop reasoningbackward chainingnatural language logicagentic llms

Self-Supervised Laplace Approximation for Bayesian Uncertainty Quantification

arXiv cs.AI · Julian Rodemann, Alexander Marquard, Thomas Augustin, Michele Caprio · 2026-05-12

The authors propose Self-Supervised Laplace Approximation (SSLA), a method for directly approximating posterior predictive distributions without computing parameter posteriors. Inspired by self-training in self-supervised learning, SSLA quantifies predictive uncertainty by refitting models on self-predicted data, yielding a deterministic, sampling-free approximation. An approximate variant, ASSLA, reduces computational costs by avoiding expensive refitting. Theoretical and empirical evaluations across Bayesian linear models and neural networks demonstrate superior predictive calibration compared to classical Laplace approximations, while maintaining computational efficiency, as validated on simulated and real-world regression tasks.

posterior predictive distributionself-supervised learningbayesian uncertainty quantificationlaplace approximationpredictive calibration

Not How Many, But Which: Parameter Placement in Low-Rank Adaptation

arXiv cs.AI · Arijit Sehanobish, Charles Lovering · 2026-05-12

The paper investigates the parameter placement problem in Low-Rank Adaptation (LoRA), focusing on which k trainable entries in the B matrix (with A frozen) impact performance. Under supervised fine-tuning (SFT), random and informed parameter subsets achieve comparable results, while gradient-informed placement is crucial for Generalized Reward-Penalty Optimization (GRPO) to recover standard LoRA accuracy. This divergence stems from gradient structure: SFT gradients are low-rank and stable, enabling coherent updates from any subset, whereas GRPO gradients are high-rank and orthogonal, requiring consistently signed gradients. A scoring procedure identifies critical parameters in under 10 seconds at <0.5% training cost, revealing concentration on residual-stream-writing projections (V, O, Down) across model families and scales (1.5B-8B).

low-rank adaptationparameter placementsupervised fine-tuninggradient-informed placementresidual-stream-writing projections

Uncertainty Quantification for LLM-based Code Generation

arXiv cs.AI · Senrong Xu, Yuhao Tan, Yanke Zhou, Guangyuan Wu · 2026-05-12

We propose RisCoSet, a method for uncertainty quantification in LLM-based code generation that addresses limitations of prior PAC prediction sets. RisCoSet leverages multiple hypothesis testing to construct risk-controlling prediction sets represented by partial programs, guaranteeing correct solutions with high confidence. The approach accommodates multiple valid outputs inherent to code generation, overcoming the single-label classification framework and monotonicity constraints of previous work. Experiments on three LLMs demonstrate RisCoSet's effectiveness, reducing code removal by up to 24.5% at equivalent risk levels compared to state-of-the-art methods.

uncertainty quantificationllm-based code generationrisk-controlling prediction setsmultiple hypothesis testingpartial programs

Overtrained, Not Misaligned

arXiv cs.AI · Joel Schreiber, Ariel Goldstein · 2026-05-12

This study provides the most comprehensive analysis to date of emergent misalignment (EM) in fine-tuned language models, demonstrating that EM is not universal but correlates with model size and emerges late in training. The authors evaluate 12 open-source models (8B to 671B parameters) across 4 families (Llama, Qwen, DeepSeek, GPT-OSS), analyzing over one million responses with multiple random seeds. Results show EM replicates in GPT-4o but occurs consistently in only 17% of models, with a strong size-EM correlation (r = 0.90). Practical mitigations include early stopping, which eliminates EM while retaining 93% task performance, and careful learning rate selection. Cross-domain validation confirms these findings generalize, particularly in medical fine-tuning.

emergent misalignmentfine-tuningearly stoppingcross-domain validationtask convergence

Mitigating Context-Memory Conflicts in LLMs through Dynamic Cognitive Reconciliation Decoding

arXiv cs.AI · Yigeng Zhou, Wu Li, Yifan Lu, Yequan Wang · 2026-05-12

The paper introduces Dynamic Cognitive Reconciliation Decoding (DCRD), a two-stage method for mitigating context-memory conflicts in large language models (LLMs). DCRD first predicts conflicts via attention map analysis, then routes inputs to either greedy decoding or context fidelity-based dynamic decoding. The approach maintains performance in conflict-free scenarios while resolving knowledge conflicts. Evaluated on the new ConflictKG benchmark and six QA datasets across four LLMs, DCRD achieves state-of-the-art results compared to existing baselines.

context-memory conflictsdynamic decodingattention mapparametric knowledgeknowledge conflict

DriftXpress: Faster Drifting Models via Projected RKHS Fields

arXiv cs.AI · Ali Falahati, Elliot Creager, Gautam Kamath, Shubhankar Mohapatra · 2026-05-12

DriftXpress accelerates drifting models for one-step generative modeling by approximating the drifting kernel in a low-rank feature space via projected RKHS fields. The method preserves the attraction-repulsion structure of the original drifting field while reducing computational costs during training. Evaluated on image-generation benchmarks, DriftXpress maintains comparable FID scores to standard drifting models while significantly decreasing wall-clock training time, demonstrating improved efficiency without sacrificing one-step inference advantages.

drifting modelsone-step generative modelingprojected rkhs fieldslow-rank approximationfid scores

MolDeTox: Evaluating Language Model's Stepwise Fragment Editing for Molecular Detoxification

arXiv cs.AI · Jueon Park, Wonjune Jang, Jiwoo Lee, Yein Park · 2026-05-12

MolDeTox introduces a novel benchmark for molecular detoxification, addressing limitations in existing toxicity repair benchmarks such as limited data diversity, low structural validity, and reliance on proxy models for toxicity assessment. The benchmark enables fine-grained evaluation of toxicity-aware molecular optimization across stepwise tasks. General-purpose Large Language Models (LLMs) and Vision Language Models (VLMs) are evaluated under diverse settings, demonstrating that fragment-level understanding and generation improves structural validity and molecular quality. Detailed task-level performance analysis provides interpretable insights into the detoxification process. The dataset is publicly available.

molecular detoxificationtoxicity repairstructural validityfragment-level generationbenchmark evaluation

A Deep Learning-based Receiver for Asynchronous Grant-Free Random Access in Control-to-Control Networks

arXiv cs.AI · Massimo Battaglioni, Edoardo Carnevali, Dania De Crescenzo, Enrico Testi · 2026-05-12

The paper introduces a deep learning-based receiver for asynchronous grant-free control-to-control (C2C) networks, addressing uncoordinated transmissions over shared channels. A convolutional neural network (CNN) detects command unit boundaries (start/tail sequences) directly from the received signal, leveraging LDPC-coded payloads and channel estimates for tail-sequence detection. Successive interference cancellation (SIC) improves decoding post-boundary identification. Simulations demonstrate reliable packet-boundary detection and low packet loss rates under high-traffic conditions.

grant-freeldpccnnsicasynchronous

Do Enterprise Systems Need Learned World Models? The Importance of Context to Infer Dynamics

arXiv cs.AI · Jishnu Sethumadhavan Nair, Patrice Bechard, Rishabh Maheshwary, Surajit Dasgupta · 2026-05-12

The paper argues that enterprise systems require runtime discovery of transition dynamics rather than relying solely on offline-trained world models, which degrade under deployment shift. It introduces enterprise discovery agents that read system configurations at inference time, and CascadeBench, a benchmark for evaluating cascade prediction in synthetic environments. Empirical results show discovery-based agents outperform traditional world models under dynamic shifts by grounding predictions in current instance logic.

world modelsenterprise systemsdeployment shiftruntime discoverycascadebench

Premover: Fast Vision-Language-Action Control by Acting Before Instructions Are Complete

arXiv cs.AI · Joonha Park, Jiseung Jeong, Taesik Gong · 2026-05-12

Premover introduces a lightweight module for Vision-Language-Action (VLA) policies that enables acting during instruction input rather than waiting for completion. The method freezes the VLA backbone and adds two small projection heads (image patches and language tokens) mapping to a shared space, supervised by target-object segmentation masks. A readiness threshold determines when to act. On LIBERO, Premover reduces wall-clock time by 13.6% (34.0s to 29.4s) while maintaining 95.1% success rate versus the full-prompt baseline, outperforming naive premoving (66.4%).

vision-language-actionprecomputationsegmentation masksreadiness thresholdprojection heads

ALGOGEN: Tool-Generated Verifiable Traces for Reliable Algorithm Visualization

arXiv cs.AI · Kunpeng Liao, Yuexiao Ma, Yisheng Lin, Hualin Zeng · 2026-05-12

ALGOGEN introduces a decoupled paradigm for reliable Algorithm Visualization (AV) by separating algorithm execution from rendering. The method employs Visualization Trace Algebra (VTA) to model algorithm states and operations, generating VTA-JSON traces via a Python tracker. Rendering is templatized using a Rendering Style Language (RSL), compiled deterministically into Manim, LaTeX/TikZ, or Three.js outputs. Evaluated on a LeetCode AV benchmark of 200 tasks, ALGOGEN achieves a 17.3% improvement in success rate (99.8% vs. 82.5%) over end-to-end methods, effectively mitigating LLM hallucinations and enhancing AV reliability.

visualization trace algebravta-jsonrendering style languagemanimllm hallucinations

MM-OptBench: A Solver-Grounded Benchmark for Multimodal Optimization Modeling

arXiv cs.AI · Zhong Li, Qi Huang, Yuxuan Zhu, Mohammad Mohammadi Amiri · 2026-05-12

We introduce MM-OptBench, a solver-grounded benchmark for multimodal optimization modeling that evaluates the ability of multimodal large language models (MLLMs) to construct mathematical formulations and solver-executable code from text-and-visual problem specifications. The framework generates 780 solver-verified instances across 6 optimization families, 26 subcategories, and 3 difficulty levels, ensuring structured inputs and reference files are derived from verified sources. Evaluations of 9 MLLMs (6 general-purpose, 3 math-specialized) reveal significant challenges: the top models achieve 52.1% and 51.3% pass@1, while math-specialized models solve 0/780 instances. Errors stem from data extraction and formulation/code generation. MM-OptBench establishes a testbed for solver-grounded multimodal intelligence.

multimodal optimizationsolver-groundedlarge language modelsmathematical formulationpass@1

CIDR: A Large-Scale Industrial Source Code Dataset for Software Engineering Research

arXiv cs.AI · Vladislav Savenkov · 2026-05-12

The Curated Industrial Developer Repository (CIDR) introduces a large-scale dataset of 2,440 proprietary software repositories totaling 373 million lines of code across 138 programming languages, collected through collaboration with 12 industrial partners. The dataset was constructed via a multi-stage pipeline involving structured partner onboarding, automated metadata filtering, manual code review, and deterministic anonymization of full version control histories. CIDR exclusively contains proprietary production codebases from domains including enterprise web/mobile development, fintech, and custom software consultancy, distinguishing it from open-source code corpora. The dataset supports research in code intelligence, software quality analysis, code language model training, developer behavior studies, and agent evaluation benchmarks, available under a restricted commercial license.

curated industrial developer repositoryproprietary codebasesdeterministic anonymizationversion control historycode intelligence

BoolXLLM: LLM-Assisted Explainability for Boolean Models

arXiv cs.AI · Du Cheng, Serdar Kadioglu, Xin Wang · 2026-05-12

BoolXLLM introduces a hybrid framework integrating Large Language Models (LLMs) into Boolean rule-based classifiers to enhance explainability. The method employs LLMs at three stages: feature selection for domain-relevant variables, threshold recommendation for numerical feature discretization, and rule compression for natural language explanations at global and local levels. This approach combines symbolic reasoning with language-based models to bridge formal explanations with human-understandable narratives. Empirical results indicate improved interpretability while maintaining competitive predictive performance, demonstrating the potential of LLM-assisted pipelines in explainable AI systems.

boolean modelslarge language modelsfeature selectionrule compressionexplainable ai

Rollout Cards: A Reproducibility Standard for Agent Research

arXiv cs.AI · Charlie Masters, Ziyuan Liu, Stefano V. Albrecht · 2026-05-12

We propose rollout cards, a reproducibility standard for agent research that preserves rollout records and declares reporting rules, addressing inconsistencies in task-success rates, cost/token accounting, and timing measurements. Through a structured audit of 50 repositories, we identify 37 cases where reporting rules alter outcomes and demonstrate that none report run failures or errors alongside headline scores. Validation in four public releases and re-grading benchmarks shows that reporting rules alone can change scores by 20.9 absolute percentage points and invert model rankings. A reference implementation integrated into Ergon is released, with rollout-card exports for benchmarks in tool use, software engineering, and multi-agent coordination.

rollout cardsreporting rulestask-success ratescost/token accountingbenchmark re-grading

It's Not the Size: Harness Design Determines Operational Stability in Small Language Models

arXiv cs.AI · Yong-eun Cho · 2026-05-12

This paper demonstrates that harness engineering significantly impacts the operational stability of small language models (SLMs) independent of model size. Through systematic experimentation with three SLMs (Gemma4 E2B, Qwen3.5:2B, LLaMA 3.2 3B) across 24 tasks, the authors evaluate three harness conditions: model-only, minimal-shell, and a 4-stage pipeline (plan->execute->verify->recover). The pipeline harness achieves optimal performance (TSR=0.952, VTSR=1.000) on Gemma4 E2B, with planning and recovery each contributing ~24.7% to total gains. Notably, scaffold collapse is observed in LLaMA 3.2 3B without harness support, yielding TSR=0.429 due to JSON structure violations.

harness engineeringscaffold collapsetask success ratesmall language modelsverification catch rate

Disentangled Sparse Representations for Concept-Separated Diffusion Unlearning

arXiv cs.AI · Hyeonjin Kim, Hangyeol Jung, Heechan Yun, Sungjun Yun · 2026-05-12

SAEParate introduces concept-specific clustering in sparse autoencoder (SAE)-based unlearning for text-to-image diffusion models, addressing shared latent features across concepts. The method employs a concept-aware contrastive objective and enhances the encoder with a GeLU-based nonlinear transformation to achieve a more discriminative and disentangled latent space. Evaluated on UnlearnCanvas, SAEParate demonstrates state-of-the-art performance, particularly excelling in joint style-object unlearning by reducing interference between target and non-target concepts.

sparse autoencoderconcept-aware contrastivegelu transformationlatent spacetext-to-image diffusion

To Whom Do Language Models Align? Measuring Principal Hierarchies Under High-Stakes Competing Demands

arXiv cs.AI · Fangyi Yu, Nabeel Seedat, Jonathan Richard Schwarz, Andrew M. Bean · 2026-05-12

This study investigates principal hierarchies in language models under high-stakes competing demands, revealing inconsistent alignment across domains and model families. The authors evaluate ten frontier models across 7,136 scenarios in legal and medical domains, testing adherence to professional standards when user instructions conflict with institutional or normative demands. Results show frequent failures to uphold professional standards during task execution, primarily through knowledge omission, even when models demonstrate relevant knowledge internally. Alignment hierarchies prove unstable across contexts and inconsistent across models, suggesting current alignment methods are insufficient for high-stakes professional deployments.

principal hierarchiesknowledge omissionalignment methodstask executionprofessional standards

Adaptive Multi-Round Allocation with Stochastic Arrivals

arXiv cs.AI · Yuqi Pan, Davin Choo, Haichuan Wang, Milind Tambe · 2026-05-12

The paper presents an adaptive multi-round resource allocation framework for stochastic network recruitment, where budget-constrained resources exhibit diminishing returns. The authors derive an exact greedy solution for single-round allocations via marginal survival probabilities, then address intractability in multi-round settings through a population-level surrogate value function. This enables polynomial-time dynamic programming using truncated probability generating functions. Theoretical analysis provides robustness guarantees under model misspecification, decomposing error into frontier and transition components. Empirical validation demonstrates effectiveness in real-world-inspired recruitment scenarios.

stochastic arrivalsdynamic programmingdiminishing returnssurrogate value functionprobability generating functions

Large Language Models as Amortized Pareto-Front Generators for Constrained Bi-Objective Convex Optimization

arXiv cs.AI · Peipei Xu, SiYuan Ma, Yaohua Liu, Yu Wu · 2026-05-12

The paper introduces DIPS, a framework fine-tuning large language models (LLMs) as amortized Pareto-front generators for constrained bi-objective convex optimization. DIPS employs Numerically Grounded Token Initialization, a compact discretization scheme, and Three-Phase Curriculum Optimization to align structural validity, feasibility, and Pareto-front quality. A fine-tuned 7B-parameter model achieves normalized hypervolume ratios of 95.29% to 98.18% across five families of problems, solving instances in as little as 0.16 seconds with vLLM-accelerated inference. Results demonstrate LLMs' effectiveness in continuous Pareto-front approximation.

pareto-frontconstrained optimizationlanguage modelshypervolume ratiocurriculum optimization

Autonomy and Agency in Agentic AI: Architectural Tactics for Regulated Contexts

arXiv cs.AI · Damir Safin, Dian Balta · 2026-05-12

This work introduces a two-dimensional design space for agentic AI in regulated contexts, coupling agency (what the system can do) and autonomy (how much it acts without human involvement). Both dimensions are organized into five operational levels, ranging from human-commanded operation (L1) to fully autonomous monitoring (L5) for autonomy, and from reasoning over supplied context (L1) to committed writes to authoritative records (L5) for agency. Six architectural tactics—checkpoints, escalation, multi-agent delegation, tool provisioning, tool fencing, and write staging—are proposed to navigate this space, grounded in public-sector examples. Five deployment parameters—model capability, agent architecture, tool fidelity, workflow bottlenecks, and evaluation—are examined to shape achievable configurations independently of agency and autonomy.

agentic aiautonomyagencyarchitectural tacticsdeployment parameters

Intermediate Artifacts as First-Class Citizens: A Data Model for Durable Intermediate Artifacts in Agentic Systems

arXiv cs.AI · Josh Rosen, Seth Rosen · 2026-05-12

The paper introduces a systems-level data model for durable intermediate artifacts in agentic AI systems, addressing the ephemeral nature of intermediate work in multi-step, revisable tasks. The model formalizes intermediate artifacts as typed, structured, versioned, and dependency-aware entities, distinct from chat transcripts or hidden chain-of-thought. It specifies additive and superseding update semantics with explicit current-state resolution and emphasizes artifact lineage for durable state maintenance. The approach aims to enhance inspectability, revisability, and maintainability of AI-generated work, shifting evaluation focus from final-output quality to maintained-state quality.

intermediate artifactsagentic systemsupdate semanticsartifact lineagemaintained-state quality

Learning What Matters: Adaptive Information-Theoretic Objectives for Robot Exploration

arXiv cs.AI · Youwei Yu, Jionghao Wang, Zhengming Yu, Wenping Wang · 2026-05-12

The paper introduces Quasi-Optimal Experimental Design (QOED), an adaptive information-theoretic objective for robot exploration that addresses challenges in parameter learnability. QOED employs eigenspace analysis of the Fisher information matrix to identify observable subspaces and suppress nuisance parameters, providing a constant-factor approximation to ideal exploration objectives. Evaluated on navigation and manipulation tasks, QOED achieves performance improvements of 35.23% and 21.98% in identifiable-direction selection and nuisance suppression, respectively, and enhances model-based policy optimization over RL baselines.

fisher informationparameter identifiabilityrobot explorationoptimal experimental designnuisance suppression

Property-Level Reconstructability of Agent Decisions: An Anchor-Level Pilot Across Vendor SDK Adapter Regimes

arXiv cs.AI · Oleg Solozobov · 2026-05-12

The study evaluates property-level reconstructability of agent decisions across six vendor SDK regimes using an unmodified Decision Trace Reconstructor. Analyzing pinned worked-example anchors, it classifies Decision Event Schema (DES) properties into four reconstructability categories (fully fillable to opaque). Results show strict-governance-completeness tiers ranging from 42.9% to 85.7%, identifying one regime-independent gap (reasoning trace), four regime-dependent gaps, and one Mixed property. The single-annotator pilot study provides checksum-verifiable outputs via a deposited reproducibility package.

decision trace reconstructorvendor sdk regimesproperty-level reconstructabilitydecision event schemastrict-governance-completeness

The Missing GAP: From Solving Square Jigsaw Puzzles to Handling Real World Archaeological Fragments

arXiv cs.AI · Ofir Itzhak Shahar, Gur Elkin, Ohad Ben-Shahar · 2026-05-12

The paper introduces GAP, a novel dataset of jigsaw puzzles featuring synthetic, eroded fragments with unrestricted shapes, derived from real-world archaeological artifacts. It proposes PuzzleFlow, a Vision Transformer (ViT) and Flow-Matching based framework for solving such puzzles, outperforming existing methods on the GAP dataset. The approach addresses limitations of prior work constrained to square pieces, demonstrating superior performance through advanced architectural and computational techniques.

jigsaw puzzlesarchaeological fragmentsvision transformerflow-matchingdataset

The Deepfakes We Missed: We Built Detectors for a Threat That Didn't Arrive

arXiv cs.AI · Shaina Raza · 2026-05-12

This position paper identifies a misalignment between deepfake research priorities and observed real-world harms, arguing that the field's focus on public-figure manipulation (2017-2019 threat model) has overlooked dominant emerging threats. Through empirical analysis of 2022-2026 incidents, the authors demonstrate that non-consensual intimate imagery (NCII), voice-clone scams, and emotional-manipulation fraud constitute primary harms, while predicted large-scale misinformation failed to materialize. The paper attributes this gap to structural research inertia, proposes rebalancing efforts toward under-defended harm categories, and outlines three concrete technical research agendas to address current threats.

deepfake detectionnon-consensual intimate imageryvoice-clone scamsthreat modelingmisinformation defense

Clausal Deletion Backdoors for QBF: a Parameterized Complexity Approach

arXiv cs.AI · Leif Eriksson, Victor Lagerkvist, Sebastian Ordyniak, George Osipov · 2026-05-12

The paper introduces clause covering (CC) backdoors as a new parameterized approach for solving quantified Boolean formulas (QBF), focusing on tractable base classes (Horn, 2-CNF, linear equations). It establishes W[1]-hardness for Horn backdoors but proves fixed-parameter tractability (FPT) for 2-CNF and linear equations via propagation and Gaussian elimination techniques. The work identifies a key missing case for a complete dichotomy, advancing theoretical understanding of QBF solvers in parameterized complexity.

quantified boolean formulasparameterized complexityclause covering backdoorsfixed-parameter tractabilitygaussian elimination

Missing Old Logits in Asynchronous Agentic RL: Semantic Mismatch and Repair Methods for Off-Policy Correction

arXiv cs.AI · Zhong Guan, Yongjian Guo, Haoran Sun, Wen Huang · 2026-05-12

The paper identifies and addresses the missing-old-logit problem in asynchronous reinforcement learning for large language model agents, where delayed updates and partial rollouts lead to semantic entanglement between training--inference discrepancy and policy-staleness correction. Three exact old-logit acquisition strategies are proposed: snapshot-based version tracking, a dedicated old-logit model, and synchronization via partial rollout interruption, alongside an approximate correction method using revised PPO-EWMA. The revised PPO-EWMA method demonstrates significant improvements in both training speed and optimization performance.

asynchronous reinforcement learningoff-policy correctionppo-ewmamissing-old-logit problemsemantic entanglement

Anomaly-Aware Vision-Language Adapters for Zero-Shot Anomaly Detection

arXiv cs.AI · Muhammad Aqeel, Maham Nazir, Uzair Khan, Marco Cristani · 2026-05-12

The paper introduces AVA-DINO, an anomaly-aware vision-language adaptation framework for zero-shot anomaly detection that leverages asymmetric distributions of normal and anomalous data. The method employs dual specialized branches for normal and anomalous patterns, trained jointly with text-guided routing and regularization to ensure branch specialization. At inference, it dynamically combines branches using image inputs and predefined language descriptions. Evaluated across nine benchmarks, AVA-DINO achieves 93.5% image-AUROC on MVTec-AD and demonstrates robust cross-domain generalization to medical imaging without domain-specific tuning.

zero-shotanomaly detectionvision-languageasymmetric distributionsdynamic routing

SAGE: A Self-Evolving Agentic Graph-Memory Engine for Structure-Aware Associative Memory

arXiv cs.AI · Juntong Wang, Haoyue Zhao, guanghui Pan, Xiyuan Wang · 2026-05-12

SAGE introduces a Self-evolving Agentic Graph-memory Engine to address long-term memory bottlenecks in language agents by modeling graph memory as a dynamic substrate. The framework combines a memory writer, which incrementally constructs structured graph memory from interaction histories, with a Graph Foundation Model-based memory reader for retrieval and feedback. Evaluations on multi-hop QA, open-domain retrieval, domain-specific review QA, and long-term agent-memory benchmarks demonstrate improved evidence recovery, answer grounding, and retrieval efficiency. SAGE achieves the best average rank on multi-hop QA after two self-evolution rounds and reaches 82.5/91.6 Recall@2/5 on NQ in zero-shot open-domain transfer.

graph memoryself-evolvingmulti-hop qaretrieval efficiencyhallucination-diagnostic

Hölder Policy Optimisation

arXiv cs.AI · Yuxiang Chen, Dingli Liang, Yihang Chen, Ziqin Gong · 2026-05-12

The paper introduces Hölder Policy Optimisation (HölderPO), a generalized framework for policy optimization in large language models that unifies token-level probability aggregation via the Hölder mean. By modulating the parameter $p$, the method dynamically balances gradient concentration and variance bounds, addressing limitations of fixed aggregation mechanisms in Group Relative Policy Optimisation (GRPO). The approach includes a dynamic annealing algorithm to schedule $p$ across training. Evaluations show HölderPO achieves 54.9% accuracy on mathematical benchmarks, a 7.2% improvement over GRPO, and 93.8% success rate on ALFWorld.

hölderpopolicy optimizationgradient concentrationdynamic annealingalfworld

OmniRefine: Alignment-Aware Cooperative Compression for Efficient Omnimodal Large Language Models

arXiv cs.AI · Yuchen Deng, Zidang Cai, Hai-Tao Zheng, Jie Wang · 2026-05-12

OmniRefine introduces a training-free two-stage framework for efficient audio-visual token compression in omnimodal large language models (Omni-LLMs), addressing inference cost challenges. The method first refines native chunk boundaries into cross-modally aligned compression units via frame-audio similarity and dynamic programming (Correspondence-Preserving Chunk Refinement). Second, it jointly compresses video and audio tokens within each unit to reduce redundancy while preserving critical evidence (Modality-Aware Cooperative Compression). Experiments demonstrate OmniRefine achieves superior efficiency-performance trade-offs, maintaining 46.7% accuracy on WorldSense at a 44% token retention ratio, nearly matching full-token performance.

omnimodal large language modelstoken compressioncross-modal alignmentdynamic programmingmodality-aware cooperative compression

Scaling Laws and Tradeoffs in Recurrent Networks of Expressive Neurons

arXiv cs.AI · Aaron Spieler, Georg Martius, Anna Levina · 2026-05-12

The paper introduces the ELM Network, a recurrent architecture with Expressive Leaky Memory (ELM) neurons, designed to explore optimal parameter allocation between unit count ($N$), per-unit complexity ($k_e$), and connectivity ($k_c$). ELM neurons emulate cortical functionality, enabling stable training across scales. Evaluated on the SHD-Adding task and Enwik8 character-level language modeling, performance improves monotonically along each axis, with a non-trivial tradeoff optimum under fixed budgets. Larger budgets favor more complex neurons. An information-theoretic model explains diminishing returns via signal-to-noise saturation and redundancy. Scaling laws trace a near-Pareto frontier, challenging the default use of simple units in machine learning.

elm networkexpressive leaky memoryparameter allocationscaling lawssignal-to-noise saturation

Rethink the Role of Neural Decoders in Quantum Error Correction

arXiv cs.AI · Ge Yan, Shanchuan Li, Yuxuan Du · 2026-05-12

This work reevaluates neural decoders for quantum error correction (QEC) in surface codes, focusing on accuracy-latency tradeoffs for code distances up to d=9 (161 physical qubits). The authors unify and redesign neural decoders into five architectural paradigms and develop an end-to-end compression pipeline for FPGA deployment. Key findings include: (i) decoding performance is more dependent on data scale than architectural complexity, (ii) appropriate inductive bias is crucial for high accuracy, and (iii) INT4 quantization is necessary to meet microsecond latency requirements. These insights provide actionable guidance for scalable, real-time neural QEC decoding.

quantum error correctionneural decoderssurface codesfpgainductive bias

Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation

arXiv cs.AI · Che Liu, Lichao Ma, Xiangyu Tony Zhang, Yuxin Zhang · 2026-05-12

The paper introduces OmniClean, a visually debiased evaluation subset (8,551 queries from 16,968) for omni-modal language models, addressing benchmark inflation from visual shortcuts. It proposes OmniBoost, a three-stage post-training method for Qwen2.5-Omni-3B: mixed bi-modal SFT, mixed-modality RLVR (reinforcement learning with visual rewards), and SFT on self-distilled data. Results show RLVR drives broad improvements, while self-distillation reshapes performance profiles, enabling the 3B model to match Qwen3-Omni-30B-A3B-Instruct without stronger teacher supervision.

omni-modalvisual debiasingreinforcement learningself-distillationpost-training

Spectral Vision Transformer for Efficient Tokenization with Limited Data

arXiv cs.AI · Alexandra G. Roberts, Maneesh John, Jinwei Zhang, Dominick Romano · 2026-05-12

The paper introduces a spectral vision transformer (ViT) architecture optimized for efficient tokenization in data-scarce scenarios, particularly medical imaging. The method leverages spectral projections to achieve spatial invariance and optimal signal-to-noise ratio, reducing computational complexity compared to spatial ViTs. Evaluations demonstrate competitive or superior performance against compact/standard ViTs, CNNs with attention, shifted window transformers, MLPs, and logistic regression, despite fewer parameters. Validation includes simulated, public, and clinical datasets, with code released on GitHub.

spectral vision transformertokenizationspatial invariancesignal-to-noise ratiomedical imaging

Efficient and Adaptive Human Activity Recognition via LLM Backbones

arXiv cs.AI · Aleksandr Bredikhin, Philippe Lalanda, German Vega · 2026-05-12

This paper introduces a novel approach for Human Activity Recognition (HAR) by repurposing large pretrained language models (LLMs) as generic temporal backbones, eliminating the need for task-specific Transformer models. A structured convolutional projection bridges the modality gap between inertial sensor data and LLMs, while parameter-efficient Low-Rank Adaptation (LoRA) adapts the frozen pretrained backbone. Experiments on standard HAR benchmarks demonstrate rapid convergence, strong data efficiency, and robust cross-dataset transfer, particularly in low-data and few-shot settings. The results highlight the complementary roles of convolutional frontends and LLMs in handling local invariances and capturing long-range temporal dependencies, respectively.

human activity recognitionlarge language modelslow-rank adaptationconvolutional projectiontemporal dependencies

LLMs and the ZPD

arXiv cs.AI · Peter Wallis · 2026-05-12

The article proposes that large language models (LLMs) engage in 'primitive thinking' through practices rather than distributed representations, aligning with Vygotsky's concept of Zones of Proximal Development (ZPD). It argues that LLMs do not hallucinate but 'dream,' suggesting a shift from guardrails to investigating cognitive tools enabling common-sense behaviors. The core claim is that interaction is fundamental to human communication, not merely supplementary to understanding. This perspective reinterprets LLM mechanisms by emphasizing the role of practices and interaction in cognitive processes.

large language modelszones of proximal developmentprimitive thinkingdistributed representationscognitive tools

SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces

arXiv cs.AI · Chang Jin, An Wang, Zeming Wei, Kai Wang · 2026-05-12

SkillSafetyBench introduces a benchmark for evaluating safety vulnerabilities in modular skill-based LLM agents, focusing on non-user attack surfaces. The benchmark comprises 155 adversarial cases across 47 tasks, 6 risk domains, and 30 safety categories, each verified by case-specific rule-based verifiers. Experiments with CLI agents and multiple model backends demonstrate that localized attacks can consistently induce unsafe behavior, revealing distinct failure patterns across domains, attack methods, and scaffold-model pairings. Results highlight that agent safety depends on skill interpretation, workflow context trust, and executable environment interactions, beyond model-level alignment.

skill-based agentsadversarial casesrule-based verifiernon-user attacksworkflow context

L2P: Unlocking Latent Potential for Pixel Generation

arXiv cs.AI · Zhennan Chen, Junwei Zhu, Xu Chen, Jiangning Zhang · 2026-05-12

The paper introduces Latent-to-Pixel (L2P), a framework for efficiently transferring knowledge from pre-trained Latent Diffusion Models (LDMs) to pixel-space generation. L2P replaces the VAE with large-patch tokenization, freezes intermediate LDM layers, and trains only shallow layers to map latent to pixel space, using solely LDM-generated synthetic data. This approach achieves rapid convergence with minimal resources (8 GPUs), enables native 4K resolution, and matches source LDM performance on DPG-Bench while reaching 93% on GenEval.

latent diffusion modelspixel-space generationlarge-patch tokenizationsynthetic training data4k resolution

LegalCheck: Retrieval- and Context-Augmented Generation for Drafting Municipal Legal Advice Letters

arXiv cs.AI · Virgill van der Meer, Julien Rossi · 2026-05-12

LegalCheck introduces a Retrieval- and Context-Augmented Generation (RAG/CAG) system for automating municipal legal advice letter drafting, addressing public-sector legal staff shortages in the Netherlands. The system integrates a large language model (LLM) with curated legal knowledge bases, retrieving relevant laws and precedents while incorporating case-specific details via controlled prompting. Deployed in the Municipality of Amsterdam, LegalCheck generated near-final letters in minutes, achieving 80-100% legal reasoning accuracy and ensuring high legal consistency. Expert-in-the-loop review maintained legal soundness, reducing workload while preserving human judgment. Results demonstrate efficiency gains, improved consistency, and positive user acceptance, showcasing responsible AI deployment in legal domains.

retrieval-augmented generationcontext-augmented generationlarge language modellegal knowledge basescontrolled prompting

CR^2: Cost-Aware Risk-Controlled Routing for Wireless Device-Edge LLM Inference

arXiv cs.AI · Nan Xue, Shengkang Chen, Zhiyong Chen, Jiangchao Yao · 2026-05-12

The paper introduces CR^2, a cost-aware risk-controlled routing framework for wireless device-edge LLM inference. CR^2 employs a two-stage architecture with a lightweight on-device margin gate and an edge-side utility selector, optimizing latency and energy trade-offs under constrained resources. A conformal risk control (CRC) procedure calibrates thresholds for explicit false-acceptance risk management. Experiments demonstrate CR^2 matches full-information routing performance while reducing deployment costs by up to 16.9% at comparable accuracy.

llm inferencedevice-edge routingconformal risk controlcost-aware optimizationwireless edge deployment

The Illusion of Power Capping in LLM Decode: A Phase-Aware Energy Characterisation Across Attention Architectures

arXiv cs.AI · Bole Ma, Ayesha Afzal, Jan Eitzinger, Gerhard Wellein · 2026-05-12

The study reveals that power capping in LLM serving is ineffective during autoregressive decode, the dominant production phase. Testing four attention architectures (GQA, MLA, Gated DeltaNet, Mamba2) on NVIDIA H200, decode consumes only 137-300W of 700W GPU capacity, leaving power headroom unused due to memory-bound operations. Firmware clock throttling further distorts measurements. Clock locking emerges as a superior alternative, recovering up to 32% decode energy with minimal throughput loss. Three DVFS behavioral classes are identified, with attention replacements showing a pattern of high prefill cost offset by efficient decode, reducing total request energy by half versus GQA at production batches.

power cappingautoregressive decodeclock lockingattention architecturesdvfs

BadSKP: Backdoor Attacks on Knowledge Graph-Enhanced LLMs with Soft Prompts

arXiv cs.AI · Xiaoting Lyu, Yufei Han, Hangwei Qian, Haoyuan Yu · 2026-05-12

The paper introduces BadSKP, a backdoor attack targeting knowledge graph (KG)-enhanced LLMs that use soft prompts. Unlike text-channel attacks, BadSKP exploits the graph-to-prompt interface via multi-stage optimization: constructing adversarial embeddings, optimizing poisoned nodes, and approximating them with fluent attributes. Experiments on two KG-enhanced LLMs across four datasets demonstrate high attack success in frozen and trojaned settings, while text-only attacks remain ineffective due to semantic anchoring by graph-derived prompts.

backdoor attackknowledge graphsoft promptssemantic anchoringmulti-stage optimization

A Transfer Learning Evaluation of Deep Neural Networks for Image Classification

arXiv cs.AI · Nermeen Abou Baker, Nico Zengeler, Uwe Handmann · 2026-05-12

The paper presents a systematic evaluation of transfer learning performance for image classification across multiple pre-trained models. The study compares eleven ImageNet-trained architectures (unspecified) on five target datasets, modifying output layers and network parameters. Metrics include accuracy, accuracy density (unspecified), training time, and model size, evaluated across single-episode and ten-episode training regimes. Results demonstrate trade-offs between model performance and computational resources, though specific numerical outcomes are not provided in the excerpt.

transfer learningimage classificationpre-trained modelsaccuracy densityimagenet

Random-Set Graph Neural Networks

arXiv cs.AI · Tommy Woodley, Shireen Kudukkil Manchingal, Matteo Tolloso, Davide Bacciu · 2026-05-12

The paper introduces Random-Set Graph Neural Networks (RS-GNNs), a novel framework for modeling node-level epistemic uncertainty in GNNs using belief functions (finite random sets). The approach incorporates a belief-function head that predicts a random set over classes, enabling both precise probability predictions and epistemic uncertainty quantification. Evaluated on 9 graph learning datasets, including Nuscene and ROAD autonomous driving benchmarks, RS-GNNs demonstrate superior uncertainty quantification performance compared to existing methods.

graph neural networksepistemic uncertaintybelief functionsrandom setsautonomous driving

On the Limitations of Large Language Models for Conceptual Database Modeling

arXiv cs.AI · Arthur F. Siqueira, Carlos D. S. Nogueira, Eduarda Farias, Claudio E. C. Campelo · 2026-05-12

The study evaluates the limitations of Large Language Models (LLMs) in conceptual database modeling by generating Entity-Relationship (ER) diagrams from natural language requirements. Three LLMs were tested using Zero-Shot, Chain of Thought, and Chain of Thought + Verifier prompting techniques across scenarios of increasing complexity. Results show that while LLMs perform reasonably in simpler contexts, their reliability declines with complexity, exhibiting inconsistencies, ambiguities, and constraint representation failures. The findings suggest LLMs are not yet mature for complex modeling tasks, and validation costs may outweigh productivity benefits.

large language modelsentity-relationship diagramsconceptual modelingprompt engineeringrelational databases

High-lift Wing Separation Control via Bayesian Optimization and Deep Reinforcement Learning

arXiv cs.AI · Ricard Montalà, Bernat Font, Oriol Lehmkuhl, Ricardo Vinuesa · 2026-05-12

The study demonstrates active flow control optimization for a 30P30N high-lift wing using Bayesian optimization (BO) and deep reinforcement learning (DRL) at Re_c = 450,000 and α = 23°. Wall-resolved large-eddy simulations (LES) validated the uncontrolled configuration against literature. BO achieved a +10.9% efficiency improvement via -9.7% drag reduction while maintaining lift, whereas DRL yielded minor aerodynamic gains due to constrained exploration from penalty-dominated rewards. Results emphasize the importance of reward design and computational acceleration for DRL-based flow control at high Reynolds numbers.

active flow controlbayesian optimizationdeep reinforcement learninglarge-eddy simulationshigh reynolds numbers

Cooperative Robotics Reinforced by Collective Perception for Traffic Moderation

arXiv cs.AI · Mohammad Khoshkdahan, John Pravin Arockiasamy, Andy Flores Comeca, Alexey Vinel · 2026-05-12

The paper introduces a cooperative robotics system enhanced by collective perception for traffic moderation at non-line-of-sight (NLOS) intersections. The system employs a humanoid robot that integrates dual-camera infrastructure for vehicle detection and V2X-based cooperative awareness messages (CAM) to assess collision risks. A fusion module combines these data streams to maintain real-time situational awareness, while a Zone of Danger (ZoD) predicts unsafe merges. Upon detecting a hazard, the robot issues a STOP gesture and physically blocks the merging path. Deployed at the Future Mobility Park in Rotterdam, the system demonstrated reliable hazard prediction and prevention of unsafe merges in NLOS conditions.

non-line-of-sightcollective perceptionv2xzone of dangerfusion module

Assessing and Mitigating Miscalibration in LLM-Based Social Science Measurement

arXiv cs.AI · Jinyuan Wang, Ningyuan Deng, Yi Yang · 2026-05-12

The paper investigates miscalibration in LLM-based social science measurements, demonstrating its impact on downstream analyses through a Federal Open Market Committee (FOMC) case study. Auditing 14 constructs across proprietary (GPT-5-mini, DeepSeek-V3.2) and open-source models reveals poor alignment between confidence and correctness. A soft label distillation pipeline is proposed, converting LLM scores into calibrated targets for training smaller classifiers. This method reduces Expected Calibration Error (ECE) by 43.2% and Brier score by 34.0%, emphasizing calibration as essential for measurement validity.

llmcalibrationsocial sciencedistillationece

Counterfactual Trace Auditing of LLM Agent Skills

arXiv cs.AI · Xiaolin Zhou, Jinbo Liu, Li Li, Ryan A. Rossi · 2026-05-12

The paper introduces Counterfactual Trace Auditing (CTA), a framework for evaluating how skills affect LLM agent behavior beyond pass rates. CTA compares agent traces with and without skills, segments them into goal-directed phases, and annotates Skill Influence Patterns (SIPs). Applied to Claude on 49 software engineering tasks, CTA reveals 522 SIP instances despite a mere +0.3pp pass rate change, uncovering behavioral shifts like template copying and excess planning. Key findings include SIP concentration in high-baseline tasks, recoverable gains in moderate tasks (with higher token costs), and baseline-dependent SIP dominance.

counterfactual trace auditingskill influence patternsllm agentsbehavioral auditpass rate

From Noise to Diversity: Random Embedding Injection in LLM Reasoning

arXiv cs.AI · Heejun Kim, Seungpil Lee, Jewon Yeom, Jaewon Sok · 2026-05-12

This work introduces Random Soft Prompts (RSPs), a training-free method that appends freshly sampled random embedding vectors to LLM inputs, isolating the structural effect of soft prompt injection. RSP vectors are drawn from an isotropic Gaussian fitted to the pretrained embedding table's statistics, inducing early-stage token diversity and branching reasoning trajectories. Empirical results show RSPs achieve accuracy comparable to optimized soft prompts on math reasoning benchmarks, widen Pass@N via temperature sampling, and extend benefits to DAPO training. The mechanism involves attention flattening initial token distributions, followed by natural dilution toward a single completion.

random soft promptstoken diversityisotropic gaussiantemperature samplingdapo training

When Simulation Lies: A Sim-to-Real Benchmark and Domain-Randomized RL Recipe for Tool-Use Agents

arXiv cs.AI · Xiaolin Zhou, Aojie Yuan, Zheng Luo, Zipeng Ling · 2026-05-12

The paper introduces RobustBench-TC, a benchmark with 22 perturbation types for tool-use agents, organized by four POMDP components, addressing real-world deployment failures. It evaluates 21 models (1.5B to 32B parameters), revealing uneven robustness: observation perturbations reduce accuracy by <5%, while reward-relevant and transition perturbations reduce it by ~40% and ~30%, respectively. ToolRL-DR, a domain-randomization RL recipe, trains agents on perturbation-augmented trajectories, achieving ~75% clean accuracy and narrowing the gap to baselines. It closes ~27% of the transition gap, demonstrating transfer to unseen runtime failures.

pomdpdomain-randomizationtool-use agentsrobustbench-tctoolrl-dr

Domain Restriction via Multi SAE Layer Transitions

arXiv cs.AI · Elias Shaheen, Avi Mendelson · 2026-05-12

The paper introduces a method for detecting out-of-domain (OOD) inputs in Large Language Models (LLMs) by analyzing internal layer transitions via sparse autoencoders (SAEs). It proposes lightweight techniques to learn domain-specific signatures from these transitions, offering interpretability into the LLM's decision process. Evaluated on Gemma-2B and Gemma-9B, the approach demonstrates strong OOD detection capabilities while revealing fine-grained input processing details.

large language modelsout-of-domain detectionsparse autoencoderlayer transitionsinterpretability

Rethinking Positional Encoding for Neural Vehicle Routing

arXiv cs.AI · Chuanbo Hua, Federico Berto, Andre Hottung, Nayeli Gast Zepeda · 2026-05-12

The paper introduces a hierarchical anisometric positional encoding (PE) tailored for transformer-based neural combinatorial optimization (NCO) of vehicle routing problems (VRPs). The proposed PE addresses three structural properties of routing solutions: anisometric node distances, cyclic and direction-aware topology, and hierarchical depot-anchored global multi-route structure, grounded in geometric principles. It combines distance-indexed, circularly consistent in-route encoding with depot-anchored angular cross-route encoding. Extensive experiments across diverse VRP variants show that geometry-grounded PE consistently outperforms index-based alternatives, with gains generalizing across problem variants, model architectures, and distribution shifts.

positional encodingneural combinatorial optimizationvehicle routing problemsanisometric distancesgeometry-grounded

Rethinking Supervision Granularity: Segment-Level Learning for LLM-Based Theorem Proving

arXiv cs.AI · Shuo Xu, Jiakun Zhang, Junyu Lai, Chun Cao · 2026-05-12

The paper introduces segment-level supervision, a novel training strategy for LLM-based theorem proving that extracts locally coherent proof segments to balance the granularity of supervision. This approach addresses limitations of step-level tactic prediction and whole-proof generation by preserving both local coherence and global structure. Evaluated on STP, LeanWorkbook, and NuminaMath-LEAN, the method achieves proof success rates of 64.84%, 60.90%, and 66.31% on miniF2F, outperforming baselines. Goal-aware rollout further enhances existing step-level provers, increasing BFS-Prover-V2-7B's success rate from 68.77% to 70.74% and InternLM2.5-StepProver's from 59.59% to 60.33%.

segment-level supervisiontheorem provingproof trajectoriesgoal-aware rolloutlean 4

Beyond Point-wise Neural Collapse: A Topology-Aware Hierarchical Classifier for Class-Incremental Learning

arXiv cs.AI · Huiyu Yi, Zhiming Xu, Dunwei Tu, Zhicheng Wang · 2026-05-12

We propose Hierarchical-Cluster SOINN (HC-SOINN), a topology-aware hierarchical classifier for Class-Incremental Learning (CIL) that addresses the limitations of Nearest Class Mean (NCM) by capturing complex class manifolds. HC-SOINN employs a 'local-to-global' representation and integrates Structure-Topology Alignment via Residuals (STAR) to adapt to non-linear feature drift through fine-grained pointwise trajectory tracking. Theoretical analysis and Procrustes distance experiments demonstrate resilience to manifold deformations. When integrated into seven state-of-the-art CIL methods, HC-SOINN consistently improves performance, validating its robustness and effectiveness.

class-incremental learningnearest class meanhierarchical-cluster soinnstructure-topology alignmentprocrustes distance

AccLock: Unlocking Identity with Heartbeat Using In-Ear Accelerometers

arXiv cs.AI · Lei Wang, Jiangxuan Shen, Xi Zhang, Dalin Zhang · 2026-05-12

AccLock introduces a passive earphone-based authentication system leveraging in-ear ballistocardiogram (BCG) signals for secure and unobtrusive user verification. The system employs a two-stage denoising scheme to mitigate inherent and sporadic interference, a disentanglement-based deep learning model (HIDNet) to isolate user-specific features from shared nuisance components, and a scalable Siamese network framework eliminating per-user classifier training. Extensive experiments with 33 participants demonstrate AccLock's efficacy, achieving an average false acceptance rate (FAR) of 3.13% and false rejection rate (FRR) of 2.99%, validating its practical feasibility.

ballistocardiogramsiamese networkdisentanglementdenoisingauthentication

Toward Modeling Player-Specific Chess Behaviors

arXiv cs.AI · Loris Sogliuzzo, Aloïs Rautureau, Eric Piette · 2026-05-12

A novel architecture is proposed to model player-specific chess behaviors by adapting the unified Maia-2 model with champion-specific embeddings and integrating a limited Monte Carlo Tree Search (MCTS) process for tactical exploration. The approach introduces a behavioral metric based on Jensen-Shannon divergence, compressing high-dimensional board representations into a latent space using AutoEncoder and Uniform Manifold Approximation and Projection (UMAP) for move distribution comparison. Evaluation across 16 historical world champions shows that while MCTS decreases standard move accuracy, it improves stylistic alignment, reducing average Jensen-Shannon divergence. The metric effectively discriminates between individual players, advancing behavioral alignment evaluation between players and AI models.

monte carlo tree searchjensen-shannon divergenceautoencoderuniform manifold approximation and projectionmove accuracy

Proteus: A Self-Evolving Red Team for Agent Skill Ecosystems

arXiv cs.AI · Zhaojiacheng Zhou · 2026-05-12

Proteus introduces a self-evolving red-team framework to measure adaptive leakage risk in agent skill ecosystems, where attackers iteratively revise skills to bypass audits and cause runtime harm. The framework explores a five-axis skill-attack space using an audit-sandbox-oracle pipeline, enabling cross-round mutation, path expansion for alternative attack implementations, and surface expansion for transferring attack patterns to new objectives. Proteus achieves 40-90% Attack Success Rate at 5 rounds (ASR@5) across eight phase-1 cells, with phase-2 expansion producing 438 bypassing and lethal variants. SkillVetter is bypassed ≥93% in every cell, while AI-Infra-Guard admits up to 41.3% joint-success, demonstrating significant underestimation of residual risk in current skill vetting.

adaptive leakageaudit-sandbox-oracle pipelinepath expansionsurface expansionattack success rate

Incentivizing Truthfulness and Collaborative Fairness in Bayesian Learning

arXiv cs.AI · Rachael Hwee Ling Sim, Jue Fan, Xiao Tian, Xinyi Xu · 2026-05-12

The paper introduces a novel mechanism ensuring collaborative fairness (F) and incentivizing data truthfulness (T) in Bayesian collaborative learning. The approach combines semivalues (e.g., Shapley value) for fairness with a truthful data valuation function (DVF) based on an undisclosed validation set. A key condition ensures sources maximize expected data values by submitting truthful datasets. Theoretical analysis explores relaxations of (F) and (T) under budget constraints or absence of validation sets. Empirical validation on synthetic and real-world datasets confirms the mechanism's effectiveness.

bayesian learningsemivaluesdata valuation functioncollaborative fairnesstruthfulness

From Clever Hans to Scientific Discovery: Interpreting EEG Foundational Transformers with LRP

arXiv cs.AI · Justus Meyer zu Bexten, Nico Scherf, Bogdan Franczyk, Simon M. Hofmann · 2026-05-12

This study evaluates Layer-wise Relevance Propagation (LRP) as a post-hoc attribution method for interpreting Transformer-based EEG foundation models (EEG-FMs). The authors extend LRP from CNNs to Transformer architectures, demonstrating its utility in verifying model decisions and uncovering biologically plausible hypotheses. Key findings include detecting 'Clever Hans' behaviors in motor imagery tasks (where models rely on ocular artifacts) and identifying a central electrode cluster as a potential sensorimotor arousal signature in affect prediction. The work positions LRP as a critical tool for both validation and discovery in EEG-FMs as they scale.

eegtransformerlrpattributionfoundation models

On-Policy Self-Evolution via Failure Trajectories for Agentic Safety Alignment

arXiv cs.AI · Bo Yin, Qi Li, Xinchao Wang · 2026-05-12

The paper introduces FATE, an on-policy self-evolution framework for improving LLM agent safety by leveraging failure trajectories as repair supervision. FATE employs verifier-scored failures to generate repair candidates, filtered across security, utility, over-refusal control, and trajectory validity, and uses Pareto-Front Policy Optimization (PFPO) to balance safety-utility trade-offs. Evaluations on AgentDojo, AgentHarm, and ATBench demonstrate FATE's effectiveness: it reduces attack success rate by 33.5%, harmful compliance by 82.6%, and improves trajectory-safety diagnosis by 6.5% compared to baselines.

on-policy learningfailure trajectoriespareto-front optimizationsafety alignmentllm agents

Modulation Consistency-based Contrastive Learning for Self-Supervised Automatic Modulation Classification

arXiv cs.AI · Chenxu Wang, Shuang Wang, Lirong Han, Xinyu Hu · 2026-05-12

The paper introduces Mod-CL, a self-supervised contrastive learning framework for Automatic Modulation Classification (AMC) that leverages intra-instance modulation consistency as a structural prior. By constructing positive pairs from temporal segments of the same signal, Mod-CL learns modulation-invariant representations while suppressing nuisance variations like noise and channel effects. The method includes a tailored contrastive objective combining temporal segmentation and data augmentation. Experiments on RadioML datasets demonstrate Mod-CL's superiority over baselines, particularly in low-label regimes, with significant gains in linear probing accuracy.

modulation classificationcontrastive learningself-supervised learningtemporal segmentationradioml

IPI-proxy: An Intercepting Proxy for Red-Teaming Web-Browsing AI Agents Against Indirect Prompt Injection

arXiv cs.AI · Chia-Pei, Chen, Kentaroh Toyoda, Anita Lai · 2026-05-12

IPI-proxy introduces an intercepting proxy for red-teaming web-browsing AI agents against indirect prompt injection (IPI). The tool dynamically rewrites HTTP responses from whitelisted domains, embedding 820 deduplicated attack strings from six benchmarks into HTML via configurable techniques (e.g., hidden CSS, LLM-generated prose). A YAML-driven harness parameterizes payloads, embedding methods, and insertion points (6 locations), enabling systematic evaluation without mock environments. The proxy logs exfiltration attempts, providing a reproducible substrate for hardening agents against real-world IPI threats on live retrieval surfaces.

indirect prompt injectionweb-browsing agentsintercepting proxyred-teamingyaml-driven harness

Very Efficient Listwise Multimodal Reranking for Long Documents

arXiv cs.AI · Yiqun Sun, Pengfei Wei, Lawrence B. Hsieh · 2026-05-12

ZipRerank introduces a highly efficient listwise multimodal reranker addressing computational bottlenecks in vision-centric retrieval and multimodal retrieval-augmented generation (M-RAG) for long documents. It reduces input length via a lightweight query-image early interaction mechanism and eliminates autoregressive decoding by scoring all candidates in a single forward pass. The model employs a two-stage training strategy: listwise pretraining on large-scale text data rendered as images, followed by multimodal finetuning with VLM-teacher-distilled soft-ranking supervision. Experiments on the MMDocIR benchmark demonstrate that ZipRerank matches or surpasses state-of-the-art rerankers while reducing LLM inference latency by up to an order of magnitude.

multimodal rerankerautoregressive decodinglistwise pretrainingsoft-ranking supervisionquery-image interaction

EvoNav: Evolutionary Reward Function Design for Robot Navigation with Large Language Models

arXiv cs.AI · Zhikai Zhao, Chuanbo Hua, Federico Berto, Zihan Ma · 2026-05-12

EvoNav introduces an evolutionary framework for automating robot navigation reward function design using large language models (LLMs), addressing the limitations of hand-crafted rewards in reinforcement learning (RL). The method employs a progressive three-stage warm-up-boost procedure, transitioning from low-cost analytical proxies to lightweight rollouts and full policy training, enabling computationally efficient exploration with effective feedback. Experimental results demonstrate that EvoNav generates navigation policies superior to manually designed RL rewards and state-of-the-art reward design methods.

reinforcement learningrobot navigationreward functionlarge language modelsevolutionary framework

Improving the Performance and Learning Stability of Parallelizable RNNs Designed for Ultra-Low Power Applications

arXiv cs.AI · Julien Brandoit, Arthur Fyon, Damien Ernst, Guillaume Drion · 2026-05-12

The paper introduces the Cumulative Memory Recurrent Unit (CMRU) and its relaxed variant αCMRU, addressing gradient blocking in the Bistable Memory Recurrent Unit (BMRU) for ultra-low power RNNs. The proposed cumulative update formulation restores gradient flow through skip-connections while preserving persistent memory and quantized states. Experiments demonstrate improved convergence stability and reduced initialization sensitivity, with CMRU/αCMRU matching or outperforming Linear Recurrent Units (LRUs) and minimal Gated Recurrent Units (minGRUs) on diverse benchmarks, particularly in long-range retention tasks, while maintaining analog implementation benefits.

recurrent neural networksgradient blockingquantized statespersistent memoryanalog implementation

GEAR: Granularity-Adaptive Advantage Reweighting for LLM Agents via Self-Distillation

arXiv cs.AI · Sijia Li, Yuchen Huang, Zifan Liu, Yanping Li · 2026-05-12

GEAR introduces Granularity-adaptivE Advantage Reweighting, a framework for adaptive-granularity credit assignment in LLM agents via self-distillation. It reshapes trajectory-level GRPO advantage using token- and segment-level signals derived from comparing an on-policy student with a ground-truth-conditioned teacher. Divergence spikes identify semantic deviations, forming adaptive credit regions: aligned tokens preserve token-level resolution, while divergent continuations group into segments with modulated advantage. Experiments on eight benchmarks with Qwen3 4B and 8B models show GEAR outperforms GRPO, self-distillation-only baselines, and token/turn-level methods, especially in challenging long-horizon settings, with gains up to 20% over GRPO.

credit assignmentself-distillationadaptive granularitytrajectory-level advantagesemantic deviation

Martingale-Consistent Self-Supervised Learning

arXiv cs.AI · Moritz Gögl, Hanwen Xing, Christopher Yau · 2026-05-12

The paper introduces a martingale-consistent self-supervised learning (SSL) framework addressing prediction coherence under partial observation. By formalizing coherence through martingale constraints, the method ensures refined predictions match coarse-view expectations without systematic drift. The approach includes prediction- and latent-space variants with a two-sample Monte Carlo estimator for stochastic refinement. Evaluations on synthetic and real datasets (time-series, tabular, image) demonstrate improved robustness and calibration in partial-observation regimes, outperforming standard SSL in semi-supervised and label-free settings.

martingaleself-supervised learningpartial observationmonte carlo estimatorcoherence

Minimax Rates and Spectral Distillation for Tree Ensembles

arXiv cs.AI · Binh Duc Vu, David S. Watson · 2026-05-12

The paper establishes minimax-optimal convergence rates for random forest (RF) regression, demonstrating that eigenvalue decay of the induced kernel operator determines statistical rates under standard tree growth conditions. It proposes spectral distillation techniques for tree ensembles: RFs use kernel operator eigenfunctions, while gradient boosting machines (GBMs) employ smoother matrix singular vectors to compress models. These nonlinear spectral representations yield order-of-magnitude smaller distilled models maintaining competitive accuracy, outperforming state-of-the-art pruning and rule extraction methods in resource-constrained settings.

random forestsgradient boostingspectral distillationminimax rateskernel operator

Trade-offs in Decentralized Agentic AI Discovery Across the Compute Continuum

arXiv cs.AI · Patrizio Dazzi, Emanuele Carlini, Matteo Mordacchini, Saul Urso · 2026-05-12

The paper analyzes trade-offs in decentralized discovery mechanisms for agentic AI systems across cloud-edge environments, comparing Chord, Pastry, and Kademlia as structured overlay networks. Using a shared control-plane framework, the study evaluates these overlays through stationary and churn benchmarks on 4096-node networks, measuring discovery reliability, startup behavior, and control-plane overhead. Results characterize the operating points exposed by each overlay for agent discovery in edge-to-cloud deployments, providing insights into their suitability for intermittently connected domains.

decentralized discoverystructured overlayagentic aicontrol-planecloud-edge continuum

Multi-Timescale Conductance Spiking Networks: A Sparse, Gradient-Trainable Framework with Rich Firing Dynamics for Enhanced Temporal Processing

arXiv cs.AI · Alex Fulleda-Garcia, Saray Soldado-Magraner, Josep Maria Margarit-Taulé · 2026-05-12

The authors introduce multi-timescale conductance spiking networks, a gradient-trainable SNN framework that combines rich firing dynamics with activity sparsity by parametrizing fast, slow, and ultra-slow conductances to shape current-voltage curves. The method employs a discrete-time formulation enabling direct backpropagation through time without surrogate gradients, supporting diverse regimes (tonic, phasic, bursting) within a single model. Evaluated on Mackey-Glass time-series regression, the networks outperform LIF and AdLIF baselines while achieving 2-3× sparser activity, demonstrating advantages for energy-aware temporal processing and neuromorphic implementation.

spiking neural networksconductance dynamicsgradient-based trainingtemporal processingneuromorphic computing

REFNet++: Multi-Task Efficient Fusion of Camera and Radar Sensor Data in Bird's-Eye Polar View

arXiv cs.AI · Kavin Chandrasekaran, Sorin Grigorescu, Gijs Dubbelman, Pavol Jancura · 2026-05-12

REFNet++ introduces a computationally efficient multimodal fusion framework for camera and radar data in autonomous driving. The method employs dual encoder-decoder architectures: a variational network transforms front-view camera images into Bird's-Eye View (BEV) polar coordinates, while a radar network converts range-Doppler spectra into range-azimuth features for domain alignment. Evaluated on the RADIal dataset, the approach demonstrates state-of-the-art performance in vehicle detection and free space segmentation tasks by leveraging complementary sensor strengths while maintaining computational efficiency.

sensor fusionbird's-eye viewrange-doppler spectrumvariational encoder-decoderautonomous driving

MedMemoryBench: Benchmarking Agent Memory in Personalized Healthcare

arXiv cs.AI · Yihao Wang, Haoran Xu, Renjie Gu, Yixuan Ye · 2026-05-12

The paper introduces MedMemoryBench, a benchmark for evaluating memory mechanisms in personalized healthcare agents, addressing the gap in existing benchmarks that focus on open-domain conversations. Using a human-agent collaborative pipeline, the authors synthesize clinically grounded, long-horizon medical trajectories, resulting in a dataset of 2,000 sessions and 16,000 interaction turns. The benchmark employs a streaming assessment protocol to mirror dynamic memory accumulation and investigates memory saturation. Results reveal significant bottlenecks in mainstream architectures, particularly in medical reasoning and noise resilience, highlighting the need for robust production-ready agents.

personalized healthcarememory mechanismsstreaming assessmentmemory saturationmedical reasoning

Automated Reformulation of Robust Optimization via Memory-Augmented Large Language Models

arXiv cs.AI · Jinbiao Chen, Shuang Jin, Guoyun Zhang, Junyu Zhang · 2026-05-12

The authors introduce Automated Reformulation with Experience Memory (AutoREM), a memory-augmented framework for automating robust optimization (RO) reformulation without domain expertise or parameter updates. AutoREM builds structured textual memory by reflecting on past failed trajectories through offline adaptation, enabling transfer across diverse large language models (LLMs). They also develop AutoRO-Bench, a benchmark for evaluating LLM-based RO reformulation, featuring automated data generation and a curated dataset. Experiments demonstrate AutoREM's consistent improvement in accuracy and efficiency across in-distribution, out-of-distribution datasets, and various base LLMs.

robust optimizationlarge language modelsmemory-augmented frameworkoffline adaptationautomated reformulation

Beyond World-Frame Action Heads: Motion-Centric Action Frames for Vision-Language-Action Models

arXiv cs.AI · Huoren Yang, Jianchao Zhao, Hu Yusong, Qiguan Ou · 2026-05-12

The paper introduces MCF-Proto, a lightweight action head for Vision-Language-Action (VLA) models that replaces fixed world-frame action prediction with Motion-Centric Action Frames (MCF) and prototype-based parameterization. The method predicts SO(3) rotations to transform actions into local frames, composes them from learned prototypes, and maps back to world coordinates—requiring only standard demonstrations. Results show emergent geometric structure in local frames, compact action representations with dominant directions, and improved robustness to geometric perturbations, demonstrating the benefits of structured action heads for robotic manipulation.

vision-language-action modelsmotion-centric action framesso(3) rotationprototype-based parameterizationrobotic manipulation

Why Users Go There: World Knowledge-Augmented Generative Next POI Recommendation

arXiv cs.AI · Qiuyu Ding, Heng-Da Xu, Wei Zhang, Dongyi Lv · 2026-05-12

The paper introduces AWARE, a generative POI recommendation system that augments LLMs with dynamic world knowledge through agent-generated contextual narratives. The method employs an LLM agent to produce location- and time-aware narratives capturing cultural traits, seasonal trends, and real-world events, while grounding them in user-specific spatial-temporal patterns. Evaluations on three real-world datasets show AWARE achieves up to 12.4% relative improvement over baselines by effectively integrating evolving external knowledge.

point-of-interest recommendationlarge language modelsagent-based narrativesspatial-temporal patternsworld knowledge augmentation

OTT-Vid: Optimal Transport Temporal Token Compression for Video Large Language Models

arXiv cs.AI · Minseok Kang, Minhyeok Lee, Jungho Lee, Minjung Kim · 2026-05-12

OTT-Vid introduces an optimal transport-based framework for temporal token compression in Video Large Language Models (Video-LLMs), addressing the inference cost bottleneck caused by accumulating visual tokens across frames. The method employs a two-stage process: spatial pruning identifies representative content within frames, followed by optimal transport (OT) with non-uniform token mass and locality-aware cost to estimate temporal compressibility. This approach dynamically allocates compression budgets based on transport difficulty, balancing token importance and matching cost. Evaluations on six benchmarks demonstrate that OTT-Vid retains 95.8% of VQA and 73.9% of VTG performance while preserving only 10% of tokens, outperforming existing training-free compression methods.

optimal transporttemporal token compressionvideo large language modelsspatial pruninglocality-aware cost

Beyond Inefficiency: Systemic Costs of Incivility in Multi-Agent Monte Carlo Simulations

arXiv cs.AI · Alison Moldovan-Mauer, Benedikt Mangold · 2026-05-12

This study quantifies the systemic costs of incivility in multi-agent debates using LLM-based simulations, addressing limitations of human subject research. A Monte Carlo framework generates thousands of 1-on-1 adversarial debates across toxicity conditions, measuring convergence time as an efficiency metric. Experiments extend prior findings across LLM agents of varying parameter sizes, confirming 25% convergence latency and showing increased latency in smaller models. Results reveal a significant first-mover advantage, where initiating agents win above chance regardless of toxicity. The method enables systematic manipulation of communicative behavior at scale.

monte carlo simulationmulti-agent systemsconvergence latencytoxicity conditionsfirst-mover advantage

Crash Assessment via Mesh-Based Graph Neural Networks and Physics-Aware Attention

arXiv cs.AI · Gabriel Curtosi, Carlos Manuel Ruiz Ruiz, Fabiola Cavaliere, Xabier Larráyoz Izcara · 2026-05-12

The work proposes hybrid neural surrogate models (MeshTransolver, MeshGeoTransolver, MeshGeoFLARE) for predicting full-field structural deformations in vehicle crash simulations, addressing computational bottlenecks in design exploration. The architectures combine mesh-based graph neural networks, geometry-aware global attention, and sparse contact-aware correction for autoregressive rollout, capturing both local interactions and long-range deformation patterns. On a 25-sample test set, the best hybrid model achieves 3.20 mm mean RMSE, with qualitative analysis showing superior physical interpretability over pure attention baselines despite comparable quantitative performance.

mesh-based gnnsphysics-aware attentioncrash simulationsurrogate modelingstructural deformation

Is Monotonic Sampling Necessary in Diffusion Models?

arXiv cs.AI · Muhammad Haris Khan · 2026-05-12

This study challenges the necessity of monotonic sampling schedules in diffusion models by testing four nonmonotonic schedule families across DDPM, EDM, and Flow Matching architectures on CIFAR-10. Results from 90 configurations show no performance improvement over monotonic baselines, with penalty magnitudes varying by architecture: significant in DDPM, moderate in Flow Matching, and negligible in EDM. The Schedule Sensitivity Coefficient is introduced as a diagnostic tool for denoiser quality, validating conventional monotonic approaches and offering a new metric complementary to sample-quality benchmarks.

diffusion modelsnonmonotonic schedulesschedule sensitivity coefficientdenoiser qualitycifar-10

Behavioral Integrity Verification for AI Agent Skills

arXiv cs.AI · Yuhao Wu, Tung-Ling Li, Hongliang Liu · 2026-05-12

The paper introduces Behavioral Integrity Verification (BIV), a framework for verifying AI agent skills by comparing declared versus actual capabilities using a shared taxonomy. BIV combines deterministic code analysis with LLM-assisted capability extraction to detect deviations, classify root causes, and identify malicious skills. Evaluation on 49,943 OpenClaw skills reveals 80.0% deviate from declared behavior, with 81.1% due to developer oversight and 18.9% to adversarial intent; BIV achieves 0.946 F1 on malicious-skill detection, outperforming baselines.

behavioral integrity verificationai agent skillscapability extractiondeviation taxonomymalicious-skill detection

Focusable Monocular Depth Estimation

arXiv cs.AI · Yuxin Du, Tao Lin, Zile Zhong, Runting Li · 2026-05-12

Focusable Monocular Depth Estimation (FDE) introduces a region-aware depth estimation task prioritizing user-specified target regions while maintaining global scene geometry. The proposed FocusDepth framework employs Multi-Scale Spatial-Aligned Fusion (MSSA) to spatially align multi-scale features from Segment Anything Model 3 with Depth Anything models, enabling prompt-conditioned depth estimation via box/text cues. FDE-Bench, a benchmark with 252.9K/72.5K train/val image-target-depth triplets across 972 categories, evaluates the approach. FocusDepth outperforms globally fine-tuned DA2/DA3 baselines, particularly in target boundary and foreground regions, with MSSA's spatial alignment reducing AbsRel errors by up to 13.8%.

monocular depth estimationmulti-scale fusionprompt-conditionedspatial alignmenttarget-centric

Towards Visually Grounded Multimodal Summarization via Cross-Modal Transformer and Gated Attention

arXiv cs.AI · Abid Ali, Diego Molla-Aliod, Usman Naseem · 2026-05-12

We propose SPeCTrA-Sum, a unified framework for multimodal summarization that jointly performs text summarization and representative image selection. The system introduces two innovations: a Deep Visual Processor (DVP) enabling hierarchical, layer-wise fusion between visual and language encoders, and a Visual Relevance Predictor (VRP) selecting salient images via Determinantal Point Processes distillation. Training employs a multi-objective loss combining autoregressive summarization, cross-modal alignment, and DPP-based distillation. Experiments demonstrate SPeCTrA-Sum generates more accurate, visually grounded summaries and selects more representative images compared to existing methods, highlighting the benefits of depth-aware fusion and principled image selection.

multimodal summarizationcross-modal transformerdeterminantal point processesvisual relevance predictordepth-aware fusion

DreamAvoid: Critical-Phase Test-Time Dreaming to Avoid Failures in VLA Policies

arXiv cs.AI · Xianzhe Fan, Yuxiang Lu, Shenyuan Gao, Xiaoyang Wu · 2026-05-12

DreamAvoid introduces a critical-phase test-time dreaming framework to enhance Vision-Language-Action (VLA) models' ability to anticipate and avoid failures in fine-grained manipulation tasks. The method employs a Dream Trigger to identify critical phases, an Action Proposer to sample candidate actions, and a Dream Evaluator trained on mixed success, failure, and boundary cases to predict and select optimal actions. Extensive evaluations on real-world manipulation tasks and simulation benchmarks demonstrate that DreamAvoid significantly improves task success rates by effectively avoiding failures.

vision-language-actioncritical-phasedream evaluatoraction proposerfailure avoidance

When Reasoning Traces Become Performative: Step-Level Evidence that Chain-of-Thought Is an Imperfect Oversight Channel

arXiv cs.AI · Wenkai Li, Fan Yang, Ananya Hazarika, Shaunak A. Mehta · 2026-05-12

This study challenges the assumption that chain-of-thought (CoT) reasoning traces reliably reflect model computation timing, introducing a step-level Detect-Classify-Compare framework validated via Patchscopes, tuned-lens probes, and causal direction ablation. Analyzing nine models across seven reasoning benchmarks, latent answer commitment and explicit trace alignment occur in only 61.9% of steps, with 58.0% mismatches attributed to confabulated continuation after answer stabilization. Architecture-matched comparisons reveal CoT utility increases with lower step-level alignment, suggesting CoT's usefulness despite temporal unreliability. Truncation and donor-corruption tests confirm post-commitment text often lacks functional relevance to final answers.

chain-of-thoughtpatchscopesconfabulated continuationtuned-lens probesanswer-commitment proxy

OptArgus: A Multi-Agent System to Detect Hallucinations in LLM-based Optimization Modeling

arXiv cs.AI · Zhong Li, Zihan Guo, Xiaohan Lu, Juntao Wang · 2026-05-12

OptArgus introduces a multi-agent system for detecting hallucinations in LLM-based optimization modeling, addressing structural inconsistencies across problem descriptions, symbolic models, and solver implementations. The method employs a fine-grained hallucination taxonomy spanning objective, variable, constraint, and implementation failures, alongside conductor routing, specialist auditors, and evidence consolidation. Evaluated on a benchmark suite of 484 clean artifacts, 1266 controlled injected artifacts, and 6292 natural LLM-generated artifacts, OptArgus outperforms a single-agent baseline in false alarm reduction, localization accuracy, and detection strength. This work establishes optimization-modeling hallucination detection as a concrete empirical problem and demonstrates the efficacy of modular, taxonomy-grounded auditing.

optimization modelinghallucination detectionmulti-agent systemsymbolic modelevidence consolidation

Allegory of the Cave: Measurement-Grounded Vision-Language Learning

arXiv cs.AI · Kepeng Xu, Li Xu, Gang He, Wenxin Yu · 2026-05-12

We introduce measurement-grounded vision-language learning to address information loss in RGB rendering, proposing PRISM-VL as an instantiation. PRISM-VL combines RAW-derived Meas.-XYZ inputs, camera-conditioned grounding, and Exposure-Bracketed Supervision Aggregation to transfer supervision from RGB proxies to measurement-domain observations. Evaluated on a 150K instruction-tuning set and a held-out benchmark targeting challenging visual conditions, PRISM-VL-8B achieves 0.6120 BLEU, 0.4571 ROUGE-L, and 82.66% LLM-Judge accuracy, outperforming the RGB-based Qwen3-VL-8B baseline by +0.1074 BLEU, +0.1071 ROUGE-L, and +4.46 percentage points. Results demonstrate that preserving measurement-domain evidence enhances multimodal reasoning.

measurement-groundedprism-vlexposure-bracketed supervisionmeas.-xyzvision-language models

CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating

arXiv cs.AI · Jiyuan Wang, Huan Ouyang, Jiuzhou Lin, Chunyu Lin · 2026-05-12

The paper introduces Concentrate and Concentrate (CaC), a hierarchical spatiotemporal anomaly reward model leveraging Vision-Language Models for video anomaly detection. CaC employs a coarse-to-fine approach: global temporal scanning identifies anomalous time windows, followed by fine-grained spatial grounding within localized intervals, and structured spatiotemporal Chain-of-Thought reasoning for robust judgments. The model is trained on a novel large-scale video anomaly dataset with per-frame annotations, using a three-stage progressive training paradigm involving supervised fine-tuning and Group Relative Policy Optimization (GRPO). CaC achieves a 25.7% accuracy improvement on fine-grained anomaly benchmarks and reduces generated-video anomalies by 11.7% while enhancing video quality.

vision-language modelsspatiotemporal reasoninggroup relative policy optimizationchain-of-thoughtvideo anomaly detection

A Research Agenda on Agents and Software Engineering: Outcomes from the Rio A2SE Seminar

arXiv cs.AI · Davide Taibi, Henry Muccini, Karthik Vaidhyanathan, Marcos Kalinowski · 2026-05-12

The A2SE seminar in Rio de Janeiro established a research agenda addressing the dual impact of agentic AI on software engineering: agents as tools for software engineering tasks and agents as complex systems requiring novel engineering practices. Eighteen experts from academia and industry participated in structured presentations, collaborative topic clustering, and group discussions to identify six thematic areas: Governance, Software Engineering for Agents, Agents for Software Architecture, Quality and Evaluation, Sustainability, and Code. The seminar prioritized short-term and long-term research directions for each area, providing a structured foundation for coordinated community efforts in this evolving field.

agentic aisoftware engineeringgovernancequality evaluationsustainability

Self-organized MT Direction Maps Emerge from Spatiotemporal Contrastive Optimization

arXiv cs.AI · Zhaotian Gu, Molan Li, Jie Su, Chang Liu · 2026-05-12

We demonstrate that self-organized direction-selective maps in the middle temporal (MT) area emerge from spatiotemporal contrastive optimization, unifying the computational origins of the ventral and dorsal streams. A 3D ResNet was trained on naturalistic videos using Momentum Contrast (MoCo) self-supervised learning combined with a biologically inspired spatial loss function. The model spontaneously developed brain-like direction maps and topological pinwheel structures, with MT tuning properties quantitatively matching macaque physiological baselines in direction selectivity index, circular variance, and pinwheel density. These results establish a general mechanism for cortical self-organization driven by optimization trade-offs between discriminative pressure and spatial regularization.

middle temporalmomentum contrastspatiotemporal optimizationdirection selectivitytopographic deep artificial neural network

SafeSteer: A Decoding-level Defense Mechanism for Multimodal Large Language Models

arXiv cs.AI · Xinyi Zeng, Xue Yang, Jingyuan Zhang, Huanqian Yan · 2026-05-12

The paper introduces SafeSteer, a decoding-level defense mechanism for multimodal large language models (MLLMs) to mitigate jailbreak attacks without costly fine-tuning. It leverages a Decoding-Probe to detect and correct harmful outputs during decoding and employs modal semantic alignment to extend textual safety to vision. Experiments show SafeSteer improves safety by up to 33.40% while maintaining model performance, balancing helpfulness and harmlessness.

multimodaljailbreakdecoding-probesemantic alignmentsafety

Toward Stable Value Alignment: Introducing Independent Modules for Consistent Value Guidance

arXiv cs.AI · Wenhao Chen, Sirui Sun, Shengyuan Bai, Guojie Song · 2026-05-12

The Stable Value Guidance Transformer (SVGT) introduces an independent value module for stable alignment of large language models with human values. SVGT employs two key designs: independent value modeling maintains normative representations in a dedicated value space isolated from the backbone, while explicit behavioral guidance transduces these stable signals into learnable latent Bridge Tokens that dynamically steer the generative trajectory. Experiments across multiple backbones and safety benchmarks demonstrate SVGT reduces harmful scores by over 70% while preserving generation fluency, validating its efficacy in architecturally grounded value modeling.

stable value guidance transformerindependent value modelingbridge tokensgenerative trajectoryvalue alignment

Debiased Model-based Representations for Sample-efficient Continuous Control

arXiv cs.AI · Jiafei Lyu, Zichuan Lin, Scott Fujimoto, Kai Yang · 2026-05-12

The paper introduces DR.Q, a debiased model-based representation method for continuous control that addresses biases in existing approaches. The method maximizes mutual information between current state-action pairs and next states while minimizing deviations, using faded prioritized experience replay. Evaluated on continuous control benchmarks with fixed hyperparameters, DR.Q matches or exceeds recent baselines, sometimes by significant margins.

model-based representationscontinuous controlmutual informationprioritized experience replayactor-critic learning

WildRelight: A Real-World Benchmark and Physics-Guided Adaptation for Single-Image Relighting

arXiv cs.AI · Lezhong Wang, Mehmet Onurcan Kaya, Siavash Bigdeli, Jeppe Revall Frisvad · 2026-05-12

The paper introduces WildRelight, the first real-world benchmark dataset for single-image relighting, addressing the synthetic-to-real domain gap in current methods. The dataset features high-resolution outdoor scenes with temporally aligned natural illuminations and paired HDR environment maps. A physics-guided inference framework combining Diffusion Posterior Sampling (DPS) and test-time adaptation (TTA) is proposed, demonstrating that synthetic models can adapt to real-world statistics through self-supervised learning on this temporal data.

single-image relightingdomain adaptationdiffusion posterior samplingtest-time adaptationhdr environment maps

Emergent Communication between Heterogeneous Visual Agents through Decentralized Learning

arXiv cs.AI · Mikako Ochiai, Masatoshi Nagano, Tadahiro Taniguchi · 2026-05-12

The study demonstrates that heterogeneous visual agents can develop shared symbolic communication through decentralized learning, despite private perceptual representations. Authors introduce the Metropolis-Hastings Captioning Game (MHCG), where agents exchange discrete token sequences and update models based on local visual evidence, without a shared communicative objective. Experiments on MS-COCO reveal that MHCG generates visually informative token sequences outperforming no-communication baselines in cross-agent alignment, visual-feature prediction, and image-text retrieval. Performance declines with increasing encoder mismatch, with moderate heterogeneity reducing sequence count while preserving specificity, and strong heterogeneity yielding fewer, coarser, and asymmetric sequences. Listener-side MH acceptance proves crucial for avoiding degenerate token formation.

emergent communicationdecentralized learningmetropolis-hastingsvisual encoderstoken sequences

Measuring What Matters Beyond Text: Evaluating Multimodal Summaries by Quality, Alignment, and Diversity

arXiv cs.AI · Abid Ali, Diego Molla-Aliod, Usman Naseem · 2026-05-12

We introduce MM-Eval, a unified evaluation framework for Multimodal Summarization with Multimodal Output (MSMO) that integrates textual quality, cross-modal alignment, and visual diversity assessments. The framework employs OpenFActScore for factual consistency, G-Eval for coherence, an MLLM-as-a-judge approach for image-text relevance, and Truncated CLIP Entropy for image-set diversity. A learned aggregation model, calibrated on the mLLM-EVAL news benchmark, aligns component contributions with human preferences. Results indicate a text-dominant hierarchy where factual consistency critically determines overall quality, while visual relevance and diversity provide complementary signals. MM-Eval outperforms heuristic aggregation baselines and offers an interpretable, reference-weak framework for multimodal summary evaluation.

multimodal summarizationfactual consistencycross-modal alignmenttruncated clip entropymllm-as-a-judge

Shaping Zero-Shot Coordination via State Blocking

arXiv cs.AI · Mingu Kang, Sunwoo Lee, Yonghyeon Jo, Seungyul Han · 2026-05-12

The paper introduces State-Blocked Coordination (SBC), a framework enhancing zero-shot coordination (ZSC) by generating diverse virtual environments via state blocking. This method exposes agents to varied suboptimal partner policies without direct environment modification, improving generalization to unseen partners. Evaluations across benchmarks show SBC outperforms existing approaches in ZSC, including robust performance with human partners.

zero-shot coordinationstate blockingmulti-agent systemsgeneralizationhuman-ai collaboration

Persistent and Conversational Multi-Method Explainability for Trustworthy Financial AI

arXiv cs.AI · Georgios Makridis, Georgios Fatouros, John Soldatos, George Katsis · 2026-05-12

The paper introduces a human-centered explainable AI architecture for financial sentiment analysis, combining persistent XAI artifacts, multi-method explanation triangulation, and faithfulness evaluation. XAI artifacts, including LIME feature attributions, occlusion-based word importance scores, and saliency heatmaps, are stored persistently in S3-compatible storage with structured metadata and natural-language summaries, enabling semantic retrieval and index reconstruction. A retrieval-augmented generation (RAG) assistant synthesizes explanations from multiple XAI methods, allowing conversational robustness assessment. Automated checks evaluate explanation faithfulness, focusing on grounding completeness, hallucinated claims, and method-attribution behavior. Evaluations on an EXTRA-BRAIN pipeline with FinBERT show constrained prompting reduces hallucination by 36% and increases method-attribution citations by 73% compared to naive prompting.

explainable airetrieval-augmented generationsentiment analysisfeature attributionhallucination rate

Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion

arXiv cs.AI · ShiYing Huang, Liang Lin, Yuer Li, Kaiwen Luo · 2026-05-12

We propose MORA (Multi-Objective Reward Assimilation), a novel method addressing the zero-sum conflict in multi-objective alignment of large language models by expanding reward diversity through prompt rewriting. MORA isolates single-reward prompts via pre-sampling and rewrites them to incorporate multi-dimensional intents, breaking the Pareto frontier limitation. Experiments show MORA achieves single-preference improvements of 5%-12.4% in sequential alignment, with significant gains in harmlessness, and an average overall reward improvement of 4.6% in simultaneous alignment across helpfulness, harmlessness, and truthfulness dimensions.

multi-objective alignmentpareto frontierprompt rewritingreward diversitysequential alignment

OOM-Free Alpamayo via CPU-GPU Memory Swapping for Vision-Language-Action Models

arXiv cs.AI · Seungwoo Roh, Huiyeong Kim, Jong-Chan Kim · 2026-05-12

A framework enabling memory-efficient inference for Vision-Language-Action (VLA) models on VRAM-constrained GPUs is introduced, achieving up to 3.55x speedup over Accelerate offloading while maintaining full BF16 precision. The method employs Sequential Demand Layering to reduce VRAM usage to layer-level granularity, Pipelined Demand Layering to overlap parameter transfer with computation, and a GPU-Resident Layer Decision Policy informed by per-module residency benefit analysis to eliminate residual transfer overhead. A performance prediction model determines optimal configurations with less than 1.3% error. Evaluated on NVIDIA's Alpamayo-R1-10B (21.52GB) on an RTX 5070Ti (16GB), the framework avoids out-of-memory errors without model modification.

vision-language-actionvram-constrainedsequential demand layeringpipelined demand layeringgpu-resident layer

A CAP-like Trilemma for Large Language Models: Correctness, Non-bias, and Utility under Semantic Underdetermination

arXiv cs.AI · Vinu Ellampallil Venugopal · 2026-05-12

This paper proposes a CAP-like trilemma for Large Language Models (LLMs), asserting that under semantic underdetermination, LLMs cannot simultaneously guarantee strong correctness, strict non-bias, and high utility. Semantic underdetermination occurs when premises do not determine a unique answer, requiring the model to introduce selection criteria or preferences. The authors formalize this trilemma, develop illustrative examples, and argue that certain LLM failures stem from the inherent structure of underdetermined decision requests rather than model limitations alone.

semantic underdeterminationcap theoremlarge language modelsselection criteriondecision requests

Cochise: A Reference Harness for Autonomous Penetration Testing

arXiv cs.AI · Andreas Happe, Jürgen Cito · 2026-05-12

The authors present Cochise, a minimal 597-line Python reference harness for evaluating LLM-driven autonomous penetration testing systems. The framework implements a Planner-Executor architecture with ReAct-style execution over SSH, maintaining long-term state externally while adapting prompts to target environments. Evaluated on the Game of Active Directory (GOAD) testbed, Cochise includes replay tools (cochise-replay), analysis utilities (cochise-analyze-logs/graphs), and a corpus of JSON trajectory logs to enable reproducible research without provisioning the resource-intensive 48-64GB RAM testbed.

autonomous penetration testingplanner-executor architecturereact-style executionssh command executionreference harness

Evolutionary Task Discovery: Advancing Reasoning Frontiers via Skill Composition and Complexity Scaling

arXiv cs.AI · Liqin Ye, Yanbin Yin, Michael Galarnyk, Yuzhao Heng · 2026-05-12

The paper introduces Evolutionary Task Discovery (EvoTD), a framework for advancing LLM reasoning through structured evolutionary operators: Crossover for skill composition and Parametric Mutation for complexity scaling. EvoTD employs a Zone of Proximal Development filter to maintain task learnability. Empirical results show consistent reasoning improvements across diverse model architectures, pretraining regimes, and scales, validating the efficacy of evolutionary curricula. The method addresses limitations of unstructured data synthesis by navigating a dual-axis manifold of Algorithmic Skills and Complexity Attributes.

evolutionary task discoveryskill compositioncomplexity scalingzone of proximal developmentreasoning frontier

Reviving In-domain Fine-tuning Methods for Source-Free Cross-domain Few-shot Learning

arXiv cs.AI · Yaze Zhao, Yicong Liu, Yixiong Zou, Yuhua Li · 2026-05-12

The paper revives in-domain fine-tuning methods for source-free cross-domain few-shot learning (CDFSL) by analyzing and rectifying attention collapse in CLIP models. Through establishing baselines, the authors find adapter-based methods (e.g., LoRA) outperform prompt-based ones (e.g., MaPLe) in CDFSL, attributing this to LoRA's ability to correct visual CLS token attention and enhance modality alignment. They propose Semantic Probe, a plug-and-play framework that rectifies attention by leveraging textual EOS tokens and improves both adapter- and prompt-based methods. Experiments on four CDFSL benchmarks demonstrate state-of-the-art performance, validating the approach.

cross-domain few-shot learningcliploramodality alignmentattention rectification

Weather-Robust Cross-View Geo-Localization via Prototype-Based Semantic Part Discovery

arXiv cs.AI · Chi-Nguyen Tran, Dao Sy Duy Minh, Huynh Trung Kiet, Nguyen Lam Phu Quy · 2026-05-12

SkyPart introduces a lightweight swappable head for patch-based vision transformers (ViTs) to address limitations in cross-view geo-localization (CVGL). The method employs learnable prototypes for patch token assignment, altitude-conditioned linear modulation during training, graph-attention readout over prototypes, and a Kendall uncertainty-weighted multi-objective loss. With 26.95M parameters and 22.14 GFLOPs, SkyPart achieves state-of-the-art performance on SUES-200, University-1652, and DenseUAV benchmarks under a single-pass, no-re-ranking, no-TTA protocol. It demonstrates superior robustness under the WeatherPrompt corruption benchmark compared to existing baselines.

cross-view geo-localizationpatch-based vision transformerslearnable prototypesgraph-attention readoutkendall uncertainty-weighted loss

Every Bit, Everywhere, All at Once: A Binomial Multibit LLM Watermark

arXiv cs.AI · Thibaud Gloaguen, Robin Staab, Mark Vero, Martin Vechev · 2026-05-12

The paper introduces a binomial multibit watermarking scheme for LLMs that encodes every bit of a payload at every token position, coupled with a stateful encoder to dynamically balance encoding pressure. This approach outperforms 8 baselines on 64-bit payloads, showing superior message accuracy and robustness, particularly for large payloads and low-distortion regimes. The authors also critique prior evaluation metrics and propose per-bit confidence scoring as a more practical alternative.

multibit watermarkingbinomial encodingstateful encoderper-bit confidencelow-distortion regimes

Hide to See: Reasoning-prefix Masking for Visual-anchored Thinking in VLM Distillation

arXiv cs.AI · Seonghoon Yu, Dongjun Nam, Byung-Kwan Lee, Jeany Son · 2026-05-12

We propose a novel think-answer distillation framework for visual-language models (VLMs) that enhances visual-anchored reasoning by masking salient reasoning prefixes during training. Our method employs token-wise salient reasoning-prefix masking and self-paced masking budget scheduling to encourage reliance on visual evidence, replacing standard causal masks with reasoning-prefix masks that block both future tokens and reasoning cues. Experiments demonstrate superior performance over existing VLM distillation and self-distillation methods on multimodal reasoning benchmarks, with analysis confirming improved visual utilization throughout the reasoning process.

visual-language modelsreasoning-prefix maskingself-paced maskingmultimodal reasoningdistillation framework

Seirênes: Adversarial Self-Play with Evolving Distractions for LLM Reasoning

arXiv cs.AI · Chi Zhang, Haibo Qiu, Qiming Zhang, Yufei Xu · 2026-05-12

Seirênes introduces a self-play reinforcement learning framework that transforms contextual interference into a training signal for enhancing large language model (LLM) reasoning robustness. The method employs a parameter-shared adversarial loop where a single model both generates plausible distracting contexts to expose its reasoning blind spots and solves problems by discerning essential tasks from these perturbations. This co-evolutionary curriculum drives the model beyond superficial pattern matching. Evaluated across seven mathematical reasoning benchmarks with model scales from 4B to 30B parameters, Seirênes achieves average accuracy gains of +10.2, +9.1, and +7.2 points. Additionally, its distracting contexts reduce top-tier closed-source model accuracy by 4--5 points.

self-playcontextual interferenceco-evolutionary curriculumreasoning robustnessadversarial loop

Unlocking UML Class Diagram Understanding in Vision Language Models

arXiv cs.AI · Artem Naboichenko, René Peinl · 2026-05-12

The work introduces a benchmark for visual question answering (VQA) on UML class diagrams, addressing a gap in Vision Language Model (VLM) capabilities for computer science diagrams. Using a dataset of 16,000 image-question-answer triples, the authors demonstrate that LoRA-based fine-tuning surpasses the performance of Qwen 3.5 27B, a state-of-the-art VLM, on this specialized task. The benchmark is designed to be both challenging and tractable, filling a niche in diagram understanding research.

vision language modelsuml class diagramsvisual question answeringlora-based fine-tuningbenchmark

Can LLM Agents Respond to Disasters? Benchmarking Heterogeneous Geospatial Reasoning in Emergency Operations

arXiv cs.AI · Junjue Wang, Weihao Xuan, Heli Qi, Pengyu Dai · 2026-05-12

The Disaster Operational Response Agent benchmark (DORA) introduces the first agentic benchmark for end-to-end disaster response, comprising 515 expert-authored tasks across 45 real-world events and 10 disaster types, with 3,500 tool-call steps in gold trajectories. Tasks span five dimensions: disaster perception, spatial relational analysis, rescue planning, temporal reasoning, and multi-modal report synthesis, utilizing a 108-tool MCP library over heterogeneous geospatial data. Evaluation of 13 frontier LLMs reveals persistent challenges in disaster-domain grounding, tool selection, argument grounding, and compositional fragility, with agent-to-gold performance gaps widening from 7% to 56% on long pipelines.

disaster responsegeospatial datatool-call stepsmulti-modal synthesiscompositional fragility

Enhancing Multilingual Counterfactual Generation through Alignment-as-Preference Optimization

arXiv cs.AI · Yilong Wang, Qianli Wang, Bohao Chu, Yihong Liu · 2026-05-12

The paper introduces Macro, a preference alignment framework for multilingual counterfactual explanation (SCE) generation using Direct Preference Optimization (DPO). It addresses the trade-off between validity and minimality in non-dominant languages by constructing preference pairs via a composite scoring function. Experiments across four LLMs and seven languages demonstrate Macro's 12.55% average validity improvement over chain-of-thought baselines while preserving minimality, outperforming supervised fine-tuning. Analyses show enhanced cross-lingual perturbation alignment and reduced generation errors.

counterfactual explanationspreference optimizationmultilingual generationvalidity-minimality trade-offdirect preference optimization

Nice Fold or Hero Call: Learning Budget-Efficient Thinking for Adaptive Reasoning

arXiv cs.AI · Zhaomeng Zhou, Lan Zhang, Junyang Wang, Mu Yuan · 2026-05-12

The paper introduces Budget-Efficient Thinking (BET), a two-stage framework for adaptive reasoning that optimizes compute allocation by aligning solve-or-fold decisions with solvability expectations. BET combines behavioral cold-start with GRPO under an investment-cost-aware reward, learning three key behaviors: short solve for easy queries, nice fold for unsolvable cases, and hero call for hard-but-solvable problems. Evaluated across seven benchmarks and three base models, BET reduces reasoning tokens by ~55% while maintaining or improving performance, demonstrating zero-shot transferability from mathematical to scientific QA and logical reasoning tasks.

adaptive reasoningbudget-efficient thinkingsolvabilitygrpozero-shot transfer

From Generic Correlation to Input-Specific Credit in On-Policy Self Distillation

arXiv cs.AI · Guobin Shen, Lei Huang, Xiang Cheng, Chenxiao Zhao · 2026-05-12

The paper introduces CREDIT (Contrastive REward from DIsTillation), a method for isolating input-specific reasoning in on-policy self-distillation for language models. Under a posterior-compatibility interpretation, the authors demonstrate that self-distillation token rewards correspond to Bayesian filtering increments, whose sum equals the pointwise mutual information (pMI) between response and feedback given the input. CREDIT employs a batch-contrastive baseline to decompose teacher log-probability along the input axis, penalizing responses likely under unrelated inputs. Evaluated across coding, scientific reasoning, and tool-use benchmarks on two model families, CREDIT achieves superior aggregate performance with minimal computational overhead.

self-distillationbayesian filteringpointwise mutual informationcontrastive rewardinput-specific reasoning

When Emotion Becomes Trigger: Emotion-style dynamic Backdoor Attack Parasitising Large Language Models

arXiv cs.AI · Ziyu Liu, Tao Li, Tianjie Ni, Xiaolong Lan · 2026-05-12

We propose Paraesthesia, a dynamic backdoor attack leveraging emotion as a semantic trigger in fine-tuned large language models (LLMs). Unlike token-level attacks, Paraesthesia manipulates emotional style as a decoupled factor in LLM representation space, enabling stealthy parasitic behavior. The method combines emotional style quantification and rewriting, injecting poisoned samples during fine-tuning to induce predefined harmful outputs upon emotional inputs. Evaluated on instruction-following generation and classification tasks across four LLMs, Paraesthesia achieves 99% attack success rate while preserving model utility on clean inputs.

backdoor attackemotional stylelarge language modelsfine-tuningsemantic manipulation

CuSearch: Curriculum Rollout Sampling via Search Depth for Agentic RAG

arXiv cs.AI · Jianghan Shen, Siqi Luo, Xinyu Cheng, Jing Xiong · 2026-05-12

The paper introduces CuSearch, a curriculum rollout sampling framework for optimizing agentic retrieval-augmented generation (RAG) systems trained via reinforcement learning with verifiable rewards (RLVR). CuSearch employs Search-Depth Greedy Allocation (SDGA) to prioritize deeper-search trajectories during training, which provide denser supervision for retrieval sub-policies. Two variants, SDGA-Auto and SDGA-Phase, adaptively allocate update budgets based on trajectory depth. Experiments demonstrate consistent improvements, including an 11.8-point exact-match gain over GRPO on ZeroSearch, validating search depth as an effective proxy for supervision density.

retrieval-augmented generationreinforcement learningcurriculum learningrollout samplingverifiable rewards

Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information

arXiv cs.AI · Guobin Shen, Xiang Cheng, Chenxiao Zhao, Lei Huang · 2026-05-12

The paper introduces Anti-Self-Distillation (AntiSD), a method that improves reasoning in reinforcement learning by ascending the divergence between student and teacher models rather than descending it, addressing inconsistent gains in math reasoning from on-policy self-distillation. AntiSD reverses the per-token sign of the distillation signal and uses an entropy-triggered gate to disable the term once teacher entropy collapses. Evaluated across five models (4B to 30B parameters) on math reasoning benchmarks, AntiSD achieves the GRPO baseline's accuracy in 2 to 10x fewer steps and improves final accuracy by up to 11.5 points.

anti-self-distillationpointwise mutual informationreasoning rlon-policy self-distillationentropy-triggered gate

PRISM: A Geometric Risk Bound that Decomposes Drift into Scale, Shape, and Head

arXiv cs.AI · Chieh-Yen Lin, Shao-Hua Sun · 2026-05-12

The paper introduces PRISM (Proxy Risk Inference via Structural Mapping), a geometric risk bound that decomposes LLM representation drift into scale, shape, and head divergence components. By leveraging the linear output head and near-isometric structure of LLM backbones, PRISM provides a closed-form upper bound on cross-entropy risk gaps between target models and post-training variants (e.g., quantized, LoRA-adapted). The method enables variant ranking and identifies specific failure modes, guiding remediation. Evaluated across two model families and five benchmarks, PRISM achieves mean Spearman correlations of 0.820 (quantization) and 0.831 (LoRA forgetting), while its shape regularizer outperforms experience replay in mitigating catastrophic forgetting.

prismrepresentation driftcross-entropy risklora-adaptedquantization

Exact Stiefel Optimization for Probabilistic PLS: Closed-Form Updates, Error Bounds, and Calibrated Uncertainty

arXiv cs.AI · Haoran Hu, Xingce Wang · 2026-05-12

The paper introduces an end-to-end framework for probabilistic partial least squares (PPLS) that addresses noise-signal coupling and orthogonality constraints. The method combines noise pre-estimation, constrained likelihood optimization, and prediction calibration, replacing full-spectrum noise averaging with noise-subspace estimation and interior-point penalty handling with exact Stiefel-manifold optimization. The noise-subspace estimator achieves a signal-strength-independent finite-sample rate and matches a minimax lower bound. The framework extends to sub-Gaussian settings via optional Gaussianization and provides closed-form standard errors through block-structured Fisher analysis. Evaluated on synthetic high-noise settings and multi-omics benchmarks (TCGA-BRCA and PBMC CITE-seq), the method achieves near-nominal coverage, Ridge-level point accuracy at rank r=3, and improved parameter recovery stability.

probabilistic partial least squaresstiefel-manifold optimizationnoise-subspace estimationblock-structured fisher analysismulti-omics benchmarks

Keep What Audio Cannot Say: Context-Preserving Token Pruning for Omni-LLMs

arXiv cs.AI · Chaeyoung Jung, Kyeongha Rho, Joon Son Chung · 2026-05-12

ContextGuard, an inference-time token pruning framework for Omni-LLMs, preserves broad audio-visual context while reducing cross-modal redundancy by predicting coarse visual semantics from audio and pruning recoverable video tokens. It retains localized visual details unspecified by audio and merges temporally similar video tokens, requiring no downstream LLM fine-tuning and using only a lightweight predictor. Evaluated on Qwen2.5-Omni and Video-SALMONN2+ at 3B and 7B scales across six benchmarks, ContextGuard outperforms prior pruning methods, achieving full-token-level performance on five benchmarks while pruning 55% of input tokens on Qwen2.5-Omni 7B.

contextguardtoken pruningomni-llmscross-modal redundancyaudio-visual context

GAR: Carbon-Aware Routing for LLM Inference via Constrained Optimization

arXiv cs.AI · Disha Sheshanarayana, Rajat Subhra Pal, Manjira Sinha, Tirthankar Dasgupta · 2026-05-12

The paper introduces Green-Aware Routing (GAR), a constrained multi-objective optimization framework for carbon-efficient LLM inference routing. GAR minimizes CO2 emissions per request while enforcing accuracy floors and p95-latency SLOs, using adaptive constraint optimization and lightweight estimators for correctness, latency, and emissions. The proposed GAR-PD algorithm and heuristic variants achieve 15-30% carbon reduction across heterogeneous LLM pools (7B-70B parameters) on NLP benchmarks, maintaining competitive accuracy and latency guarantees.

carbon-aware routingllm inferenceconstrained optimizationservice-level objectivesprimal-dual algorithm

DiffScore: Text Evaluation Beyond Autoregressive Likelihood

arXiv cs.AI · Wen Lai, Yingli Shen, Dingnan Jin, Qing Cui · 2026-05-12

DiffScore introduces masked reconstruction as an alternative to autoregressive text evaluation, addressing positional bias inherent in left-to-right factorization. Leveraging Masked Large Diffusion Language Models, it scores tokens using full bidirectional context across continuous masking rates, establishing a hierarchy from local fluency to global coherence. The framework provides diagnostic tools like multi-timestep quality profiles and bidirectional PMI decomposition, disentangling fluency from faithfulness. Experiments across ten benchmarks demonstrate DiffScore's consistent superiority over autoregressive baselines in both zero-shot and fine-tuned settings.

masked reconstructionpositional biasbidirectional contextmasking ratesfluency-faithfulness disentanglement

EpiCastBench: Datasets and Benchmarks for Multivariate Epidemic Forecasting

arXiv cs.AI · Madhurima Panja, Danny D'Agostino, Huitao Li, Tanujit Chakraborty · 2026-05-12

EpiCastBench introduces a large-scale benchmarking framework for multivariate epidemic forecasting, addressing the lack of diverse, high-quality datasets. The framework comprises 40 curated multivariate datasets spanning various infectious diseases and geographical regions, characterized by diverse temporal granularity, series length, and sparsity. Standardized evaluation settings, including unified forecasting horizons, preprocessing pipelines, and performance metrics, ensure reproducibility and fair comparison. The framework evaluates 15 multivariate forecasting models, ranging from statistical baselines to state-of-the-art deep learning and foundation models. All datasets and code are publicly available on Kaggle and GitHub.

multivariate forecastingepidemic forecastingtemporal granularitydeep learningfoundation models

Native Explainability for Bayesian Confidence Propagation Neural Networks: A Framework for Trusted Brain-Like AI

arXiv cs.AI · Georgios Makridis, Georgios Fatouros, John Soldatos, George Katsis · 2026-05-12

The paper introduces a native explainability framework for Bayesian Confidence Propagation Neural Networks (BCPNNs), addressing the EU AI Act's transparency requirements for high-risk systems. It proposes a taxonomy mapping BCPNN architectural primitives to explainable-AI modalities, introduces 16 explanation primitives with closed-form algorithms, and 5 configuration-as-explanation primitives for hyperparameter auditing. The method leverages BCPNN's inherent transparency properties to provide attribution, prototype, and mechanistic explanations without computational overhead. Results demonstrate feasibility for edge deployment and alignment with Industry 5.0 standards through FPGA implementations and neuromorphic sparsity.

bcpnnexplainable-aieu ai actbayesian inferenceedge computing

SoK: Unlearnability and Unlearning for Model Dememorization

arXiv cs.AI · Mengying Zhang, Derui Wang, Ruoxi Sun, Xiaoyu Xia · 2026-05-12

This paper presents the first integrated analysis of model dememorization techniques, focusing on unlearnability and machine unlearning. The authors develop a unified taxonomy of these methods and conduct empirical evaluations to assess their robustness, interplay, and limitations regarding shallow dememorization. They identify vulnerabilities such as falsely claimed data learnability reduction, weight perturbation effects, and domain knowledge recovery during unlearning. The study also establishes the first theoretical guarantee on dememorization depth for models processed through certified unlearning. These contributions provide foundational insights for achieving deeper immemor states of sensitive knowledge across the machine learning lifecycle.

dememorizationunlearnabilitymachine unlearningcertified unlearningimemor state

NexOP: Joint Optimization of NEX-Aware k-space Sampling and Image Reconstruction for Low-Field MRI

arXiv cs.AI · Tal Oved, Efrat Shimron · 2026-05-12

NexOP introduces a deep-learning framework for joint optimization of k-space sampling and image reconstruction in multi-NEX acquisitions for low-field MRI. The method optimizes sampling density probabilities across the k-space-NEX domain under fixed sampling-budget constraints and employs a novel architecture to reconstruct high-SNR images from multiple low-SNR measurements. Experiments on 0.3T brain data show NexOP outperforms existing methods across various acceleration factors and tissue contrasts, yielding non-uniform sampling strategies that decrease across repetitions. Theoretical analysis supports these findings, demonstrating NexOP's potential for faster, higher-quality imaging in low-cost MRI systems.

k-space samplingsignal-to-noise ratiolow-field mrideep-learning architecturemulti-nex acquisitions

Three Regimes of Context-Parametric Conflict: A Predictive Framework and Empirical Validation

arXiv cs.AI · Pruthvinath Jeripity Venkata · 2026-05-12

This paper resolves empirical contradictions in how large language models handle conflicts between training knowledge and contradicting documents by proposing a three-regime framework: Regime 1 (single-source updating, dominant predictor: evidence coherence), Regime 2 (competitive integration, dominant predictor: parametric certainty), and Regime 3 (task-appropriate selection, dominant predictor: task knowledge requirement). The authors formalize parametric strength (exposure frequency) and parametric uniqueness (encoding consistency) as orthogonal dimensions, with strength being the operative predictor in stable factual domains. The framework is validated across Claude Sonnet 4.6, GPT-5.5, Gemini 2.5 Flash, Llama 4 Maverick, and DeepSeek V3 using 9,970 API calls. GEE logistic regression confirms the predicted Regime 2 certainty gradient (beta = -0.38 to -0.50, all p <= .013), and a Regime 3 ablation shows task framing flips context-following from near-100% to 6-71% (p < .001).

parametric strengthparametric uniquenessgee logistic regressioncertainty gradienttask framing

Dual-Temporal LSTM with Hybrid Attention for Airline Passenger Load Factor Forecasting: Integrating Intra-Flight and Inter-Flight Booking Dynamics

arXiv cs.AI · ASM Nazrul Islam, Md. Hasanul Kabir, Md. Liakot Ali, Joydeb Kumar Sana · 2026-05-12

A dual-stream Long Short-Term Memory (LSTM) with hybrid attention is proposed for airline passenger load factor forecasting, addressing the limitations of unidimensional temporal modeling. The model processes two complementary sequences: intra-flight booking accumulation and inter-flight booking patterns at fixed days-before-departure offsets. Multiple architectural variants combining self-attention, cross-attention, and hybrid attention with concatenation, residual, and gated fusion strategies are evaluated. Experiments on Biman Bangladesh Airlines data show the hybrid model achieves a Mean Absolute Error of 2.8167 and an R² of 0.9495, outperforming baselines and prior dual-LSTM architectures. The model generalizes across diverse route types and has been integrated into airline operations.

long short-term memoryattention mechanismbooking dynamicsmean absolute errordual-stream architecture

TCP-SSM: Efficient Vision State Space Models with Token-Conditioned Poles

arXiv cs.AI · Sara Shoouri, Morteza Tavakoli Taba, Hun-Seok Kim · 2026-05-12

TCP-SSM introduces token-conditioned poles to improve State Space Models (SSMs) for vision tasks, addressing implicit recurrence dynamics and memory control limitations. The method employs real poles for decay patterns and complex-conjugate poles for oscillatory responses, with token-dependent pole adaptation via bounded radius/angle modulation. Grouped pole sharing and low-rank pathways enable efficient linear-time scans. Evaluations on image classification, segmentation, and detection show 44% computation reduction in Vision Mamba models while maintaining accuracy.

state space modelstoken-conditioned polesscan operatorcomplex-conjugate poleslinear-time complexity

When Looking Is Not Enough: Visual Attention Structure Reveals Hallucination in MLLMs

arXiv cs.AI · Fanpu Cao, Xin Zou, Xuming Hu, Hui Xiong · 2026-05-12

The paper introduces LaSCD (Laplacian-Spectral Contrastive Decoding), a training-free method to mitigate visual hallucinations in multimodal large language models (MLLMs). By analyzing high-frequency visual attention structure via layer-wise Laplacian energy, LaSCD identifies hallucination-prone layers and remaps next-token logits in closed form. Evaluations on hallucination and general multimodal benchmarks demonstrate consistent hallucination reduction while maintaining model capabilities, achieving this without additional training. The approach reveals that hallucination correlates with specific attention patterns rather than simple attention mass distribution.

multimodal large language modelsvisual hallucinationlaplacian energycontrastive decodingattention structure

Hindsight Hint Distillation: Scaffolded Reasoning for SWE Agents from CoT-free Answers

arXiv cs.AI · Shengjie Wang, Guanghe Li, Zonghan Yang, Yang Gao · 2026-05-12

Hindsight Hint Distillation (HHD) introduces a method to enhance reasoning in long-horizon tasks without requiring costly chain-of-thought (CoT) annotations. HHD synthesizes hindsight hints from failed self-rollouts, scaffolds on-policy rollouts to complete tasks, and self-distills these trajectories for generalization. Experiments demonstrate HHD's superiority, achieving an 8% absolute improvement on SWE-bench Verified compared to iterative RFT and trajectory-synthesis baselines, which improve by only 2%. Notably, HHD-induced reasoning strategies generalize effectively to out-of-distribution tasks, yielding significant gains on SWE-bench Multilingual without multilingual training.

hindsight hint distillationchain-of-thoughtself-rolloutson-policy rolloutsself-distillation

Sharpen Your Flow: Sharpness-Aware Sampling for Flow Matching

arXiv cs.AI · Aditi Gupta, Soon Hoe Lim, Annan Yu, N. Benjamin Erichson · 2026-05-12

SharpEuler introduces a training-free sampler for flow matching models that optimizes sample quality under fixed evaluation budgets. The method profiles pretrained models offline by estimating velocity field sharpness along calibration trajectories, converting this profile into a non-uniform timestep grid via quantile transform. SharpEuler is theoretically justified through numerical, variational, and statistical principles, demonstrating stability at the terminal distribution level. Empirical results show improved sample quality, reducing inter-mode leakage and increasing mode coverage compared to uniform sampling schedules.

flow matchingsharpness profileeuler integrationtimestep gridmode coverage

Optimal LTLf Synthesis

arXiv cs.AI · Yujian Cao, Sven Schewe, Qiyi Tang, Shufang Zhu · 2026-05-12

The paper introduces optimal LTLf synthesis, addressing the limitation of traditional strategy synthesis by maximizing the realization of objectives when not all are jointly achievable. Three approaches are proposed: max-guarantee synthesis, which identifies a maximal set of a priori guaranteed objectives; max-observation synthesis, which maximizes a posteriori realized objectives across executions; and incremental max-observation synthesis, enhancing strategies by leveraging stronger guarantees during execution. Experimental results demonstrate that these variations scale effectively, solving a significant fraction of benchmark instances within timeout constraints, confirming their practical feasibility.

ltlf synthesismax-guarantee synthesismax-observation synthesisincremental synthesisstrategy synthesis

Taming Extreme Tokens: Covariance-Aware GRPO with Gaussian-Kernel Advantage Reweighting

arXiv cs.AI · Cheng Wang, Qin Liu, Wenxuan Zhou, Muhao Chen · 2026-05-12

We introduce Covariance-Aware GRPO with Gaussian-Kernel Advantage Reweighting, a hyperparameter-free optimization method that improves Group Relative Policy Optimization (GRPO) for large language models. The method dynamically down-weights extreme token-level updates using a Gaussian kernel, leveraging the covariance between token probabilities and their advantages to stabilize entropy changes during training. Empirical evaluations demonstrate that this approach enhances downstream reasoning performance across benchmarks compared to standard GRPO, while effectively maintaining training stability and preserving informative learning signals.

group relative policy optimizationcovariance-awaregaussian-kerneltoken probabilitiesadvantage reweighting

Read, Grep, and Synthesize: Diagnosing Cross-Domain Seed Exposure for LLM Research Ideation

arXiv cs.AI · Yunju Choi, Min Song · 2026-05-12

The paper investigates whether LLM-based research ideation benefits from cross-domain retrieval or mere exposure to diverse mechanisms. It introduces PaperGym, a three-stage pipeline: (1) tool-augmented seed extraction (read, grep, bash) from isolated papers, (2) cross-domain seed retrieval via paraphrasing across seven ML domains, and (3) rubric-scored method synthesis. Results show cross-domain retrieval outperforms no-retrieval and same-domain baselines in novelty but matches random diverse-seed controls, suggesting LLMs benefit from diversity but lack semantic retrieval exploitation. The authors release seed libraries and evaluation scripts.

llm ideationcross-domain retrievaltool-augmented extractionmethod synthesisrubric-based evaluation

Efficient and provably convergent end-to-end training of deep neural networks with linear constraints

arXiv cs.AI · Zonglin Yang, Zhexuan Gu, Yancheng Yuan · 2026-05-12

The paper introduces an efficient, provably convergent method for end-to-end training of deep neural networks with linear constraints via projection layers. Key innovation is the HS-Jacobian, a conservative mapping for polyhedral projection operators that enables nonsmooth automatic differentiation and integration with optimizers like Adam. Theoretical convergence guarantees are established for the HS-Jacobian-based Adam algorithm. Experiments across finance, computer vision, and architecture design demonstrate superior performance over existing methods.

hs-jacobianpolyhedral projectionnonsmooth automatic differentiationlinear constraintsend-to-end training

PointGS: Semantic-Consistent Unsupervised 3D Point Cloud Segmentation with 3D Gaussian Splatting

arXiv cs.AI · Yixiao Song, Qingyong Li, Wen Wang, Zhicheng Yan · 2026-05-12

PointGS introduces an unsupervised 3D point cloud segmentation pipeline leveraging 3D Gaussian Splatting to bridge the discrete-continuous domain gap between sparse point clouds and dense 2D images. The method reconstructs sparse point clouds into dense 3D Gaussian spaces via multi-view observations, renders multi-view dense images, extracts 2D semantic masks using the Segment Anything Model (SAM), and distills semantics to 3D Gaussian primitives through contrastive learning. Point semantics are assigned via nearest-neighbor search on labeled Gaussians after two-step registration. PointGS achieves +0.9% mIoU on ScanNet-V2 and +2.8% mIoU on S3DIS, outperforming state-of-the-art unsupervised methods.

3d gaussian splattingpoint cloud segmentationcontrastive learningsegment anything modelsemantic consistency

Controllable User Simulation

arXiv cs.AI · Guy Tennenholtz, Ofer Meshi, Amir Globerson, Uri Shalit · 2026-05-12

The paper formalizes controllable user simulation as a causal inference problem, demonstrating that standard supervised fine-tuning introduces structural bias via trajectory labels coupled to behavior policies. This look-ahead bias causes controllability collapse under policy shift, where evaluation metric variance grows geometrically. The authors propose causally consistent training mitigations: a priori controls, dynamic step-wise controls, and policy-conditioned learning. Experiments show their method eliminates bias, preserves conversational variance, and generalizes zero-shot to unseen agent behaviors, unlike standard approaches that distort distributions and collapse diversity.

causal inferencecontrollability collapselook-ahead biaspolicy-conditioned learningzero-shot generalization

AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration -- Learning from Cheap, Optimizing Expensive

arXiv cs.AI · Taicheng Guo, Nitesh V. Chawla, Olaf Wiest, Xiangliang Zhang · 2026-05-12

AutoLLMResearch introduces an agentic framework for automating high-cost LLM experiment configurations, addressing the inefficiency of manual expert-driven approaches. The method leverages LLMConfig-Gym, a multi-fidelity environment with over one million GPU hours of verifiable outcomes across four critical LLM experiment tasks, and a structured training pipeline formulated as a long-horizon Markov Decision Process to incentivize cross-fidelity extrapolation reasoning. Evaluations against diverse baselines demonstrate the framework's effectiveness, generalization, and interpretability, supporting its practical utility for scalable LLM experiment automation.

multi-fidelity environmentmarkov decision processcross-fidelity extrapolationllm experiment configurationagentic framework

A Study on Hidden Layer Distillation for Large Language Model Pre-Training

arXiv cs.AI · Maxime Guigon, Lucas Dixon, Michaël E. Sander · 2026-05-12

The study evaluates Hidden Layer Distillation (HLD) for decoder-only LLM pre-training, comparing it to logit-based Knowledge Distillation (KD) and self-supervised baselines. Using Gemma3 3.4B as teacher and 123M/735M student models trained on up to 168B C4 tokens, HLD shows systematic perplexity gains over KD but no consistent downstream task improvement. Results suggest HLD extracts latent signals, though further breakthroughs may be needed for broader pre-training utility.

hidden layer distillationknowledge distillationlarge language modelsdecoder-only architectureperplexity

Pion: A Spectrum-Preserving Optimizer via Orthogonal Equivalence Transformation

arXiv cs.LG · Kexuan Shi, Hanxuan Li, Zeju Qiu, Yandong Wen · 2026-05-12

The authors propose Pion, a spectrum-preserving optimizer for LLM training that uses orthogonal equivalence transformations to update weight matrices while preserving their singular values. Unlike additive optimizers like Adam, Pion applies left and right orthogonal transformations to modulate weight matrix geometry without altering spectral norms. Theoretical analysis covers update rule derivation, design choices, and convergence properties. Experiments demonstrate Pion's stability and competitiveness in both LLM pretraining and finetuning compared to standard optimizers.

spectrum-preserving optimizerorthogonal equivalence transformationsingular value preservationllm trainingspectral norm

Elastic Attention Cores for Scalable Vision Transformers

arXiv cs.LG · Alan Z. Song, Yinjie Chen, Mu Nan, Rui Zhang · 2026-05-12

We propose Visual Elastic Core Attention (VECA), a Vision Transformer architecture that replaces quadratic-complexity all-to-all self-attention with linear-time core-periphery structured attention. VECA introduces a small set of learned core tokens that mediate information exchange between image patches, reducing complexity to O(N) for N patches interacting with C resolution-invariant cores. The model maintains and updates all N input tokens while avoiding a C-way bottleneck through nested training along the core axis. Evaluated on classification and dense tasks, VECA achieves competitive performance with state-of-the-art vision foundation models while significantly reducing computational costs, establishing elastic core-periphery attention as a scalable alternative for Vision Transformers.

vision transformerscore-periphery attentionlinear complexityelastic attentionnested training

Task-Adaptive Embedding Refinement via Test-time LLM Guidance

arXiv cs.LG · Ariel Gera, Shir Ashury-Tahan, Gal Bloch, Ohad Eytan · 2026-05-12

We propose a task-adaptive embedding refinement method using test-time LLM guidance to enhance zero-shot search and classification performance. The approach refines query embeddings in real-time by leveraging generative LLM feedback on a small document set, enabling embeddings to adapt to task-specific constraints. Extensive experiments on diverse benchmarks demonstrate consistent improvements, with up to +25% gains in literature search, intent detection, key-point matching, and query-instruction following. The refined embeddings improve ranking quality and binary separation, expanding the practical deployment scope of embedding models as a cost-effective alternative to LLM pipelines. Code is released for reproducibility.

embedding refinementzero-shot classificationllm guidancetask adaptationquery embedding

MEME: Multi-entity & Evolving Memory Evaluation

arXiv cs.LG · Seokwon Jung, Alexander Rubinstein, Arnas Uselis, Sangdoo Yun · 2026-05-12

MEME introduces a benchmark evaluating LLM-based agents in persistent environments, focusing on multi-entity and evolving memory tasks. It defines six tasks, including Cascade, Absence, and Deletion, which assess dependency reasoning and post-removal state. The study evaluates six memory systems across three paradigms on 100 episodes, revealing severe performance collapse on dependency reasoning tasks (Cascade: 3%, Absence: 1% average accuracy). Prompt optimization, deeper retrieval, and stronger LLMs fail to bridge this gap; only a file-based agent with Claude Opus 4.7 partially improves performance at ~70x baseline cost, highlighting scalability challenges.

llm-based agentsdependency reasoningmemory systemsprompt optimizationscalability challenges

Routers Learn the Geometry of Their Experts: Geometric Coupling in Sparse Mixture-of-Experts

arXiv cs.LG · Sagi Ahrac, Noya Hochwald, Mor Geva · 2026-05-12

The paper demonstrates geometric coupling between routers and experts in Sparse Mixture-of-Experts (SMoE) models, where router-expert weight gradients align along input directions. Analyzing a 1B-parameter SMoE, the authors show router scores predict expert activations, revealing shared routing-expert dynamics. Auxiliary load-balancing losses disrupt this coupling by homogenizing router directions. A parameter-free online K-Means router, leveraging geometric coupling, achieves low load imbalance with minimal perplexity increase, outperforming loss-based balancing methods. Results indicate routers learn assignment geometries that facilitate effective expert specialization.

sparse mixture-of-expertsgeometric couplingrouter-expert alignmentload-balancing lossesonline k-means router

High-arity Sample Compression

arXiv cs.LG · Leonardo N. Coregliano, William Opich · 2026-05-12

The work introduces a high-arity variant of sample compression schemes, extending concepts from learning theory to product spaces. By analyzing the properties of these schemes, the authors demonstrate that the existence of a high-arity sample compression scheme with non-trivial quality implies high-arity PAC learnability. This result bridges high-arity learning theory with classical PAC learning frameworks, providing a theoretical foundation for understanding learnability in complex, multi-dimensional spaces.

sample compression schemeshigh-arity learning theorypac learnabilityproduct spaceslearning theory

Search Your Block Floating Point Scales!

arXiv cs.LG · Tanmaey Gupta, Hayden Prairie, Xiaoxia Wu, Reyna Abhyankar · 2026-05-12

ScaleSearch introduces a fine-grained search strategy for selecting optimal scale factors in Block Floating Point (BFP) quantization, minimizing quantization errors by leveraging mantissa bits in microscaling formats. The method integrates with existing techniques like Post Training Quantization (PTQ) and low-precision attention, enhancing their performance. ScaleSearchAttention, an NVFP4-based attention algorithm, ensures near-zero performance loss in causal language modeling. Experiments demonstrate a 27% reduction in quantization error for NVFP4, a 15-point improvement in PTQ for Qwen3-8B on MATH500, and a 0.77-point improvement in Wikitext-2 PPL for Llama 3.1 70B.

block floating pointquantizationmicroscalingnvfp4post training quantization

A proximal gradient algorithm for composite log-concave sampling

arXiv cs.LG · Linghai Liu, Sinho Chewi · 2026-05-12

The authors propose a proximal gradient algorithm for sampling from composite log-concave distributions of the form π∝e^(-f-g), where f is smooth and g admits a restricted Gaussian oracle (RGO). The method leverages gradient evaluations of f and RGO calls for g, achieving ε error in total variation distance in O~(κ√d log^4(1/ε)) iterations when f+g is α-strongly convex and f is β-smooth (κ=β/α). The results extend to non-log-concave distributions satisfying Poincaré/log-Sobolev inequalities and non-smooth Lipschitz f.

log-concave samplingproximal gradientrestricted gaussian oracletotal variation distancepoincaré inequality

Multi-Stream LLMs: Unblocking Language Models with Parallel Streams of Thoughts, Inputs and Outputs

arXiv cs.LG · Guinan Su, Yanwu Yang, Xueyan Li, Jonas Geiping · 2026-05-12

The paper proposes Multi-Stream LLMs, a paradigm shift from sequential message processing to parallel streams of computation in language models. By instruction-tuning models to simultaneously read from multiple input streams and generate tokens across multiple output streams in each forward pass, the method addresses limitations of sequential processing (e.g., inability to act while reading/react while writing). The approach improves efficiency through parallelization, enhances security via separation of concerns, and increases monitorability while maintaining causal dependencies across timesteps.

multi-stream llmsparallel computationinstruction-tuningcausal dependenciesautonomous agents

TextSeal: A Localized LLM Watermark for Provenance & Distillation Protection

arXiv cs.LG · Tom Sander, Hongyan Chang, Tomáš Souček, Tuan Tran · 2026-05-12

TextSeal introduces a localized watermark for large language models that combines dual-key generation, entropy-weighted scoring, and multi-region localization to enhance detection robustness without inference overhead. The method builds on Gumbel-max sampling and supports optimizations like speculative decoding and multi-token prediction. Evaluations demonstrate TextSeal strictly outperforms baselines like SynthID-text in detection strength, maintains downstream performance on reasoning benchmarks, and shows no perceptible quality degradation in multilingual human evaluations (6000 A/B comparisons, 5 languages). Additionally, its 'radioactive' property enables detection of unauthorized use through model distillation.

gumbel-max samplingdual-key generationspeculative decodingmulti-region localizationmodel distillation

ORCE: Order-Aware Alignment of Verbalized Confidence in Large Language Models

arXiv cs.LG · Chen Li, Xiaoling Hu, Songzhu Zheng, Jiawei Zhou · 2026-05-12

The paper introduces ORCE, an order-aware framework for improving verbalized confidence alignment in LLMs by decoupling confidence estimation from answer generation. The method first generates answers, then estimates confidence conditioned on fixed question-answer pairs using a sampling-based surrogate and rank-based RL objectives to align confidence with correctness likelihood. Experiments on reasoning and knowledge benchmarks demonstrate improved calibration and failure prediction while maintaining answer accuracy, showing the benefits of decoupled confidence optimization.

verbalized confidenceconfidence calibrationreinforcement learninglarge language modelsfailure prediction

Environment-Adaptive Preference Optimization for Wildfire Prediction

arXiv cs.LG · Enyi Jiang, Wu Sun · 2026-05-12

We propose Environment-Adaptive Preference Optimization (EAPO), a framework for wildfire prediction that addresses long-tailed distributions and environmental shifts. EAPO constructs distribution-aligned datasets via $k$-nearest neighbor retrieval and performs hybrid fine-tuning combining supervised learning with preference optimization, emphasizing rare extreme events. Evaluated on a real-world wildfire prediction task with environmental shifts, EAPO achieves robust performance (ROC-AUC 0.7310) and improves detection in extreme regimes, demonstrating effectiveness in dynamic wildfire prediction systems.

environment-adaptive preference optimizationlong-tailed distributionwildfire predictionk-nearest neighbor retrievalpreference optimization

Learning Minimally Rigid Graphs with High Realization Counts

arXiv cs.LG · Oleksandr Slyvka, Jan Rubeš, Rodrigo Alves, Jan Legerský · 2026-05-12

The paper introduces a reinforcement learning method to construct minimally rigid graphs with high realization counts, addressing an extremal problem in rigidity theory. The approach uses the Deep Cross-Entropy Method with a policy combining a Graph Isomorphism Network encoder and a permutation-equivariant action head to perform 0- and 1-extensions (Henneberg moves). Empirical results show the method matches known optima for planar realization counts and improves bounds for spherical realization counts, producing new record graphs.

minimally rigid graphsrealization countshenneberg movesgraph isomorphism networkdeep cross-entropy method

ORBIT: Preserving Foundational Language Capabilities in GenRetrieval via Origin-Regulated Merging

arXiv cs.LG · Neha Verma, Nikhil Mehta, Shao-Chuan Wang, Naijing Zhang · 2026-05-12

ORBIT preserves foundational language capabilities during Generative Retrieval (GenRetrieval) fine-tuning by regulating weight drift. The method monitors distance between fine-tuned and original model parameters, applying weight averaging when divergence exceeds a threshold. Experiments demonstrate ORBIT's superiority over continual learning baselines and regularization methods, maintaining both text generation and retrieval performance.

generative retrievalcatastrophic forgettingweight averagingmodel driftfine-tuning

Aligning Flow Map Policies with Optimal Q-Guidance

arXiv cs.LG · Christos Ziakas, Alessandra Russo, Avishek Joey Bose · 2026-05-12

The paper introduces flow map policies, a novel class of generative policies enabling fast action generation via arbitrary-size jumps in flow-based dynamics, addressing inference latency in sequential decision-making. The method combines FLOW MAP Q-GUIDANCE (FMQ), a trust-region optimization for offline-to-online RL adaptation, with Q-GUIDED BEAM SEARCH (QGBS) for iterative inference-time refinement. Evaluated on 12 robotic tasks from OGBench and RoboMimic, FMQ achieves a 21.3% relative improvement in average success rate over prior one-step policies.

flow map policiesoffline-to-online rltrust-region optimizationq-guided beam searchgenerative policies

Model-based Bootstrap of Controlled Markov Chains

arXiv cs.LG · Ziwei Su, Imon Banerjee, Diego Klabjan · 2026-05-12

The authors introduce a model-based bootstrap method for estimating transition kernels in finite controlled Markov chains (CMCs) under nonstationary or history-dependent control policies, addressing offline reinforcement learning scenarios with unknown behavior policies. The method leverages a novel bootstrap law of large numbers for visitation counts and applies the martingale central limit theorem to bootstrap transition increments. Distributional consistency is established for both single long-chain and episodic offline RL regimes, extending to offline policy evaluation and optimal policy recovery via the delta method. Empirical results on the RiverSwim problem demonstrate superior coverage of percentile bootstrap confidence intervals compared to baselines.

controlled markov chainsoffline reinforcement learningbootstrap law of large numbersmartingale central limit theoremdelta method

Trajectory-Agnostic Asteroid Detection in TESS with Deep Learning

arXiv cs.LG · Brian P. Powell, Jorge Martinez-Palomera, Amy Tuson, Christina Hedges · 2026-05-12

The paper introduces a deep learning method for asteroid detection in TESS data using a novel W-Net architecture comprising two stacked 3D U-Nets with skip connections. The approach eliminates the need for speed/direction assumptions by employing data augmentation through image cube rotation. A key innovation is Adaptive Normalization, a learned data scaling technique optimizing input processing. The publicly released tess-asteroid-ml toolkit generates training data with asteroid masks. The method's generalizability makes it suitable for future missions like the Nancy Grace Roman Space Telescope and NEOSurveyor.

w-netadaptive normalizationtess3d u-netshift-and-stack

Events as Triggers for Behavioral Diversity in Multi-Agent Reinforcement Learning

arXiv cs.LG · Hannes Büchi, Manon Flageat, Eduardo Sebastián, Amanda Prorok · 2026-05-12

The authors propose a Multi-Agent Reinforcement Learning (MARL) framework that decouples agent identity from behavior using event-triggered behavioral transitions. The framework introduces Neural Manifold Diversity (NMD), a formal distance metric for transient, agent-agnostic behaviors, and employs an event-based hypernetwork generating Low-Rank Adaptation (LoRA) modules over a shared team policy for dynamic agent-policy reconfiguration. Theoretical analysis ensures diversity does not interfere with reward maximization. Empirical results show the framework outperforms baselines across benchmarks, achieves zero-shot generalization, and uniquely solves tasks requiring sequential behavior reassignment.

multi-agent reinforcement learningneural manifold diversitylow-rank adaptationevent-triggered transitionsbehavioral diversity

A Semi-Supervised Framework for Speech Confidence Detection using Whisper

arXiv cs.LG · Adam Wynn, Jingyun Wang · 2026-05-12

A semi-supervised hybrid framework is proposed for speech confidence detection, addressing data scarcity and annotation subjectivity. The method combines deep semantic embeddings from the Whisper encoder with interpretable acoustic features (eGeMAPS descriptors) and auxiliary probability estimates of vocal stress and disfluency. An Uncertainty-Aware Pseudo-Labelling strategy is introduced to generate high-quality pseudo-labels for unlabelled data, prioritizing data quality over quantity. The framework achieves a Macro-F1 score of 0.751, outperforming self-supervised baselines (WavLM, HuBERT, Wav2Vec 2.0) and the unimodal Whisper baseline, with a 3% improvement in the minority class. Ablation studies confirm the superiority of curated pseudo-labels over indiscriminate augmentation.

whisper encoderegemaps descriptorspseudo-labellingmacro-f1 scoreacoustic features

MetaColloc: Optimization-Free PDE Solving via Meta-Learned Basis Functions

arXiv cs.LG · Zichuan Yang · 2026-05-12

MetaColloc introduces an optimization-free, data-free framework for solving partial differential equations (PDEs) by decoupling basis discovery from the solving process. The method meta-trains a dual-branch neural network on Gaussian Random Fields offline to create a universal dictionary of neural basis functions. At test time, the frozen network assembles a collocation matrix, solving PDEs via a single linear least squares step or Newton-Raphson for non-linear cases. Experiments across six 2D and 3D PDEs demonstrate state-of-the-art accuracy and test-time computation reductions by orders of magnitude. Frequency sweep analysis reveals a critical mismatch between function approximation and operator stability at high frequencies, guiding future operator-aware meta-learning.

partial differential equationsmeta-learningcollocation matrixgaussian random fieldsnewton-raphson

Attacks and Mitigations for Distributed Governance of Agentic AI under Byzantine Adversaries

arXiv cs.LG · Matthew D. Laws, Alina Oprea, Cristina Nita-Rotaru · 2026-05-12

This work analyzes vulnerabilities in SAGA, a state-of-the-art agentic AI governance system, focusing on attacks from a compromised Provider. The authors identify concrete attacks, including undermining agent attributability, extracting private data, and bypassing access control. They propose three mitigation strategies: SAGA-BFT, a Byzantine-resilient architecture with strong security but high overhead; SAGA-MON and SAGA-AUD, leveraging lightweight monitoring and auditing for minimal overhead; and SAGA-HYB, a hybrid approach balancing security and performance. Evaluations compare these architectures against SAGA, discussing optimal solutions under varying conditions.

agentic ai governancebyzantine resilienceaccess controlmonitoringauditing

From Message-Passing to Linearized Graph Sequence Models

arXiv cs.LG · Joël Mathys, Basil Rohner, Saku Peltonen, Roger Wattenhofer · 2026-05-12

The authors introduce Linearized Graph Sequence Models, a framework that reformulates message-passing graph computation through the lens of sequence modeling to simplify architectural decisions. This approach decouples computational processing depth from information propagation depth, enabling graph architectural choices to be treated as sequence modeling problems. The study empirically and theoretically analyzes sequence properties that effectively preserve graph inductive bias, particularly demonstrating improved performance on long-range information tasks in graphs. The findings provide a principled method for integrating modern sequence modeling advances into message-passing based graph learning while recasting architectural questions as input modeling choices.

message-passingsequence modelinggraph inductive biasinformation propagationlinearized graph

Neural-Schwarz Tiling for Geometry-Universal PDE Solving at Scale

arXiv cs.LG · Paolo Secchi, Daniel S. Balint, Marco Maurizi · 2026-05-12

The paper introduces Neural-Schwarz Tiling (NEST), a local-to-global framework for scalable PDE solving that shifts learning from full-domain solution operators to reusable local physical solvers. NEST trains a neural operator on minimal voxel patches (3×3×3) with diverse local geometries and boundary/interface data, then composes global solutions through domain decomposition and iterative Schwarz coupling with partition-of-unity assembly. Evaluated on nonlinear static equilibrium in compressible neo-Hookean solids, NEST demonstrates generalization across domain size, shape, and boundary-condition configurations, offering a reusable path for scalable learned PDE solvers.

neural operatordomain decompositionschwarz couplingvoxel patchespartition-of-unity

Multi-Variable Conformal Prediction: Optimizing Prediction Sets without Data Splitting

arXiv cs.LG · Laura Lützow, Simone Garatti, Marco C. Campi, Lars Lindemann · 2026-05-12

The paper introduces multi-variable conformal prediction (MCP), a framework extending conformal prediction to vector-valued score functions with multiple calibration variables, eliminating data splitting while preserving coverage guarantees. MCP unifies prediction set design and calibration via scenario theory, proposing two variants: RemMCP (constrained optimization with constraint removal) and RelMCP (iterative optimization with constraint relaxation). Experiments on ellipsoidal and multi-modal prediction sets show both variants achieve target coverage with smaller or comparable set sizes than split conformal baselines, while reducing calibration variance by using all data simultaneously.

conformal predictionscenario theoryprediction setscoverage guaranteesconstrained optimization

Grid Games: The Power of Multiple Grids for Quantizing Large Language Models

arXiv cs.LG · Vage Egiazarian, Erik Schultheis, Andrei Panferov, Earl Killian · 2026-05-12

The paper introduces multiple-grid quantization for large language models, formalizing the power-of-two-grids (PO2) problem and demonstrating its efficacy for small-group formats like MXFP and NVFP. Four grid families are instantiated: PO2(NF4), MPO2, PO2(Split87), and SFP4, each leveraging adaptive grids to enhance quantization accuracy. Empirical results show consistent improvements in post-training quantization of open models and pre-training of Llama-like models, outperforming single-grid FP4 in both weight-only and weight+activation scenarios. Theoretical analysis indicates diminishing returns for very large groups. Source code is provided for reproducibility.

quantizationpower-of-two-gridsmxfpnf4tensorcore

In-context learning to predict critical transitions in dynamical systems

arXiv cs.LG · Yunus Sevinchan, Juan Nathaniel, Kai Ueltzhöffer, Carla Roesch · 2026-05-12

We introduce TipPFN, an in-context learning framework for predicting critical transitions in dynamical systems using a prior-data fitted network. The method leverages a novel synthetic data generator based on canonical bifurcation scenarios with randomized stochastic dynamics, enabling flexible adaptation to contexts of varying size, complexity, and dimensionality. TipPFN achieves robust, state-of-the-art performance in early detection of critical transitions across unseen tipping regimes, sim-to-real scenarios, and real-world observations, demonstrating effectiveness in both in-context learning and zero-shot settings.

in-context learningcritical transitionsprior-data fitted networkbifurcation scenarioszero-shot learning

From Model Uncertainty to Human Attention: Localization-Aware Visual Cues for Scalable Annotation Review

arXiv cs.LG · Moussa Kassem Sbeyti, Joshua Holstein, Philipp Spitzer, Nadja Klein · 2026-05-12

The study introduces a novel interface for visualizing spatial uncertainty in AI-assisted annotation workflows, addressing the challenge of mislocalized predictions in tasks requiring both class labels and spatial boundaries. Through a controlled experiment with 120 participants, the interface demonstrates that annotators receiving spatial uncertainty cues achieve higher label quality and increased efficiency. Box-level analysis reveals that these cues effectively redirect annotator attention toward high-uncertainty predictions and away from well-localized boxes. The findings establish localization uncertainty as a critical factor in improving human-in-the-loop annotation processes.

spatial uncertaintyai-assisted annotationlocalization-awarehuman-in-the-looplabel quality

Approximation of Maximally Monotone Operators : A Graph Convergence Perspective

arXiv cs.LG · Takashi Furuya, Yury Korolev, Takaharu Yaguchi · 2026-05-12

The paper introduces a graph convergence framework for approximating maximally monotone operators, addressing limitations of classical uniform and $L^p$ approximation methods. By leveraging Painlevé-Kuratowski convergence, the authors demonstrate that continuous encoder-decoder architectures can approximate such operators locally in the graph sense. Additionally, they propose resolvent-based parameterizations to construct structure-preserving approximations that maintain maximal monotonicity. This approach extends operator learning to discontinuous and set-valued operators, which are prevalent in differential operator contexts.

graph convergencemaximally monotone operatorspainlevé-kuratowski convergenceresolvent-based parameterizationsencoder-decoder architectures

STRABLE: Benchmarking Tabular Machine Learning with Strings

arXiv cs.LG · Gioia Blayer, Myung Jun Kim, Félix Lefebvre, Lennart Purucker · 2026-05-12

STRABLE introduces a benchmarking corpus of 108 real-world tabular datasets containing both string and numerical entries, addressing the understudied area of tabular learning with strings. The study evaluates 445 pipelines, comparing end-to-end architectures against modular pipelines where strings are encoded, post-processed, and fed to tabular learners. Results show that advanced tabular learners paired with simple string embeddings perform well on categorical-dominant tables, while large LLM encoders excel on free-text-dominant tables, with performance sensitive to post-processing. STRABLE provides generalizable pipeline rankings, establishing it as a foundational resource for research on string tabular learning.

tabular learningstring embeddingsllm encoderspost-processingbenchmarking corpus

Targeted Neuron Modulation via Contrastive Pair Search

arXiv cs.LG · Sam Herring, Jake Naviasky, Karan Malhotra · 2026-05-12

We introduce contrastive neuron attribution (CNA), a method for identifying the 0.1% of MLP neurons whose activations most distinguish harmful from benign prompts using only forward passes. Applying CNA to Llama and Qwen models (1B-72B parameters), we find that ablating these neurons reduces refusal rates by over 50% on a jailbreak benchmark while preserving fluency. Base models exhibit similar late-layer discrimination structures, but steering these neurons produces content shifts rather than behavioral change. Results suggest alignment fine-tuning transforms pre-existing discrimination structures into sparse, targetable refusal gates.

contrastive neuron attributionmlp neuronsalignment fine-tuningrefusal gatejailbreak benchmark

What makes a word hard to learn? Modeling L1 influence on English vocabulary difficulty

arXiv cs.LG · Jonas Mayer Martins, Zhuojing Huang, Aaricia Herygers, Lisa Beinborn · 2026-05-12

The study computationally models English vocabulary difficulty for learners with Spanish, German, or Chinese as their first language (L1), using gradient-boosted models trained on word familiarity, meaning, surface form, and cross-linguistic transfer features. Shapley value analysis reveals word familiarity as the dominant feature across all L1s, with orthographic transfer additionally influencing Spanish and German learners. Chinese learners' difficulty is determined solely by familiarity and surface features, lacking orthographic transfer. The models yield interpretable, L1-specific difficulty estimates for curriculum design.

gradient-boosted modelsshapley valuesorthographic transfercross-linguistic transfervocabulary difficulty

Hypernetworks for Dynamic Feature Selection

arXiv cs.LG · Javier Fumanal-Idocin, Raquel Fernandez-Peralta, Javier Andreu-Perez · 2026-05-12

We propose Hyper-DFS, a hypernetwork-based dynamic feature selection (DFS) approach that generates feature subset-specific classifier parameters on demand. The method employs a Set Transformer encoding to create a smooth conditioning space, ensuring geometric proximity for functionally similar tasks. Structural analysis shows Hyper-DFS achieves a smaller complexity bound than mask-embedding methods. Experiments demonstrate state-of-the-art performance on synthetic and real-life tabular data, competitive results on image datasets, and superior zero-shot generalization to unseen feature subsets compared to existing DFS approaches.

hypernetworkdynamic feature selectionset transformerzero-shot generalizationcomplexity bound

Reconstruction of Personally Identifiable Information from Supervised Finetuned Models

arXiv cs.LG · Sae Furukawa, Alina Oprea · 2026-05-12

We introduce COVA, a novel decoding algorithm for reconstructing personally identifiable information (PII) from supervised finetuned (SFT) language models under prefix-based attacks. Using multi-turn, user-centric Q&A datasets in medical and legal domains containing PII, we evaluate PII leakage across varying levels of attacker knowledge about the fine-tuning dataset. Results demonstrate that partial attacker knowledge significantly improves reconstruction success, with leakage varying substantially across PII types. COVA consistently outperforms existing extraction methods in reconstructing sensitive information from SFT models.

supervised finetuningpersonally identifiable informationprefix-based attacksdecoding algorithmmulti-turn q&a

Delay-Empowered Causal Hierarchical Reinforcement Learning

arXiv cs.LG · Chenran Zhao, Dianxi Shi, Haotian Wang, Mengzhu Wang · 2026-05-12

The paper proposes Delay-Empowered Causal Hierarchical Reinforcement Learning (DECHRL), a novel method for handling stochastic delayed effects in reinforcement learning. DECHRL explicitly models causal state transitions and their delay distributions, integrating them into a delay-aware empowerment objective to guide exploration toward controllable states. Evaluated on modified 2D-Minecraft and MiniGrid environments with stochastic delays, DECHRL outperforms baselines in decision-making under temporal uncertainty by effectively modeling and adapting to variable delays.

hierarchical reinforcement learningstochastic delayscausal modelingempowerment objectivetemporal uncertainty

Instruction Lens Score: Your Instruction Contributes a Powerful Object Hallucination Detector for Multimodal Large Language Models

arXiv cs.LG · Runhe Lai, Xinhua Lu, Yanqi Wu, Jinlun Ye · 2026-05-12

The Instruction Lens Score (InsLen) is introduced as a novel plug-and-play object hallucination detector for multimodal large language models (MLLMs), addressing a critical challenge in their reliable deployment. InsLen leverages instruction token embeddings, which implicitly encode visual information while filtering misleading visual embeddings, combining a Calibrated Local Score with a Context Consistency Score to measure object token consistency. Extensive experiments across multiple benchmarks and diverse MLLM architectures demonstrate InsLen's consistent outperformance of existing hallucination detection methods, highlighting its effectiveness and robustness without requiring auxiliary models or additional training.

instruction lens scoreobject hallucinationmultimodal large language modelscontext consistency scorecalibrated local score

SOAR: Scale Optimization for Accurate Reconstruction in NVFP4 Quantization

arXiv cs.LG · Chengzhu Bao, Xianglong Yan, Zhiteng Li, Guangshuo Qin · 2026-05-12

The paper proposes SOAR, a post-training quantization framework that improves NVFP4 (4-bit microscaling format) accuracy for large language models. It introduces Closed-form Joint Scale Optimization (CJSO) to analytically optimize global and block-wise scales via reconstruction error minimization, and Decoupled Scale Search (DSS) to separate quantization/dequantization scales with discrete search. Experiments demonstrate SOAR outperforms existing NVFP4 methods across multiple LLMs, achieving higher accuracy at identical memory footprints without hardware overhead.

nvfp4post-training quantizationreconstruction errorscale optimization4-bit quantization

Optimal Policy Learning under Budget and Coverage Constraints

arXiv cs.LG · Giovanni Cerulli · 2026-05-12

The paper presents a framework for optimal policy learning under joint budget and minimum coverage constraints, demonstrating that the problem exhibits a knapsack-type structure. The optimal policy is characterized by an affine threshold rule incorporating budget and coverage shadow prices. The linear programming relaxation of the combinatorial solution is shown to have an O(1) integrality gap, ensuring asymptotic equivalence with optimal discrete allocation. Two algorithms, Greedy-Lagrangian (GLC) and rank-and-cut (RC), are analyzed: GLC achieves near-optimal performance in finite samples, while RC performs optimally under slack coverage constraints or homogeneous costs but misallocates when cost heterogeneity interacts with binding coverage constraints. Monte Carlo simulations validate these findings.

knapsack-type structureaffine threshold ruleintegrality gapgreedy-lagrangianrank-and-cut

Intrinsic Vicarious Conditioning for Deep Reinforcement Learning

arXiv cs.LG · Rodney A Sanchez, Ferat Sahin, Alex Ororbia, Jamison Heard · 2026-05-12

The paper introduces vicarious conditioning as an intrinsic reward mechanism for deep reinforcement learning, enabling agents to learn from demonstrators without accessing their policies or reward functions. The method implements four cognitive steps (attention, retention, reproduction, reinforcement) using memory-based approaches, supporting low-shot learning. Evaluations in MiniWorld Sidewalk and Box2D CarRacing show improved episode lengths by avoiding non-descriptive terminal states and guiding toward desirable states, demonstrating applicability to single-life and continual learning scenarios.

intrinsic rewardvicarious conditioninglow-shot learningmemory-based methodsnon-descriptive terminal

On the Importance of Multistability for Horizon Generalization in Reinforcement Learning

arXiv cs.LG · Asad Bakija, Florent De Geeter, Julien Brandoit, Pierre Sacré · 2026-05-12

We formalize temporal horizon generalization in reinforcement learning (RL) for partially observable Markov decision processes (POMDPs), deriving necessary and sufficient conditions for policies to remain optimal across arbitrary horizons. Through empirical evaluation of nonlinear and parallelizable recurrent neural network (RNN) variants, we demonstrate that multistability is necessary for horizon generalization and sufficient in simple tasks, while complex tasks additionally require transient dynamics. Modern parallelizable architectures, including state space models and gated linear RNNs, fail to generalize due to inherent monostability. These findings establish multistability and transient dynamics as complementary dynamical regimes essential for scalable long-horizon RL, motivating the design of parallelizable architectures combining both properties.

temporal horizon generalizationmultistabilitytransient dynamicspartially observable markov decision processesparallelizable architectures

Investigating simple target-covariate relationships for Chronos-2 and TabPFN-TS

arXiv cs.LG · Gaspard Berthelier, Mariia Baranova, Andrei-Tiberiu Pantea, Etienne Le Naour · 2026-05-12

The paper evaluates the ability of Time Series Foundation Models (TSFMs) to integrate covariates by conducting controlled experiments on simple target-covariate relationships. Specifically, it compares Chronos-2 and TabPFN-TS, two recent TSFM architectures, in their capacity to model these dependencies. Results indicate that TabPFN-TS outperforms Chronos-2 in capturing such relationships, particularly for short prediction horizons. This suggests that Chronos-2's strong benchmark performance does not necessarily correlate with optimal modeling of simple covariate-target dependencies.

time series foundation modelscovariateschronos-2tabpfn-tszero-shot

A Unified Graph Language Model for Multi-Domain Multi-Task Graph Alignment Instruction Tuning

arXiv cs.LG · Haibo Chen, Xin Wang, Jiaheng Chao, Ling Feng · 2026-05-12

UniGraphLM introduces a unified graph language model addressing multi-domain, multi-task graph alignment instruction tuning. It integrates a multi-domain, multi-task GNN encoder to learn generalizable graph representations aligned with textual semantics, overcoming challenges of domain-task variability and LLM token space compatibility. The model adaptively aligns these representations with LLMs, enhancing generalization across diverse graph data. This approach bridges the gap between GNN-encoded representations and LLM token spaces, enabling effective instruction tuning for graph-language alignment.

graph neural networkslarge language modelsinstruction tuninggraph alignmenttoken space

ECTO: Exogenous-Conditioned Temporal Operator for Ultra-Short-Term Wind Power Forecasting

arXiv cs.LG · Cao Yuan, Junjun Wang · 2026-05-12

We propose ECTO (Exogenous-Conditioned Temporal Operator), a unified framework for ultra-short-term wind power forecasting that decomposes exogenous variable modeling into Physically-Grounded Variable Selection (PGVS) and Exogenous-Conditioned Regime Refinement (ECRR). PGVS performs hierarchical, group-aware sparse selection using domain-informed physical priors and sparsemax activations, while ECRR routes forecasts through learned regime experts with gain-bias calibration and horizon-specific corrections. Experiments on three wind farms (66-200 MW, 11-13 exogenous variables) show ECTO achieves the lowest MSE across all sites, with relative improvements of 2.2%-5.2% over baselines, widening to 8.6% at H=32. Ablations confirm positive contributions from PGVS (+1.84%) and ECRR (+2.86%), with interpretability analysis revealing physically meaningful variable selection and consistent calibration strategies.

ultra-short-term forecastingexogenous variablessparsemax activationgain-bias calibrationmixture-of-experts

Fair Conformal Classification via Learning Representation-Based Groups

arXiv cs.LG · Senrong Xu, Yanke Zhou, Yuhao Tan, Zenan Li · 2026-05-12

The paper introduces a fair conformal inference framework for classification that guarantees conditional coverage on adaptively identified subgroups, addressing algorithmic biases in traditional conformal prediction methods. The method constructs compact prediction sets by learning representation-based groups through nonlinear feature combinations, balancing effectiveness and efficiency while ensuring adaptive equalized coverage across subgroups. Experiments on synthetic and real-world datasets demonstrate the framework's effectiveness in achieving fair and trustworthy machine learning.

conformal predictionconditional coveragealgorithmic biasrepresentation learningfair classification

Probing Non-Equilibrium Grain Boundary Dynamics with XPCS and Domain-Adaptive Machine Learning

arXiv cs.LG · Mouyang Cheng, Bowen Yu, Chu-Liang Fu, Nina Andrejevic · 2026-05-12

The authors introduce a domain-adaptive machine learning framework combined with X-ray photon correlation spectroscopy (XPCS) to quantitatively probe non-equilibrium grain boundary (GB) dynamics in nanocrystalline materials. They employ a semi-supervised learning approach that transfers physical parameter labels from continuum simulations to unlabeled experimental XPCS maps via domain-adaptive representation alignment. This method enables direct extraction of kinetic parameters, including bulk diffusivity, GB stiffness, and effective GB concentration, from noisy XPCS fluctuation maps. Results demonstrate that GB relaxation in nanocrystalline silicon deviates from time-translation invariance, remaining far from equilibrium over experimental timescales.

grain boundary dynamicsx-ray photon correlation spectroscopydomain-adaptive machine learningnon-equilibrium relaxationsemi-supervised learning

Information-Theoretic Generalization Bounds for Sequential Decision Making

arXiv cs.LG · Futoshi Futami, Masahiro Fujisawa · 2026-05-12

The paper introduces a sequential supersample framework to extend information-theoretic generalization bounds to sequential decision-making problems, addressing limitations in existing supersample conditional mutual information (CMI) bounds. The method separates learner filtration from proof-side enlargement, leveraging row-wise exchangeability to control the sequential generalization gap via sequential CMI, a sum of roundwise selector-loss information terms. A Bernstein-type refinement is also established for faster rates under variance conditions. The framework applies to online learning, streaming active learning with importance weighting, and stochastic multi-armed bandits.

sequential supersampleconditional mutual informationrow-wise exchangeabilityselector-loss informationbernstein-type refinement

Multi-Task Representation Learning for Conservative Linear Bandits

arXiv cs.LG · Jiabin Lin, Shana Moothedath · 2026-05-12

The paper introduces Constrained Multi-Task Representation Learning (CMTRL), a framework for conservative linear bandits that leverages shared low-dimensional representations across tasks. It proposes Safe-AltGDmin, an algorithm combining alternating projected gradient descent and minimization to recover a low-rank feature matrix while adhering to safety or performance constraints. Theoretical guarantees for regret and sample complexity bounds are established. Empirical evaluations demonstrate the algorithm's performance against benchmark methods in multi-task linear bandit settings.

linear banditsmulti-task learninglow-rank representationregret boundssample complexity

Lower bounds for one-layer transformers that compute parity

arXiv cs.LG · Daniel Hsu · 2026-05-12

The work establishes a lower bound for one-layer transformers, proving that no self-attention layer post-processed by a rational function can sign-represent the parity function unless the product of heads and post-processing degree grows linearly with input length. The method combines this bound with ReLU network rational approximations to extend the result to ReLU-post-processed self-attention layers. Results show fundamental limitations in transformer expressivity for parity computation under these architectural constraints.

self-attentionparity functionrational functionlower boundrelu networks

On What We Can Learn from Low-Resolution Data

arXiv cs.LG · Theresa Dahl Frehr, Niels Henrik Pontoppidan, Hiba Nassar, Tommy Sonne Alstrøm · 2026-05-12

The paper theoretically analyzes the informativeness of low-resolution data when models are evaluated on high-resolution inputs, using Kullback-Leibler divergence to characterize how datapoint influence varies with resolution. It derives bounds relating the contributions of high- and low-resolution observations to information loss under downsampling. Empirical validation with a vision transformer and convolutional neural network shows that incorporating low-resolution data improves performance when high-resolution samples are scarce.

low-resolution datakullback-leibler divergencevision transformerconvolutional neural networkdownsampling

Machine Learning for neutron source distributions

arXiv cs.LG · Jose Ignacio Robledo, Norberto Schmidt, Klaus Lieutenant, Jingjing Li · 2026-05-12

A novel machine learning approach for neutron source distribution estimation is proposed, leveraging probabilistic generative models trained on Monte Carlo particle lists. The method eliminates dependency on particle lists post-training, enabling efficient, rapid, and memory-efficient sampling. Four generative models—variational autoencoder, normalizing flow, generative adversarial network, and denoising diffusion model—are evaluated and compared against existing estimation techniques. Results demonstrate the feasibility of modeling neutron source distributions using probabilistic generative models, highlighting their potential for advancing this field.

probabilistic generative modelsmonte carloneutron source distributionvariational autoencoderdenoising diffusion model

Fused Gromov-Wasserstein Distance with Feature Selection

arXiv cs.LG · Harlin Lee, Ying Yu, Mingxin Li, Ranthony Clark · 2026-05-12

The paper introduces Fused Gromov-Wasserstein (FGW) distances with feature selection, enhancing interpretability and robustness in high-dimensional settings by adaptively suppressing irrelevant features. Two methods are proposed: (1) regularized FGW with Lasso/Ridge penalties and (2) simplex-constrained weights, including groupwise extensions. Theoretical analysis establishes bounds relative to classical FGW and Gromov-Wasserstein distances, alongside metric properties. An alternating minimization algorithm is developed. Experiments demonstrate improved interpretability and task-relevant structure identification, particularly in computational redistricting applications.

fused gromov-wassersteinfeature selectionalternating minimizationcomputational redistrictingmetric learning

PrivacySIM: Evaluating LLM Simulation of User Privacy Behavior

arXiv cs.LG · James Flemings, Murali Annavaram · 2026-05-12

The paper introduces PrivacySIM, an evaluation suite for assessing large language models' (LLMs) ability to simulate individual privacy decisions. The method benchmarks nine frontier LLMs against ground-truth responses from 1,000 users across five privacy studies, conditioning models on three persona facets (demographics, previous experiences, stated privacy attitudes). Results show persona conditioning improves simulation accuracy (best model: 40.4%), but LLMs still fail to faithfully replicate individual decisions, particularly for users with high AI experience but low privacy concerns.

privacy simulationllm evaluationpersona conditioningbehavioral modelingdata-sharing scenarios

STRUM: A Spectral Transcription and Rhythm Understanding Model for End-to-End Generation of Playable Rhythm-Game Charts

arXiv cs.LG · Joshua Opria · 2026-05-12

STRUM introduces an end-to-end audio-to-chart pipeline for generating playable rhythm-game charts (Clone Hero/YARG) across five instruments without oracle metadata. The hybrid system combines specialized modules: a two-stage CRNN onset detector and six-model ensemble for drums, neural onset detectors with monophonic pitch tracking for guitar/bass, word-aligned ASR for vocals, and spectral keyboard detection. Evaluation on a 30-song benchmark (selected via drum-stem RMS criteria) shows F1 scores of 0.838 (drums), 0.694 (bass), 0.651 (guitar), and 0.539 (vocals) at ±100ms tolerance. The work includes ablation studies, timing distribution analysis, and releases code/models/benchmark data.

audio-to-chartonset detectionmonophonic pitch trackingsource separationrhythm-game

MULTI: Disentangling Camera Lens, Sensor, View, and Domain for Novel Image Generation

arXiv cs.LG · Sonali Godavarthy, Matthias Neuwirth-Trapp, Tim-Felix Faasch, Maarten Bieshaar · 2026-05-12

The paper introduces MULTI, a method for disentangling imaging factors (camera lens, sensor, viewpoint, domain) in text-to-image generation to address limitations of current content-focused approaches. The two-stage approach first learns general factors via textual inversion, then extracts dataset-specific factors, enabling novel factor combinations and distribution gap reduction. Evaluation on the DF-RICO benchmark demonstrates MULTI's effectiveness, establishing factor disentanglement as a new research direction for precise image generation control.

factor disentanglementtextual inversionimage generationcontrolnetsdf-rico benchmark

Keeping Score: Efficiency Improvements in Neural Likelihood Surrogate Training via Score-Augmented Loss Functions

arXiv cs.LG · Alexander Shen, Mikael Kuusela · 2026-05-12

The authors propose a score-augmented loss function to improve the efficiency of neural likelihood surrogate training for structured stochastic process models. By augmenting binary cross-entropy loss with exact score information ∇θlog p(x∣θ) and adaptive weighting based on loss gradients, they bypass the black-box assumption of simulation-based inference. Evaluations on network dynamics and spatial processes demonstrate that the method achieves inference performance equivalent to a 10x increase in training data with less than a 1.1x increase in training time, drastically reducing computational costs.

simulation-based inferencelikelihood surrogatescore augmentationadaptive weightingstochastic process

Elicitation-Augmented Bayesian Optimization

arXiv cs.LG · Alvar Haltia, Ville Hyvönen, Samuel Kaski · 2026-05-12

The authors propose elicitation-augmented Bayesian optimization (BO), a method that integrates pairwise comparison queries from domain experts to improve sample efficiency in human-in-the-loop BO. Unlike prior approaches requiring explicit quantification of expert knowledge, this method interprets pairwise judgments as noisy evidence about the objective function, combining them with direct observations via a cost-aware value-of-information acquisition function. The method adapts to query cost and noise: it outperforms observation-only BO when queries are cheap and reverts to standard BO when queries are costly or noisy, achieving performance near the convex hull of individual information sources.

bayesian optimizationpairwise comparisonsvalue-of-informationsample efficiencyelicitation

Learning plug-in surrogate endpoints for randomized experiments

arXiv cs.LG · Alessandro-Umberto Margueritte, Ahmet Zahid Balcıoğlu, Jesse Krijthe, Dave Zachariah · 2026-05-12

The paper introduces plug-in composite surrogates as functions of post-treatment variables that can substitute for primary outcomes in randomized experiments. Two methods are proposed for learning these surrogates by maximizing effect predictiveness, with theoretical analysis of unbiased effect estimation in representative scenarios. Empirical evaluation on synthetic and real-world experimental data demonstrates that the proposed method outperforms established approaches in predicting primary effects.

surrogate endpointsrandomized experimentseffect predictivenessplug-in estimatorscausal inference

Resilient Vision-Tabular Multimodal Learning under Modality Missingness

arXiv cs.LG · Camillo Maria Caruso, Valerio Guarrasi, Paolo Soda · 2026-05-12

A multimodal transformer framework is proposed for joint vision-tabular learning under pervasive modality missingness, eliminating the need for imputation or heuristic model switching. The architecture integrates vision, tabular, and multimodal fusion encoders, utilizing learnable modality tokens and masked self-attention to exclude missing tokens and modalities during information aggregation and gradient propagation. A modality-dropout regularization strategy stochastically removes available modalities during training to enhance resilience. Evaluated on the MIMIC-CXR dataset paired with MIMIC-IV clinical data for multilabel classification of 14 diagnostic findings, the method consistently outperforms baselines across all missingness regimes, demonstrating smoother performance degradation and improved robustness. Attention-level masking and intermediate fusion with joint fine-tuning are identified as critical for resilient multimodal inference.

multimodal transformermodality missingnessmasked self-attentionmodality-dropoutintermediate fusion

Approximation Theory of Laplacian-Based Neural Operators for Reaction-Diffusion System

arXiv cs.LG · Takashi Furuya, Ryo Ozawa, Jenn-Nan Wang · 2026-05-12

The paper establishes explicit approximation error bounds for Laplacian-based neural operators applied to the generalized Gierer-Meinhardt reaction-diffusion system, a nonlinear PDE model of pattern formation. By leveraging the Laplacian spectral representation of the Green's function, the authors derive bounds in terms of network depth, width, and spectral rank, demonstrating polynomial growth in parameter complexity relative to target accuracy. This alleviates the curse of parametric complexity in generic operator learning. Numerical experiments on the Gierer-Meinhardt system empirically validate the theoretical findings.

neural operatorsreaction-diffusion systemlaplacian spectral representationapproximation error boundsparametric complexity

Limits of Learning Linear Dynamics from Experiments

arXiv cs.LG · Aybüke Ulusarslan, Niki Kilbertus, Nora Schneider · 2026-05-12

The work establishes fundamental limits on learning linear time-invariant (LTI) dynamics from experimental data, showing that identifiability depends on the experimental setup (initial state and control input) rather than just classical controllability conditions. Using geometric analysis, the authors derive a closed-form characterization of all systems consistent with observed trajectories and prove that dynamics remain uniquely identifiable on the reachable subspace, even when full system identification fails. This provides a theoretical framework for partial identifiability in data-driven system identification.

linear time-invariant systemssystem identificationidentifiabilityreachabilitygeometric control theory

Estimating Subgraph Importance with Structural Prior Domain Knowledge

arXiv cs.LG · Changhyun Kim, Seunghwan An, Jong-June Jeon · 2026-05-12

The authors propose a subgraph importance estimation method for pretrained Graph Neural Networks (GNNs) on graph-level tasks, formulated as a linear Group Lasso regression problem in the embedding space. The method leverages prior domain knowledge of graph substructures while remaining architecture-agnostic regarding output layers and readout functions, and operates without ground-truth labels. Experiments on real-world graph datasets demonstrate consistent outperformance over existing baselines in subgraph importance estimation. The method is further extended to identify important nodes within graphs.

graph neural networkssubgraph importancegroup lassoembedding spacegraph-level tasks

Split the Differences, Pool the Rest: Provably Efficient Multi-Objective Imitation

arXiv cs.LG · Ziyad Sheebaelhamd, Luca Viano, Volkan Cevher, Claire Vernade · 2026-05-12

We propose Multi-Output Augmented Behavioral Cloning (MA-BC), a provably efficient algorithm for multi-objective imitation learning in Multi-Objective Markov Decision Processes (MOMDPs). MA-BC systematically partitions divergent expert demonstrations while pooling non-conflicting state-action pairs, addressing limitations of standard imitation approaches that aggregate conflicting trajectories. Theoretical analysis shows MA-BC converges to Pareto-optimal policies faster than independent expert dataset learners and achieves minimax optimality. Empirical validation across discrete environments and a continuous Linear Quadratic Regulator task demonstrates MA-BC's effectiveness.

multi-objective imitation learningpareto-optimal policiesbehavioral cloningmulti-objective markov decision processminimax optimality

QDSB: Quantized Diffusion Schrödinger Bridges

arXiv cs.LG · Tobias Fuchs, Florian Kalinke, Nadja Klein · 2026-05-12

Quantized Diffusion Schrödinger Bridges (QDSB) accelerate training of generative models between unpaired source and target distributions by avoiding costly global coupling computations. QDSB computes endpoint coupling on anchor-quantized distributions and lifts the plan back to original data points via cell-wise sampling, ensuring stability with quantization error controlled by anchor approximation quality. Experiments demonstrate that QDSB achieves comparable sample quality to existing baselines while significantly reducing training time. The method addresses the computational inefficiency and geometric distortion of iterative minibatch-based entropic optimal transport solutions.

schrödinger bridgesquantized diffusionoptimal transportgenerative modelsanchor quantization

Stochastic Minimum-Cost Reach-Avoid Reinforcement Learning

arXiv cs.LG · Jingduo Pan, Taoran Wu, Yiling Xue, Bai Xue · 2026-05-12

The authors introduce reach-avoid probability certificates (RAPCs) to address stochastic minimum-cost reach-avoid reinforcement learning, enabling agents to satisfy probabilistic reach-avoid constraints while minimizing expected cumulative costs. They develop a contraction-based Bellman formulation that integrates reach-avoid considerations into reinforcement learning, ensuring cost optimization under probabilistic constraints. The proposed algorithms achieve almost sure convergence to locally optimal policies. Experimental results in the MuJoCo simulator demonstrate improved cost performance and higher reach-avoid satisfaction rates compared to existing methods.

reach-avoid probability certificatesbellman formulationstochastic environmentscost optimizationmujoco simulator

Towards Order Fairness: Mitigating LLMs Order Sensitivity through Dual Group Advantage Optimization

arXiv cs.LG · Xu Chu, Guanyu Wang, Zhijie Tan, Xinrong Chen · 2026-05-12

We introduce Dual Group Advantage Optimization (DGAO), a reinforcement learning method to mitigate order sensitivity in Large Language Models (LLMs) while improving accuracy. DGAO balances intra-group relative accuracy advantage and inter-group relative stability advantage, rewarding order-stable and correct outputs while penalizing order-sensitive or incorrect responses. We also propose Consistency Rate and Overconfidence Rate metrics to evaluate pseudo-stability. Experiments show DGAO enhances order fairness and performance on Retrieval-Augmented Generation (RAG), mathematical reasoning, and classification tasks.

order sensitivitydual group advantage optimizationlarge language modelsretrieval-augmented generationreinforcement learning

NOFE -- Neural Operator Function Embedding

arXiv cs.LG · Lars Uebbing, Harald L. Joakimsen, Siyan Chen, Georgios Leontidis · 2026-05-12

Neural Operator Function Embedding (NOFE) introduces a domain-aware framework for continuous dimensionality reduction, addressing limitations of discrete point cloud methods. NOFE learns function-to-function mappings via a Graph Kernel Operator, enabling mesh-free evaluation and generalizing Sheaf Neural Networks to continuous domains. Evaluated against PCA, t-SNE, and UMAP, NOFE significantly outperforms baselines in local structure preservation (local Stress: 0.111 vs. 0.398, 0.773, 0.791) and sampling independence (Patch Stitching Error reduced by 20.0× relative to UMAP). While maintaining competitive global structure preservation (Stress-1: 0.379 vs. PCA's 0.268), NOFE resolves fine-grained structures and ensures consistency across varying sample densities.

dimensionality reductiongraph kernel operatorsheaf neural networksmesh-free evaluationlocal structure preservation

Assessment of cloud and associated radiation fields from a GAN stochastic cloud subcolumn generator

arXiv cs.LG · Dongmin Lee, Lazaros Oreopoulos, Nayeong Cho, Daeho Jin · 2026-05-12

A novel two-stage machine learning subcolumn generator for the GEOS atmospheric model is introduced, combining a Conditional Variational Autoencoder with a Generative Adversarial Network (CVAE-GAN) and a U-Net architecture. Trained on CloudSat-CALIPSO height-resolved cloud optical depth data, the generator produces 56 stochastic subcolumns representing cloud occurrence and optical depth profiles. Compared to the Räisänen method, it accurately reproduces bimodal cloud overlap distributions, reduces biases in grid-mean statistics, and halves the root-mean-square error in ISCCP-style cloud-top pressure and optical thickness joint histograms. The approach improves offline radiative transfer calculations, reducing the global-mean shortwave top-of-atmosphere cloud radiative effect bias by a factor of three.

generative adversarial networkcloud optical depthradiative transferearth system modelsvariational autoencoder

STAGE: Tackling Semantic Drift in Multimodal Federated Graph Learning

arXiv cs.LG · Zekai Chen, Xun Wu, Xunkai Li, Yihan Sun · 2026-05-12

STAGE introduces a protocol-first framework for multimodal federated graph learning (MM-FGL) to address semantic drift across heterogeneous client modalities. The method constructs a shared semantic space by translating multimodal features into comparable representations before graph propagation, mitigating false agreement and inconsistency amplification. Evaluations on 8 multimodal-attributed graphs demonstrate state-of-the-art performance across 5 tasks while reducing communication overhead.

federated graph learningmultimodal learningsemantic driftrepresentation translationgraph propagation

Understanding Sample Efficiency in Predictive Coding

arXiv cs.LG · Gaspard Oliviers, Elene Lominadze, Rafal Bogacz · 2026-05-12

This work provides a mechanistic understanding of the higher sample efficiency in Predictive Coding (PC) compared to Backpropagation (BP) through the introduction of 'target alignment', a metric quantifying the alignment between network output changes and prediction error. The authors derive and empirically validate analytical expressions for target alignment in Deep Linear Networks, demonstrating that PC outperforms BP in efficiency, particularly in deep, narrow, and pre-trained networks. They establish exact conditions for optimal target alignment in PC and validate findings through experiments on linear and non-linear models, showing PC's benefits persist even when theoretical assumptions are violated.

predictive codingbackpropagationtarget alignmentdeep linear networkssample efficiency

Delightful Gradients Accelerate Corner Escape

arXiv cs.LG · Jincheng Mei, Ian Osband · 2026-05-12

The paper introduces Delightful Policy Gradient (DG), a modified policy gradient method that accelerates escape from sub-optimal simplex corners in reinforcement learning. DG gates each gradient term by the product of advantage and action surprisal, eliminating self-trapping behavior near corners. Theoretical analysis for $K$-armed bandits shows DG achieves logarithmic escape time and maintains $O(1/t)$ global convergence in both bandits and tabular MDPs. Experiments on MNIST contextual bandits demonstrate faster recovery from bad initializations compared to standard policy gradient, though a counterexample reveals limitations under shared function approximation.

policy gradientsimplex cornersself-trappingadvantage gatingtabular mdps

Procedural-skill SFT across capacity tiers: A W-Shaped pre-SFT Trajectory and Regime-Asymmetric Mechanism on 0.8B-4B Qwen3.5 Models

arXiv cs.LG · Igor Strozzi · 2026-05-12

This work analyzes procedural-skill supervised fine-tuning (SFT) contributions across three Qwen3.5 model scales (0.8B, 2B, 4B) using a 200-task/40-skill holdout set, with Claude Haiku 4.5 as a frontier reference. The study employs a corpus of 353 demonstration rows and identifies a W-shaped pre-SFT trajectory, where SFT-attributable procedural-skill improvements are roughly uniform across model sizes (+0.070, +0.040, +0.075). Results reveal a regime-asymmetric pattern, with SFT providing the most significant absolute gains where the base model struggles with procedures. Cross-family validation via GPT-5.4 confirms findings with Cohen's κ ≥ 0.754 and agreement ≥ 93.25%. Earlier framings of format-only and shrinking SFT are identified as path-mismatch artifacts.

supervised fine-tuningprocedural-skillw-shaped trajectoryregime-asymmetriccross-family validation

Qwen-Scope: Turning Sparse Features into Development Tools for Large Language Models

arXiv cs.LG · Boyi Deng, Xu Wang, Yaoning Wang, Yu Wan · 2026-05-12

Qwen-Scope introduces an open-source suite of sparse autoencoders (SAEs) built on the Qwen model family, comprising 14 SAE groups across 7 variants from Qwen3 and Qwen3.5 series, including dense and mixture-of-expert architectures. The SAEs serve as practical interfaces for model development, enabling inference-time steering, evaluation analysis, data-centric workflows, and post-training optimization. Results demonstrate SAEs' utility in controlling language and concepts, analyzing benchmark redundancy, supporting multilingual toxicity classification, and mitigating undesirable behaviors like code-switching and repetition. Qwen-Scope aims to advance mechanistic interpretability research and connect model internals to downstream behavior.

sparse autoencodersmechanistic interpretabilitymixture-of-expertinference-time steeringpost-training optimization

Sobolev Regularized MMD Gradient Flow

arXiv cs.LG · Chenyang Tian, Bharath K. Sriperumbudur, Arthur Gretton, Zonghao Chen · 2026-05-12

We introduce Sobolev-regularized Maximum Mean Discrepancy (SrMMD) gradient flow, a novel regularization of MMD gradient flow that imposes a gradient penalty on the witness function. This regularization addresses the non-convexity of the MMD objective, enabling provable global convergence guarantees in both continuous and discrete time without requiring isoperimetric assumptions on the target distribution. The method leverages regularity conditions on kernel mean embeddings and is applicable to both sampling from unnormalized distributions (using Stein kernels) and generative modeling, unlike prior gradient flows limited to one setting. Empirical validation demonstrates its effectiveness across diverse generative modeling and sampling tasks.

sobolev regularizationmaximum mean discrepancygradient flowstein kernelskernel mean embeddings

Adaptive TD-Lambda for Cooperative Multi-agent Reinforcement Learning

arXiv cs.LG · Yue Deng, Zirui Wang, Yin Zhang · 2026-05-12

The paper introduces Adaptive TD($λ$) (ATD($λ$)) for Multi-agent Reinforcement Learning (MARL), addressing the challenge of policy distribution estimation in large joint action spaces. The method employs a parametric likelihood-free density ratio estimator with two replay buffers to approximate policy distributions without statistical calculation. ATD($λ$) dynamically assigns values to state-action pairs based on their likelihood under the current policy's stationary distribution. Evaluated on QMIX and MAPPO baselines across SMAC benchmarks and Gfootball academy scenarios, ATD($λ$) consistently outperforms or matches static $λ$ approaches.

adaptive td($λ$)multi-agent reinforcement learningdensity ratio estimatorreplay bufferspolicy distribution

LOFT: Low-Rank Orthogonal Fine-Tuning via Task-Aware Support Selection

arXiv cs.LG · Lanxin Zhao, Bamdev Mishra, Pratik Jawanpuria, Lequan Lin · 2026-05-12

LOFT introduces a low-rank orthogonal fine-tuning framework that decouples subspace adaptation from transformation, unifying various orthogonal PEFT methods. By framing adaptation as multiplicative subspace rotation, LOFT emphasizes support selection as a key design axis, informed by task-specific signals. Experiments across language understanding, visual transfer, and multilingual adaptation demonstrate that LOFT recovers principal-subspace orthogonal adaptation while gradient-informed supports enhance efficiency-performance trade-offs under constrained budgets.

orthogonal fine-tuninglow-rank adaptationsubspace rotationtask-aware supportparameter-efficient

Information theoretic underpinning of self-supervised learning by clustering

arXiv cs.LG · Josef Kittler, Sara Atito, Muhammad Awais · 2026-05-12

This paper contributes to the theoretical foundation of self-supervised learning (SSL) by formulating SSL as Kullback-Leibler (K-L) divergence optimization, specifically focusing on deep clustering approaches. The authors prevent mode collapse by imposing optimization constraints on the teacher distribution, leading to normalization using inverse cluster priors. Through the application of Jensen's inequality, this normalization simplifies to the batch centering procedure, a common heuristic in SSL. The theoretical model not only validates existing SSL methods but also provides a framework for future research directions.

self-supervised learningk-l divergencedeep clusteringbatch centeringmode collapse

FIS-DiT: Breaking the Few-Step Video Inference Barrier via Training-Free Frame Interleaved Sparsity

arXiv cs.LG · Jian Tang, Jiawei Fan, Qingbin Liu, Zheng Wei · 2026-05-12

FIS-DiT introduces a training-free framework for accelerating Video Diffusion Transformers (DiTs) by exploiting frame interleaved sparsity in latent frames, overcoming limitations of step-wise optimization in few-step regimes. The method strategically processes frame subsets across model layers while maintaining structural consistency, enabling reduced computation without full block evaluations. Evaluations on Wan 2.2 and HunyuanVideo 1.5 show 2.11--2.41× speedup with minimal quality degradation on VBench-Q and CLIP metrics, advancing real-time HD video generation.

video diffusion transformersframe interleaved sparsityfew-step inferencelatent frame dualitytraining-free acceleration

Variance-aware Reward Modeling with Anchor Guidance

arXiv cs.LG · Shuxing Fang, Ruijian Han, Liangyu Zhang, Fan Zhou · 2026-05-12

The paper introduces Anchor-guided Variance-aware Reward Modeling (AVRM) to address limitations in standard Bradley-Terry reward models and Gaussian reward models for handling pluralistic human preferences. AVRM resolves non-identifiability in Gaussian models by augmenting pairwise preference data with two coarse response-level anchor labels, proving two anchors suffice for identification. The method includes a joint training objective and establishes non-asymptotic convergence rates for reward mean and variance estimation. Empirical results across simulations and four real-world datasets demonstrate consistent improvements in reward modeling and downstream RLHF tasks, including PPO training and best-of-N selection.

bradley-terrygaussian reward modelsnon-identifiabilityanchor guidanceppo training

Beyond Parameter Aggregation: Semantic Consensus for Federated Fine-Tuning of LLMs

arXiv cs.LG · Amr Abourayya, Jens Kleesiek, Michael Kamp · 2026-05-12

The paper proposes semantic consensus as an alternative to parameter aggregation for federated fine-tuning of LLMs, addressing scalability and heterogeneity challenges. Clients locally fine-tune models on private data and exchange generated outputs on public prompts; the server maps these to a semantic space, forms consensus pseudo-labels, and returns them for further local tuning. This approach reduces communication by orders of magnitude (e.g., 1006× for Llama3.1-405B), supports heterogeneous architectures, and matches federated fine-tuning performance while lowering runtime and energy costs.

federated learningsemantic consensuslarge language modelsparameter-efficient fine-tuningheterogeneous architectures

Constrained Stochastic Spectral Preconditioning Converges for Nonconvex Objectives

arXiv cs.LG · Konstantinos Oikonomidis, Jan Quan, Kimon Antonakopoulos, Antonio Silveti-Falls · 2026-05-12

The authors propose proximal preconditioned gradient methods, extending Muon and Scion optimizers to handle convex and nonconvex constraints. They introduce stochastic algorithms with convergence guarantees under heavy-tailed noise, supported by a novel geometric analysis, and a variance-reduced variant for faster convergence under standard noise. The work demonstrates that polynomial iterations in Muon are better modeled by nonlinear preconditioners than ideal matrix signs, yielding more accurate convergence analysis for practical implementations.

proximal methodsspectral gradientnonconvex optimizationvariance reductionpreconditioning

A Fast and Energy-Efficient Latch-Based Memristive Analog Content-Addressable Memory

arXiv cs.LG · Paul-Philipp Manea, Aishwarya Natarajan, Jim Ignowski, John Paul Strachan · 2026-05-12

The authors propose a strong-arm latched memristor (SALM) analog content-addressable memory (aCAM) cell that addresses limitations of conventional 6T2M designs, including static search power, limited voltage gain, and match-line crosstalk. SALM replaces static voltage division with a dynamic current-race comparator, enabling high regenerative gain, intrinsic result latching, and near-zero static search power. Compared to 6T2M, SALM reduces read energy by 33% at identical latency and eliminates scalability constraints. A dataset-aware optimization framework achieves up to 50% energy reduction at 3x latency across workloads. Integrated into the X-TIME decision-tree compiler, SALM maintains near-software accuracy for high-dimensional datasets, outperforming baseline designs.

memristorcontent-addressable memorymatch-line crosstalkcurrent-race comparatordecision-tree inference

More Edits, More Stable: Understanding the Lifelong Normalization in Sequential Model Editing

arXiv cs.LG · Xin Ma, Wei Chen, Qi Liu, Derong Xu · 2026-05-12

This work provides the first theoretical analysis of Lifelong Normalization (LN), a core strategy enabling stable lifelong model editing in Large Language Models. LN normalizes value gradients using running statistics, creating a self-reinforcing stability loop that yields asymptotically orthogonal parameter updates with bounded norms when combined with ridge-regularized regression. The authors propose StableEdit, which enhances LN via explicit warm-up and full whitening, improving long-horizon stability with minimal overhead. Experiments validate the theoretical insights, demonstrating competitive performance in mitigating catastrophic forgetting and model collapse during sequential editing.

lifelong normalizationsequential model editingridge-regularized regressionasymptotic orthogonalityvalue gradients

Fed-BAC: Federated Bandit-Guided Additive Clustering in Hierarchical Federated Learning

arXiv cs.LG · Satwat Bashir, Tasos Dagiuklas, Muddesar Iqbal · 2026-05-12

Fed-BAC introduces a federated bandit-guided additive clustering framework for hierarchical federated learning, addressing joint optimization of cluster assignment and client selection under data heterogeneity. The method employs a two-level bandit mechanism: contextual bandits at the cloud layer for server-to-cluster assignments and Thompson Sampling at edge servers for client selection. Additive decomposition enables knowledge sharing via a global network while capturing distribution variations through cluster-specific networks. Evaluated on CIFAR-10, SVHN, and Fashion-MNIST under non-IID settings, Fed-BAC achieves accuracy gains of up to +35.5pp over HierFAVG and +8.4pp over IFCA, converges 1.5 to 4.8× faster, and improves cross-server fairness, with scalability validated at 5× deployment scale.

hierarchical federated learningadditive clusteringcontextual banditsthompson samplingnon-iid

Stop Marginalizing My Dreams: Model Inversion via Laplace Kernel for Continual Learning

arXiv cs.LG · Patryk Krukowski, Jacek Tabor, Przemysław Spurek, Marek Śmieja · 2026-05-12

REMIX introduces a structured covariance modeling framework for data-free continual learning (DFCIL), addressing limitations of diagonal covariance assumptions in model inversion. By leveraging a Laplace kernel parameterization, REMIX enables scalable full-covariance modeling without dense matrix inversion or log-determinant computation, capturing feature dependencies with linear memory scaling and logarithmic computational overhead. This approach produces more coherent synthetic samples, improving performance on standard DFCIL benchmarks. Results demonstrate the necessity of modeling feature correlations for effective and scalable DFCIL. Code is available at https://github.com/pkrukowski1/REMIX-Model-Inversion-via-Laplace-Kernel.

data-free continual learningmodel inversionlaplace kernelstructured covariancefeature dependencies

ROMER: Expert Replacement and Router Calibration for Robust MoE LLMs on Analog Compute-in-Memory Systems

arXiv cs.LG · Wenyong Zhou, Yuannuo Feng, Yizhe Chen, Taiqiang Wu · 2026-05-12

ROMER introduces a post-training calibration framework for robust MoE-based LLMs on analog compute-in-memory (CIM) systems, addressing hardware noise-induced expert load imbalance and suboptimal routing. The method combines expert replacement (swapping underactivated experts with high-frequency ones) and router logit recalibration via percentile-based normalization. Evaluations on DeepSeek-MoE, Qwen-MoE, and OLMoE show perplexity reductions of 58.6%, 58.8%, and 59.8% respectively under real-chip noise conditions, demonstrating cross-architecture generalizability.

mixture-of-expertscompute-in-memoryrouter calibrationload balancehardware noise

Entropy Polarity in Reinforcement Fine-Tuning: Direction, Asymmetry, and Control

arXiv cs.LG · Jiazheng Zhang, Ziche Fu, Junrui Shen, Yunbin Zhao · 2026-05-12

The paper introduces entropy polarity, a token-level quantity predicting how reinforcement learning updates affect policy entropy in LLMs. Through theoretical analysis, the authors identify structural asymmetry: high-probability tokens induce entropy contraction while low-probability samples promote expansion. They propose Polarity-Aware Policy Optimization (PAPO), which dynamically balances entropy-expanding and contracting updates via advantage reweighting. Experiments on mathematical reasoning and agentic tasks demonstrate PAPO's superior performance over baselines, with improved training efficiency and reward gains.

entropy polaritypolicy optimizationreinforcement learningtoken-level mechanismadvantage reweighting

From Token to Token Pair: Efficient Prompt Compression for Large Language Models in Clinical Prediction

arXiv cs.LG · Mingcheng Zhu, Zhiyao Luo, Yu Liu, Tingting Zhu · 2026-05-12

The paper introduces Medical Token-Pair Encoding (MedTPE), a lossless prompt compression method for LLMs processing electronic health records (EHRs). MedTPE merges frequently co-occurring medical token pairs into composite tokens via dependency-aware replacement, requiring fine-tuning only 0.5-1.0% of the LLM's parameters. Experiments show MedTPE reduces token length by 31% and inference latency by 34-63% while maintaining or improving predictive performance across four clinical tasks, with demonstrated generalizability to other domains and languages.

prompt compressionelectronic health recordstoken-pair encodingdependency-aware replacementself-supervised learning

Decomposing the Generalization Gap in PROTAC Activity Prediction: Variance Attribution and the Inter-Laboratory Ceiling

arXiv cs.LG · Thor Klamt, Wolfgang Nejdl, Ming Tang · 2026-05-12

The study decomposes the generalization gap in PROTAC activity prediction, identifying inter-laboratory measurement variance as the dominant factor (0.124 AUROC contribution) over binarization-threshold choice (0.05). Using PROTAC-Bench (10,748 measurements, 173 targets), the authors evaluate eight architectures and ESM-2 models up to 3B parameters, finding a LOTO AUROC plateau near 0.67. Few-shot k=5 stratified retraining with ADMET features improves LOTO AUROC from 0.668 to 0.705, while Platt scaling maintains calibration. The work releases a variance-decomposition framework, per-target calibration protocol, and evaluation code.

protacgeneralization gapaurocesm-2platt scaling

A nonlinear extension of parametric model embedding for dimensionality reduction in parametric shape design

arXiv cs.LG · Andrea Serani, Giorgio Palma, Matteo Diez · 2026-05-12

The paper introduces NLPME, a nonlinear extension of Parametric Model Embedding (PME) for dimensionality reduction in parametric shape design. NLPME replaces PME's linear subspace with a nonlinear latent representation while maintaining geometry-driven latent variables and parameter-mediated reconstruction. Evaluated on a 32D bio-inspired underwater glider design, NLPME achieves 5% reconstruction error with 5 latent variables (vs PME's 8) and 1% error with 9 (vs PME's 15). The method retains most nonlinear compression benefits of deep autoencoders while preserving explicit backmapping to original design parameters.

dimensionality reductionparametric shape designnonlinear embeddinglatent representationgeometry reconstruction

One-Step Generative Modeling via Wasserstein Gradient Flows

arXiv cs.LG · Jiaqi Han, Puheng Li, Qiushan Guo, Renyuan Xu · 2026-05-12

We introduce W-Flow, a framework for one-step generative modeling via Wasserstein gradient flows, addressing the computational inefficiency of iterative sampling in diffusion and flow-based models. The method defines an evolution from a reference to a target distribution by minimizing an energy functional instantiated with Sinkhorn divergence, then trains a static neural generator to compress this evolution into a single step. Theoretical analysis shows convergence of finite-sample training dynamics to continuous-time distributional dynamics. W-Flow achieves state-of-the-art performance on ImageNet 256×256 generation with 1.29 FID, 100× faster sampling than comparable diffusion models, and improved mode coverage and domain transfer.

wasserstein gradient flowssinkhorn divergenceone-step generationenergy functionalfinite-sample dynamics

Federated Client Selection under Partial Visibility: A POMDP Approach with Spatio-Temporal Attention

arXiv cs.LG · Qijun Hou, Yuchen Shi, Pingyi Fan, Khaled B. Letaief · 2026-05-12

We propose a spatial-temporal attention-based reinforcement learning framework for federated client selection under partial visibility, formulated as a Partially Observable Markov Decision Process (POMDP). The method integrates historical global models and client identity embeddings to capture both temporal training contexts and persistent client characteristics. Experiments across multiple datasets demonstrate superior performance compared to existing baselines in heterogeneous and partially visible settings, effectively addressing incomplete observations in practical federated learning systems.

federated learningclient selectionpomdpspatial-temporal attentionreinforcement learning

Learning Feature Encoder with Synthetic Anomalies for Weakly Supervised Graph Anomaly Detection

arXiv cs.LG · Yingjie Zhou, Yuqin Xie, Fanxing Liu, Dongjin Song · 2026-05-12

The authors propose a weakly supervised graph anomaly detection method that learns domain-specific feature representations through synthetic anomalies. The approach employs a multi-task learning scheme where synthetic anomalies are generated by perturbing normal graphs, with each anomaly type assigned a dedicated detection head to ensure sensitivity to deviations. A two-phase training strategy is used: initial warm-up with synthetic samples only, followed by full training integrating both synthetic and real data. Experiments on public datasets demonstrate superior performance over existing methods. Code is available on GitHub.

graph anomaly detectionweakly supervised learningsynthetic anomaliesmulti-task learningfeature representation

Training-Inference Consistent Segmented Execution for Long-Context LLMs

arXiv cs.LG · Xianpeng Shang, Jiang Li, Zehua Duo, Qianyi Cai · 2026-05-12

We introduce a training-inference consistent segment-level generation framework for Transformer-based large language models, addressing the computational and memory challenges of long-context generation. The method enforces consistency by restricting gradient propagation to KV states from the immediately preceding segment during training, while allowing head-specific access to past KV states in the forward pass. Evaluated on long-context benchmarks, the approach achieves performance comparable to full-context attention, with competitive latency-memory trade-offs and significantly improved scalability, reducing peak prefill memory by approximately 6x at 128K context length compared to full-context attention with FlashAttention.

transformerkv statesgradient propagationlong-context generationscalability

WorldComp2D: Spatio-semantic Representations of Object Identity and Location from Local Views

arXiv cs.LG · SeongMin Jin, Doo Seok Jeong · 2026-05-12

WorldComp2D introduces a lightweight representation learning framework for spatio-semantic reasoning by explicitly structuring latent space geometry based on object identity and spatial proximity. The framework comprises a proximity-dependent encoder mapping observations into spatio-semantic latent space and a localizer inferring object coordinates from this representation. Evaluated on facial landmark localization, WorldComp2D reduces parameters and FLOPs by up to 4.0X and 2.2X, respectively, compared to state-of-the-art lightweight models, while maintaining real-time CPU performance. This demonstrates the efficiency and generality of explicitly structured latent spaces for spatio-semantic reasoning.

spatio-semantic reasoninglatent space geometryproximity-dependent encoderlocalizerfacial landmark localization

Online Continual Learning with Dynamic Label Hierarchies

arXiv cs.LG · Xinrui Wang, Shao-Yuan Li, Bartłomiej Twardowski, Alexandra Gomez-Villa · 2026-05-12

The paper introduces DHOCL (Online Continual Learning from Dynamic Hierarchies), a novel problem setting addressing evolving hierarchical label structures in online continual learning. To tackle partial supervision and granularity-dependent interference, the authors propose HALO (Hierarchical Adaptive Learning with Organized Prototypes), which combines adaptive classification heads with regularized hierarchical prototypes for rapid adaptation and semantic consistency. HALO outperforms existing methods on multiple benchmarks, achieving improvements in hierarchical accuracy, mistake severity, and continual performance metrics.

online continual learningdynamic hierarchiespartial supervisiongranularity-dependent interferencehierarchical prototypes

U-STS-LLM A Unified Spatio-Temporal Steered Large Language Model for Traffic Prediction and Imputation

arXiv cs.LG · Yichen Zhang, Jun Li · 2026-05-12

U-STS-LLM introduces a unified spatio-temporal steered LLM framework for traffic prediction and imputation, addressing limitations of specialized STGNNs and weakly guided LLM adaptations. The method features a Dynamic Spatio-Temporal Attention Bias Generator for explicit structural guidance, LoRA-based parameter-efficient tuning, and Gated Adaptive Fusion for multi-task learning. Evaluations on cellular datasets show state-of-the-art performance in long-horizon forecasting (e.g., 12-step) and high-missing-rate imputation (e.g., 50% missing), with improved training stability and efficiency over baselines.

spatio-temporal attentionlow-rank adaptationtraffic imputationdynamic graphmulti-task learning

Persona-Conditioned Adversarial Prompting: Multi-Identity Red-Teaming for Adversarial Discovery and Mitigation

arXiv cs.LG · Cristian Morasso, Anisa Halimi, Muhammad Zaid Hameed, Douglas Leith · 2026-05-12

The paper introduces Persona-Conditioned Adversarial Prompting (PCAP), a method for multi-identity red-teaming that conditions adversarial search on diverse attacker personas (e.g., doctors, students) to discover transferable jailbreaks. PCAP generates rich defense datasets with automatic metadata tracking, increasing attack success from 57% to 97% on GPT-OSS 120B while producing 2-6× more diverse prompts. Fine-tuning lightweight adapters on PCAP-generated data improves model robustness (recall: 0.36→0.99, F1: 0.53→0.96) with minimal false positives, demonstrating a closed-loop approach from vulnerability discovery to alignment.

adversarial promptingred-teamingjailbreakspersona-conditionedrobustness

Block-R1: Rethinking the Role of Block Size in Multi-domain Reinforcement Learning for Diffusion Large Language Models

arXiv cs.LG · Yan Jiang, Ruihong Qiu, Zi Huang · 2026-05-12

The paper introduces Block-R1, a framework addressing domain block size conflicts in multi-domain reinforcement learning (RL) for diffusion large language models (dLLMs). It formulates domain block size conflict, proposes a novel dataset (Block-R1-41K) with sample-level optimal block sizes, and establishes a benchmark for flexible RL post-training. The method includes a cross-domain post-training approach using sample-specific block sizes. Evaluations span 13 datasets, 7 RL algorithms, and multiple dLLM backbones, with resources open-sourced.

reinforcement learningdiffusion modelsblock size conflictmulti-domain learningpost-training

EPIC: Efficient Predicate-Guided Inference-Time Control for Compositional Text-to-Image Generation

arXiv cs.LG · Sunung Mun, Sunghyun Cho, Jungseul Ok · 2026-05-12

EPIC introduces a training-free inference-time refinement framework for compositional text-to-image generation, addressing challenges with multi-object, count, attribute, and relation prompts. The method parses prompts into visual programs of object variables and typed predicates, verifying generated images against these programs to guide targeted editing or resampling. EPIC improves prompt-level accuracy on GenEval2 from 34.16% to 71.46%, outperforming prior baselines by 19.23 points while reducing image-model executions by 31%, MLLM calls by 72%, and MLLM tokens by 81%.

text-to-imageinference-time controlvisual programpredicate-guided searchcompositional generation

Unlocking Compositional Generalization in Continual Few-Shot Learning

arXiv cs.LG · Phu-Quy Nguyen-Lam, Phu-Hoa Pham, Dao Sy Duy Minh, Chi-Nguyen Tran · 2026-05-12

The paper introduces a novel paradigm for compositional generalization in continual few-shot learning by decoupling representation learning from compositional inference. The method leverages self-supervised Vision Transformers (ViTs) to preserve object-level geometries during training and dynamically composes slot representations at inference. This dual-phase strategy prevents representation drift and enables novel-concept transfer. Experiments show state-of-the-art performance in unseen-concept generalization and minimal forgetting across continual learning benchmarks.

compositional generalizationcontinual few-shot learningvision transformersrepresentation learningobject-centric representations

GRAFT: Graph-Tokenized LLMs for Tool Planning

arXiv cs.LG · Xinyi Gao, Xinyu Ren, Junliang Yu, Tong Chen · 2026-05-12

The paper introduces GRAFT, a graph-tokenized LLM framework for tool planning that internalizes tool graphs by mapping nodes to special tokens and learning dependencies in representation space. It employs on-policy tool context distillation to train on sampled trajectories while distilling stepwise planning signals. Experiments demonstrate GRAFT's state-of-the-art performance in exact sequence matching and dependency legality, enhancing reliability in complex workflow planning.

graph-tokenizedtool planningdependency-awareon-policy distillationworkflow reliability

Augmented Lagrangian Method for Last-Iterate Convergence for Constrained MDPs

arXiv cs.LG · Michael Lu, Max Qiushi Lin, Mo Chen, Sharan Vaswani · 2026-05-12

The paper proposes an augmented Lagrangian (AL) method for last-iterate convergence in constrained Markov decision processes (CMDPs), addressing the impracticality of mixture policies. It introduces projected Q-ascent (PQA) to solve AL sub-problems, proving global last-iterate convergence in tabular settings. The framework extends to log-linear and non-linear policies, validated on continuous control tasks. Theoretical guarantees match prior work while enabling practical deployment of single policies.

augmented lagrangianconstrained mdpslast-iterate convergenceprojected q-ascentlog-linear policies

Compositional Neural Operators for Multi-Dimensional Fluid Dynamics

arXiv cs.LG · Hamda Hmida, Hsiu-Wen Chang, Youssef Mesri · 2026-05-12

The paper introduces Compositional Neural Operators (CompNO), a framework for solving 2D PDEs by decomposing complex systems into modular Foundation Blocks of specialized Neural Operators. Each block (convection, diffusion, nonlinear convection, Poisson Solver) is pretrained on elementary physics and assembled via an Adaptation Block with an Aggregator that learns nonlinear interactions through physics-informed loss minimization. Evaluated on Convection-Diffusion, Burgers', and Incompressible Navier-Stokes equations, CompNO demonstrates improved adaptability, interpretability, and pretrained block reuse compared to traditional encoding-decoding approaches.

compositional neural operatorsfoundation blocksphysics-informed learningneural operatorspde surrogates

Slicing and Dicing: Configuring Optimal Mixtures of Experts

arXiv cs.LG · Margaret Li, Sneha Kudugunta, Danielle Rothermel, Luke Zettlemoyer · 2026-05-12

This work presents the first systematic study of Mixture-of-Experts (MoE) architecture design choices, analyzing over 2,000 pretraining runs across models up to 6.6B parameters. The authors exhaustively vary expert count, dimension, heterogeneous sizing, shared experts, and load-balancing mechanisms. Results show that performance consistently improves with total MoE parameters, even at extreme active-to-total parameter ratios (e.g., 128:1). Optimal expert size depends primarily on active parameter count rather than total parameters, while other design choices have minimal impact relative to expert count and granularity. Dropless routing emerges as the only secondary factor with consistent performance gains.

mixture-of-expertspretrainingload-balancingheterogeneous expertsdropless routing

Partial Model Sharing Improves Byzantine Resilience in Federated Conformal Prediction

arXiv cs.LG · Ehsan Lari, Reza Arablouei, Stefan Werner · 2026-05-12

The paper introduces a Byzantine-resilient federated conformal prediction (FCP) method using partial model sharing, where only subsets of parameters are exchanged per round. This approach safeguards both training and calibration phases by limiting attack surfaces and compressing non-conformity scores into histogram vectors for Byzantine detection. Experiments demonstrate improved coverage and tighter prediction intervals under diverse Byzantine attacks compared to standard FCP, offering robust uncertainty quantification with reduced communication overhead.

federated conformal predictionbyzantine resiliencepartial model sharingnon-conformity scoreshistogram-based characterization

Posterior Contraction Rates for Sparse Kolmogorov-Arnold Networks in Anisotropic Besov Spaces

arXiv cs.LG · Jeunghun Oh, Kyeongwon Lee, Jaeyong Lee, Lizhen Lin · 2026-05-12

The paper establishes posterior contraction rates for sparse Bayesian Kolmogorov-Arnold networks (KANs) in anisotropic Besov spaces, providing a statistical foundation for KANs. Using spike-and-slab priors and a hyperprior on model size, the method achieves near-minimax contraction rates that adapt to unknown anisotropic smoothness. Key results show fixed-depth KANs control approximation complexity through width, spline-grid parameters, and sparsity, with rates depending on layerwise smoothness in compositional settings. Theoretical tools for spline-edge architectures are developed, avoiding the curse of dimensionality.

kolmogorov-arnold networksposterior contractionanisotropic besov spacesspike-and-slab priorscompositional smoothness

GeomHerd: A Forward-looking Herding Quantification via Ricci Flow Geometry on Agent Interactive Simulations

arXiv cs.LG · Lake Yang, Junwei Su, Jingfeng Zeng, Wenhao Lu · 2026-05-12

GeomHerd introduces a forward-looking geometric framework for quantifying herding behavior in financial markets by analyzing agent-interaction graphs rather than lagging price correlations. The method employs discrete Ollivier--Ricci curvature on graphs generated from LLM-driven multi-agent simulations, linking graph topology to macroscopic herding statistics (CSAD). Results show early detection: 272-step median lead time before order-parameter onset, 65% recall of critical trajectories 318 steps early, and 40-step precedence over price-correlation baselines. The approach generalizes to the Vicsek model and improves cascade-window forecasting (reduced MAE).

ollivier-ricci curvaturemulti-agent simulationherding quantificationprice-correlation laggeomherd

Learning U-Statistics with Active Inference

arXiv cs.LG · Xiaoning Wang, Yuyang Huo, Liuhua Peng, Changliang Zou · 2026-05-12

The paper proposes an active inference framework for efficient estimation of U-statistics under label acquisition constraints. The method employs augmented inverse probability weighting to incorporate sampling rules and machine learning predictions, characterizing the optimal sampling rule for variance minimization. Experimental results on real datasets show significant improvements in estimation efficiency over baselines while maintaining target coverage.

u-statisticsactive inferenceaugmented inverse probability weightingsampling ruleestimation efficiency

MIST: Reliable Streaming Decision Trees for Online Class-Incremental Learning via McDiarmid Bound

arXiv cs.LG · Phu-Hoa Pham, Chi-Nguyen Tran, Nguyen Lam Phu Quy, Dao Sy Duy Minh · 2026-05-12

MIST introduces a reliable streaming decision tree for online class-incremental learning, addressing two miscalibrations in traditional approaches: unreliable split criteria and lack of knowledge transfer. The method combines (i) a K-independent McDiarmid confidence radius for Gini splitting, (ii) a Bayesian inheritance protocol for variance reduction, and (iii) per-leaf KLL quantile sketches for adaptive prediction. Evaluated on tabular streams, MIST matches parametric methods on near-Gaussian data and outperforms state-of-the-art benchmarks on non-Gaussian geometry.

streaming decision treesonline class-incremental learningmcdiarmid boundgini splittingquantile sketches

Targeted Tests for LLM Reasoning: An Audit-Constrained Protocol

arXiv cs.LG · Hongmin Li · 2026-05-12

The paper introduces an audit-constrained protocol for targeted evaluation of LLM reasoning, addressing limitations of fixed benchmarks by systematically testing prompt variations. Methodologically, it employs Component-Adaptive Prompt Sampling (CAPS) within a deterministic grammar-based framework, with strict semantic and extraction audits to distinguish genuine model errors from artifacts. Results show the protocol effectively identifies confirmed errors while filtering invalid cases, but CAPS does not outperform uniform sampling in audited yield or unique prompt discovery, emphasizing the need for audited metrics over proxy-guided policies.

llm reasoningprompt variationaudit protocolcomponent-adaptive samplingsemantic validation

A Mixture Autoregressive Image Generative Model on Quadtree Regions for Gaussian Noise Removal via Variational Bayes and Gradient Methods

arXiv cs.LG · Shota Saito, Yuta Nakahara, Kohei Horinouchi, Naoki Ichijo · 2026-05-12

The paper introduces a probabilistic generative model for grayscale image denoising, combining quadtree region-partitioning with a mixture autoregressive model. The framework reformulates MAP-estimation-based denoising as variational lower bound maximization, solved via alternating variational Bayes and gradient methods. Analytical computation of gradient updates eliminates numerical approximation. Experimental validation confirms noise reduction efficacy, with identified improvement pathways.

quadtreeautoregressivevariational bayesmap-estimationgradient methods

FedOUI: OUI-Guided Client Weighting for Federated Aggregation

arXiv cs.LG · Alberto Fernández-Hernández, Jose I. Mestre, Cristian Pérez-Corral, Manuel F. Dolz · 2026-05-12

FedOUI proposes a novel federated aggregation method using the Overfitting-Underfitting Indicator (OUI), an activation-based metric that captures input-space organization without requiring labels. Clients transmit local updates with OUI values computed on a fixed probe batch, enabling the server to reweight atypical clients via smooth distribution-based weighting. Evaluations on non-IID CIFAR-10 show FedOUI outperforms FedAvg, FedProx, and gradient-alignment baselines under strong heterogeneity, demonstrating activation structure's utility beyond traditional size/gradient criteria.

federated learningclient weightingactivation metricnon-iid dataaggregation rule

OUI as a Structural Observable: Towards an Activation-Centric View of Neural Network Training

arXiv cs.LG · Alberto Fernández-Hernández, Jose I. Mestre, Cristian Pérez-Corral, Manuel F. Dolz · 2026-05-12

The paper introduces the Overfitting--Underfitting Indicator (OUI) as a structural observable for analyzing neural network training dynamics from an activation-centric perspective. OUI serves as an early, label-free signal derived from activation patterns, enabling the identification of poor or promising training regimes prior to convergence. Empirical results demonstrate its utility across domains: in supervised learning, it anticipates weight decay regimes; in reinforcement learning, it discriminates learning-rate regimes in PPO actor--critic; and in online control, it facilitates layer-wise weight decay adaptation. These findings, combined with evidence of early activation pattern stabilization, suggest OUI as a foundational tool for developing an activation-centric theory of training dynamics.

overfitting-underfitting indicatoractivation patternstraining dynamicsppo actor-criticweight decay

A Composite Activation Function for Learning Stable Binary Representations

arXiv cs.LG · Seokhun Park, Choeun Kim, Kwanho Lee, Sehyun Park · 2026-05-12

We propose Heavy Tailed Activation Function (HTAF), a smooth composite sigmoid-tanh approximation to the Heaviside function that enables stable gradient-based optimization for networks with binary activations. HTAF maintains large gradient mass near zero inputs while exhibiting slower gradient decay in tail regions, theoretically supporting stable training. Experiments demonstrate that HTAF enables stable training of Spiking Neural Networks, Binary Neural Networks, and Deep Heaviside Networks. Additionally, we introduce Implicit Concept Bottleneck Models (ICBMs), leveraging HTAF for discrete feature representations in image models, achieving comparable or superior prediction performance to standard models across architectures and datasets.

heavy tailed activation functionheaviside functiongradient-based optimizationimplicit concept bottleneck modelsbinary neural networks

A Controlled Counterexample to Strong Proxy-Based Explanations of OOD Performance: in a Fixed Pretraining-and-Probing Setup

arXiv cs.LG · Hongmin Li · 2026-05-12

The paper demonstrates a controlled counterexample where task-agnostic structure proxies fail to align with out-of-distribution (OOD) probe accuracy rankings, challenging strong proxy-based explanations of OOD performance. Using a fixed pretraining-and-probing setup motivated by computationally bounded notions like epiplexity, the authors construct a scenario where a formal structure quantity, its operational proxy, and task-relevant structure separate. In a synthetic sequence-model experiment, OOD accuracy rankings reversed proxy rankings in two of three seeds, supported by auxiliary diagnostics and ablations. This identifies a boundary on proxy-based explanations, showing proxies for total learned structure can fail to track task-relevant structure driving OOD performance.

out-of-distributionpretrainingprobe accuracyepiplexitytask-agnostic

VNDUQE: Information-Theoretic Novelty Detection using Deep Variational Information Bottleneck

arXiv cs.LG · Aryan Gondkar, Hayder Radha, Yiming Deng · 2026-05-12

The paper proposes VNDUQE, a novelty detection method using the Deep Variational Information Bottleneck (VIB) to constrain information flow in learned representations. The approach evaluates out-of-distribution (OOD) detection via KL divergence and prediction entropy, showing complementary strengths: KL divergence achieves 100% AUROC on far-OOD samples (e.g., noise), while prediction entropy attains 94.7% AUROC on near-OOD cases (novel digit classes). Combined, they yield 95.3% average AUROC, a 32 percentage point improvement over maximum softmax probability. VIB compression (β=10⁻³) reduces Expected Calibration Error by 38%, demonstrating improved uncertainty calibration for active learning applications.

novelty detectionvariational information bottleneckout-of-distributionkl divergenceuncertainty quantification

Fast MoE Inference via Predictive Prefetching and Expert Replication

arXiv cs.LG · Ankit Jyothish, Ali Jannesari, Aishwarya Sarkar, Joseph Zuber · 2026-05-12

A dynamic expert replication strategy is proposed to accelerate Mixture of Experts (MoE) inference by predicting overloaded experts and replicating them for concurrent batch processing. This approach addresses GPU underutilization, load imbalance, and latency issues caused by sparse expert activation in large-scale MoE models. The method enables near-complete GPU utilization (~100%) and achieves up to 3x inference speed improvement while maintaining 90-95% of baseline performance, as demonstrated on Switch-base-128 and Switch-base-256 architectures.

mixture of expertsgpu utilizationexpert replicationinference accelerationload balancing

Generative climate downscaling enables high-resolution compound risk assessment by preserving multivariate dependencies

arXiv cs.LG · Takuro Kutsuna, Noriko N. Ishizaki, Norihiro Oyama, Hiroaki Yoshida · 2026-05-12

A diffusion-based multivariate generative framework with bias correction significantly improves high-resolution climate downscaling by preserving inter-variable dependencies critical for compound risk assessment. The method addresses resolution gaps up to 50× while maintaining correlations among five meteorological variables, reducing correlation errors by over 4× compared to existing baselines. Applied to Japan, it enhances both univariate and spatial accuracy, enabling more reliable detection of severe drought and other compound hazards.

generative downscalingdiffusion modelscompound hazardsbias correctionmultivariate dependencies

Multi-Narrow Transformation as a Single-Model Ensemble: Boundary Conditions, Mechanisms, and Failure Modes

arXiv cs.LG · Tatsuhito Hasegawa, Taisei Tanaka · 2026-05-12

The paper investigates the trade-offs between single-wide and multi-narrow (MN) architectures in single-model ensembles (SMEs) under matched parameter budgets. Through systematic experiments with CNNs across varied data regimes, architectures, and datasets, it demonstrates that MN transformation excels in low-data settings by learning diverse, non-redundant path-wise features, while single-wide configurations dominate in data-rich scenarios due to imbalanced training. The findings provide empirical guidelines for capacity allocation between width and member multiplicity in resource-constrained settings.

single-model ensemblesmulti-narrow transformationparameter budgetfeature diversitylow-data regimes

FERMI: Exploiting Relations for Membership Inference Against Tabular Diffusion Models

arXiv cs.LG · Abtin Mahyar, Masoumeh Shafieinejad, Yuhan Liu, Xi He · 2026-05-12

FERMI introduces a novel membership inference attack tailored for tabular diffusion models in multi-relational settings, addressing the limitation of existing single-table approaches. The method leverages auxiliary relational information during training to enrich single-table features with relational membership signals, while requiring only target table attributes at inference time. Evaluated across three tabular diffusion architectures and three real-world relational datasets, FERMI demonstrates significant improvements in attack performance, achieving up to 53% higher true positive rate at 0.1 false positive rate (TPR@0.1FPR) in white-box settings and 22% in black-box settings compared to single-table baselines.

membership inferencetabular diffusion modelsmulti-relational datafeature-mappingprivacy risk

OverNaN: NaN-Aware Oversampling for Imbalanced Learning with Meaningful Missingness

arXiv cs.LG · Amanda S Barnard · 2026-05-12

OverNaN introduces a NaN-aware oversampling framework for imbalanced learning that preserves meaningful missingness in datasets. The method extends synthetic oversampling techniques to operate directly on incomplete feature vectors, allowing missing values to be preserved, propagated, or selectively interpolated based on defined strategies. By treating missingness as part of the feature space, OverNaN avoids introducing artificial certainty while addressing class imbalance. The framework is demonstrated to retain meaningful missingness during oversampling, making it suitable for small, incomplete, and imbalanced datasets in scientific and engineering domains where missingness is informative and unavoidable.

nan-aware oversamplingimbalanced learningmissing-data handlingsynthetic oversamplingfeature space

EqOD: Symmetry-Informed Stability Selection for PDE Identification

arXiv cs.LG · Gnankan Landry Regis N'guessan, Bum Jun Kim · 2026-05-12

Equivariant Operator Discovery (EqOD) introduces a fully automatic method for partial differential equation (PDE) identification by combining symmetry-informed library reduction and randomized LASSO stability selection. When Galilean invariance is detected via a weak-form structural test, EqOD reduces the candidate library using a proven Galilean exclusion result; otherwise, it applies stability selection guided by false-positive bounds. EqOD achieves F1 = 1.000 ± 0.000 on the Heat equation at 20% noise, outperforming WF-LASSO (0.475 ± 0.181), PySINDy 2.0 (0.000), and WSINDy (0.789). It wins 7 of 32 test cases under strict criteria and outperforms PySINDy 2.0.0 in 23 of 32 cases. External validation yields F1 = 1.000 on all 5 clean benchmarks.

equivariant operator discoverygalilean invariancestability selectionweak-form structural testpartial differential equations

📰 Industry Media (7)

AI chatbots are giving out people’s real phone numbers

MIT Tech Review — AI · Eileen Guo · 2026-05-13

Generative AI chatbots, including Google Gemini, OpenAI ChatGPT, and Anthropic Claude, are increasingly exposing personal phone numbers and other PII due to training on web-scraped datasets containing sensitive information. Instances include incorrect customer service numbers and personal contacts surfaced during casual queries. Experts attribute this to LLMs memorizing and reproducing PII from training data, exacerbated by diminishing public datasets and reliance on data brokers. Guardrails like content filters and privacy instructions often fail, as evidenced by cases where chatbots bypassed safeguards to reveal addresses and phone numbers. Current privacy laws inadequately address this issue, and AI companies lack clear mechanisms for PII removal.

piillmsguardrailsdata brokersmemorization

Fastino Labs Open-Sources GLiGuard: A 300M Parameter Safety Moderation Model That Matches or Exceeds Accuracy of Models 23–90x Its Size

MarkTechPost · Asif Razzaq · 2026-05-13

Fastino Labs introduces GLiGuard, a 300M parameter encoder-based safety moderation model that reframes moderation as a text classification task rather than autoregressive generation. GLiGuard evaluates four safety tasks concurrently in a single forward pass: safety classification, jailbreak strategy detection, harm category detection, and refusal detection. Trained on 87,000 human-annotated examples and synthetic data, GLiGuard achieves 87.7 F1 on prompt classification and 82.7 F1 on response classification, matching or exceeding models 23–90× its size while reducing latency by up to 16.6× (26ms vs. 426ms) on an NVIDIA A100 GPU.

encoder-basedautoregressive generationtext classificationharm category detectionmacro-averaged f1

Mira Murati’s Thinking Machines Lab Introduces Interaction Models: A Native Multimodal Architecture for Real-Time Human-AI Collaboration

MarkTechPost · Asif Razzaq · 2026-05-13

Thinking Machines Lab introduces Interaction Models, a native multimodal architecture for real-time human-AI collaboration, addressing limitations of turn-based systems. The 276B parameter Mixture-of-Experts model employs micro-turn design (200ms chunks) with encoder-free early fusion, enabling simultaneous audio/video/text processing via co-trained lightweight embeddings. Benchmarks show superior performance on interaction metrics (77.8 FD-bench v1.5, 0.40s latency) and novel tasks like TimeSpeak (64.7 accuracy) where existing models score near-zero.

interaction modelsmicro-turn designencoder-free early fusionmixture-of-expertsreal-time multimodal

Google DeepMind Introduces an AI-Enabled Mouse Pointer Powered by Gemini That Captures Visual and Semantic Context Around the Cursor

MarkTechPost · Michal Sutter · 2026-05-13

Google DeepMind introduces an AI-enabled mouse pointer powered by Gemini, capturing visual and semantic context around the cursor to streamline user interactions. The system operates on four principles: maintaining workflow continuity, leveraging visual-semantic context, interpreting deictic language, and converting pixels into actionable entities. It dynamically processes cursor hover state and UI content as structured inputs, enabling intuitive interactions without manual prompting. Experimental demos for image editing and map search are available in Google AI Studio, with integrations rolling out in Chrome and planned for Googlebook laptops. The approach shifts AI assistance from isolated windows to cursor-level functionality across applications.

geminideictic languagestructured inputsmultimodal modelsentity extraction

Build a Hybrid-Memory Autonomous Agent with Modular Architecture and Tool Dispatch Using OpenAI

MarkTechPost · Sana Hassan · 2026-05-12

The article presents a modular architecture for building hybrid-memory autonomous agents using OpenAI's API, combining semantic vector search (text-embedding-3-small) and keyword retrieval (BM25) via Reciprocal Rank Fusion. The system implements abstract interfaces for MemoryBackend, LLMProvider, and Tool, with concrete implementations including a HybridMemory class and OpenAIProvider (gpt-4o-mini). Four tools (memory_store, memory_search, calculator, web_search) demonstrate tool dispatch, while an AgentPersona class enforces consistent behavior through compiled system prompts. The agent achieves multi-turn tool-augmented reasoning with an 8-round maximum dispatch loop.

hybrid-memoryreciprocal rank fusiontool dispatchmodular architectureautonomous agent

Meet AntAngelMed: A 103B-Parameter Open-Source Medical Language Model Built on a 1/32 Activation-Ratio MoE Architecture

MarkTechPost · Asif Razzaq · 2026-05-12

Researchers introduce AntAngelMed, a 103B-parameter open-source medical LLM employing a 1/32 activation-ratio Mixture-of-Experts (MoE) architecture, activating only 6.1B parameters per inference. The model builds on Ling-flash-2.0 with architectural optimizations including sigmoid routing, QK-Norm, and Partial-RoPE, achieving 7× efficiency over dense models. Three-stage training combines medical pre-training, mixed-domain SFT, and GRPO-based RL. Benchmarks show state-of-the-art performance on HealthBench (surpassing proprietary models), MedAIBench, and MedBench, with 128K context via YaRN and >200 tokens/s throughput on H20 hardware.

mixture-of-expertsqk-normpartial-ropegrpoyarn

Physical AI Conference Comes to San Jose as Robotics & Autonomous AI Go Mainstream

AI News · TechEx · 2026-05-13

The Physical AI Conference 2026 in San Jose focuses on scaling AI from digital to physical domains, emphasizing robotics, autonomous systems, and industrial automation. The event highlights enterprise-scale deployment strategies, infrastructure requirements, and real-world AI reliability. Sessions cover AI strategy, robotics, autonomous operations, and developer workflows, featuring insights from NVIDIA, Airbus, Qualcomm, and Hyundai. The conference aims to bridge the gap between AI experimentation and production, addressing challenges in scalability, infrastructure, and safety. Attendees will explore advancements in Physical AI, including sensing, reasoning, and acting in dynamic environments, marking a shift from software-based AI to embedded intelligent systems.

roboticsautonomous systemsindustrial automationai infrastructurephysical ai


Generated automatically at 2026-05-13 21:19 UTC. Summaries and keywords are produced by an LLM and may contain inaccuracies — always consult the original article.