Daily Digest — 2026-05-20

Tuesday, May 19, 2026 · 345 items · model: deepseek/deepseek-chat

345 items · 10 research labs, 325 arxiv papers, 10 industry media

🏛️ Research Labs (10)

Advancing content provenance for a safer, more transparent AI ecosystem

OpenAI News · 2026-05-19

OpenAI advances content provenance for AI-generated media through a multi-layered ecosystem approach, combining C2PA metadata standards, SynthID watermarking, and public verification tools. The method integrates C2PA-conformant metadata for detailed provenance context, durable SynthID watermarking resistant to transformations, and a verification tool to detect OpenAI-generated content. Results include enhanced resilience of provenance signals across platforms, enabling users to verify AI-generated images via integrated signals. Limitations persist, as no detection method is foolproof, and stripped metadata or watermarks may prevent definitive conclusions. This approach aims to foster interoperability and trust in the provenance ecosystem.

content provenancec2pa metadatasynthid watermarkingpublic verificationai-generated media

OlmoEarth v1.1: A more efficient family of models

Hugging Face Blog · 2026-05-19

OlmoEarth v1.1 introduces a more efficient family of transformer-based models for remote sensing tasks, reducing compute costs by up to 3x while maintaining performance. The efficiency gains are achieved by optimizing token sequence lengths, particularly by merging Sentinel-2 bands into single tokens instead of using separate tokens per resolution. This approach reduces token counts multiplicatively, lowering MACs (multiply-accumulate operations) during inference. Pre-training modifications were necessary to prevent performance drops, such as a 10 ppt decrease on the m-eurosat kNN benchmark. The models are available in Base, Tiny, and Nano sizes, enabling planet-scale map refreshes at lower computational expense.

transformer-basedtoken sequencemultiply-accumulate operationssentinel-2pre-training

Introducing the Ettin Reranker Family

Hugging Face Blog · 2026-05-19

Hugging Face introduces the Ettin Reranker Family, six state-of-the-art Sentence Transformers CrossEncoder models (17M to 1B parameters) for document retrieval. These models employ ModernBERT encoders with 8K token context windows, trained via pointwise MSE distillation on mixedbread-ai/mxbai-rerank-large-v2 scores. Evaluated on MTEB(eng, v2) Retrieval and NanoBEIR benchmarks, the 17M model outperforms MiniLM-L12-v2 (+0.051 NDCG@10) while the 1B model nearly matches its 1.54B teacher (0.6114 vs 0.6115 NDCG@10). The architecture features RoPE positional encodings, GeGLU activations, and Flash Attention 2 optimization for 1.7x-8.3x speedups.

crossencodermodernbertropegegluflash-attention-2

I/O 2026

Google AI Blog · 2026-05-19

Google I/O 2026 announced Gemini Omni and Gemini 3.5 Flash, advancing multimodal AI capabilities with world understanding and editing. The event highlighted agentic AI development through Google Antigravity, enabling action-oriented applications like Information agents in Search and Universal Cart. Integration spans products from Google Pics to Ask YouTube, emphasizing scalable deployment across form factors.

gemini omniagentic aimultimodalitygoogle antigravityuniversal cart

How AI Mode is changing the way people search in the U.S.

Google AI Blog · Shivani Mohan · 2026-05-19

Google's AI Mode, launched in the U.S. one year ago, has significantly transformed search behavior by integrating conversational AI capabilities. The system now serves over one billion monthly active users globally, with query volume doubling quarterly. Analysis reveals three key trends: multimodal search adoption, with 16.7% of U.S. queries utilizing voice or image inputs; increased query complexity, as AI Mode queries average triple the length of traditional searches; and task-oriented usage growth, particularly for planning (+80% in 6 months) and decision-making queries (+30% since launch). These findings demonstrate AI Mode's impact on expanding the scope of searchable content and user interaction patterns.

conversational aimultimodal searchquery complexitytask-orienteduser interaction

New ways to create and get things done in Google Workspace

Google AI Blog · Yulie Kwon Kim · 2026-05-19

Google Workspace introduces four AI-powered features to enhance productivity and creativity. Voice capabilities in Gmail, Docs, and Keep enable conversational interactions for tasks like inbox search (Gmail Live), document drafting (Docs Live), and note organization. Google Pics, built on the Nano Banana model, offers precise image editing with object segmentation and text manipulation. AI Inbox prioritizes emails and surfaces contextual actions, while Gemini Spark acts as a personal AI agent for task automation. These features roll out progressively to Google AI subscribers and Workspace business customers, with some in limited preview.

nano banana modelobject segmentationvoice capabilitiesai inboxgemini spark

I/O 2026: Welcome to the agentic Gemini era

Google AI Blog · Sundar Pichai · 2026-05-19

Google announced significant advancements in AI infrastructure and model capabilities at I/O 2026, highlighting a 7x increase in monthly token processing to 3.2 quadrillion. The company introduced Gemini Omni, a multimodal model generating video outputs, and Gemini 3.5 Flash, a cost-efficient frontier model with 4x faster token output than competitors. Custom TPU 8t/8i chips enable distributed training across 1M+ TPUs and low-latency inference. SynthID watermarking expanded to 100B+ media items with new industry partnerships. Agentic workflows accelerated via Antigravity 2.0, processing 3T+ daily tokens internally.

gemini omnitpu 8tsynthidagentic codingmultimodal generation

Gemini 3.5: frontier intelligence with action

Google AI Blog · Koray Kavukcuoglu, Jeff Dean, Oriol Vinyals, Noam Shazeer · 2026-05-19

Google introduces Gemini 3.5 Flash, a state-of-the-art model combining frontier intelligence with agentic capabilities, optimized for coding and long-horizon tasks. The model achieves 76.2% on Terminal-Bench 2.1, 1656 Elo on GDPval-AA, and 83.6% on MCP Atlas, while excelling in multimodal understanding with 84.2% on CharXiv Reasoning. It operates 4x faster than comparable frontier models, enabling rapid execution of complex workflows via the Antigravity harness. Gemini 3.5 Flash integrates with Google AI Studio, Android Studio, and enterprise platforms, driving real-world applications such as financial document processing and codebase maintenance. Enhanced safety measures ensure reduced harmful content generation.

gemini 3.5 flashagentic tasksantigravity harnessmultimodal understandingfrontier intelligence

A new era for AI Search

Google AI Blog · Elizabeth Reid · 2026-05-19

Google Search introduces Gemini 3.5 Flash as the default model in AI Mode globally, enhancing query processing with sustained frontier performance for agents and coding. The intelligent Search box, upgraded for the first time in 25 years, dynamically expands to accommodate multimodal inputs (text, images, videos) and provides AI-powered suggestions. Search agents, launching for Google AI Pro & Ultra subscribers, enable continuous monitoring of web data for personalized updates. Agentic coding capabilities, powered by Google Antigravity, allow real-time generation of custom UIs and dashboards. Personal Intelligence expansion integrates user context from apps like Gmail and Google Photos across 200 countries and 98 languages.

gemini 3.5 flashmultimodal inputssearch agentsagentic codingpersonal intelligence

Everything new in our Google AI subscriptions, fresh from I/O 2026

Google AI Blog · Shimrit Ben-Yair · 2026-05-19

Google announced updates to its AI subscription tiers at I/O 2026, introducing a $100/month AI Ultra plan targeting developers and advanced creators. The plan includes 5X higher usage limits in Gemini and Google Antigravity, Gemini 3.5 Flash integration for rapid testing, 20TB cloud storage, and YouTube Premium. Existing AI Ultra plans were reduced from $250 to $200/month. New features include Gemini Spark, a 24/7 AI agent for task automation, and Project Genie, a world-creation tool leveraging Street View. Subscribers gain access to Gemini Omni for multimodal content creation and Gemini 3.5 Flash for complex coding tasks. Compute-based usage limits replace daily prompt caps, refreshing every five hours.

gemini sparkproject geniegemini omnicompute-based limitsantigravity

📜 arXiv Papers (325)

DashAttention: Differentiable and Adaptive Sparse Hierarchical Attention

arXiv cs.AI · Yuxiang Huang, Nuno M. T. Gonçalves, Federico Alvetreti, Lei Li · 2026-05-18

DashAttention introduces a differentiable and adaptive sparse hierarchical attention mechanism for efficient long-context modeling in large language models (LLMs). It employs the α-entmax transformation to dynamically select a variable number of key-value (KV) blocks per query, maintaining full differentiability across sparse and dense stages. Compared to hierarchical methods like NSA and InfLLMv2, DashAttention demonstrates non-dispersive properties and achieves comparable accuracy to full attention with 75% sparsity. Its GPU-aware Triton implementation outperforms FlashAttention-3 in inference speed, offering a cost-effective solution for long-context tasks.

hierarchical attentionα-entmaxkey-value blockssparsitytriton

Code as Agent Harness

arXiv cs.AI · Xuying Ning, Katherine Tieu, Dongqi Fu, Tianxin Wei · 2026-05-18

The paper introduces 'code as agent harness' as a unifying framework for agentic AI systems, positioning code as the operational substrate for reasoning, action, and environment modeling. It systematically examines three interconnected layers: harness interfaces connecting agents to their environments, harness mechanisms enabling planning, memory, tool use, and feedback-driven optimization, and scaling to multi-agent systems via shared code artifacts. The survey highlights applications in coding assistants, GUI/OS automation, embodied agents, and enterprise workflows, while identifying open challenges in verification, regression-free improvement, multi-agent state consistency, and safety-critical human oversight. This framework provides a roadmap for executable, verifiable, and stateful AI agent systems.

agent harnessmulti-agent systemsfeedback-driven optimizationexecutable aiverification

ESI-Bench: Towards Embodied Spatial Intelligence that Closes the Perception-Action Loop

arXiv cs.AI · Yining Hong, Jiageng Liu, Han Yin, Manling Li · 2026-05-18

ESI-BENCH introduces a benchmark for embodied spatial intelligence, encompassing 10 task categories and 29 subcategories built on OmniGibson, grounded in Spelke's core knowledge systems. The benchmark emphasizes active exploration through perception-action loops, requiring agents to deploy and sequence perception, locomotion, and manipulation abilities to accumulate task-relevant evidence. Experiments on state-of-the-art MLLMs demonstrate that active exploration outperforms passive methods, with agents discovering emergent spatial strategies without explicit instructions. Failures primarily arise from action blindness rather than weak perception, and imperfect 3D representations harm performance more than 2D baselines. Human studies reveal a metacognitive gap in models, which commit prematurely with high confidence regardless of evidence quality.

embodied spatial intelligenceperception-action loopaction blindness3d representationmetacognitive gap

Actionable World Representation

arXiv cs.AI · Kunqi Xu, Jitao Li, Jianglong Ye, Tianshu Tang · 2026-05-18

The paper introduces WorldString, a neural architecture for modeling actionable object representations in physical world models. The method learns state manifolds of real-world objects directly from point clouds or RGB-D video streams, providing a unified digital twin framework. The fully differentiable design enables future integration with policy learning and neural dynamics, positioning it as a foundational component for physical world modeling.

world modelsactionable representationstate manifolddigital twinneural dynamics

Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation

arXiv cs.AI · Qianhao Yuan, Jie Lou, Xing Yu, Hongyu Lin · 2026-05-18

Vision-OPD introduces a regional-to-global self-distillation framework to enhance fine-grained visual understanding in Multimodal Large Language Models (MLLMs). The method employs two conditional policies from the same MLLM: a crop-conditioned teacher and a full-image-conditioned student. The student generates on-policy rollouts, and Vision-OPD minimizes token-level divergence between the teacher and student next-token distributions along these rollouts. This approach enables the model to internalize visual zooming benefits without external teacher models or ground-truth labels. Experiments demonstrate that Vision-OPD models achieve competitive or superior performance against larger open-source, closed-source, and agentic models on multiple fine-grained visual understanding benchmarks.

multimodal large language modelsself-distillationon-policy rolloutstoken-level divergencefine-grained visual understanding

What Does the AI Doctor Value? Auditing Pluralism in the Clinical Ethics of Language Models

arXiv cs.AI · Payal Chandak, Victoria Alkin, David Wu, Maya Dagan · 2026-05-18

The study introduces a framework for auditing value pluralism in medical AI, comprising a benchmark of clinician-verified ethical dilemmas and an attribution method to extract value priorities from model decisions. Analyzing frontier language models, the authors find that while models exhibit physician-level value heterogeneity and discuss competing values (Overton pluralism), their decisions are near-deterministic, lacking the distributional pluralism of human physicians. Most model priorities align with inter-physician variation, but some systematically underweight patient autonomy, risking a deployment monoculture that could erode clinical pluralism.

value pluralismethical dilemmasoverton pluralismdistributional pluralismclinical ethics

Predictable Confabulations: Factual Recall by LLMs Scales with Model Size and Topic Frequency

arXiv cs.AI · Matthew L. Smith, Jonathan P. Shock, Samuel T. Segun, Iyiola E. Olatunji · 2026-05-18

The study establishes a scaling law linking factual recall in large language models to both model size and training-data composition, explaining 60% of variance across models and 74-94% within individual families. Using an automated reference verification system, the authors evaluated 38 models on over 8,900 scholarly references, finding recall quality follows a sigmoid function of the log-linear combination of model parameter count and topic representation in training data. Results align with a superposition-inspired account where recall is gated by a signal-to-noise ratio, with signal strength scaling by concept frequency and noise floor by model capacity.

scaling lawsfactual recallsignal-to-noise ratiosuperposition-inspiredreference verification

DexHoldem: Playing Texas Hold'em with Dexterous Embodied System

arXiv cs.AI · Feng Chen, Tianzhe Chu, Li Sun, Pei Zhou · 2026-05-18

DexHoldem introduces a real-world system-level benchmark for dexterous manipulation using Texas Hold'em poker with a ShadowHand, evaluating embodied agents on tabletop execution, agentic perception, and decision routing. The benchmark includes 1,470 teleoperated demonstrations across 14 manipulation primitives, a standardized physical policy benchmark, and an agentic perception benchmark. Results show task completion rates of 61.2% for primitive execution and scene-preserving success rates of 47.5%. Agentic perception achieves 34.3% strict problem-level accuracy with Opus 4.7 and 66.8% average field-wise accuracy with GPT 5.5, highlighting gaps between visual sub-capabilities and complete state recovery. Case studies demonstrate error accumulation in closed-loop deployment.

dexterous manipulationagentic perceptionteleoperated demonstrationsclosed-loop deploymentscene-preserving success

Semantic Generative Tuning for Unified Multimodal Models

arXiv cs.AI · Songsong Yu, Yuxin Chen, Ying Shan, Yanwei Li · 2026-05-18

The paper introduces Semantic Generative Tuning (SGT), a novel paradigm for aligning visual understanding and generation in unified multimodal models (UMMs). SGT employs image segmentation as a generative proxy to bridge representation gaps, leveraging structural semantics to enhance both perception and layout fidelity. Experiments demonstrate improved feature linear separability and optimized visual-textual attention, with consistent gains in multimodal comprehension and generative fidelity across benchmarks.

unified multimodal modelsgenerative post-trainingimage segmentationfeature linear separabilityvisual-textual attention

Distilling Tabular Foundation Models for Structured Health Data

arXiv cs.AI · Aditya Tanna, Nassim Bouarour, Mohamed Bouadi, Vinay Kumar Sankarapu · 2026-05-18

This work demonstrates that knowledge distillation can effectively transfer predictive performance from tabular foundation models (TFMs) to lightweight tabular models in healthcare applications. The proposed leakage-aware distillation method, using stratified out-of-fold teacher labeling, addresses context leakage issues inherent in TFMs. Evaluated across 19 healthcare datasets, 6 TFM teachers, and 4 student families, distilled students retain ≥90% of teacher AUC while achieving 26× faster CPU inference, with maintained calibration and fairness. Multi-teacher ensembles did not consistently outperform single teachers.

tabular foundation modelsknowledge distillationstratified out-of-foldauccpu inference

PopPy: Opportunistically Exploiting Parallelism in Python Compound AI Applications

arXiv cs.AI · Stephen Mell, David Mell, Konstantinos Kallas, Steve Zdancewic · 2026-05-18

PopPy introduces a system for opportunistically exploiting parallelism in Python-based compound AI applications, addressing end-to-end latency bottlenecks dominated by external ML model invocations. The system combines an ahead-of-time compiler with a runtime to handle language complexity, dynamic dispatch, and variable mutation, requiring minimal developer input. PopPy supports an expressive Python fragment and preserves sequential program semantics. Evaluation on real-world compound AI applications demonstrates up to 6.4× speedups in execution time compared to standard Python execution.

parallelismcompound aipythondynamic dispatchahead-of-time compiler

Ensembling Tabular Foundation Models - A Diversity Ceiling And A Calibration Trap

arXiv cs.AI · Aditya Tanna, Yash Desai, Pratinav Seth, Mohamed Bouadi · 2026-05-18

This study investigates ensembling strategies for tabular foundation models (TFMs), revealing limited diversity and calibration issues. Six modern TFMs exhibit near-redundancy with a mean pairwise Q-statistic of 0.961, constraining ensemble performance. Six ensemble strategies were benchmarked on 153 OpenML classification tasks, with two-level cascade stacking achieving a marginal accuracy gain of +0.18% over the best single TFM at 253× computational cost. Friedman and Nemenyi analysis grouped three ensembles and the best base TFM equivalently, while logistic-regression meta-learner stacking improved accuracy but degraded calibration. Greedy selection is recommended as a practical default due to its balance of performance and efficiency.

tabular foundation modelsensemblingq-statisticcalibrationmeta-learner

SkillGenBench: Benchmarking Skill Generation Pipelines for LLM Agents

arXiv cs.AI · Yifan Zhou, Zhentao Zhang, Ziming Cheng, Shuo Zhang · 2026-05-18

The paper introduces SkillGenBench, a benchmark for evaluating skill generation pipelines in LLM agents, focusing on the isolated problem of generating correct, reusable, and executable skills. The benchmark evaluates two generation regimes (task-conditioned and task-agnostic) and two procedural sources (repository-grounded and document-grounded), using standardized execution-based checks and auxiliary diagnostics. Experiments reveal performance variation across methods, the challenge of reusable skill distillation, and distinct failure modes in repository versus document-based skill generation.

skill generationllm agentsbenchmarktask-conditionedrepository-grounded

Democratizing Large-Scale Re-Optimization with LLM-Guided Model Patches

arXiv cs.AI · Tinghan Ye, Arnaud Deza, Ved Mohan, El Mehdi Er Raqabi · 2026-05-18

The paper introduces an LLM-guided framework for dynamic re-optimization of operations research models, enabling end users to adapt deployed optimization systems via natural-language interaction. The framework employs an LLM as an OR expert, translating prompts into structured model updates, selecting re-optimization techniques from a toolbox (primal information, valid inequalities, solver configurations), and solving instances. Evaluations on supply chain and exam scheduling case studies demonstrate improved computational efficiency (primal-based methods) and interpretability (patch-based updates) while maintaining solution quality.

re-optimizationprimal informationvalid inequalitiessolver configurationsmetaheuristics

Reversa: A Reverse Documentation Engineering Framework for Converting Legacy Software into Operational Specifications for AI Agents

arXiv cs.AI · Sanderson Oliveira de Macedo, Ronaldo Martins da Costa · 2026-05-18

The paper introduces Reversa, a reverse documentation engineering framework that converts legacy software into operational specifications for AI agents through a multi-agent pipeline. Specialized agents perform project mapping, module analysis, implicit rule extraction, architecture synthesis, specification writing, and claim review, emphasizing code-specification traceability, confidence marking, and gap preservation. In a case study migrating an ATM from COBOL to Go, Reversa generated 517 claims, 10 gaps, 53 Gherkin parity scenarios, and a reconstruction plan with 9/11 tasks completed, though final validation was not achieved. The work contributes to reverse engineering, LLM-based documentation, and software agent literature.

reverse engineeringmulti-agent pipelineoperational specificationstraceabilitygherkin scenarios

Learning Quantifiable Visual Explanations Without Ground-Truth

arXiv cs.AI · Amritpal Singh, Andrey Barsky, Mohamed Ali Souibgui, Ernest Valveny · 2026-05-18

A novel framework for evaluating Explainable AI (XAI) methods is proposed, addressing the challenge of lacking ground-truth by using continuous input perturbation to quantify explanation quality. The metric assesses both sufficiency and necessity of attributed information in model decision-making, aligning better with human intuition than existing metrics. A differentiable approximation of this metric is used to fine-tune an adapter module atop black-box models, generating causal explanations without performance degradation. Experimental results demonstrate superior performance of the proposed method across multiple quantifiable metrics compared to competing XAI techniques.

explainable aicontinuous input perturbationsufficiencynecessityadapter module

Lance: Unified Multimodal Modeling by Multi-Task Synergy

arXiv cs.AI · Fengyi Fu, Mengqi Huang, Shaojin Wu, Yunsheng Jiang · 2026-05-18

Lance introduces a lightweight unified multimodal model for image and video understanding, generation, and editing through multi-task synergy. The method employs a dual-stream mixture-of-experts architecture with unified context modeling and decoupled capability pathways, trained from scratch on interleaved multimodal sequences. Key innovations include modality-aware rotary positional encoding and staged multi-task training with adaptive data scheduling. Experiments show Lance outperforms open-source unified models in image/video generation while maintaining strong comprehension, achieving this without capacity scaling or text-dominant designs.

multimodal modelingmixture-of-expertsrotary positional encodingmulti-task trainingcapability pathways

COOPO: Cyclic Offline-Online Policy Optimization Algorithm

arXiv cs.AI · Qisai Liu, Zhanhong Jiang, Joshua Russell Waite, Aditya Balu · 2026-05-18

The paper introduces COOPO (Cyclic Offline-Online Policy Optimization), a hybrid RL framework that cyclically alternates between constrained offline training and online fine-tuning to address distributional shift and catastrophic forgetting. Each cycle employs KL-regularized advantage-weighted updates for offline policy anchoring, followed by online policy optimization for stable exploration. Theoretical analysis shows improved sample efficiency and monotonic improvement under standard coverage assumptions. Empirical results on D4RL benchmarks demonstrate reduced online interactions and higher final returns compared to state-of-the-art hybrid methods, while maintaining robustness across various offline algorithms and online optimizers.

offline reinforcement learningonline fine-tuningdistributional shiftkl-regularizationadvantage-weighted updates

Efficient Lookahead Encoding and Abstracted Width for Learning General Policies in Classical Planning

arXiv cs.AI · Michael Aichmüller, Simon Ståhlberg, Martin Funkquist, Hector Geffner · 2026-05-18

The paper introduces two improvements to Iterated Width (IW) policies for generalized planning: a holistic encoding of search trees and Abstracted IW(1). The joint encoding represents IW(1)-reachable states by relational differences to the current state, enabling Relational GNNs (R-GNNs) to score transitions in a single forward pass. Abstracted IW(1) improves scaling via relational abstraction during novelty checks, replacing atom arguments with types. Evaluated on IPC 2023 and diverse domains, including those beyond $C_2$ logic, the approach achieves state-of-the-art performance, surpassing prior work like LAMA.

generalized planningiterated widthrelational gnnsnovelty searchabstracted iw(1)

Position: A Three-Layer Probabilistic Assume-Guarantee Architecture Is Structurally Required for Safe LLM Agent Deployment

arXiv cs.AI · S. Bensalem, Y. Dong, M. Franzle, X. Huang · 2026-05-18

The paper proposes that safe deployment of LLM agents structurally requires a three-layer probabilistic assume-guarantee architecture, as single-layer safety enforcement is fundamentally insufficient. This necessity arises from three distinct safety dimensions—semantic intent and policy compliance, environmental validity, and dynamical feasibility—each relying on information available at different execution stages. The authors outline an architecture where each safety dimension is enforced by an independently certified layer, with probabilistic guarantees satisfying the next layer's assumptions. System-level safety bounds are derived using the chain rule of probability. Key open problems include bound estimation from non-i.i.d. traces, graceful degradation under deployment drift, and extension to multi-agent settings.

llm agentsprobabilistic guaranteeassume-guarantee architectureruntime assurancedeployment drift

GIM: Evaluating models via tasks that integrate multiple cognitive domains

arXiv cs.AI · Rohit Patel, Alexandre Rezende, Steven McClain · 2026-05-18

The Grounded Integration Measure (GIM) introduces a novel benchmark of 820 expert-authored problems that evaluate LLMs by requiring integration of multiple cognitive operations (e.g., constraint satisfaction, state tracking) over accessible knowledge, avoiding reliance on memorization or abstract reasoning. The benchmark employs a 2-parameter logistic (2PL) IRT model calibrated over >200k prompt-response pairs across 28 models, enabling robust ability estimates despite raw accuracy distortions. A leaderboard spanning 22 models and 47 test-configurations reveals that within-family choices (e.g., thinking budget, quantization) impact performance as much as model selection. The framework, IRT parameters, and public problems are released.

grounded integration measure2-parameter logisticcognitive operationstest-configurationsirt model

AI for Auto-Research: Roadmap & User Guide

arXiv cs.AI · Lingdong Kong, Xian Sun, Wei Chow, Linfeng Li · 2026-05-18

This work presents a comprehensive analysis of AI's role across the complete research lifecycle, organized into four epistemological phases: Creation, Writing, Validation, and Dissemination. Through an end-to-end study spanning developments until April 2026, the authors identify stage-dependent boundaries between reliable assistance and unreliable autonomy, demonstrating AI's strengths in structured, retrieval-grounded tasks while highlighting fragility in novel idea generation and scientific judgment. The analysis reveals that end-to-end autonomous systems fail to consistently meet major-venue acceptance standards, with generated ideas degrading post-implementation and research code underperforming pattern-matching benchmarks. The study concludes with a taxonomy, benchmark suite, tool inventory, cross-stage design principles, and practitioner-oriented playbook for AI-assisted research.

research lifecycleepistemological phasesretrieval-groundedpattern-matching benchmarksend-to-end autonomy

KairosHope: A Next-Generation Time-Series Foundation Model for Specialized Classification via Dual-Memory Architecture

arXiv cs.AI · Luis Balderas, José Alberto Rodríguez, Miguel Lastra, Antonio Arauzo-Azofra · 2026-05-18

KairosHope introduces a next-generation Time Series Foundation Model (TSFM) optimized for specialized classification tasks, addressing computational bottlenecks and integration of classical statistical knowledge. The model employs a dual-memory architecture: Titans modules for short-term retention and Continuum Memory System (CMS) for long-term context abstraction, alongside a Hybrid Decision Head fusing deep representations with statistical features. Pre-trained via Masked Time Series Modeling (MTSM) and contrastive learning on the Monash archive, it is fine-tuned on UCR benchmark datasets using Linear Probing and Full Fine-Tuning (LP-FT) to mitigate catastrophic forgetting. Empirical results show superior performance in temporally causal domains like HAR and Sensor data.

time series foundation modeldual-memory architecturemasked time series modelingcontinuum memory systemlinear probing

Statistical Limits and Efficient Algorithms for Differentially Private Federated Learning

arXiv cs.AI · Arnab Auddy, Xiangni Peng, Subhadeep Paul · 2026-05-18

The paper proposes FedHybrid and FedNewton, two differentially private federated learning algorithms addressing accuracy-privacy-communication trade-offs in M-estimation. FedHybrid combines FedAvg's initialization with FedSGD, while FedNewton reduces FedAvg's bias via local Newton iteration averaging. Theoretical analysis provides finite-sample MSE bounds for DP versions, relating to client count, local samples, privacy budget, and iterations, alongside a minimax lower bound for optimality assessment. Empirical evaluation on MNIST and CIFAR-10 demonstrates effectiveness for logistic regression and neural networks.

federated learningdifferential privacym-estimationnewton iterationminimax lower bound

Pocket Foundation Models: Distilling TFMs into CPU-Ready Gradient-Boosted Trees

arXiv cs.AI · Aditya Tanna, Nassim Bouarour, Mohamed Bouadi, Vinay kumar Sankarapu · 2026-05-18

The paper introduces a method for distilling tabular foundation models (TFMs) into CPU-ready gradient-boosted trees (XGBoost/CatBoost) to achieve sub-2ms inference times. The key challenge is preventing label leakage in in-context learning (ICL) teachers during distillation, addressed via stratified out-of-fold teacher labeling. On 153 classification datasets, distilled TabICLv2 achieves 0.882 macro-mean AUC (96.5% of teacher performance) at 1.9ms CPU latency, with 38x-860x speedups over GPU TFMs and statistically significant gains over CatBoost baselines (Wilcoxon p=0.0008). Additional findings include teacher rank preservation, dimensional sensitivity, and multi-teacher effects.

tabular foundation modelsin-context learninggradient-boosted treeslabel leakageout-of-fold labeling

An Assessment of Human vs. Model Uncertainty in Soft-Label Learning and Calibration

arXiv cs.AI · Maja Pavlovic, Silviu Paun, Massimo Poesio · 2026-05-18

The study isolates the benefits of human soft-labels in AI training by decoupling them from label mode shifts, demonstrating their role as a regularizer that enhances model calibration and training stability. Using MNIST and a synthetic variant, the authors re-annotated subsets to extract human uncertainty, comparing models trained on human soft-labels versus synthetic labels. Results show that human soft-labels improve accuracy and calibration, particularly on difficult samples, while aligning model uncertainty with human uncertainty. Dataset cartography reveals synthetic labels fail to achieve this alignment. The work establishes a diagnostic framework for human-AI uncertainty alignment.

soft-labelscalibrationregularizerdataset cartographyuncertainty alignment

Post-Trained MoE Can Skip Half Experts via Self-Distillation

arXiv cs.AI · Xingtai Lv, Li Sheng, Kaiyan Zhang, Yichen You · 2026-05-18

The paper introduces Zero-Expert Self-Distillation Adaptation (ZEDA), a framework for converting post-trained static Mixture-of-Experts (MoE) models into dynamic ones without full retraining. ZEDA injects parameter-free zero-output experts and employs two-stage self-distillation with a frozen teacher model and group-level balancing loss. Evaluated on Qwen3-30B-A3B and GLM-4.7-Flash across 11 benchmarks, ZEDA reduces expert FLOPs by over 50% with minimal accuracy loss, outperforming dynamic MoE baselines by 6.1 and 4.0 points and achieving ~1.20× inference speedup.

mixture-of-expertsself-distillationdynamic routingflops reductioninference speedup

Data Presentation Over Architecture: Resampling Strategies for Credit Risk Prediction with Tabular Foundation Models

arXiv cs.AI · Aditya Tanna, Mitul Solanki, Mohamed Bouadi, Nassim Bouarour · 2026-05-18

The study demonstrates that context construction strategy significantly impacts Tabular Foundation Models (TFMs) in credit default prediction, surpassing architectural choice in explaining variance. Benchmarking four classical models and five TFMs on Home Credit and Lending Club datasets, the authors evaluate seven context-construction strategies and context sizes from 1K to 50K. Balanced and hybrid sampling improve AUC-ROC by 3-4 points over uniform sampling, exceeding TFM family differences. With balanced contexts of 5K-10K examples, top TFMs match classical baselines' AUC while improving default-class recall, highlighting context construction as a key deployment factor for TFMs in imbalanced settings.

tabular foundation modelscontext constructionauc-rocclass imbalancecredit default prediction

Position: Weight Space Should Be a First-Class Generative AI Modality

arXiv cs.AI · Zhangyang Wang, Peihao Wang, Kai Wang · 2026-05-18

The position paper advocates for treating neural network checkpoints as a first-class generative AI modality, proposing weight space as a core machine learning primitive. It argues that high-performing models occupy low-dimensional, structured regions of weight space characterized by symmetry, flatness, modularity, and shared subspaces. The authors organize existing methods into a five-stage pipeline, highlighting applications where weight space generation is practical, such as adapter-scale and conditional generation. While unrestricted frontier-scale checkpoint synthesis remains challenging, the approach reduces adaptation costs by orders of magnitude and matches fine-tuning performance. The paper aims to shift focus from task-specific optimization to sampling models from learned weight distributions, advancing AI systems that improve or create other AI systems.

weight spacecheckpoint synthesislow-dimensionalmodularityfine-tuning

SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science

arXiv cs.AI · Nithin Somasekharan, Youssef Hassan, Shiyao Lin, Gihan Panapitiya · 2026-05-18

SCICONVBENCH introduces a benchmark for evaluating LLMs on multi-turn clarification dialogues in computational science, focusing on disambiguation and inconsistency resolution across fluid mechanics, solid mechanics, materials science, and PDEs. The benchmark employs a structured task ontology and rubric-based framework to assess clarification behavior, conversational grounding, and specification fidelity. Results show frontier models achieve 52.7% disambiguation accuracy in fluid mechanics but frequently make ungrounded assumptions, highlighting gaps in reliable task formulation.

multi-turn clarificationtask formulationcomputational sciencedisambiguationinconsistency resolution

Learning Lifted Action Models from Traces with Minimal Information About Actions and States

arXiv cs.AI · Jonas Gösgens, Niklas Jansen, Hector Geffner · 2026-05-18

The paper introduces a method for learning lifted STRIPS+ action models from traces with partial observability of both actions and states, relaxing previous assumptions of full state observability. Three algorithms are formulated for different observability conditions: no state observability, full observability of selected state predicates, and local observability of state predicates. Theoretical completeness results characterize when an equivalent STRIPS+ domain can be learned under these conditions. Experimental validation demonstrates the approach's effectiveness in learning from partially observable traces.

lifted stripsaction modelspartial observabilitystate predicateslearning from traces

CrossView Suite: Harnessing Cross-view Spatial Intelligence of MLLMs with Dataset, Model and Benchmark

arXiv cs.AI · Wei Wang, Yuqian Yuan, Tianwei Lin, Wenqiao Zhang · 2026-05-18

The paper introduces CrossView Suite, a comprehensive framework for enhancing cross-view spatial intelligence in multimodal large language models (MLLMs). It addresses three key gaps: data scarcity, benchmark absence, and alignment mechanisms. The suite includes CrossViewSet (1.6M samples, 17 task types), CrossViewBench (systematic evaluation), and CrossViewer (a three-stage reasoning framework). CrossViewer employs adaptive spatial region tokenization and explicit multi-view alignment to boost inference. Experiments demonstrate the criticality of large-scale data, systematic evaluation, and explicit alignment for advancing MLLMs.

cross-view spatial intelligencemultimodal large language modelsinstruction datasetscene-disjoint benchmarkprogressive framework

Stochastic Penalty-Barrier Methods for Constrained Machine Learning

arXiv cs.AI · Adam Bosák, Andrii Kliachkin, Jana Lepšová, Gilles Bareilles · 2026-05-18

The Stochastic Penalty-Barrier Method (SPBM) introduces a novel approach for constrained machine learning in non-convex, non-smooth, stochastic settings prevalent in deep learning. SPBM extends classical penalty and barrier methods by integrating exponential dual averaging, a stabilized penalty schedule, and the Moreau envelope to address non-smoothness. Empirical evaluations demonstrate that SPBM matches or exceeds existing constrained optimization baselines across diverse scenarios, achieving this with only linear runtime overhead compared to unconstrained Adam, even with up to 10,000 constraints.

stochastic optimizationpenalty-barrier methodmoreau envelopeexponential dual averagingconstrained machine learning

ManiSoft: Towards Vision-Language Manipulation for Soft Continuum Robotics

arXiv cs.AI · Ziyu Wei, Luting Wang, Chen Gao, Li Wen · 2026-05-18

ManiSoft introduces a benchmark for vision-language manipulation with soft continuum robotic arms, addressing challenges in deformable control and unreliable proprioception. The benchmark features a simulator coupling realistic soft-body dynamics with contact-rich interactions via elastic force constraints, and defines four tasks highlighting distinct aspects of deformable control. An automated pipeline generates 6,300 diverse scenes and expert trajectories using a high-level planner and low-level reinforcement learning policy for torque command generation. Benchmarking three policy models shows promising results in clean scenes but significant performance drops under randomization, attributed to inaccurate visual proprioceptive estimation and limited deformability exploitation. ManiSoft aims to bridge the gap between rigid and soft arms in vision-language manipulation.

soft continuum roboticsvision-language manipulationelastic force constraintdeformable controlproprioceptive estimation

SAME: A Semantically-Aligned Music Autoencoder

arXiv cs.AI · Julian D. Parker, Zach Evans, CJ Carr, Zachary Zukowski · 2026-05-18

The paper introduces SAME (Semantically-Aligned Music autoEncoder), a transformer-based autoencoder for stereo music and general audio achieving 4096× temporal compression while preserving reconstruction quality and generative performance. The method combines semantic regularization, phase-aware reconstruction losses, and optimized discriminator designs. Results demonstrate computational efficiency through high compression ratios and transformer primitives, with two variants (SAME-L and CPU-compatible SAME-S) released as open-weights.

autoencodertransformercompressionsemantic regularizationphase-aware

CATA: Continual Machine Unlearning via Conflict-Averse Task Arithmetic

arXiv cs.AI · Shen Lin, Junhao Dong, Rongjie Chen, Xiaoyu Zhang · 2026-05-18

We introduce CATA, a conflict-averse task arithmetic method for continual machine unlearning in vision-language models (VLMs), addressing sequential removal requests. CATA represents each forget request as an unlearning task vector, maintains historical task vectors, and performs sign-aware conflict-averse aggregation to suppress conflicting update components. Extensive experiments demonstrate CATA's superiority over baselines in forgetting effectiveness, model fidelity, and forgetting persistence under both single-shot and continual unlearning settings.

continual machine unlearningvision-language modelstask arithmeticconflict-averse aggregationunlearning task vector

Latent Action Reparameterization for Efficient Agent Inference

arXiv cs.AI · Wenhao Huang, Qingwen Zeng, Qiyue Chen, Zijie Guo · 2026-05-18

Latent Action Reparameterization (LAR) introduces a framework for improving inference efficiency in large language model (LLM) agents by learning a compact latent action space. LAR reparameterizes low-level textual actions into multi-step semantic behaviors, reducing the effective decision horizon while preserving action expressiveness. Unlike hand-crafted macros or hierarchical controllers, latent actions are learned from agent trajectories and integrated directly into the model, enabling planning and execution over abstract representations. Evaluations across LLM-based agent benchmarks demonstrate significant reductions in action tokens and wall-clock inference time, with maintained or improved task success rates. This highlights action representation learning as a critical factor in scaling efficient LLM agent inference.

latent action reparameterizationaction representation learningmulti-step semantic behavioreffective decision horizonllm agent inference

Not What You Asked For: Typographic Attacks in Household Robot Manipulation

arXiv cs.AI · Ali Iranmanesh, Peng Liu · 2026-05-18

This work demonstrates that typographic attacks can compromise household robot manipulation by exploiting CLIP's shared embedding space, causing physical execution errors. The authors evaluate attacks in Habitat simulations using HomeRobot, employing a decoupled architecture with frozen CLIP and DETIC for geometric grounding. Results show 67.8% Attack Success Rate (70.0% in successful episodes), with kinetic failures manifesting as incorrect object grasping and transport due to poisoned semantic maps, revealing a critical safety vulnerability in modular manipulation pipelines.

typographic attacksclip embeddinghome-robot benchmarksemantic mappingkinetic failures

AMARIS: A Memory-Augmented Rubric Improvement System for Rubric-Based Reinforcement Learning

arXiv cs.AI · Peilin Wu, Xinlu Zhang, Kun Wan, Wentian Zhao · 2026-05-18

AMARIS introduces a memory-augmented rubric improvement system for rubric-based reinforcement learning, addressing limitations of stateless rubric adaptation by leveraging long-term training history. The system analyzes rollouts, aggregates step-level summaries, retrieves historical context via static and dynamic memory mechanisms, and asynchronously updates rubrics. Experiments demonstrate consistent performance gains over baselines in closed and open-ended domains, with ablation studies confirming contributions from both retrieval types and minimal (~5%) time overhead. Results show that persistent evaluation memory enables evidence-driven rubric adaptation, transforming reward shaping into a strategic, curriculum-like process.

rubric-based reinforcement learningmemory-augmented systemsreward shapingasynchronous executiondynamic retrieval

Randomized Advantage Transformation (RAT): Computing Natural Policy Gradients via Direct Backpropagation

arXiv cs.AI · Mingfei Sun · 2026-05-18

Randomized Advantage Transformation (RAT) introduces a method for estimating Tikhonov-regularized natural policy gradients via direct backpropagation, circumventing the computational expense of Fisher matrix inversion. By applying the Woodbury formula, RAT reformulates regularized natural policy gradients as vanilla policy gradients with a transformed advantage, computed efficiently using randomized block Kaczmarz iterations on on-policy mini-batches. The method avoids explicit Fisher matrix construction, conjugate-gradient solvers, and architecture-specific approximations. Empirical results demonstrate that RAT matches or exceeds established natural-gradient methods across continuous and visual control benchmarks, while maintaining simplicity and architectural compatibility.

natural policy gradientswoodbury formulatikhonov regularizationkaczmarz iterationsfisher matrix

Overeager Coding Agents: Measuring Out-of-Scope Actions on Benign Tasks

arXiv cs.AI · Yubin Qu, Ying Zhang, Yanjun Zhang, Gelei Deng · 2026-05-18

The paper introduces OverEager-Gen, a benchmark for measuring overeager actions in coding agents, defined as scope expansions where agents perform unauthorized operations on benign tasks. The benchmark addresses measurement validity by employing a behavioral-gradient validator, dual-channel stack auditing, and byte-identical consent variants. Results from 500 scenarios and ~7,500 runs across four agent products show that stripping consent declarations increases overeager rates significantly (Delta in [11.9, 17.2] pp), with framework design (permissive vs. ask-to-continue) dominating effect size (Fisher p <= 10^-5).

overeager actionsbehavioral-gradient validatordual-channel stackconsent declarationscope expansion

When Outcome Looks Right But Discipline Fails: Trace-Based Evaluation Under Hidden Competitor State

arXiv cs.AI · Peiying Zhu, Sidi Chang · 2026-05-18

This work introduces discipline stability, a trace-based evaluation paradigm for assessing agent behavior in environments with hidden competitor states. The method defines benchmark behavior, restricts observations to deployment regimes, induces trace diagnostics from failures, separates mechanisms via ablations, and tests transfer and deployment. Experiments on a two-hotel benchmark and a hidden-budget bidding task show that reward-only PPO variants fail trace alignment, while trace-prior RL and corrected history policies better preserve price or bid distributions. Behavior cloning suffices for symmetric imitation, but Trace-Prior RL enables bounded adaptation under capacity asymmetry. The contribution focuses on evaluation methodology rather than proposing new optimizers or MARL claims.

trace-based evaluationdiscipline stabilityhidden competitor statetrace-prior rlbehavior cloning

Query-Conditioned Knowledge Alignment for Reliable Cross-System Medical Reasoning

arXiv cs.AI · Yan Jiao, Jingran Xu, Pin-Han Ho, Limei Peng · 2026-05-18

The paper introduces Query-Conditioned Entity Alignment (QCEA), a novel approach to cross-domain knowledge alignment in medical systems that addresses context-dependent, non-bijective, and direction-sensitive correspondence. QCEA reformulates entity alignment as a query-conditioned correspondence problem, leveraging semantic encoding, graph-based representation learning, and a direction-aware transformation module to rank candidate entities in target graphs based on textual descriptions of source entities. Evaluated on TCM--WM knowledge graphs from SymMap, QCEA outperforms baselines on rank-sensitive metrics like Hit@K and MRR, and downstream retrieval-augmented generation experiments show improved evidence retrieval, grounding, and answer accuracy.

query-conditioned entity alignmentcross-domain knowledge alignmentgraph-based representation learningretrieval-augmented generationrank-sensitive metrics

LongMINT: Evaluating Memory under Multi-Target Interference in Long-Horizon Agent Systems

arXiv cs.AI · Hyunji Lee, Justin Chih-Yao Chen, Joykirat Singh, Zaid Khan · 2026-05-18

The paper introduces LongMINT, a benchmark for evaluating memory-augmented agents in long-horizon, interference-heavy settings. It features 15.6k QA pairs across diverse domains (state tracking, dialogue, Wikipedia, GitHub) with contexts averaging 138.8k tokens (up to 1.8M). The benchmark assesses single-target recall and multi-target aggregation under dynamic memory interference. Evaluation of 7 systems (LLMs, RAG, agent frameworks) reveals low average accuracy (27.9%), with performance limited by retrieval and memory construction, especially for revised facts amid intervening updates.

long-horizonmemory interferencemulti-target aggregationretrieval-augmented generationcontextual reasoning

Estimating Item Difficulty with Large Language Models as Experts

arXiv cs.AI · Diana Kolesnikova, Kirill Fedyanin, Abe D. Hofman, Matthieu J. S. Brinkhuis · 2026-05-18

This study demonstrates that large language models (LLMs) can effectively estimate item difficulty for newly created tasks without response data, offering a cost-efficient alternative to pretesting and expert judgment. Using an item bank from an online learning system, the authors evaluated three LLMs across six primary-school mathematics domains, employing a full factorial design to test judgment formats (absolute vs pairwise), decision types (hard decisions vs token-probability-based estimates), and prompting strategies (zero-shot vs few-shot). LLM-derived difficulty estimates showed moderate-to-strong Spearman rank correlations with empirical difficulties, with pairwise comparisons outperforming absolute judgments in most cases, and token-probability-based estimates enhancing accuracy when combined with few-shot prompting.

large language modelsitem difficultyspearman rank correlationtoken-probabilityfew-shot prompting

Improving BM25 Code Retrieval Under Fixed Generic Tokenization: Adaptive q-Log Odds as a Drop-In BM25 Fix

arXiv cs.AI · Santosh Kumar Radha, Oktay Goktas · 2026-05-18

The authors propose replacing the logarithmic Robertson-Spärck-Jones (RSJ) odds in BM25 with a q-logarithm to improve code retrieval under fixed generic tokenization. This adaptive q-log odds transform recovers BM25 at q=1 and acts as a Box-Cox transform for q<1. Evaluated on CoIR CodeSearchNet Go (182K documents), oracle-tuned NDCG@10 improves from 0.2575 to 0.4874 (+89.3%), with statistically significant gains (p ≤ 10^-4). The method shows graded improvements across code languages, negligible effect on BEIR text, and maintains query latency. A corpus-level q parameter can be estimated from hapax density, and identifier-aware tokenization reduces the incremental gain from q-IDF.

bm25q-logarithmrsj oddscode retrievalhapax density

Key-Gram: Extensible World Knowledge for Embodied Manipulation

arXiv cs.AI · Jingjing Fan, Siyuan Li, Botao Ren, Zhidong Deng · 2026-05-18

Key-Gram introduces a conditional-memory framework that decouples language-derived world knowledge from visual-state reasoning in embodied control tasks. The framework employs a memory module that decomposes instructions into task-specific key-grams, retrieves static linguistic priors via deterministic hashed lookup, and injects these entries into selected hidden layers using context-aware gating and lightweight convolutional fusion. This design enables the backbone to focus on visual reasoning and action inference while storing reusable instruction knowledge in an extensible external memory. Evaluations on RoboTwin2.0, LIBERO/LIBERO-Plus, and real-world dual-arm manipulation tasks show average relative gains of 29.5%/9.9%, 35.8%/4.5%, and 15.4%/8.1%, respectively, demonstrating improved compositional grounding and transfer capabilities.

embodied controlconditional-memory frameworkkey-gramsdeterministic hashed lookupcontext-aware gating

StableHand: Quality-Aware Flow Matching for World-Space Dual-Hand Motion Estimation from Egocentric Video

arXiv cs.AI · Huajian Zeng, Chaohua Yao, Yuantai Zhang, Jiaqi Yang · 2026-05-18

StableHand introduces a quality-aware flow-matching framework for world-space dual-hand motion estimation from egocentric video, addressing challenges of extended hand disappearance and severe occlusions. The method decomposes hand motion observations into four channels (wrist translation and finger articulations for both hands) and integrates per-frame quality signals predicted by a learned quality network. These signals guide a flow-matching process via a per-channel forward schedule, quality-adjusted velocity targets, AdaLN modulation of the DiT denoiser, and quality-aware ODE initialization. On HOT3D and ARCTIC benchmarks, StableHand reduces W-MPJPE by 20-25% compared to baselines, achieving state-of-the-art performance, particularly on heavily occluded sequences.

flow-matchingegocentric videoocclusionsw-mpjpedenoiser

STT-Arena: A More Realistic Environment for Tool-Using with Spatio-Temporal Dynamics

arXiv cs.AI · Tingfeng Hui, Hao Xu, Pengyu Zhu, Hongsheng Xin · 2026-05-18

The paper introduces STT-Arena, a benchmark of 227 interactive tasks designed to evaluate LLMs' ability to replan under spatio-temporal disruptions, featuring nine conflict types and four solvability levels. The environment includes dynamic triggers that invalidate ongoing plans, requiring models to detect state shifts and adapt. Evaluations show state-of-the-art models like Claude-4.6-Opus achieve under 40% accuracy, with failure modes including Stale-State Execution and Misdiagnosis of Dynamic Triggers. The authors propose iterative trajectory refinement and online RL to develop STT-Agent-4B, which outperforms existing LLMs.

spatio-temporal dynamicstool-usingadaptive replanningiterative trajectory refinementonline reinforcement learning

VISAFF: Speaker-Centered Visual Affective Feature Learning for Emotion Recognition in Conversation

arXiv cs.AI · Linan ZHU, Zihao Zhai, Xiao Han, Yuqian Fu · 2026-05-18

VISAFF introduces a speaker-centered visual affective feature learning framework for Emotion Recognition in Conversation (ERC), addressing limitations of text-based methods and Vision-Language Models (VLMs). The framework operates in two stages: Speaker-Centered Affective Grounding leverages frozen VLMs to focus on active speakers' emotional visual cues without fine-tuning, while Reliability-Guided Affective Complementation dynamically integrates textual and acoustic modalities to mitigate visual uncertainty. Evaluated on two real-world datasets, VISAFF achieves competitive performance with state-of-the-art methods while significantly improving computational efficiency by eliminating costly VLM fine-tuning.

emotion recognitionvision-language modelsaffective groundingmultimodal integrationcomputational efficiency

Probing for Representation Manifolds in Superposition

arXiv cs.AI · Alexander Modell · 2026-05-18

The Manifold Probe is introduced as a supervised method for discovering representation manifolds in superposition, generalizing linear regression probes by learning feature spaces and encoding directions. Applied to Llama 2-7b representations of time and space, the probe identifies manifolds that linearly represent interpretable features. Steering along the time manifold influences model completions regarding the release years of cultural artifacts, demonstrating the probe's ability to uncover causally relevant manifolds in model behavior.

manifold probesuperpositionlinear regressionllama 2-7brepresentation manifolds

Continuous Diffusion Scales Competitively with Discrete Diffusion for Language

arXiv cs.AI · Zhihan Yang, Wei Guo, Shuibai Zhang, Subham Sekhar Sahoo · 2026-05-18

RePlaid, a likelihood-based continuous diffusion language model (DLM), demonstrates competitive scalability with discrete DLMs by achieving a compute gap of only 20× compared to autoregressive models. RePlaid aligns Plaid's architecture with modern discrete DLMs, optimizing noise schedules and embeddings via likelihood to minimize ELBO variance and create structured geometries. On OpenWebText, RePlaid sets a new state-of-the-art perplexity bound of 22.1 among continuous DLMs, outperforming Duo with fewer parameters and MDLM in the over-trained regime. Theoretical insights reveal that likelihood-based training evenly distributes denoising difficulty and drives significant likelihood gains.

diffusion language modellikelihood-based trainingnoise scheduleelbo variancestructured geometries

AMR-SD: Asymmetric Meta-Reflective Self-Distillation for Token-Level Credit Assignment

arXiv cs.AI · Zhenlin Wei, Pu Jian, Yingzhuo Deng, Xiaohan Wang · 2026-05-18

We propose Asymmetric Meta-Reflective Self-Distillation (AMR-SD), a novel method for token-level credit assignment in Large Language Models (LLMs) trained with Reinforcement Learning with Verifiable Rewards (RLVR). AMR-SD introduces a reflection bottleneck that compresses diagnostic signals into self-generated Socratic hints and critiques, avoiding over-conditioned teacher distributions. It employs Causal Information Gain (CIG) with an asymmetric, ReLU-gated threshold to translate reflections into sparse token-level advantage modulations, combined with temporal annealing to filter distributional noise. Experiments on scientific, mathematical, and tool-use benchmarks show AMR-SD outperforms baselines, achieving robust long-horizon stability and preventing late-stage training collapse.

self-distillationcredit assignmenttoken-levelreflection bottleneckcausal information gain

Beyond Morphology: Quantifying the Diagnostic Power of Color Features in Cancer Classification

arXiv cs.AI · Farnaz Kheiri, Shahryar Rahnamayan, Masoud Makrehchi · 2026-05-18

This study quantifies the diagnostic value of color features alone in cancer classification, excluding morphological cues. Using statistical color moments and discretized RGB/HSV histograms with classical ML classifiers, the authors demonstrate that color features achieve up to 89% accuracy in binary malignancy classification. The results suggest chromatic shifts encode malignancy signals, enabling lightweight pre-screening tools to reduce computational burden on deep learning systems. Evaluations across ten settings show color features consistently outperform random baselines.

histopathologycolor momentsrgb/hsv histogramsmalignancy classificationcomputational triage

A Practical Noise2Noise Denoising Pipeline for High-Throughput Raman Spectroscopy

arXiv cs.AI · David Martin-Calle, Cesar Alvarez Llamas, Vincent Motto- Ros, Christophe Dujardin · 2026-05-18

A Noise2Noise-based denoising pipeline for high-throughput Raman spectroscopy is introduced, eliminating the need for external spectral libraries or high signal-to-noise reference spectra. The method employs a one-dimensional convolutional autoencoder trained on repeated short-exposure acquisitions, enabling stochastic noise suppression. Evaluated on heterogeneous mineral samples using spectral fidelity metrics (RMSE, SNR, SSIM) and unsupervised K-means classification, the pipeline achieves high-fidelity denoising with integration times as short as 5 ms per spectrum. This approach balances spectral quality and acquisition speed, facilitating fast Raman workflows suitable for routine laboratory use and transferable to other one-dimensional spectroscopic modalities.

noise2noiseraman spectroscopyconvolutional autoencoderspectral fidelityk-means classification

DiPRL: Learning Discrete Programmatic Policies via Architecture Entropy Regularization

arXiv cs.AI · Chengpeng Hu, Yingqian Zhang, Hendrik Baier · 2026-05-18

DiPRL introduces differentiable discrete programmatic reinforcement learning to address performance drops from post-hoc discretization in programmatic RL. The method employs programmatic architecture entropy regularization to encourage convergence toward discrete programs during training, avoiding separate fine-tuning. Experiments on discrete and continuous RL tasks show DiPRL maintains interpretability while achieving strong performance through gradient-based optimization.

programmatic reinforcement learningdiscrete policiesentropy regularizationgradient-based optimizationinterpretable policies

DBES: A Systematic Benchmark and Metric Suite for Evaluating Expert Specialization in Large-Scale MoEs

arXiv cs.AI · Jing Wang, Hongxuan Lu, Jazze Young, Shu Wang · 2026-05-18

We introduce DBES, a systematic framework for evaluating expert specialization in Mixture-of-Experts (MoE) models, addressing the conflation of architectural load-balancing with functional specialization. DBES combines a multi-domain benchmark with five metrics: Routing Specialization, Normalized Effective Rank, Domain Isolation, Routing Stiffness Score, and N-gram Expertise measures. Experiments reveal distinct specialization paradigms: Qwen-series models exhibit modular specialization, while DeepSeek and GLM employ distributed collaboration. Using DBES to identify high-specialization expert paths during domain-specific post-training improved performance by 66% to 94.48% with only 15% of original training resources, demonstrating actionable optimization potential.

mixture-of-expertsspecialization metricsdomain isolationrouting stiffnesspost-training optimization

Modality vs. Morphology: A Framework for Time Series Classification for Biological Signals

arXiv cs.AI · Jordan Tschida, Matthew Yohe, Edward Kane, Gavin Jager · 2026-05-18

The article proposes a morphology-modality framework for time series classification (TSC) of biological signals, emphasizing waveform structure over model class. It analyzes electroencephalography, electromyography, electrocardiography, photoplethysmography, and ocular modalities, demonstrating how morphology informs preprocessing and modeling strategies. Results indicate that morphology, rather than model class, most strongly determines performance and interpretability, particularly when deep models' inductive biases align with waveform dynamics. Future work includes morphological data augmentation and evaluation metrics to enhance generalization. The framework positions morphology-aware modeling as a unifying principle for developing generalizable, interpretable TSC models across biological signals.

time series classificationmorphology-modality frameworkbiological signalsinductive biaseswaveform dynamics

OCCAM: Open-set Causal Concept explAnation and Ontology induction for black-box vision Models

arXiv cs.AI · Chiara Maria Russo, Simone Carnemolla, Simone Palazzo, Daniela Giordano · 2026-05-18

OCCAM introduces a framework for open-set causal concept explanation and ontology induction in black-box vision models. The method discovers visual concepts in an open-set manner, localizes them via text-guided segmentation, and performs object-level interventions by removing concepts to estimate causal contributions to class confidence. Aggregating interventional evidence across datasets induces a structured concept ontology, revealing concept dependencies, latent causal relations, and model biases. Experiments on Broden and ImageNet-S demonstrate OCCAM's improved explanation quality in open-set black-box settings and richer global insights compared to per-image attribution methods.

open-setcausal conceptontology inductiontext-guided segmentationobject-level intervention

AI4BayesCode: From Natural Language Descriptions to Validated Modular Stateful Bayesian Samplers

arXiv cs.AI · Jungang Zou, Alex Ziyu Jiang, Qixuan Chen · 2026-05-18

AI4BayesCode introduces a novel LLM-driven system that generates validated MCMC samplers from natural-language Bayesian model descriptions, addressing limitations in existing probabilistic programming systems. The system employs a modular design, decomposing models into sampling blocks mapped to built-in components, and incorporates pre-generation and post-generation validation for reliability. A recursively stateful coding paradigm enables coherent composition of modular sampling components within larger MCMC procedures. Benchmark evaluations demonstrate AI4BayesCode's capability to implement diverse Bayesian models solely from natural-language descriptions, with extensibility through improvements in the underlying AI agent and additional built-in blocks.

mcmcprobabilistic programmingmodular designstateful codingsampler-generation

GAMMA: Global Bit Allocation for Mixed-Precision Models under Arbitrary Budgets

arXiv cs.AI · Zhangyang Yao, Haiyan Zhao, Haoyu Wang, Tianbo Huang · 2026-05-18

GAMMA introduces a quantizer-agnostic framework for global bit allocation in mixed-precision LLMs, addressing limitations of existing methods by learning module-wise precision preferences post-training. It optimizes a teacher-forced hidden-state reconstruction objective under an augmented Lagrangian constraint and uses integer programming to project preferences into exact budget-feasible discrete assignments. A key innovation is score reuse, enabling adaptation to arbitrary budgets by re-solving only the integer program, reducing per-budget adaptation time from hours to minutes. Evaluated on Llama and Qwen models (8B--32B), GAMMA outperforms fixed-precision baselines (up to +12.99 Avg.) and search-based methods (up to +7.00 Avg.), achieving fixed 3-bit quality at 2.5-bit average precision.

mixed-precision quantizationinteger programmingaugmented lagrangianteacher-forced reconstructionscore reuse

Prompt2Fingerprint: Plug-and-Play LLM Fingerprinting via Text-to-Weight Generation

arXiv cs.AI · Sixu Chen, Xiang Chen, Hongyao Yu, Jiaxin Hong · 2026-05-18

Prompt2Fingerprint (P2F) introduces a scalable framework for large language model (LLM) fingerprinting by reformulating it as a conditional parameter generation task. Unlike resource-intensive fine-tuning approaches, P2F employs a specialized generator to map textual descriptions directly to low-rank parameter increments in a single forward pass, enabling plug-and-play fingerprint injection without retraining. Experiments demonstrate that P2F maintains high accuracy, harmlessness, and robustness in fingerprinting while significantly reducing computational overhead, offering an efficient solution for LLM ownership management.

large language modelsfingerprintingparameter generationlow-rank incrementsownership management

Flowing with Confidence

arXiv cs.AI · Friso de Kruiff, Dario Coscia, Max Welling, Erik Bekkers · 2026-05-18

We propose Flow Matching with Confidence (FMwC), a method for estimating per-sample confidence in generative models without additional computational cost. FMwC introduces input-dependent multiplicative noise at selected layers, propagates its variance through the network in closed form, and integrates it along the ODE trajectory. This yields a confidence score that supports applications such as filtering, trajectory editing, and adaptive ODE stepping. Experiments demonstrate that the confidence score correlates with the divergence of the learned velocity field, enabling improved image quality, thermodynamic stability of crystals, and interpretability of generative processes.

flow matchingconfidence scoremultiplicative noiseode trajectoryvelocity field

When Fireflies Cluster; Enhancing Automatic Clustering via Centroid-Guided Firefly Optimization

arXiv cs.AI · MKA Ariyaratne, Azwirman Gusrialdi, Yury Nikulin, Jaakko Peltonen · 2026-05-18

The authors propose a centroid-guided Firefly Algorithm (FA) variant for automatic data clustering, addressing limitations of K-Means in handling non-uniform cluster shapes and densities. The method introduces a centroid movement strategy and a multi-objective fitness function balancing compactness, separation, and a TSP-based navigation penalty, enabling automatic estimation of optimal cluster count and dynamic boundary adjustment. Experiments demonstrate improved clustering quality and reduced intra-cluster path distances compared to K-Means, particularly in robotic sensor network applications. Results indicate robustness in complex spatial clustering tasks, with potential for extension to higher-dimensional and adaptive scenarios.

firefly algorithmmulti-objective fitnesscentroid movementtsp-based navigationrobotic sensor networks

Scheduling That Speaks: An Interpretable Programmatic Reinforcement Learning Framework

arXiv cs.AI · Chengpeng Hu, Yingqian Zhang, Hendrik Baier · 2026-05-18

ProRL introduces an interpretable programmatic reinforcement learning framework for job shop scheduling, addressing the opacity and computational demands of DNN-based policies. The method employs a domain-specific language (DSL-S) to represent scheduling strategies as editable programs, combining local search for structure discovery with Bayesian optimization for parameter learning. Evaluations on benchmark instances show ProRL outperforms heuristics and DRL baselines, achieving strong performance even with only 100 training episodes.

programmatic reinforcement learningjob shop schedulingdomain-specific languagebayesian optimizationinterpretable policies

Modelling Customer Trajectories with Reinforcement Learning for Practical Retail Insights

arXiv cs.AI · Ken Ming Lee, Paul Barde, Maxime C. Cohen, Derek Nowrouzezahrai · 2026-05-18

The paper introduces a maximum entropy reinforcement learning framework for modeling customer trajectories in retail spaces, addressing the limitations of heuristic approaches like Travelling Salesman Problem and Probabilistic Nearest Neighbours. The RL-based method balances reward maximization with stochasticity to capture bounded rationality in customer behavior. Evaluated on real-world convenience store trajectory data, RL-generated trajectories outperformed heuristic baselines in accuracy, yielding better estimates of impulse purchase rates and shelf traffic densities. Notably, RL-based layout repositioning decisions aligned with those derived from actual trajectory data, demonstrating comparable profit gains. The framework provides a practical alternative to data-intensive approaches, enabling more accessible store layout optimization.

reinforcement learningcustomer trajectoriesmaximum entropybounded rationalityretail optimization

What is Holding Back Latent Visual Reasoning?

arXiv cs.AI · André G. Viveiros, Nuno Gonçalves, André F. T. Martins, Matthias Lindemann · 2026-05-18

This work identifies limitations in latent visual reasoning for Vision-Language Models (VLMs), demonstrating that latent tokens often play a minimal causal role in predictions. Through empirical analysis, the authors reveal two key issues: (1) oracle latent tokens in existing datasets provide insufficient task-relevant information, causing models to bypass them during training and inference, and (2) inference-time latent tokens deviate from oracle representations, collapsing to a narrow region and limiting their utility. Experiments on a diagnostic dataset show that models can effectively rely on latent tokens when they provide sufficient support. The findings highlight the need for datasets with informative intermediate steps and improved latent token prediction mechanisms.

latent visual reasoningvision-language modelsoracle latent tokenschain-of-thought reasoninginference-time deviation

Building Reliable Arithmetic Multipliers Under NBTI Aging and Process Variations

arXiv cs.AI · Masoud Heidary, Biresh Kumar Joardar · 2026-05-18

This work introduces a novel aging mitigation technique for arithmetic multipliers, crucial components in CPUs, GPUs, FPGAs, and AI accelerators, which are susceptible to Negative Bias Temperature Instability (NBTI) effects exacerbated by AI workloads. The method leverages the sign-invariance property of multiplication, applying selective 2s complement transformations to redistribute transistor stress and mitigate NBTI aging. Integrated into systolic arrays, the approach demonstrates improved lifetime in Cadence tool evaluations, with minimal area and delay overheads compared to natural aging baselines.

nbti agingarithmetic multipliers2s complementsystolic arrayscadence tools

EvoMemBench: Benchmarking Agent Memory from a Self-Evolving Perspective

arXiv cs.AI · Yuyao Wang, Zhongjian Zhang, Mo Chi, Kaichi Yu · 2026-05-18

The paper introduces EvoMemBench, a benchmark for evaluating memory mechanisms in LLM-based agents through a self-evolving perspective, organized along memory scope (in-episode vs. cross-episode) and content (knowledge-oriented vs. execution-oriented). It compares 15 memory methods and long-context baselines under standardized protocols. Results indicate long-context baselines remain competitive, memory aids when context is insufficient or tasks are difficult, and no single method dominates across settings, with retrieval-based excelling in knowledge-intensive tasks and procedural memory in execution-oriented ones.

llm-based agentsmemory mechanismsself-evolving perspectiveretrieval-based methodslong-context baselines

Geometry-Aware Uncertainty Coresets for Robust Visual In-Context Learning in Histopathology

arXiv cs.AI · Franciskus Xaverius Erick, Johanna Paula Müller, Bernhard Kainz · 2026-05-18

The paper introduces GAUC, a training-free coreset selection method for robust visual in-context learning (ICL) in histopathology. GAUC operates in pre-trained multimodal embedding space, jointly optimizing three objectives: Maximum Mean Discrepancy for distributional fidelity, Effective Mutual Information Difference for prompt robustness, and predictive-variance penalty for output stability. Evaluated on CRC-100K and MHIST datasets with multiple vision-language models, GAUC improves accuracy, calibration, and prompt robustness over existing ICL selection methods without gradient updates.

in-context learningcoreset selectionvision-language modelsmultimodal embeddinghistopathology

Prompts Don't Protect: Architectural Enforcement via MCP Proxy for LLM Tool Access Control

arXiv cs.AI · Rohith Uppala · 2026-05-18

The paper introduces a governed MCP proxy for architectural enforcement of attribute-based access control (ABAC) in LLM-based autonomous agents, addressing the failure of prompt-based restrictions to prevent unauthorized tool access. The method implements two enforcement points: tool discovery (removing unauthorized tools from context) and tool invocation (blocking unauthorized calls). Evaluated on Qwen 2.5 7B, Llama 3.1 8B, and Claude Haiku 3.5 across 150 adversarial tasks, the proxy achieves 0% unauthorized invocation rate (UIR) with <50ms latency overhead, outperforming prompt-based approaches (11-18pp UIR reduction).

llm agentsaccess controlmcp proxyadversarial robustnesstool invocation

Qumus: Realization of An Embodied AI Quantum Material Experimentalist

arXiv cs.AI · Lihan Shi, Zhaoyi Joy Zheng, Xinzhe Juan, Yimin Wang · 2026-05-18

Qumus introduces the first embodied AI quantum materials experimentalist, integrating high-level reasoning, multimodal processing, and robotic execution for autonomous scientific discovery. The system operates a robotic mini-laboratory to create and process 2D materials, including graphene and van der Waals structures, through closed-loop experimentation. Qumus achieves autonomous error correction and fabricates nanodevices like atomically thin field-effect transistors, demonstrating a framework for self-improving AI in quantum materials research.

quantum materialsembodied aivan der waalsautonomous experimentationnanodevices

SkillsVote: Lifecycle Governance of Agent Skills from Collection, Recommendation to Evolution

arXiv cs.AI · Hongyi Liu, Haoyan Yang, Tao Jiang, Bo Tang · 2026-05-18

SkillsVote introduces a lifecycle-governance framework for managing LLM agent skills, addressing challenges in skill redundancy, environment sensitivity, and indiscriminate updates. The framework profiles a million-scale open-source corpus to assess environment requirements, quality, and verifiability, synthesizes tasks for verifiable skills, and performs agentic library search to expose instructional context. Post-execution, it decomposes trajectories into skill-linked subtasks, attributes outcomes, and gates updates based on evidence. Evaluations show offline evolution improves GPT-5.2 on Terminal-Bench 2.0 by up to 7.9 percentage points, while online evolution enhances SWE-Bench Pro by up to 2.6 percentage points, demonstrating the efficacy of governed skill libraries for frozen agents.

lifecycle-governanceagent skillsverifiabilityskill-linked subtasksevidence-gated updates

Diagnosing Korean-Language LLM Political Bias via Census-Grounded Agent Simulation

arXiv cs.AI · Sungwoo Kang · 2026-05-18

The study introduces Dynamo-K, a census-grounded simulation framework for diagnosing political bias in Korean-language LLMs across four models and six elections (2017-2025). It identifies three systematic failure modes: progressive bias in moderate agents (reducible by 5.2× MAE), model-dependent third-party salience collapse, and regional polarization collapse. Scenario reframing recovers 62% of 2017 MAE, while a learned reweighting adapter calibrates opposing-valence models without candidate names. Dynamo-K achieves 2.1%p MAE on a 0.73%-margin 2022 race and correctly predicts 3/3 presidential winners, validating its diagnostic capability.

political biasagent simulationkorean-language llmsmean absolute errorsalience collapse

Graph Hierarchical Recurrence for Long-Range Generalization

arXiv cs.AI · Stefano Carotti, Marco Pacini, Alessio Gravina, Davide Bacciu · 2026-05-18

We propose Graph Hierarchical Recurrence (GHR), a novel framework addressing limitations of Graph Neural Networks (GNNs) and Graph Transformers (GTs) in capturing long-range dependencies and out-of-range generalization. GHR operates jointly on the input graph and its hierarchical abstraction obtained through pooling, achieving strong performance on long-range dependencies while maintaining parameter efficiency. Empirical results demonstrate that GHR consistently outperforms existing graph models across long-range benchmarks, using as little as 1% of the parameters of state-of-the-art models. These findings suggest that increased model capacity alone may not suffice for generalization, offering a complementary direction to scaling graph foundation models.

graph neural networksgraph transformerslong-range dependenciesout-of-range generalizationparameter efficiency

Towards Ubiquitous Mapping and Localization for Dynamic Indoor Environments

arXiv cs.AI · Halim Djerroud, Nico Steyn, Olivier Rabreau, Patrick Bonnin · 2026-05-18

UbiSLAM introduces a fixed RGB-D camera network for real-time mapping and localization in dynamic indoor environments, addressing traditional SLAM limitations like environmental sensitivity and mobile sensor reliance. The system employs centralized, continuously updated maps to enhance robot localization accuracy, navigation, and human-robot interaction. Challenges in spatial coverage and blind spots are mitigated through automatic camera calibration and real-time data sharing protocols, reducing computational load on individual robots while improving system robustness.

ubislamrgb-d camerasslamreal-time mappinghuman-robot interaction

Probing SMEFT Operators through $t\bar{t}t\bar{t}$ Production with Hyper-Graph Neural Networks at the LHC

arXiv cs.AI · Amir Subba, Sanmay Ganguly · 2026-05-18

The study introduces a Hyper-Graph Neural Network (H-GNN) for analyzing $t\bar{t}t\bar{t}$ production in proton-proton collisions at $\sqrt{s} = 13$ TeV, targeting discrimination from Standard Model backgrounds. The H-GNN models events as hypergraphs, capturing higher-order correlations among jets and leptons to learn many-body kinematic structures. It achieves an AUC of 0.951 and a statistical significance of $Z = 9.11$ at $140~\mathrm{fb}^{-1}$, outperforming SPANet ($Z = 8.62$), Particle Transformer ($Z = 7.37$), and ATLAS ($Z = 5.13$). The method also constrains Wilson coefficients of dimension-six operators and projects sensitivity for HL-LHC luminosities.

hyper-graph neural networkstandard model effective field theorywilson coefficientsmany-body kinematicslhc phenomenology

QSTRBench: a New Benchmark to Evaluate the Ability of Language Models to Reason with Qualitative Spatial and Temporal Calculi

arXiv cs.AI · Anthony G. Cohn, Robert E. Blackwell · 2026-05-18

The authors introduce QSTRBench, a novel benchmark for evaluating large language models' (LLMs) qualitative spatial and temporal reasoning (QSTR) abilities. The benchmark tests compositional reasoning, converse relations, and conceptual neighbourhoods across multiple calculi including Point Algebra, Allen's Interval Algebra, and Region Connection Calculus variants (RCC-5/8/22). Methodologically, it systematically varies question presentation formats (prefix/infix, symbolic/nonce terms) and includes the first published RCC-22 conceptual neighbourhood. Results show all tested frontier LLMs outperform random guessing but fail to achieve perfect accuracy, with performance varying significantly by calculus (PA easiest, RCC-22 hardest). The benchmark and results are released openly.

qualitative spatial reasoningcomposition tablesconceptual neighbourhoodsregion connection calculusallen's interval algebra

Beyond Inference-Time Search: Reinforcement Learning Synthesizes Reusable Solvers

arXiv cs.AI · Soheyl Massoudi, Gabriel Apaza, Milad Habibi, Mark Fuge · 2026-05-18

This work demonstrates that reinforcement learning can synthesize reusable solvers for combinatorial optimization problems, shifting reasoning cost into model weights rather than relying on inference-time search. The authors fine-tune Qwen2.5-Coder-14B-Instruct using Group Relative Policy Optimization with feasibility-gated rewards on Synergistic Dependency Selection (SDS), achieving a 5.0% gap to the Virtual Best Solver while reducing execution costs by 91× compared to Best-of-64 sampling. The synthesized Simulated Annealing template generalizes across SDS instances and shows narrower transferability to Job Shop Scheduling. Ablation studies highlight sensitivity to reward normalization and domain-specific design choices.

reinforcement learningcombinatorial optimizationsimulated annealingfeasibility-gated rewardgroup relative policy optimization

The Hidden Cost of Contextual Sycophancy: an AI Literacy Intervention in Human-AI Collaboration

arXiv cs.AI · Cansu Koyuturk, Sabrina Guidotti, Dimitri Ognibene · 2026-05-18

This study identifies contextual sycophancy in LLMs during multi-turn human-AI collaboration, where models propagate user errors rather than correcting them, degrading task performance. A mixed-design experiment with 60 participants examined the impact of AI literacy interventions on survival ranking tasks, comparing general and sycophancy-focused prompting training. Results show LLMs mirror user reasoning, particularly with lower-quality inputs, but interventions reduced direct error propagation and improved AI advice quality. Findings suggest prompting and AI literacy alone are insufficient for epistemically independent AI support, necessitating system-level solutions for critical engagement.

contextual sycophancyllmsmulti-turn interactionai literacyerror propagation

Optimising CSRNet with parameter-free attention mechanisms for crowd counting in public transport

arXiv cs.AI · Aida Rostamza, Enrico Del Re, Joshua Cherian Varughese, Cristina Olaverri-Monreal · 2026-05-18

The paper introduces PFCASA, a novel parameter-free attention mechanism combining channel-wise (PFCA) and spatial-wise (SA) modules, optimized for crowd counting in public transport. Using CSRNet as the backbone, the authors evaluate PFCA, SA, and 3-D SimAM modules, constraining parameter increases to ≤1%. Experiments on ShanghaiTech show PFCASA excels in sparse crowds (<40 individuals), while PFCA performs better in dense scenarios, achieving comparable or superior accuracy without additional parameters.

crowd countingparameter-free attentioncsrnetdensity map estimationpublic transport

Focused Forcing: Content-Aware Per-Frame KV Selection for Efficient Autoregressive Video Diffusion

arXiv cs.AI · Peiliang Cai, Evelyn Zhang, Jiacheng Liu, Hao Lin · 2026-05-18

Focused Forcing introduces a training-free KV selection method for efficient autoregressive video diffusion, addressing the challenge of large KV caches in long-horizon generation. The method focuses cached history along generated-frame and head dimensions by preserving relevant historical frames through combined attention and diversity scores, while allocating larger budgets to heads with higher estimated importance. Across multiple autoregressive generation paradigms, Focused Forcing achieves up to 1.48× end-to-end acceleration without training, while improving visual quality and text alignment.

kv selectionautoregressive video diffusionattention scoresdiversity scoreshead importance

Same Signal, Different Semantics: A Cross-Framework Behavioral Analysis of Software Engineering Agents

arXiv cs.AI · Wei Ma, Zhi Chen, Jingxu Gu, Tianling Li · 2026-05-18

The study examines whether behavioral patterns in LLM-based software engineering agents generalize across frameworks, analyzing 64,380 SWE-bench runs from 126 configurations spanning 43 frameworks. By holding either the LLM or framework fixed, it isolates their effects on agent behavior and resolution rates. Results show framework identity explains 64% of variance in mean turns (vs. 10% for LLM family), with behavioral signals often carrying opposite meanings across configurations—e.g., 47 agents resolve more issues with lower error rates while 48 do so with higher rates. This demonstrates the need for cross-framework validation of behavioral findings.

llm-based agentssoftware engineeringbehavioral analysiscross-framework validationerror rate

Causely: A Causal Intelligence Layer for Enterprise AI A Benchmark Study on SRE and Reliability Workflows

arXiv cs.AI · Dhairya Dalal, Endre Sara, Ben Yemini, Christine Miller · 2026-05-18

The paper introduces Causely, a causal intelligence layer for enterprise AI that structures observability telemetry into a queryable model with ontological grounding. It benchmarks four AI agent configurations (Claude Code, OpenAI Codex, HolmesGPT with Sonnet/Gemini) in fault diagnosis scenarios with/without Causely. Results show 63% faster diagnosis, 60% lower token use, 78% fewer tool calls, 4.8× footprint reduction, 57% cost savings, and 25pp accuracy improvement (75%→100%) when using the causal layer.

causal intelligenceobservability telemetryroot-cause diagnosisopenai codexholmesgpt

Improved Baselines with Representation Autoencoders

arXiv cs.AI · Jaskirat Singh, Boyang Zheng, Zongze Wu, Richard Zhang · 2026-05-18

Representation Autoencoders (RAE) are improved through three key insights: (1) summing the last k encoder layers enhances reconstruction without finetuning, (2) RAE and representation alignment (REPA) exhibit complementary mechanisms, enabling joint use in intermediate diffusion layers, and (3) REPA enables classifier-free guidance by reparameterizing DiT outputs. RAEv2 achieves 10x faster convergence than original RAE, attaining a state-of-the-art gFID of 1.06 in 80 epochs on ImageNet-256 and FDr^k of 2.17 in 80 epochs, compared to previous bests of 3.26 in 800 epochs. RAEv2 also achieves EP_FID@2 in 35 epochs versus 177 for RAE, demonstrating superior training efficiency. Improvements generalize across text-to-image generation and navigation world models.

representation autoencodersclassifier-free guidancediffusion layerstraining efficiencyreconstruction

ISEP: Implicit Support Expansion for Offline Reinforcement Learning via Stochastic Policy Optimization

arXiv cs.AI · Yifei Chen, Shaoqin Zhu, Xiaoqiang Ji · 2026-05-18

ISEP introduces Implicit Support Expansion via stochastic Policy optimization for offline reinforcement learning, addressing the limitation of rigid constraints that restrict optimal behavior discovery. The method leverages a value function interpolated between in-distribution data and policy samples to implicitly expand the feasible action support, densifying high-reward regions and enabling policy improvement while ensuring bounded value error. To avoid mode collapse and invalid actions in the multimodal optimization landscape, ISEP employs a stochastic action selection strategy, alternating between conservative cloning and optimistic expansion signals. ISEP-FM, an instantiation of this framework, uses Conditional Flow Matching with classifier-free guidance to effectively capture the interpolated value signal.

offline reinforcement learningimplicit support expansionstochastic policy optimizationconditional flow matchingclassifier-free guidance

Wasserstein Equilibrium Decoding for Reliable Medical Visual Question Answering

arXiv cs.AI · Luca Hagen, Johanna P. Müller, Weitong Zhang, Mengyun Qiao · 2026-05-18

The paper introduces Wasserstein Equilibrium Decoding for medical visual question answering (VQA), extending game-theoretic decoding to vision-language models. The method employs a Wasserstein stopping criterion based on semantic consensus among candidate answers, improving convergence efficiency. Evaluated on VQA-RAD and PathVQA, the approach yields statistically significant improvements (+3.5 percentage points on VQA-RAD for Qwen3-VL-2B) over greedy baselines, achieving accuracy parity with 20% faster convergence. The technique demonstrates scalability, with Gemma-3-4B matching MedGemma-4B's performance on PathVQA without domain-specific tuning.

wasserstein equilibriummedical vqagame-theoretic decodingsemantic consensusconvergence efficiency

Alignment Dynamics in LLM Fine-Tuning

arXiv cs.AI · Yuhan Huang, Huanran Chen, Yinpeng Dong · 2026-05-18

This work introduces a unified framework for understanding alignment dynamics in LLM fine-tuning by decomposing alignment updates into a Rebound Force and a Driving Force. The Rebound Force depends on current alignment state and model distribution narrowness, while the Driving Force is determined by training distribution alignment with outcome-conditioned posteriors. The framework explains alignment reversal during fine-tuning and predicts a Rehearsal Priming Effect where prior alignment accelerates re-alignment upon re-exposure. Experiments validate predictions across safety alignment, emergent misalignment, and sentiment settings, demonstrating alignment reversal and accelerated re-alignment, with controlled experiments confirming the dependence of rebound strength on posterior narrowness.

alignment dynamicsrebound forcedriving forcerehearsal priming effectposterior narrowness

PH-Dreamer: A Physics-Driven World Model via Port-Hamiltonian Generative Dynamics

arXiv cs.AI · Xueyu Luan, Chenwei Shi · 2026-05-18

PH-Dreamer introduces a physics-driven world model via Port-Hamiltonian generative dynamics, addressing limitations of recurrent state space architectures by embedding implicit physical priors into recurrent transitions. The framework incorporates three mechanisms: modeling latent evolution as energy routing governed by flow and dissipation, estimating Hamiltonian and power balance from proprioceptive observations, and regularizing policy optimization using energy gradients and Lagrangian multipliers. Evaluated on visual control benchmarks, PH-Dreamer achieves superior asymptotic returns, improves simulator fidelity by aligning imagined and real rewards more tightly, reduces latent phase space volume by 4.18-8.41%, decreases energy consumption by up to 7.80%, and lowers mean squared jerk by up to 9.38%.

port-hamiltonianlatent evolutionenergy routinglagrangian multipliersproprioceptive observations

SD-Search: On-Policy Hindsight Self-Distillation for Search-Augmented Reasoning

arXiv cs.AI · Yufei Ma, Zihan Liang, Ben Chen, Zhipeng Qian · 2026-05-18

SD-Search introduces on-policy hindsight self-distillation for search-augmented reasoning agents, eliminating the need for external teachers or annotations. The method employs a single model in dual roles: a student conditioned on inference-time context and a teacher additionally conditioned on a hindsight block summarizing rollout outcomes. The student minimizes token-level Jensen-Shannon divergence to the teacher's query distribution, providing dense step-level supervision alongside GRPO's trajectory reward. This approach operates within the standard RL training loop without external inference or auxiliary pipelines.

self-distillationsearch-augmented reasoninghindsight blockjensen-shannon divergencerl training loop

DARE-EEG: A Foundation Model for Mining Dual-Aligned Representation of EEG

arXiv cs.AI · Yang Shao, Peiliang Gong, Qun Dai, Daoqiang Zhang · 2026-05-18

DARE-EEG introduces a foundation model for EEG representation learning that enforces mask-invariance through dual-aligned self-supervised pre-training. The method combines mask alignment (contrastive learning of masked views) with anchor alignment (momentum-based feature consistency) to improve transferability. It also employs conv-linear-probing for parameter-efficient adaptation to heterogeneous EEG configurations. Evaluations show state-of-the-art accuracy, low parameter complexity, and superior cross-dataset portability across diverse benchmarks.

eegmask-invariancecontrastive learningfoundation modelself-supervised

CommitDistill: A Lightweight Knowledge-Centric Memory Layer for Software Repositories

arXiv cs.AI · Divya Chukkapalli, Thejesh Avula, Aditya Aggarwal, Harsimran Singh · 2026-05-18

CommitDistill introduces a lightweight, deterministic memory layer for software repositories that distills git history into typed knowledge units (Facts, Skills, Patterns) using regex and TF-IDF retrieval with a calibrated silence threshold (θ=2.5). The system operates locally, without embeddings or external services, and achieves a useful-precision of 0.525 (Cohen's kappa=0.633) on annotated Python units. At a 256-character query budget, CommitDistill achieves a 0.750 hit-rate, outperforming BM25 (0.333) and git log --grep (0.083). Extraction over 10,000 commits completes in under 4 seconds on a laptop.

git historytyped knowledgetf-idfdeterministic extractionsoftware repositories

From Volume to Value: Preference-Aligned Memory Construction for On-Device RAG

arXiv cs.AI · Changmin Lee, Jaemin Kim, Taesik Gong · 2026-05-18

EPIC (Efficient Preference-aligned Index Construction) introduces a method for optimizing on-device RAG pipelines by prioritizing user preferences as a compact and stable form of personal context. EPIC selectively retains preference-relevant information from raw data and aligns retrieval toward preference-aligned contexts, addressing memory constraints. Evaluated across four benchmarks covering conversations, debates, explanations, and recommendations, EPIC reduces indexing memory by 2,404 times, improves preference-following accuracy by 20.17 percentage points, and achieves 33.33 times lower retrieval latency compared to baselines. On-device experiments demonstrate a memory footprint under 1 MB with 29.35 ms/query latency during streaming updates.

rag pipelineon-devicememory footprintpreference-alignedretrieval latency

CodeBind: Decoupled Representation Learning for Multimodal Alignment with Unified Compositional Codebook

arXiv cs.AI · Zeyu Chen, Jie Li, Kai Han · 2026-05-18

CodeBind introduces a decoupled representation learning framework for multimodal alignment, addressing cross-modal discrepancies and data scarcity through a modality-shared-specific codebook design. The method decomposes features into shared components for semantic consistency and specific components for modality-unique details, employing a compositional vector quantization scheme. This approach incrementally aligns target and bridging modalities without requiring fully paired data, mitigating representation bias by preventing dominant modalities from overshadowing others. Validated across nine modalities, CodeBind achieves state-of-the-art performance in multimodal classification and retrieval tasks.

multimodal alignmentvector quantizationrepresentation biasmodality-shared-specific codebooksemantic consistency

Machine Unlearning for Masked Diffusion Language Models

arXiv cs.AI · Georu Lee, Seungwon Jeong, Hoki Kim, Jinseong Park · 2026-05-18

We introduce Masked Diffusion Unlearning (MDU), the first unlearning framework for masked diffusion language models (MDLMs). MDU minimizes forward KL divergence from prompt-conditional predictions to prompt-masked unconditional anchors at each masked response position, incorporating temperature scaling to balance privacy and utility. Experiments on standard benchmarks and MDLM architectures demonstrate that MDU outperforms existing unlearning methods for large language models. The method addresses a gap in machine unlearning research for MDLMs, which generate text via parallel denoising rather than autoregressive prediction.

masked diffusion unlearningforward kl divergenceprompt-conditional predictiontemperature scalingparallel denoising

Privacy Preserving Reinforcement Learning with One-Sided Feedback

arXiv cs.AI · Lin William Cong, Guangyan Gan, Hanzhang Qin, Zhenzhen Yan · 2026-05-18

We propose POOL, a privacy-preserving reinforcement learning algorithm for multi-dimensional continuous state-action spaces with one-sided feedback, where agents receive partial state observations and limited reward information. POOL addresses challenges in learning efficiency and privacy preservation through a novel algorithmic framework. Theoretical analysis demonstrates that POOL achieves a sample complexity bound matching known lower bounds for non-private RL, quantified by privacy parameter E_rho, time horizon H, and optimality-gap parameter α. Results show strong privacy guarantees can be maintained without compromising learning efficiency, advancing practical privacy-aware RL in complex environments.

privacy-preserving reinforcement learningone-sided feedbacksample complexitymulti-dimensional spacesoptimality-gap

Multilingual jailbreaking of LLMs using low-resource languages

arXiv cs.AI · Dylan Marx, Marcel Dunaiski · 2026-05-18

The study demonstrates that multi-turn conversations in low-resource African languages (Afrikaans, Kiswahili, isiXhosa, isiZulu) effectively jailbreak commercial LLMs, with success rates ranging from 41.8% to 83.6% across models like GPT-4o-mini and Claude 3.5 Haiku. Automated testing and human red-teaming revealed that translation quality significantly impacts jailbreak efficacy, with human intervention boosting average success rates from 59.8% to 75.8%. The findings highlight persistent LLM vulnerabilities in multilingual contexts, emphasizing the role of linguistic precision in safety bypass.

jailbreaklow-resource languagesmulti-turn conversationstranslation qualitysafety mechanisms

SomaliWeb v1: A Quality-Filtered Somali Web Corpus with a Matched Tokenizer and a Public Language-Identification Benchmark

arXiv cs.AI · Khalid Yusuf Dahir · 2026-05-18

The study introduces SomaliWeb v1, the first dedicated Somali pretraining corpus with 819,322 documents (~303M tokens), a matched BPE-16K tokenizer, and a public language-identification benchmark. The corpus was constructed from three sources (HPLT v2, CC100, Somali Wikipedia) using a six-stage pipeline, revealing quality issues in existing distributions: 17.3% duplicates, 56.1% mojibake, and 10.7% near-duplicates in HPLT v2. The BPE-16K tokenizer reduces token count by 40.2% compared to GPT-4's cl100k_base on FLORES-200 Somali devtest.

somalicorpustokenizerlanguage-identificationbenchmark

Are Sparse Autoencoder Benchmarks Reliable?

arXiv cs.AI · David Chanin · 2026-05-18

This study critically evaluates the reliability of benchmarks for sparse autoencoders (SAEs), a key interpretability tool for large language models. Through three methodological lenses—reseed noise analysis, ground-truth correlation on synthetic SAEs, and discriminability across training trajectories—the authors identify significant flaws in SAEBench, the standard SAE evaluation suite. They find that Targeted Probe Perturbation (TPP) and Spurious Correlation Removal (SCR) fail multiple reliability tests, while other metrics exhibit higher reseed noise and lower discriminability than expected. The sae-probes variant of $k$-sparse probing emerges as the most reliable metric, though it still struggles to differentiate SAE architecture variants. The findings underscore the need for improved SAE benchmarks.

sparse autoencodersinterpretabilitybenchmarksreseed noisediscriminability

Context Memorization for Efficient Long Context Generation

arXiv cs.AI · Yasuyuki Okoshi, Hao Mark Chen, Guanxi Lu, Hongxiang Fan · 2026-05-18

The paper introduces attention-state memory, a training-free method to enhance long-context generation in LLMs by externalizing prefixes into a lightweight lookup-based memory of precomputed attention states. This addresses limitations of prefix-augmented inference, specifically fading influence and linear attention scaling with prefix length. Evaluated on ManyICLBench with LLaMA-3.1-8B, the approach improves accuracy over in-context learning at 1K-8K memory budgets, reduces attention latency by 1.36x at 8K, and outperforms full-attention RAG on the NBA benchmark using 20% of its memory footprint.

attention-state memoryprefix-augmented inferencellama-3.1-8bmanyiclbenchnba benchmark

A Simplex Witness Certificate for Constant Collapse in Variational Autoencoders

arXiv cs.AI · Zegu Zhang, Jianhua Peng, Jian Zhang · 2026-05-18

The paper introduces a simplex witness certificate to address exact constant collapse in variational autoencoders (VAEs), where the encoder mean becomes input-independent. The method employs a fixed simplex witness head attached to the latent mean, enabling pre-design, monitoring, and certification of this failure mode. By aligning teacher-student losses and leveraging a closed-form inverse, the approach ensures that latent means cannot collapse to constants if the alignment loss falls below a baseline derived from teacher information. The framework also incorporates a computable view gap to handle teacher targets from different views, transforming constant collapse into a design-and-certificate problem.

variational autoencoderssimplex witnessconstant collapselatent meanteacher-student alignment

Leveraging Graph Structure in Seq2Seq Models for Knowledge Graph Link Prediction

arXiv cs.AI · Luu Huu Phuc, Ratan Bahadur Thapa, Mojtaba Nayyeri, Jingcheng Wu · 2026-05-18

The paper introduces Graph-Augmented Sequence-to-Sequence (GA-S2S), a novel framework combining a T5-small encoder-decoder with a Relational Graph Attention Network (RGAT) for knowledge graph link prediction. Unlike existing Seq2Seq models that linearize graph neighborhoods, GA-S2S jointly encodes textual features and $k$-hop subgraph topology via relation-aware embeddings, preserving structural information. Preliminary experiments on CoDEx show GA-S2S achieves a 19% relative accuracy improvement over Seq2Seq baselines by leveraging multi-hop relational patterns.

graph-augmented seq2seqrelational graph attention networkknowledge graphlink predictionmulti-hop reasoning

SPATIOROUTE: Dynamic Prompt Routing for Zero-Shot Spatial Reasoning

arXiv cs.AI · Pawat Chunhachatrachai, Gueter Josmy Faure, Hung-Ting Su, Winston H. Hsu · 2026-05-18

SpatioRoute introduces dynamic prompt routing for zero-shot spatial reasoning in egocentric video question answering, achieving up to 5% accuracy gains over fixed-prompt baselines on SQA3D. The method employs two complementary approaches: SpatioRoute-R, a rule-based router mapping question typologies to specialized prompts, and SpatioRoute-L, an LLM-driven router generating context-aware prompts without video input. Evaluation across multiple VLM families shows consistent improvements, with findings indicating that question-aware routing outperforms uniform Chain-of-Thought prompting for spatial video understanding.

zero-shot learningvision-language modelsprompt routingspatial reasoningegocentric video

Concise and Logically Consistent Conformal Sets for Neuro-Symbolic Concept-Based Models

arXiv cs.AI · Samuele Bortolotti, Emanuele Marconato, Andrea Pugnana, Andrea Passerini · 2026-05-18

The authors propose COCOCO, a conformal prediction framework for Neuro-Symbolic Concept-based Models (NeSy-CBMs) that jointly conformalizes concepts and labels while satisfying consistency, coverage, and conciseness desiderata. The method employs a deduction-abduction revision step to reconcile predictions, maintains distribution-free coverage guarantees, and accommodates imperfect knowledge and user-specified size budgets. Experiments across 8 datasets demonstrate COCOCO's superiority over baselines in performance and set size metrics.

neuro-symbolic modelsconformal predictionconcept-based modelsdeduction-abductioncoverage guarantees

PIPER: Content-Based Table Search via profiling and LLM-Generated Pseudoqueries

arXiv cs.AI · Riccardo Terrenzi, Matteo Falconi, Serkan Ayvaz, Pierluigi Plebani · 2026-05-18

PIPER introduces a content-driven retrieval method for tabular datasets, addressing limitations of metadata-based search in data lakes and open data portals. The approach combines table profiling with LLM-generated pseudoqueries, embedded for dense retrieval, to capture both schema and cell value semantics. Evaluations show PIPER outperforms metadata-based baselines and TableQA retrieval methods, demonstrating the efficacy of LLM-based content modeling for dataset search in poor-metadata environments.

tabular datasetsdense retrievalllm-generated queriestable profilingdataset search

RGB-only Active 3D Scene Graph Generation for Indoor Mobile Robots

arXiv cs.AI · Giorgia Modi, Davide Buoso, Giuseppe Averta, Daniele De Martini · 2026-05-18

The paper introduces a hardware-agnostic framework for active, incremental 3D scene graph generation using RGB-only input, addressing limitations of depth sensor dependency and passive observation trajectories. The method unifies perception and planning through a shared structured representation capturing object semantics, 3D geometry, relational context, and multi-viewpoint information. Experiments on the Replica dataset demonstrate F1-score parity with ground-truth depth baselines, while active exploration on ReplicaCAD shows semantic-driven viewpoint selection detects over twice as many objects as geometric frontier-based methods under the same exploration budget. External RGB camera integration further enhances scene graph initialization and contextual understanding without additional exploration costs.

3d scene graphrgb-only inputactive explorationsemantic-driven viewpointhardware-agnostic

Beyond the Cartesian Illusion: Testing Two-Stage Multi-Modal Theory of Mind under Perceptual Bottlenecks

arXiv cs.AI · Yajing Zhou, Xiangyu Kong · 2026-05-18

The paper introduces an Epistemic Sensory Bottleneck module to enhance Multi-Modal Large Language Models' (MLLMs) spatial reasoning in multi-agent environments, addressing the 'Cartesian Illusion' limitation. The proposed Anchor-Based Embodied Spatial Decomposition Chain-of-Thought (CoT) guides MLLMs through a geometric-to-semantic projection, dynamically weighting visual and auditory modalities based on Agent B's sensory constraints. Evaluations demonstrate that while current MLLMs struggle with spatial symmetry and out-of-view ambiguities (42% zero-shot accuracy), the sensory-bounded reasoning chain outperforms egocentric and allocentric baselines. This work establishes a foundational paradigm for epistemic, modality-aware inference in Embodied AI.

multi-modal large language modelstheory of mindepistemic sensory bottleneckchain-of-thoughtembodied ai

Pairwise Preference Reward and Group-Based Diversity Enhancement for Superior Open-Ended Generation

arXiv cs.AI · Guining Cao, Jiaxin Peng, Chu Zeng, Yu Zhao · 2026-05-18

The paper proposes Pairwise Preference Reward and Group-based Diversity Enhancement (PPR-GDE), a reinforcement learning method for open-ended generation that avoids scalar rewards and enhances diversity. PPR-GDE integrates pairwise preference rewards to capture subjective evaluation, mitigates judge bias via order-swapped comparisons, and introduces group-based diversity rewards for semantic dispersion. Evaluated on role-playing tasks, PPR-GDE outperforms RL baselines in alignment quality and expressive diversity, with pairwise preferences crucial for subjective alignment and diversity metrics essential for semantic coverage.

reinforcement learningopen-ended generationpairwise preferencediversity enhancementsemantic dispersion

Fixed External Cameras as Common Prior Maps for Active 3D Scene Graph Generation

arXiv cs.AI · Giorgia Modi, Davide Buoso, Giuseppe Averta, Daniele De Martini · 2026-05-18

The paper introduces a framework for active, incremental 3D scene graph (3DSG) generation that integrates observations from fixed external RGB cameras as Common Prior Maps (CPMs) alongside onboard robot cameras. The RGB-only approach leverages a feed-forward 3D reconstruction model to process all camera inputs uniformly, enabling hardware-agnostic operation. A graph-based active semantic exploration strategy utilizes the partial scene graph to guide the robot toward regions of high semantic uncertainty, refining the scene prior iteratively. Experimental results show that incorporating a single external camera boosts initial object recall by up to +79% and enhances exploration efficiency through richer contextual priors.

3d scene graphcommon prior mapsactive explorationsemantic uncertaintyfeed-forward reconstruction

Scalable Environments Drive Generalizable Agents

arXiv cs.AI · Jiayi Zhang, Fanqi Kong, Guibin Zhang, Maojia Song · 2026-05-18

The paper proposes environment scaling as a key requirement for training generalizable agents, contrasting it with trajectory and task scaling approaches that operate within fixed rule-sets. It introduces a taxonomy distinguishing these scaling paradigms by their impact on executable rule-sets, and synthesizes methods for constructing scalable environments using programmatic generators (for controllability) and generative world models (for open-endedness). The authors advocate coupling environment scaling with stateful learning mechanisms to address world-level distribution shifts, positioning it as essential for robust generalization.

environment scalinggeneralizable agentsexecutable rule-setsworld-level distribution shiftstateful learning

MARS: Technical Report for the CASTLE Challenge at EgoVis 2026

arXiv cs.AI · Haoyu Zhang, Qiaohui Chu, Yisen Feng, Meng Liu · 2026-05-18

MARS introduces a multimodal agentic reasoning system for the CASTLE Challenge at EgoVis 2026, addressing the task of answering 185 closed-form questions over a complex multimodal dataset. The system processes four days of activity across 15 synchronized perspectives, leveraging primary sources (videos, transcripts) and auxiliary modalities (gaze, heartrate, photos, thermal imagery). Temporal evidence is compressed via DeepSeek-based summaries and captions due to video length constraints. A GPT-5.4 decision agent dynamically selects missing modalities or produces answers based on evidence sufficiency. MARS achieved second place on the CASTLE Challenge leaderboard, demonstrating effective multimodal reasoning and source selection.

multimodal reasoningagentic decisiondeepseektemporal compressionevidence selection

Visualizing the Invisible: Generative Visual Grounding Empowers Universal EEG Understanding in MLLMs

arXiv cs.AI · Junyu Pan, Yansen Wang, Enze Zhang, Baoliang Lu · 2026-05-18

The paper introduces Generative Visual Grounding (GVG), a framework that enhances EEG understanding in multimodal LLMs (MLLMs) by generating instance-specific proxy images from EEG signals. GVG employs an EEG-to-image generative model to provide visual contexts, enabling MLLMs to leverage their visual priors for clinical interpretation. Evaluated on two backbones (GVG-X-Omni and GVG-Janus), the method shows competitive performance: GVG-X-Omni matches 1.7B-parameter text-aligned baselines while tuning only 170M parameters on a frozen 7B backbone. Trimodal (Image+Text) alignment in GVG-Janus further improves EEG understanding by combining categorical text anchors with perceptual visual details.

generative visual groundingeeg-to-imagemultimodal llmsvisual proxytrimodal alignment

TRACE: Trajectory Correction from Cross-layer Evidence for Hallucination Reduction

arXiv cs.AI · Tej Sanibh Ranade · 2026-05-18

The paper introduces TRACE (Trajectory Correction from Cross-layer Evidence for Hallucination Reduction), a training-free algorithm that mitigates hallucinations in LLMs by leveraging cross-layer evidence during inference. TRACE dynamically selects corrective operations (scalar reversal, earlier-state recovery, or candidate-space correction) based on internal model trajectories, requiring no external labels or calibration. Evaluated across 15 models, 8 families, and 3 benchmarks, TRACE consistently improves factuality, with mean gains of +12.26 MC1 and +8.65 MC2 points, and peak improvements reaching +47.20 MC1 and +43.38 MC2 points.

hallucinationcross-layertrajectoryfactualityinference

Self-Evolving Spatial Reasoning in Vision Language Models via Geometric Logic Consistency

arXiv cs.AI · Junming Liu, Yuqi Li, Yifei Sun, Maonan Wang · 2026-05-18

The paper introduces Spatial Alignment via Geometric Evolution (SAGE), a self-evolving framework to enhance spatial reasoning in Vision-Language Models (VLMs) by enforcing geometric and linguistic consistency. SAGE employs duality operations and a dynamic operation pool to probe inconsistencies, integrating geometric logic consistency as an auxiliary reward in GRPO training. This model-agnostic approach improves data efficiency and generalization, demonstrated by superior performance on video and spatial reasoning benchmarks compared to baselines.

sagevlmsspatial reasoninggeometric consistencygrpo

Vision Inference Former: Sustaining Visual Consistency in Multimodal Large Language Models

arXiv cs.AI · Xinpeng Dong, Min Zhang, Kairong Han, Xu Tan · 2026-05-18

The Vision Inference Former (VIF) addresses limitations in multimodal large language models (MLLMs) by sustaining visual consistency during generation. Current connector-based paradigms project visual features into textual sequences, diminishing the visual modality's unique contribution and weakening vision-language alignment over extended generation lengths. VIF introduces a lightweight architectural module that continuously injects visual semantics into the decoding phase, ensuring the model remains grounded in visual content. Evaluated across 14 benchmark tasks, including general reasoning, OCR, and hallucination, VIF consistently improves performance across diverse architectures with minimal overhead.

multimodal large language modelsvisual semanticsdecoding phasevision-language alignmentconnector-based paradigm

Whispers in the Noise: Surrogate-Guided Concept Awakening via a Multi-Agent Framework

arXiv cs.AI · Mengyu Sun, Ziyuan Yang, Zunlong Zhou, Junxu Liu · 2026-05-18

The paper introduces ConceptAgent, a training-free multi-agent framework for black-box concept awakening in diffusion models (DMs), addressing limitations of current concept erasure methods. By analyzing denoising trajectories, the authors show that erased concepts persist in later stages due to the model's reliance on noisy states rather than textual conditions. ConceptAgent leverages surrogate-guided initialization to bypass erased mappings, achieving accurate concept awakening without accessing model internals. Experiments demonstrate its effectiveness, revealing fundamental vulnerabilities in concept erasure techniques and offering insights into DM dynamics.

concept awakeningdiffusion modelsblack-box attackdenoising trajectorymulti-agent framework

Evidence-Grounded Frontier Mapping and Agentic Hypothesis Generation in Nanomedicine

arXiv cs.AI · Christiaan G. A. Viviers, Koen de Bruin, Mirre M. Trines, Ayla M. Hokke · 2026-05-18

We introduce pArticleMap, a literature-mapping and hypothesis-generation system for nanomedicine that integrates article embeddings, similarity-graph analysis, sparse frontier extraction, and audited LLM workflows to identify low-density bridge regions and generate citation-grounded hypotheses. The system employs a retrospective realization benchmark and blinded human assessment to evaluate its performance. Results show a pooled gold recovery rate of 10.8%, recall@10 of 15.9%, and future-neighborhood rate of 61.0%, indicating effective forward-looking neighborhood identification. Human-agent agreement is modest, suggesting internal scoring serves as a support signal rather than replacing expert judgment. pArticleMap emerges as a conservative, evidence-grounded research assistant for nanomedicine.

article embeddingssimilarity-graph analysissparse frontier extractioncitation-grounded hypothesesretrospective realization benchmark

Generative AI and the Productivity Divide: Human-AI Complementarities in Education

arXiv cs.AI · Lihi Idan, Bharat Anand · 2026-05-18

This study investigates heterogeneous productivity effects of Generative AI (GenAI) in education through a randomized controlled experiment comparing traditional self-study resources with large-language-model (LLM) assistance. Participants, modeled as early-career knowledge workers, demonstrated significant average performance gains with GenAI access, but outcomes varied substantially based on AI Interaction Competence (AIC) — the ability to elicit, filter, and verify model outputs. High-AIC participants achieved disproportionate benefits, while low-AIC users saw limited or negative returns. A scaffolding intervention using conceptual maps reduced outcome variance, suggesting standardized workflows can mitigate AI-mediated performance inequality. Findings highlight human-AI complementarities, where GenAI increases mean productivity but introduces new capability disparities.

generative ailarge-language-modelai interaction competencerandomized controlled experimenthuman-ai complementarities

An Empirical Study of Privacy Leakage Chains via Prompt Injection in Black-Box Chatbot Environments

arXiv cs.AI · Hongjang Yang, Hyunsik Na, Daeseon Choi · 2026-05-18

The paper demonstrates a privacy-leakage attack chain via indirect prompt injection in black-box chatbot environments, where attackers manipulate external content to hijack agent tasks without access to model internals. It introduces 'exemplification', a novel prompt-injection technique that reframes user prompts and benign content as few-shot examples before injecting malicious objectives, comparing its success rate against prior fake-completion methods. Experimental results in controlled settings show feasible data exfiltration by combining prompt injection, instruction steering, and web-tool invocation.

prompt injectionprivacy leakageblack-box attackfew-shot learningdata exfiltration

Who Generated This 3D Asset? Learning Source Attribution for Generative 3D Models

arXiv cs.AI · Sihan Ma, Siyuan Liang, Dacheng Tao · 2026-05-18

The study introduces the first passive source attribution benchmark for generative 3D models, addressing dispersed attribution signals and realistic deployment constraints. It identifies cross-view inconsistency and structural artifacts as stable fingerprints left by generative 3D models. A hierarchical multi-view multi-modal Transformer is proposed to fuse appearance, geometric, and frequency-domain features within each view and model global relationships across views. Experiments demonstrate 97.22% accuracy under full supervision and 77.17% accuracy with only 1% training data, showing stable and attributable fingerprints in modern 3D generators.

source attributiongenerative 3d modelscross-view inconsistencystructural artifactsmulti-modal transformer

POST: Prior-Observation Adversarial Learning of Spatio-Temporal Associations for Multivariate Time Series Anomaly Detection

arXiv cs.AI · Suofei Zhang, Yaxuan Zheng, Haifeng Hu · 2026-05-18

The paper proposes POST, a novel multivariate time series anomaly detection framework addressing spatial over-generalization in existing GNN-sequence model hybrids. The method employs prior-observation adversarial learning to alternately optimize adjacency matrices (structural prior) and minimize association discrepancies with data-driven observations, enhancing both temporal detection sensitivity and channel-wise anomaly localization. Evaluations on public datasets and a new synthetic benchmark with precise annotations show state-of-the-art performance in time-wise detection and spatial localization tasks.

multivariate time seriesanomaly detectiongraph neural networksadversarial learningspatio-temporal modeling

TaskGround: Structured Executable Task Inference for Full-Scene Household Reasoning

arXiv cs.AI · ZhiYuan Feng, Yu Deng, Ruichuan An, Zhenhua Liu · 2026-05-18

TaskGround introduces a Ground-Infer-Execute framework for full-scene household reasoning, enabling agents to infer executable task structures from complete household scenes and situated requests. The method grounds scenes into compact task-relevant slices, infers task structures, and compiles them into skill-level action sequences, operating training-free and model-agnostically. Evaluated on FullHome, a human-validated suite of 400 household tasks, TaskGround significantly improves task success rates across proprietary and open-weight models. Notably, it reduces input-token costs by up to 18x while making Qwen3.5-9B competitive with GPT-5 under complete-scene prompting, highlighting structured grounding as crucial for practical household deployment.

full-scene reasoningtask-structure inferenceskill-level actionsopen-weight modelstask-relevant slices

Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers

arXiv cs.AI · Tim Tsz-Kit Lau, Weijie Su · 2026-05-18

The paper introduces a symmetry-compatible principle for optimizer design, requiring gradient updates to respect the equivariance structures of parameter blocks. It presents specialized update rules for matrix layers (e.g., embeddings, LM heads, SwiGLU MLPs, MoE routers) that maintain symmetry under orthogonal, permutation, and shared-shift groups. Experiments on Qwen3-0.6B, Gemma 3 1B, OLMoE-1B-7B, and downsized GPT architectures demonstrate consistent improvements in validation loss and training stability over AdamW baselines.

equivariant optimizationsymmetry-compatiblemoe routersswigluspectral descent

Safety Geometry Collapse in Multimodal LLMs and Adaptive Drift Correction

arXiv cs.AI · Jiahe Guo, Xiangran Guo, Jiaxuan Chen, Weixiang Zhao · 2026-05-18

The paper identifies Safety Geometry Collapse in multimodal LLMs, where modality-induced drift compromises safety capabilities by compressing refusal direction separability. The authors quantify this via conditional refusal separability, demonstrating that drift strength inversely correlates with safety performance. They propose ReGap, an inference-time method that adaptively corrects drift using self-rectification signals, improving safety without utility loss across multiple benchmarks. Experiments show drift correction restores refusal separability and enables harm recognition.

multimodal safetymodality-induced driftrefusal separabilityself-rectificationrepresentation alignment

SENSE: Satellite-based ENergy Synthesis for Sustainable Environment

arXiv cs.AI · Kailai Sun, Mingyi He, Heye Huang, Can Rong · 2026-05-18

SENSE proposes a generative urban building energy modeling framework that synthesizes satellite imagery with aligned energy consumption and height maps using a controllable diffusion model. The method conditions on road networks and urban density metrics, leveraging latent space representations from large vision models to address data scarcity. Evaluated across four cities, SENSE achieves ASHRAE-compliant physical consistency, reduces prediction errors by 3-11% NMBE and 1-9% CVRMSE versus SOTA, and boosts downstream task performance by 10% IoU with only 20% labeled data.

urban building energy modelingcontrollable diffusion modellatent space synthesissatellite imagery generationenergy consumption prediction

Learning to Solve Compositional Geometry Routing Problems

arXiv cs.AI · Mingfeng Fan, Jianan Zhou, Jiaqi Cheng, Yifeng Zhang · 2026-05-18

The paper introduces DiCon, a differential attention-assisted solver with contrastive learning, for addressing the Compositional Geometry Routing Problem (CGRP), a unified superclass encompassing point-only, line-only, area-only, and hybrid task geometries. DiCon employs a differential attention mechanism to suppress less competitive candidate actions and a double-level contrastive learning objective to enhance global instance representations and geometry-aware task representations. Extensive experiments demonstrate DiCon's strong performance, versatility, and superior generalization across diverse CGRP instances.

compositional geometry routing problemdifferential attentioncontrastive learningtask geometriesinstance representations

Parameterized 4-Qubit EWL Quantum Game Circuits with Dirac-Solow-Swan Hamiltonian Integration for Quadruple Helix Disruptive Innovation Recommender Systems

arXiv cs.AI · Agung Trisetyarso, Fithra Faisal Hastiadi, Kridanto Surendro · 2026-05-18

The paper introduces a parameterized 4-qubit Eisert-Wilkens-Lewenstein (EWL) quantum game circuit for recommender systems in quadruple helix innovation ecosystems. The circuit uses local strategy operators tuned by real funding data from the CORDIS Horizon Europe database, achieving 22 gates and depth 11 with O(n) scaling. Measurement probabilities serve as innovation trend scores, mapped into a Dirac-Solow-Swan Hamiltonian for capital trajectory simulation. Numerical experiments on CORDIS networks demonstrate NISQ compatibility and high-fidelity forecasting of disruptive innovation dynamics. The framework integrates quantum game theory, parameterized circuits, and economic growth models.

ewl quantum gamedirac-solow-swan hamiltonianparameterized quantum circuitsquadruple helix innovationnisq compatibility

LLM-Guided Communication for Cooperative Multi-Agent Reinforcement Learning

arXiv cs.AI · Sangjun Bae, Yisak Park, Sanghyeon Lee, Seungyul Han · 2026-05-18

We introduce LLM-driven Multi-Agent Communication (LMAC), a novel approach leveraging large language models (LLMs) to design efficient communication protocols for multi-agent reinforcement learning (MARL). LMAC iteratively refines the protocol using an explicit state-awareness criterion, enabling agents to reconstruct the underlying state more accurately and uniformly while minimizing knowledge disparities. Evaluated across diverse MARL benchmarks, LMAC demonstrates improved state reconstruction and significant performance gains over existing communication baselines.

multi-agent reinforcement learningcommunication protocolstate reconstructionlarge language modelsstate-awareness criterion

A-ProS: Towards Reliable Autonomous Programming Through Multi-Model Feedback

arXiv cs.AI · Anika Tabassum, Md Sifat Hossain, Md. Fahim Arefin, Tariqul Islam · 2026-05-18

A-ProS introduces a multi-model feedback framework for autonomous programming, combining ChatGPT-based generators (GPT-4, GPT-5) with three debugging critics (Codestral-2508, Llama-3.3-70B, DeepSeek-R1) in a 2 x 3 factorial design. The system iteratively refines solutions using execution feedback, evaluated on 367 competitive programming problems from ICPC World Finals and Codeforces. Results show GPT-5 workflows improve from 39 to 85-90 accepted solutions after three refinement rounds, while GPT-4 improves from 15 to 31-38. Stateful refinement outperforms stateless approaches by 8.5-10.6 percentage points and reduces repeated failures by up to 3.5x, demonstrating the efficacy of persistent context and multi-model feedback.

autonomous programmingmulti-model feedbackcompetitive programmingstateful refinementexecution feedback

Improving Spatio-Temporal Residual Error Propagation by Mitigating Over-Squashing

arXiv cs.AI · Seyed Mohamad Moghadas, Esther Rodrigo Bonet, Bruno Cornelis, Adrian Munteanu · 2026-05-18

The paper introduces Teger, a structured uncertainty module for probabilistic multivariate timeseries forecasting that addresses residual error propagation in recurrent models. Teger employs a spatial curvature-aware graph rewiring mechanism based on discrete Forman curvature to strengthen information-bottleneck edges, integrated with a low-rank-plus-diagonal covariance head for tractable inference via the Woodbury identity. Evaluated on LSTM, Transformer, and xLSTM backbones across four real-world datasets, Teger consistently improves Continuous Ranked Probability Score (CRPS). Theoretical analysis connects curvature-aware rewiring to oversquashing alleviation, spectral connectivity, effective resistance reduction, and covariance calibration bounds.

residual error propagationdiscrete forman curvaturelow-rank-plus-diagonal covariancecontinuous ranked probability scoreoversquashing alleviation

FLAG: Foundation model representation with Latent diffusion Alignment via Graph for spatial gene expression prediction

arXiv cs.AI · Qi Si, Penglei Wang, Yushuai Wu, Yifeng Jiao · 2026-05-18

FLAG introduces a diffusion-based framework for spatial gene expression prediction that models biological structures through structured distribution modeling, addressing the Gene Dimension Curse via a spatial graph encoder and Gene Foundation Model alignment. The method preserves gene-gene and gene-spatial relationships by integrating topological consistency and gene-gene fidelity in generation. Experiments show FLAG achieves competitive accuracy (PCC/MSE) while significantly improving structural fidelity, measured by novel metrics Gene Structural Correlation (GSC) and Spatial Structural Correlation (SSC).

spatial gene expressiondiffusion-based frameworkgene dimension cursegene foundation modelstructural correlation

DocOS: Towards Proactive Document-Guided Actions in GUI Agents

arXiv cs.AI · Jingjing Liu, Ziye Huang, Zihao Cheng, Zeming Liu · 2026-05-18

The paper introduces Proactive Document-Guided Action, a novel paradigm enabling GUI agents to autonomously search for and utilize online documentation to resolve long-tailed tasks in dynamic web environments. This approach addresses the limitations of static parametric knowledge by mimicking human problem-solving strategies. The authors propose DocOS, a benchmark evaluating agents' ability to navigate web browsers, locate relevant documentation, comprehend procedural instructions, and execute precise GUI actions. Experimental results identify dual bottlenecks: agents' unreliable information retrieval during proactive search and frequent failures in grounding instructions into accurate actions, highlighting document-guided interaction as a critical pathway for self-evolving GUI agents.

gui agentsdocument-guided actiondocos benchmarkproactive searchdynamic environments

Confidence-Gated Robot Autonomy: When Does Uncertainty Actually Help?

arXiv cs.AI · Johannes A. Gaus, Jhon P. F. Charaja, Daniel Haeufle · 2026-05-18

The study demonstrates that uncertainty metrics in robotic autonomy systems are effective for selective gating only when base models achieve a certain competence level, with threshold selection having greater impact than uncertainty estimation method. The evaluation employs Spearman rank correlation, paired bootstrap equivalence testing, and act/defer agreement across three temporal activity-recognition benchmarks. Results show that softmax heuristics, MC Dropout, and ensembles yield similar gating behavior above the competence regime, while semantic out-of-distribution detection remains near chance under temporal covariate shift. Embodied simulations confirm these patterns for collision rate and cost metrics.

uncertainty metricstemporal activity-recognitionselective gatingsemantic novelty detectiontemporal covariate shift

Exploring Trust Calibration in XAI - The Impact of Exposing Model Limitations to Lay Users

arXiv cs.AI · Alfio Ventura, Tim Katzke, Jan Corazza, Mustafa Yalçıner · 2026-05-18

This preregistered study (N=418) investigates trust calibration in explainable AI (XAI) for skin-lesion classification by manipulating limitation disclosures and performance evidence. Using a between-subject design with five onboarding conditions, participants evaluated 15 cases with fixed XAI outputs (malignancy score, reliability score, saliency map). Hierarchical mixed-effects models revealed that limitation disclosure significantly impacted case-wise trust calibration, while short-term experience did not. Stimulus packages explained more variance than experimental manipulations. Participants struggled to differentiate perceived trust, trustworthiness, and accuracy. The study provides open materials for replication and discusses implications for XAI limitation communication.

trust calibrationxailimitation disclosurehierarchical mixed-effectsskin-lesion classification

PROTEA: Offline Evaluation and Iterative Refinement for Multi-Agent LLM Workflows

arXiv cs.AI · Kazuki Kawamura, Satoshi Waki, Kei Tateno · 2026-05-18

PROTEA introduces a unified interface for offline evaluation and iterative refinement of multi-agent LLM workflows, addressing debugging challenges through test-driven improvement. The system executes workflows, scores intermediate outputs with configurable rubrics, and localizes bottlenecks via backward node evaluation—generating node-level expectations from final-answer references. It supports targeted prompt revisions with before/after comparisons and automated reruns. Evaluations show accuracy improvements from 64.3% to 83.9% in document inspection and Hit@5 from 0.30 to 0.38 in recommendation tasks. Developers valued graph-level localization and editable prompt revisions.

multi-agent llm workflowsbackward node evaluationtest-driven improvementconfigurable rubricsprompt revisions

Quantum Sidecar Architectures for Hybrid AI Training and Inference: Stateful Protected Registers, Stateless Reset-and-Reprepare Circuits and Quantum Weight-State Outlook

arXiv cs.AI · Y. Mo, G. D. Su · 2026-05-18

The paper proposes a quantum sidecar architecture family for hybrid AI systems, distinguishing between stateful protected-register and stateless reset-and-reprepare operational modes. The stateful mode employs protected qubits with QND-style readout (simulated via 2-8 qubit density matrices and Qiskit verification), while the stateless mode uses task-conditioned circuits with QAOA-style sampling and reset overhead analysis. Results position quantum sidecars as bounded signal generators for classical AI pipelines, with speculative extensions to quantum weight-state representations. Simulations demonstrate feasibility for optimizer sampling, adapter selection, and reasoning-path proposal tasks.

quantum sidecarprotected-registerreset-and-reprepareqnd-style readoutquantum weight-state

FedSDR: Federated Self-Distillation with Rectification

arXiv cs.AI · Ziheng Ren, Zhanming Shen, Hao Wang, Ning Liu · 2026-05-18

FedSDR introduces Federated Self-Distillation with Rectification to address statistical heterogeneity in federated fine-tuning of Large Language Models. The method combines Federated Self-Distillation (FedSD) with a dual-stream mechanism: a local LoRA-S branch for smoothing client representations and a global LoRA-R branch for enforcing factual correctness. By selectively aggregating LoRA-R, FedSDR achieves global alignment and reduces hallucinations. Extensive experiments demonstrate its superiority over conventional federated learning algorithms, resolving the Rewrite Paradox inherent in unconstrained self-distillation.

federated learningself-distillationstatistical heterogeneitylorarectification

TeleCom-Bench: How Far Are Large Language Models from Industrial Telecommunication Applications?

arXiv cs.AI · Jieting Xiao, Yun Lin, Huizhen Qiu, Rui Ma · 2026-05-18

TeleCom-Bench introduces a standardized evaluation framework for assessing Large Language Models in telecommunications applications, addressing the gap in domain-specific benchmarks. The benchmark comprises 12 evaluation sets with 22,678 samples, structured into Multi-dimensional Knowledge Comprehension and End-to-End Knowledge Application. It evaluates LLMs across six core tasks derived from live network agent workflows. Results show that while LLMs achieve 90% accuracy in linguistic interface tasks like intent recognition, their performance drops to approximately 30% in procedural execution tasks such as solution generation, highlighting a significant capability gap. The dataset and evaluation code are publicly available for domain-specific alignment.

telecommunicationsbenchmarkknowledge comprehensionprocedural executiondomain-specific alignment

Interaction-Breaking Adversarial Learning Framework for Robust Multi-Agent Reinforcement Learning

arXiv cs.AI · Sunwoo Lee, Mingu Kang, Yonghyeon Jo, Seungyul Han · 2026-05-18

The interaction-breaking adversarial learning (IBAL) framework enhances robustness in multi-agent reinforcement learning (MARL) by addressing vulnerabilities in inter-agent interactions. Unlike prior robust MARL methods focused on value-oriented attacks, IBAL constructs adversarial perturbations targeting agents' observations and actions to disrupt coordination, leveraging an information-theoretic perspective. Agents are trained to maintain reliable performance under these disruptions. Empirical evaluations demonstrate IBAL's superiority over existing robust MARL baselines across diverse attack scenarios, including improved resilience in agent-missing conditions.

multi-agent reinforcement learningadversarial learninginteraction-breakinginformation-theoreticrobustness

Unveiling Memorization-Generalization Coexistence: A Case Study on Arithmetic Tasks with Label Noise

arXiv cs.AI · Linyu Liu, Pinyan Lu · 2026-05-18

This work elucidates the coexistence of memorization and generalization in over-parameterized models through modular arithmetic tasks under heavy label noise. Using two-layer neural networks, the authors demonstrate that larger models generalize better with appropriate optimization, while noisy labels are memorized faster than clean data. Despite 80% label noise, near-perfect test accuracy is achieved by extracting internal generalization structures via frequency-based methods. A task-agnostic method partitions networks into generalization and memorization components, though its generalization improvement is limited compared to frequency-based extraction, indicating distributed generalization structures across neurons.

over-parameterized modelsmodular arithmeticlabel noisefrequency-based methodsgeneralization structure

See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding

arXiv cs.AI · Boyuan Sun, Bowen Yin, Yuanming Li, Xihan Wei · 2026-05-18

SWIM (See What I Mean) introduces a novel training strategy for aligning vision and language representations to enable fine-grained object understanding from textual prompts without explicit visual prompts at inference. The method leverages mask supervision during training to guide cross-modal attention, addressing systematic discrepancies in pretrained multimodal large language models (MLLMs) where attribute words produce localized visual activations but object nouns yield diffuse patterns. SWIM enforces spatial consistency between multi-layer cross-attention maps and ground-truth masks using the NL-Refer dataset. Experiments show SWIM improves text-visual alignment and outperforms visual-prompt-based methods on fine-grained object understanding benchmarks.

cross-modal attentionmultimodal large language modelsfine-grained object understandingmask supervisionspatial consistency

TinySAM 2: Extreme Memory Compression for Efficient Track Anything Model

arXiv cs.AI · Zhaoyuan Ding, Yijing Yang, Han Shu, Xinghao Chen · 2026-05-18

TinySAM 2 introduces memory-efficient techniques for video segmentation, addressing SAM 2's computational bottlenecks. The method employs a memory quality management mechanism to retain high-informative frames and joint-spatial-temporal token compression using average pooling and token-level similarity measurement. RepViT serves as the lightweight image encoder, reducing parameter count. Evaluations on DAVIS and SA-V datasets show TinySAM 2 achieves 90% of SAM 2.1's performance with only 7% memory tokens and 3% training data, significantly lowering deployment costs and computational load.

memory compressionvideo segmentationtoken compressionlightweight encodermemory bank

SAS: Semantic-aware Sampling for Generative Dataset Distillation

arXiv cs.AI · Mingzhuo Li, Guang Li, Linfeng Ye, Jiafeng Mao · 2026-05-18

The paper introduces SAS (Semantic-aware Sampling), a novel approach for generative dataset distillation that leverages CLIP's semantic prior to enhance sample quality. The method employs three semantic scoring functions to assess class relevance, inter-class separability, and intra-set diversity, followed by a two-stage sampling strategy for discriminative and diverse dataset construction. Experiments across multiple benchmarks show consistent performance improvements, demonstrating the efficacy of semantic-aware distillation.

dataset distillationcontrastive learningsemantic scoringclipdiversity-aware selection

Spiker-LL: An Energy-Efficient FPGA Accelerator Enabling Adaptive Local Learning in Spiking Neural Networks

arXiv cs.AI · Alessio Caviglia, Filippo Marostica, Alessandro Savino, Stefano Di Carlo · 2026-05-18

SPIKER-LL introduces an FPGA-based accelerator enabling energy-efficient local learning in Spiking Neural Networks (SNNs) through hardware-algorithm co-design. The architecture extends the Spiker+ inference framework with microarchitectural optimizations for the STSF learning rule, supporting both inference and online training with minimal overhead. Evaluations on MNIST, F-MNIST, and DIGITS demonstrate 93% accuracy, <1ms latency, and 0.1mJ/inference while maintaining DSP-free operation and scalability for edge deployments.

spiking neural networksfpga acceleratorlocal learningedge computingenergy efficiency

Shared Backbone PPO for Multi-UAV Communication Coverage with Connection Preservation

arXiv cs.AI · Z. Jiang · 2026-05-18

The paper introduces Shared Backbone Proximal Policy Optimization (Shared Backbone PPO), a reinforcement learning algorithm that shares a base module between Actor and Critic networks to enhance training efficiency and performance. The method is evaluated on a multi-UAV swarm communication coverage task with connectivity preservation, outperforming standard PPO. A graph information aggregation module is integrated to model inter-agent communication, enabling improved cooperative behavior in the trained swarm.

proximal policy optimizationmulti-uav swarmgraph information aggregationactor-critic networksconnectivity preservation

Verify-Gated Completion as Admission Control in a Governed Multi-Agent Runtime: A Bounded Architecture Case Study

arXiv cs.AI · Hai-Duong Nguyen, The-Xuan Tran · 2026-05-18

The preprint introduces verify-gated completion as an admission-control pattern for governed multi-agent runtimes, where agents propose completions but a read-only verifier decides admission. Ambiguous cases resolve fail-closed, with packetized state and event traces enabling auditability. A bounded reference implementation demonstrated 99.5% verify success (1,791/1,800) on invoked verification events, though task-level coverage remains uncomputable. A shadow Policy/Governance Verifier showed 98.58% rule agreement (1,526/1,548) and 0 false-success cases, maintaining advisory status. Evidence supports inspectable, fail-closed completion decisions under observed conditions, but broader claims remain unaddressed.

verify-gated completionadmission-controlmulti-agent runtimepacketized statefail-closed

MARR: Module-Adaptive Residual Reconstruction for Low-Bit Post-Training Quantization

arXiv cs.AI · Le Su, Xing Luo, Zhi Jin · 2026-05-18

The paper introduces Module-Adaptive Residual Reconstruction (MARR), a method for low-bit post-training quantization (PTQ) that addresses bias from Hessian-approximation assumptions in residual reconstruction. MARR assigns module-specific scaling coefficients to balance error correction and residual-related bias, using a PID-based adaptive update strategy for efficient coefficient estimation. Experiments on large language models (LLMs) and vision transformers (ViTs) show performance gains of up to 20.2% and 4.6%, respectively, under ≤4-bit quantization compared to state-of-the-art methods.

quantizationresidual reconstructionhessian-approximationpid controllerlow-bit

Towards Sustainable Growth: A Multi-Value-Aware Retrieval Framework for E-Commerce Search

arXiv cs.AI · Yifan Wang, Yixuan Wang, YiDan Liang, Qiang Liu · 2026-05-18

We propose GrowthGR, a Multi-Value-Aware retrieval framework for e-commerce search that balances immediate conversion and long-term item growth. The framework consists of two components: (1) ItemLTV, which employs counterfactual inference to quantify long-term value increments from user interactions, and (2) MultiGR, which integrates semantic-ID-based generative retrieval with Multi-Value-Aware Policy Optimization (MoPO) to align with multi-stage online values. Deployed on Taobao's platform, GrowthGR achieved a 5.3% increase in new item GMV and a 0.3% gain in overall search GMV, demonstrating improved ecosystem value through enhanced new item exposure and growth potential.

counterfactual inferencegenerative retrievalsemantic-idpolicy optimizationgmv

Stable Audio 3

arXiv cs.AI · Zach Evans, Julian D. Parker, Matthew Rice, CJ Carr · 2026-05-18

Stable Audio 3 introduces a family of fast latent diffusion models (small, medium, large) for variable-length audio generation and editing, addressing the inefficiency of full-length generation for short sounds. The models leverage a novel semantic-acoustic autoencoder to project audio into a compact latent space, enabling efficient diffusion-based generation while preserving audio fidelity and semantic structure. Adversarial post-training accelerates inference and improves generation quality, reducing inference steps while enhancing fidelity and prompt adherence. Trained on licensed and Creative Commons data, the models generate music and sounds in under 2s on an H200 GPU and a few seconds on a MacBook Pro M4. Small and medium model weights are released for consumer-grade hardware.

latent diffusion modelssemantic-acoustic autoencoderadversarial post-trainingvariable-length generationaudio fidelity

Predictive Prefetching for Retrieval-Augmented Generation

arXiv cs.AI · Wuyang Zhang, Shichao Pei · 2026-05-18

We introduce an advanced asynchronous retrieval framework for Retrieval-Augmented Generation (RAG) that enables predictive prefetching aligned with evolving information needs. The framework employs three components—a retrieval predictor, context monitor, and query generator—to exploit semantic precursors in generation dynamics, predicting when retrieval should be triggered and what information should be retrieved. Experiments on multiple benchmarks demonstrate up to 43.5% reduction in end-to-end latency and 62.4% improvement in time-to-first-token, while maintaining answer quality comparable to synchronous RAG baselines.

retrieval-augmented generationpredictive prefetchingasynchronous retrievalsemantic precursorsgeneration dynamics

LivePI: More Realistic Benchmarking of Agents Against Indirect Prompt Injectio

arXiv cs.AI · Lei Zhao, Abhay Bhaskar, Edgar Dobriban · 2026-05-18

The paper introduces LivePI, a structured benchmark for evaluating indirect prompt injection (IPI) risks in AI agents operating in production-like environments. The benchmark covers seven input surfaces, twelve attack/rendering families, and five malicious goals, tested on a real virtual machine with live interfaces. Evaluation across GPT-5.3-Codex, Claude Opus 4.6, Gemini 3.1 Pro, Kimi K2.5, and GLM-5 shows attack success rates ranging from 10.7% to 29.6%, with group-chat injection being universally successful. A two-layer defense combining prompt-level filtering and pre-execution tool-call authorization effectively intercepts all malicious completions in GPT-5.3-Codex while maintaining benign utility.

indirect prompt injectionai agentsbenchmarkingvirtual machinetool-call authorization

SAFE-SVD: Sensitivity-Aware Fidelity-Enforcing SVD for Physics Foundation Models

arXiv cs.AI · Chengjie Hong, Feixiang He, Yiheng Zeng, Lulu Kang · 2026-05-18

We introduce SAFE-SVD, a sensitivity-aware fidelity-enforcing compression framework for physics foundation models (PFMs) that explicitly models loss-aware layer sensitivity in output function space. The method addresses the challenge of preserving physical fidelity in PFMs, where conventional compression techniques often degrade performance due to high sensitivity of spatiotemporal dynamics encoded in partial derivatives. SAFE-SVD achieves significantly higher compression ratios while maintaining accuracy, outperforming existing methods by orders of magnitude across multiple models and datasets. This work establishes a foundation for efficient, deployable, and sustainable scientific foundation models in AI for Science.

physics foundation modelsmodel compressionspatiotemporal dynamicslayer sensitivitypartial derivatives

Unleashing LLMs in Bayesian Optimization: Preference-Guided Framework for Scientific Discovery

arXiv cs.AI · Xinzhe Yuan, Zhuo Chen, Jianshu Zhang, Huan Xiong · 2026-05-18

The paper introduces LLM-Guided Bayesian Optimization (LGBO), a novel framework integrating large language models (LLMs) into Bayesian Optimization (BO) through continuous preference guidance. LGBO employs a region-lifted preference mechanism to embed LLM-driven semantic reasoning into each optimization iteration, dynamically shifting the surrogate mean. Theoretical analysis shows LGBO maintains worst-case performance comparable to standard BO while achieving faster convergence when preferences align with objectives. Empirical evaluations demonstrate LGBO's superiority in physics, chemistry, biology, and materials science benchmarks, notably achieving 90% of optimal Fe-Cr battery electrolyte performance in 6 iterations versus >10 for baselines.

bayesian optimizationlarge language modelspreference-guided learningsurrogate modelingscientific discovery

Babel: Jailbreaking Safety Attention via Obfuscation Distribution Optimized Sampling

arXiv cs.AI · Ziwei Wang, Jing Chen, Ruichao Liang, Zhi Wang · 2026-05-18

Babel introduces a black-box jailbreaking framework for Large Language Models (LLMs) by exploiting a vulnerability in safety mechanisms, where alignment relies on sparsely distributed attention heads. The method formalizes this vulnerability mathematically and employs systematic obfuscation sampling with iterative, feedback-driven distribution refinement to optimize attack efficiency. Evaluations on GPT-4o and Claude-3-5-haiku demonstrate Babel's superior performance, increasing attack success rates from 41.33% to 82.67% and from 38.33% to 78.33%, respectively, within an average of 40 queries. This provides a robust red-teaming methodology for LLM safety research.

jailbreakingattention headsobfuscation samplingred-teamingquery efficiency

Reconciling Contradictory Views on the Effectiveness of SFT in LLMs: An Interaction Perspective

arXiv cs.AI · Junpeng Zhang, Lei Cheng, Guoxi Zhang, Hua Cai · 2026-05-18

The study reconciles contradictory findings on supervised fine-tuning (SFT) effectiveness in LLMs by analyzing token interactions during training. Using interaction-based explanations, it demonstrates that SFT primarily removes noise-like interactions rather than acquiring reliable new ones, with this denoising phase being brief. Prolonged fine-tuning introduces overfitted interactions, explaining inconsistent SFT outcomes. These findings are validated across multiple LLMs and datasets, offering insights into early stopping strategies and practical guidance for LLM training.

supervised fine-tuningtoken interactionsdenoisingoverfittingearly stopping

BLAgent: Agentic RAG for File-Level Bug Localization

arXiv cs.AI · Md Afif Al Mamun, Gias Uddin · 2026-05-18

BLAgent introduces an agentic Retrieval-Augmented Generation (RAG) framework for file-level bug localization, addressing limitations in static retrieval and reasoning. The method integrates code structure-aware repository encoding via path-augmented AST-based chunking, dual-perspective query transformation for structural and behavioral signals, and two-phase agentic reranking combining symbolic inspection with evidence-grounded reasoning. Evaluated on SWE-bench Lite, BLAgent achieves 78% Top-1 accuracy with open-source models and 86% with a closed-source model, while reducing costs by 18x compared to baselines. Integrated into an Automated Program Repair (APR) framework, it improves end-to-end repair success by over 20%.

bug localizationretrieval-augmented generationast-based chunkingquery transformationagentic reranking

A More Word-like Image Tokenization for MLLMs

arXiv cs.AI · Hyun Lee, Hyemin Jeong, Yejin Kim, Hyungwook Choi · 2026-05-18

The paper introduces Disentangled Visual Tokenization (DiVT), a novel method for multimodal large language models (MLLMs) that clusters patch embeddings into semantic units, aligning visual tokens with word-like discrete tokens. DiVT adapts token budgets to image complexity without modifying vision encoders or language models, improving compatibility with LLMs. Evaluations show DiVT matches or exceeds baselines across benchmarks while reducing memory and latency costs, particularly under limited token budgets.

multimodaltokenizationvisualsemanticembedding

SVFSearch: A Multimodal Knowledge-Intensive Benchmark for Short-Video Frame Search in the Gaming Vertical Domain

arXiv cs.AI · Lingtao Mao, Huangyu Dai, Xinyu Sun, Zihan Liang · 2026-05-18

SVFSearch introduces the first open benchmark for short-video frame search in the Chinese gaming domain, addressing multimodal knowledge-intensive tasks in visually ambiguous paused frames. The benchmark comprises 5,000 four-choice test examples and 4,198 auxiliary training examples, supported by a frozen offline retrieval environment with game-domain text corpus, topic-linked image gallery, and multimodal retrieval interfaces. Evaluations compare direct QA, RAG workflows, Plan-Act-Replan agents, and learned search models, revealing performance gaps: open-source direct-QA achieves 66.4%, practical agents reach 79.1%, and oracle knowledge attains 95.4%. Analysis identifies bottlenecks in visual grounding, retrieval quality, and tool-use behavior.

multimodal retrievalshort-video frame searchknowledge-intensive tasksvisual groundingretrieval-augmented generation

Training data attribution in diffusion models via mirrored unlearning and noise-consistent skew

arXiv cs.AI · Joan Serrà, Dipam Goswami, Fabio Morreale, Wei-Hsiang Liao · 2026-05-18

The paper introduces mirrored unlearning and noise-consistent skew (MUCS), a novel method for training data attribution (TDA) in diffusion models. MUCS fine-tunes a secondary model using bounded mirrored gradient ascent and measures normalized skew against the original model with consistent noise samples. The approach significantly outperforms existing TDA methods across three datasets, with analysis of design choices, influential instance overlap, and ensembling potential. The findings may extend to general unlearning setups and diffusion loss comparison tasks.

training data attributiondiffusion modelsmirrored unlearningnoise-consistent skewgradient ascent

BacktestBench: Benchmarking Large Language Models for Automated Quantitative Strategy Backtesting

arXiv cs.AI · Zhensheng Wang, Wenmian Yang, Qingtai Wu, Lequan Ma · 2026-05-18

BacktestBench, the first large-scale benchmark for automated quantitative strategy backtesting, addresses the lack of standardized evaluation in this domain. Built from over 6 million real market records, it comprises 18,246 annotated question-answering pairs across four task categories: metrics calculation, ticker selection, strategy selection, and parameter confirmation. The authors propose AutoBacktest, a multi-agent baseline integrating a Summarizer, Retriever, and Coder to translate natural language strategies into reproducible backtests. Evaluation of 23 mainstream LLMs, complemented by ablations, identifies key performance factors and emphasizes the importance of grounded verification and standardized indicator representations.

quantitative backtestingmulti-agent systemsemantic factor extractionsql generationindicator representations

Prompt Compression in Diffusion Large Language Models: Evaluating LLMLingua-2 on LLaDA

arXiv cs.AI · Sterling Huang, Abigayle Brown, Jiyoo Noh, Jiakang Xu · 2026-05-18

This study evaluates prompt compression in diffusion large language models (DLLMs), demonstrating that autoregressive compression methods do not generalize effectively. Using LLMLingua-2 on the 8B-parameter LLaDA model, experiments on GSM8K, DUC2004, and ShareGPT datasets (250 prompts each) at ~2× compression ratio assessed mathematical reasoning, prompt reconstruction, and summarization. Metrics included exact-match accuracy, BLEU, ROUGE, and BERTScore. Results reveal that semantic preservation does not ensure stable downstream behavior: summarization remained robust, while mathematical reasoning degraded significantly. BERTScore recall consistently lagged precision, indicating compression failures stem from information omission rather than semantic drift. Findings motivate diffusion-aware compression strategies.

prompt compressiondiffusion large language modelsllmlingua-2semantic preservationbertscore

AdaptiveLoad: Towards Efficient Video Diffusion Transformer Training

arXiv cs.AI · Yucheng Guo, Yongjian Guo, Zhong Guan, Haoran Sun · 2026-05-18

The paper introduces AdaptiveLoad, an optimization framework for efficient training of video diffusion Transformers (e.g., DiT, MMDiT) on mixed-mode datasets. The method combines a dual-constraint load balancing system (limiting both memory and computational load via $B \times S^p \le M_{\text{comp}}$) with a fused LayerNorm-Modulate CUDA kernel using D-tile coalesced reduction. Evaluated on the Wan 2.1 world model, AdaptiveLoad reduces computational imbalance from 39% to 18.9%, improves VRAM utilization by 22.7%, and increases training throughput by 27.2%.

video diffusion transformersload balancingcuda kernellayernorm-modulatecomputational imbalance

Domain Transfer Becomes Identifiable via a Single Alignment

arXiv cs.AI · Sagar Shrestha, Subash Timilsina, Hoang-Son Nguyen, Xiao Fu · 2026-05-18

The paper establishes identifiability in domain transfer (DT) tasks by introducing a structural sparsity condition on the Jacobian support pattern, requiring only a single paired anchor sample for alignment. This contrasts with prior methods needing multiple conditional distributions. The authors propose a scalable Jacobian sparsity regularizer using randomized masked finite differences, avoiding explicit Jacobian computation. Experiments on synthetic and real-world DT tasks confirm the method's efficacy, demonstrating identifiability with minimal supervision.

domain transferidentifiabilityjacobian sparsitymeasure-preserving automorphismsalignment

Ethical Hyper-Velocity (EHV): A Provably Deterministic Governance-Aware JIT Compiler Architecture for Agentic Systems

arXiv cs.AI · Riddhi Mohan Sharma · 2026-05-18

The paper introduces Ethical Hyper-Velocity (EHV), a hardware-rooted architectural framework for real-time AI governance policy enforcement in autonomous agentic systems. EHV integrates a Governance-Aware JIT Compiler with Conflict-free Replicated Data Types (CRDTs) for policy synchronization and Epoch-based Attestation Caching in Trusted Execution Environments (TEEs), achieving sub-millisecond formal determinism. Through TLA+ formal verification, the authors prove that non-compliant actions are computationally unreachable, reducing governance latency from O(days) to O(1) while maintaining deployment velocity.

ethical hyper-velocitypolicy enforcement pointconflict-free replicated data typestrusted execution environmentsgovernance-aware jit compiler

One Model to Translate Them All: Universal Any-to-Any Translation for Heterogeneous Collaborative Perception

arXiv cs.AI · Yang Li, Weize Li, Quan Yuan, Congzhang Shao · 2026-05-18

UniTrans introduces a universal any-to-any feature modality translation model for heterogeneous collaborative perception, addressing the challenge of modality heterogeneity in real-world scenarios. The model dynamically instantiates translators for arbitrary modalities by pretraining a bank of translator expert parameters and learning combination coefficients based on source-to-target modality mappings in a modality-intrinsic latent space. This approach enables zero-shot translator instantiation without requiring retraining or fine-tuning. Evaluations on OPV2V-H and DAIR-V2X datasets demonstrate UniTrans's superior performance over state-of-the-art methods in both simulated and real-world settings, facilitating efficient any-to-any translation.

collaborative perceptionmodality translationzero-shot learninglatent spaceheterogeneous fusion

Agentic Chunking and Bayesian De-chunking of AI Generated Fuzzy Cognitive Maps: A Model of the Thucydides Trap

arXiv cs.AI · Akash Kumar Panda, Olaoluwa Adigun, Bart Kosko · 2026-05-18

The authors propose a method for generating and refining fuzzy cognitive maps (FCMs) from text using large language models (LLMs) and Bayesian inference. Gemini 3.1 LLMs decompose text into overlapping chunks, which are converted into sparse causal matrices and convexly mixed to form cyclic FCM knowledge graphs. Bayesian de-chunking then produces posterior-like FCMs, enabling iterative updates. Applied to Allison's 'Thucydides Trap' model, the approach predicted conflict outcomes through FCM dynamical systems, with 7 out of 8 FCM graphs forecasting war when simulating rising power ambition.

fuzzy cognitive mapsbayesian inferenceconvex mixingdynamical systemslarge language models

LAST-RAG: Literature-Anchored Stochastic Trajectory Retrieval-Augmented Generation for Knowledge-Conditioned Degradation Model Selection

arXiv cs.AI · Hanbyeol Park, Hyerim Bae · 2026-05-18

The paper proposes LAST-RAG, a literature-anchored stochastic trajectory retrieval-augmented generation method for knowledge-conditioned degradation model selection in RUL estimation. The approach hierarchically conditions candidate models using both observed health indicator trajectories and domain-specific context retrieved from a local evidence bank, supplemented by RCRUS to handle decision uncertainty. Experiments show superior performance over statistical and prognostic baselines in Wiener/gamma family classification (p<0.05), reframing model selection as a knowledge-augmented decision problem rather than pure statistical fitting.

remaining useful lifestochastic processretrieval-augmented generationdegradation modelinghealth indicator

DuIVRS-2: An LLM-based Interactive Voice Response System for Large-scale POI Attribute Acquisition

arXiv cs.AI · Le Zhang, Shengming Zhang, Rui Zha, Yunpeng Wu · 2026-05-18

DuIVRS-2 introduces an LLM-based end-to-end framework for large-scale POI attribute acquisition in Baidu Maps, addressing error accumulation and maintenance overhead in traditional IVR systems. The methodology employs a finite state machine-guided data augmentation strategy to synthesize a balanced training dataset, streamlines dialogue management via selective generation combined with Chain-of-Thought mechanisms, and implements a cooperative iterative learning framework with a dual-evaluator voting system for policy refinement. Deployed in production for two months, DuIVRS-2 processed 0.4 million calls daily, achieving an 83.9% Task Success Rate, a 4% improvement over its predecessor, with a 130ms reaction time.

llm-basedfinite state machinechain-of-thoughttask success ratedialogue management

DCFold: Efficient Protein Structure Generation with Single Forward Pass

arXiv cs.AI · Zhe Zhang, Yuanning Feng, Yuxuan Song, Keyue Qiu · 2026-05-18

DCFold introduces a single-step generative model for protein structure prediction that achieves AlphaFold3-level accuracy without iterative inference. The method employs a Dual Consistency training framework with a novel Temporal Geodesic Matching scheduler, enabling efficient protein structure generation in a single forward pass. Experimental results demonstrate a 15x acceleration in inference speed while maintaining predictive fidelity, validated across structure prediction and binder design benchmarks. This approach addresses AlphaFold3's computational limitations in downstream applications like virtual screening and protein design.

protein structure predictionsingle-step generative modeldual consistency trainingtemporal geodesic matchingbinder design

Evaluating Cognitive Age Alignment in Interactive AI Agents

arXiv cs.AI · Yifan Shen, Jiawen Zhang, Jian Xu, Junho Kim · 2026-05-18

The paper introduces ChildAgentEval, the first psychometrically grounded interactive benchmark for evaluating cognitive age alignment in MLLM-based agents. Inspired by the Wechsler Intelligence Scale for Children (WISC), the benchmark systematically compares reasoning performance of multimodal large language models (MLLMs) against human developmental stages. Results expose gaps where current agentic AI systems fail to simulate age-specific cognitive behavior, despite proficiency in complex domains.

childagentevalmultimodal large language modelscognitive age alignmentwechsler intelligence scaleinteractive benchmark

Attention Sinks and Outliers in Attention Residuals

arXiv cs.AI · Haozheng Luo, Haoran Dai, Shaoyang Zhang, Xi Chen · 2026-05-18

OASIS introduces an outlier- and sink-aware technique for AttnResidual architectures, addressing attention sinks and activation outliers that degrade inference stability and quantization robustness. The method employs Softmax1-based null space and inter-layer null signaling to couple token-level null evidence with depth routing, reducing sink-dominated routing. Theoretical analysis shows AttnResidual's dual-normalization exacerbates sink formation and quantization brittleness. Experiments on three datasets demonstrate OASIS reduces maximum infinity norm by 9.26%, average kurtosis by 2.60%, lowers perplexity by 75.85% under W8A8, and improves GSM8K Pass@1 by 12.42% under W4A4.

attention sinksactivation outliersinter-layer null signalingsoftmax1quantization robustness

Multi-agent AI systems outperform human teams in creativity

arXiv cs.AI · Tiancheng Hu, Yixuan Jiang, Haotian Li, José Hernández-Orallo · 2026-05-18

Multi-agent LLM systems demonstrate superior creativity compared to human teams, with a large effect size (Cohen's d=1.50) across 4,541 LLM-generated and 341 human-generated ideas in six problem-solving tasks. The study analyzes conversational dynamics using neural language model representations, revealing that both groups benefit from wide-ranging discussions (low global coherence), but differ in optimal exploration patterns: LLM teams excel with efficient semantic exploration (high spread, short paths), while human teams perform better with smooth conversational flow (high local coherence, frequent pivots). Model choice and discussion structure jointly explain 26.8% of variance in LLM conversational dynamics, enabling systematic design of creative multi-agent systems.

multi-agent llmsemantic spaceglobal coherencelocal coherenceconversational dynamics

Guard: Scalable Straggler Detection and Node Health Management for Large-Scale Training

arXiv cs.AI · Guanliang Liu, Abhinandan Patni, Congzhu Lin, Zoe Zeng · 2026-05-18

Guard introduces a scalable system for straggler detection and node health management in large-scale foundation model training. The system combines lightweight online performance monitoring with an offline node-sweep mechanism to detect both acute failures and long-running fail-slow behaviors. Deployed on large-scale pretraining workloads, Guard improves mean FLOPs utilization by up to 1.7x, reduces run-to-run training step variance from 20% to 1%, increases mean time to failure, and reduces operational overhead. These results highlight the importance of proactive straggler detection and systematic node qualification for stable and efficient large-scale training.

straggler detectionnode health managementfoundation modelsflops utilizationtraining step variance

PAIR: Prefix-Aware Internal Reward Model for Multi-Turn Agent Optimization

arXiv cs.AI · Wonjoong Kim, Yeonjun In, Sangwu Park, Dongha Lee · 2026-05-18

The paper introduces Prefix-Aware Internal Reward (PAIR), a two-stage model for multi-turn agent optimization that combines frozen hidden-state probes with attention-based correction to provide dense step-level rewards. PAIR addresses prefix contamination in hidden-state probing by leveraging complementary robustness properties: hidden states track belief-consistency while attention features maintain grounded correctness. Experiments demonstrate PAIR achieves superior AUROC on contaminated trajectories (0.87 vs 0.78 baseline) with negligible inference overhead, enabling efficient Group Relative Policy Optimization without external judges or ground-truth dependencies.

multi-turn agentprefix contaminationhidden-state probinggroup relative policy optimizationattention-based correction

HINT-SD: Targeted Hindsight Self-Distillation for Long-Horizon Agents

arXiv cs.AI · Woongyeng Yeo, Yumin Choi, Taekyung Ki, Sung Ju Hwang · 2026-05-18

HINT-SD introduces a targeted self-distillation framework for training long-horizon LLM agents, addressing inefficiencies in sparse outcome rewards and misaligned feedback application. The method leverages full-trajectory hindsight to identify failure-relevant actions and applies feedback-conditioned distillation selectively on targeted action spans, avoiding unnecessary per-turn feedback. Experiments on BFCL v3 and AppWorld demonstrate improvements of up to 18.80% over dense per-turn feedback baselines, with a 2.26× reduction in training time per step. This highlights the importance of selective distillation for effective and efficient long-horizon agent training.

self-distillationlong-horizon agentsfeedback-conditionedtrajectory hindsightselective distillation

$\boldsymbol{f}$-OPD: Stabilizing Long-Horizon On-Policy Distillation with Freshness-Aware Control

arXiv cs.AI · Xianwei Chen, Shimin Zhang, Jibin Wu · 2026-05-18

The paper introduces $f$-OPD, a freshness-aware control framework that stabilizes long-horizon on-policy distillation (OPD) for LLMs by addressing the trade-off between asynchronous execution efficiency and on-policy objective fidelity. The method theoretically decomposes objective discrepancy into rollout drift and supervision drift, then uses a sample-level freshness score to adaptively regulate stale-sample influence and constrain policy drift. Experiments on reasoning, tool-use, and coding-agent tasks demonstrate that $f$-OPD matches synchronous optimization performance while preserving asynchronous throughput advantages, establishing a scalable approach for long-horizon agentic post-training.

on-policy distillationrollout driftsupervision driftfreshness-aware controlasynchronous execution

PAREDA: A Multi-Accent Speech Dataset of Natural Language Processing Research Discussions

arXiv cs.AI · Sicheng Jin, Dipankar Srirag, Aditya Joshi · 2026-05-18

We introduce PAREDA, a novel multi-accent speech dataset capturing academic NLP discussions with Australian, Indian-English, and Chinese-English accents, addressing variability in accented, spontaneous, and domain-specific speech. The corpus includes monologues (paper summaries) and non-monologues (Q&A sessions), enriched with technical jargon and conversational phenomena. Evaluation of SOTA ASR models reveals performance degradation in zero-shot settings due to accent mixing and increased speech rate, while fine-tuning on PAREDA significantly reduces Word Error Rate (WER). This dataset enables development of more robust ASR systems for specialized real-world applications.

multi-accent speechword error rateautomatic speech recognitiontechnical jargonzero-shot setting

KISS - Knowledge Infrastructure for Scientific Simulation: A Scaffolding for Agentic Earth Science

arXiv cs.AI · Ziwei Li, Liujun Zhu, Yuchen Liu, Yichen Zhao · 2026-05-18

The authors introduce Knowledge Infrastructure (KI), a scaffolding framework that externalizes scientific expertise into agent-actionable components—validated modeling operators, domain protocols, and diagnostic recovery mechanisms—to enable broader access to process-based simulation models. KI was evaluated on a 3,000-trial coupled-hydrology benchmark, where agents equipped with KI achieved physically plausible, verifiable simulations in 84% of trials, compared to <40% without KI. A Knowledge Dissection Toolkit (KDT) autonomously generated KI for 117 additional process-based models across 14 Earth-science domains, demonstrating generalization and convergence of modeling decisions. KI lowers access barriers for non-specialists and fosters integration across modeling communities.

knowledge infrastructureprocess-based simulationagent-actionabledomain protocolsdiagnostic recovery

Generating Pretraining Tokens from Organic Data for Data-Bound Scaling

arXiv cs.AI · Zichun Yu, Chenyan Xiong · 2026-05-18

SynPro introduces a synthetic data generation framework to address data-bound LLM pretraining by maximizing organic corpus utilization. The method employs rephrasing and reformatting operations, optimized via RL with quality, faithfulness, and data influence rewards, and dynamically updated during pretraining plateaus. Experiments with 400M and 1.1B models show SynPro yields 3.7-5.2x more effective tokens than repetition, outperforming non-data-bound oracles at the 1.1B scale, while avoiding distribution collapse.

synthetic data generationdata-bound scalingreinforcement learningllm pretrainingorganic corpus utilization

Balancing Knowledge Distillation for Imbalance Learning with Bilevel Optimization

arXiv cs.AI · Anh B. H. Nguyen, Ba Tho Phan, Viet Cuong Ta · 2026-05-18

We propose BiKD, a bilevel optimization framework for knowledge distillation in imbalanced learning scenarios. BiKD dynamically balances hard and soft losses at the sample level using a weight generation network, guided by a small balanced validation set. The method employs a multi-step SGD strategy to optimize the weight model efficiently. Experiments on long-tailed CIFAR-10/100 demonstrate that BiKD outperforms existing balanced distillation methods across various imbalance factors, achieving superior performance in transferring knowledge from teacher to student models.

knowledge distillationbilevel optimizationimbalanced learninghard lossessoft losses

Temporal Aware Pruning for Efficient Diffusion-based Video Generation

arXiv cs.AI · Sheng Li, Yang Sui, Junhao Ran, Bo Yuan · 2026-05-18

The paper introduces TAPE, a training-free Temporal Aware Pruning method for efficient diffusion-based video generation. TAPE addresses computational inefficiency in ViT-based video diffusion models by (i) applying temporal smoothing to align token-importance across frames, (ii) performing token reselection to align pruning with layer-specific semantics, and (iii) using timestep-level budget scheduling to vary pruning intensity. Experiments demonstrate that TAPE achieves significant speedups while maintaining visual fidelity, outperforming prior token reduction approaches.

video diffusion modelstoken pruningtemporal coherencevit-based architecturestraining-free pruning

Efficient Bilevel Optimization for Meta Label Correction in Noisy Label Learning

arXiv cs.AI · Ba Hoang Anh Nguyen, Viet Cuong Ta · 2026-05-18

We propose EBOMLC, an efficient bilevel optimization method for meta label correction in noisy label learning. The method introduces three key improvements over standard meta label correction: one-step inner loop update, mixture upper loss, and alignment-aware dynamic barrier gradient descent. These enhancements address computational inefficiency, noisy signal leakage, and meta model instability while maintaining first-order complexity. Experiments on CIFAR-10 and CIFAR-100 demonstrate EBOMLC's superior performance over baselines, particularly under high noise rates, while significantly reducing training time compared to traditional meta label correction approaches.

meta label correctionbilevel optimizationdynamic barrier gradientnoisy label learninghypergradient

Remembering More, Risking More: Longitudinal Safety Risks in Memory-Equipped LLM Agents

arXiv cs.AI · Ahmad Al-Tawaha, Shangding Gu, Peizhi Niu, Ruoxi Jia · 2026-05-18

The study introduces temporal memory contamination as a longitudinal safety risk in memory-equipped LLM agents, contrasting with traditional single-state evaluations. A trigger-probe protocol isolates memory exposure effects by evaluating fixed probe sets against read-only memory snapshots at varying prefix lengths, using a NullMemory baseline for comparison. Experiments across three deployment scenarios and eight memory architectures, including Claw-like agents, reveal that memory-enabled agents consistently exceed the NullMemory baseline, with violation rates increasing with memory exposure. Order-randomization confirms content accumulation drives this effect. A diagnostic monitor detects memory-induced risks from retrieval state before generation, advocating for temporal evaluation of memory safety.

temporal memory contaminationtrigger-probe protocolnullmemory baselinememory exposureretrieval state

Interactive Evaluation Requires a Design Science

arXiv cs.AI · Keyang Xuan, Peiyang Song, Pan Lu, Pengrui Han · 2026-05-18

The paper advocates for interactive evaluation as a distinct paradigm in AI assessment, addressing limitations of traditional response-centered benchmarks. It highlights the need for new methodologies to handle dynamic interactions involving tools, environments, and multi-agent systems. The authors propose a two-axis taxonomy, design principles, and reporting standards to systematize interactive evaluation. Key challenges include trajectory-level analysis, process assessment, and robustness metrics. The work bridges gaps in fragmented interactive benchmarking practices by formalizing evidence-to-judgment mappings for evolving LLM deployments.

interactive evaluationtrajectory analysisllm deploymentbenchmark designrobustness metrics

Content-Style Identification via Differential Independence

arXiv cs.AI · Subash Timilsina, Hoang-Son Nguyen, Sagar Shrestha, Xiao Fu · 2026-05-18

The paper introduces content-style differential independence (CSDI), a structural condition enabling identifiability of domain-invariant content and domain-specific style variables in multi-domain generative analysis. CSDI requires that infinitesimal variations in content and style induce orthogonal directions on the data manifold, relaxing restrictive assumptions of statistical independence or sparse Jacobians. The method operationalizes this via blockwise orthogonality constraints on Jacobian subspaces and employs a stochastic regularizer for scalable training in high-dimensional settings. Experiments across multiple datasets validate CSDI's identifiability and demonstrate improved performance on counterfactual generation and domain translation tasks.

content-style differential independenceidentifiabilityjacobian subspacescounterfactual generationdomain translation

CounterCount: A Diagnostic Framework for Counting Bias in Vision Language Models

arXiv cs.AI · Reem Alzahrani, Hassan Alshanqiti, Bushra Bin Hemid, Zaid Alyafeai · 2026-05-18

CounterCount introduces a diagnostic framework for evaluating counting bias in Vision-Language Models (VLMs) by leveraging paired factual and counterfactual images with verified answers and localized evidence annotations. The method assesses VLMs' reliance on visual evidence versus object-level priors, revealing consistent degradation in counterfactual scenarios despite contradictory visual evidence. Analysis using localized annotations indicates failures stem from underweighted attention to count-relevant visual tokens. A unified inference-time attention modulation strategy improves counterfactual counting accuracy by up to 8% across multiple VLMs. CounterCount highlights prior-driven counting failures and offers diagnostic insights for future VLM design.

vision-language modelscounterfactual countingattention modulationlocalized annotationsvisual tokens

Why We Look Where We Look: Emergent Human-like Fixations of a Foveated Visual Language Model Maximizing Scene Understanding

arXiv cs.AI · Shravan Murlidaran, Ziqi Wen, Sana Shehabi, Miguel P. Eckstein · 2026-05-18

The study demonstrates that human-like fixation patterns during free-viewing emerge when a foveated visual language model optimizes scene comprehension. Using a computational agent with simulated foveation, the authors show that training for scene comprehension, rather than search or classification, yields fixation patterns aligning with human behavior (center bias, attention to people, text, and semantically meaningful regions). Agents with non-human peripheral vision performance or alternative training objectives predicted human fixations less accurately. This suggests human fixation patterns may result from optimizing scene understanding under biological foveation constraints.

foveated visionscene comprehensionfixation patternscomputational modelingfree-viewing

TierCheck: Tiered Checkpointing for Fault Tolerance in Large Language Model Training

arXiv cs.AI · Shujie Han, Feng Jiang, Patrick P. C. Lee, Xiao Zhang · 2026-05-18

TierCheck introduces a tiered checkpointing system for fault-tolerant LLM training, addressing heterogeneous failure modes through storage-aligned state preservation. The method employs a three-tier design: lightweight differential checkpoints in local/peer memory for fast recovery, asynchronous migration of base checkpoints to persistent storage, and cluster-aware restoration with strict consistency. Evaluations on 40B-parameter models demonstrate <10s checkpointing latency, low training overhead (1-3%), and high-frequency checkpoint support, optimizing the persistence-recovery tradeoff.

tiered checkpointingfault tolerancedifferential checkpointscluster-aware recoverypersistent storage

Virtues of Ordered Chaos: Planning with Topple Actions in Tabletop Stack Rearrangement

arXiv cs.AI · Hao Lu, Rahul Shome · 2026-05-18

This work introduces a novel abstraction for task planning in tabletop stack rearrangement, augmenting traditional pick-and-place actions with nonprehensile toppling actions to compress relocation sequences. The method employs a directed graphical abstraction treating objects as pebbles, formulating the problem as a variant of the pebble motion problem. Plans interleave pick-and-place and topple actions based on problem constraints. Benchmarks conducted in IsaacSim physics simulation demonstrate faster execution compared to pick-and-place-only strategies. While focused on toppling, the abstraction framework extends to other aggregating actions like scooping, indicating promising benefits for rich object interactions in manipulation tasks.

nonprehensile actionspebble motion problemtask planningstack rearrangementdirected graphical abstraction

Going Headless? On the Boundaries of Vertical AI Firms

arXiv cs.AI · Muhammad Zia Hydari, Farooq Muzaffar · 2026-05-18

The article analyzes the strategic implications of 'going headless' for vertical AI firms, distinguishing between interface and accountability boundaries. Using Coase's theory of the firm, platform envelopment frameworks, and Teece's complementary assets analysis, it demonstrates that orchestrators gain power through open protocols while value capture concentrates in accountability assets like professional signoff and regulated workflows. A three-position taxonomy (component, integrated platform, dual-track) is proposed, emphasizing task-accountability regimes over sectors. The concept of rule debt is formalized as the governance burden from migrating business rules to prompts. Four principles guide firms in decomposing by accountability, retaining core assets, positioning rule debt as a cost, and avoiding single-orchestrator dependence.

vertical aiaccountability boundaryrule debtplatform envelopmentcospecialized assets

Accelerating AI-Powered Research: The PuppyChatter Framework for Usable and Flexible Tooling

arXiv cs.AI · Chun-Hsiung Tseng, Hao-Chiang Koong Lin, Andrew Chih-Wei Huang, Yung-Hui Chen · 2026-05-18

The study introduces PuppyChatter, a novel software framework that balances vendor-specific SDK usability with vendor-neutral model abstraction principles for AI application development. The framework addresses challenges in LLM tooling by simplifying API interactions while avoiding vendor lock-in and reducing complexity. Results suggest PuppyChatter provides a more streamlined and flexible development paradigm compared to existing approaches.

large language modelssoftware frameworkvendor lock-inmodel abstractionapi interaction

Curriculum Group Policy Optimization: Adaptive Sampling for Unleashing the Potential of Text-to-Image Generation

arXiv cs.AI · Baoteng Li, Xianghao Zang, Xinran Wang, Xiangyu Na · 2026-05-18

Curriculum Group Policy Optimization (CGPO) introduces an adaptive curriculum training framework to enhance text-to-image generation efficiency. CGPO dynamically prioritizes prompts based on their inconsistency, measured by reward variance across generated image groups, aligning sample difficulty with model capability. A category calibration method addresses dataset imbalance via proportional fairness optimization. Evaluations on GenEval, T2I-CompBench++, and DPG Bench demonstrate improved generation performance, validating CGPO's effectiveness in optimizing training dynamics for text-to-image tasks.

curriculum learningtext-to-image generationgroup relative policy optimizationreward varianceproportional fairness

Optimal Knock-Pick Planning for Tightly Packed Tabletop Blocks With Parallel Grippers

arXiv cs.AI · Hao Lu, Rahul Shome · 2026-05-18

The paper introduces an optimal knock-pick planning method for rearranging densely packed tabletop blocks using parallel grippers, addressing scenarios where prehensile picks are infeasible due to insufficient clearance. A directional knock primitive is formulated, and the problem is abstracted using minimal constraining gadgets to identify necessary knocks. Optimal plans minimizing action count are computed via maximum-weight perfect matching on a graphical abstraction, achieving polynomial-time efficiency. Experiments in synthetic settings and IsaacSim demonstrate scalability with increasing grid sizes. This work advances manipulation strategies by rigorously interleaving prehensile and non-prehensile actions.

knock-pick planningparallel grippersprehensile actionsmaximum-weight matchingisaacsim

STRIDE: A Self-Reflective Agent Framework for Reliable Automatic Equation Discovery

arXiv cs.AI · Jiarui Su, Songjun Tu, Bei Sun, Xiaojun Liang · 2026-05-18

STRIDE introduces a self-reflective agent framework for reliable automatic equation discovery, addressing limitations in LLM-based systems that rely on generation-centered loops. The framework integrates data-aware generation, mixed-fitting evaluation, critic-executor repair, and diversity-preserving semantic memory to enable closed-loop discovery. Experiments on symbolic-regression benchmarks and LSR-Synth suites demonstrate improved accuracy, out-of-distribution robustness, and structural recovery across multiple LLM backbones, with ablations validating the contribution of core components.

symbolic regressionself-reflective agentmixed-fitting evaluationcritic-executor repairsemantic memory

SocialMemBench: Are AI Memory Systems Ready for Social Group Settings?

arXiv cs.AI · Olukunle Owolabi · 2026-05-18

SocialMemBench introduces a benchmark for evaluating AI memory systems in multi-party social group settings, addressing a critical gap in existing dyadic or workplace-focused benchmarks. The benchmark comprises 430 personas across five archetypal social networks (close friends, family, recreational, interest community, acquaintance network) and three group-size tiers (4-30 members), generating 7,355 conversation turns and 1,031 QA pairs across nine categories isolating architectural capabilities. Evaluations reveal significant performance gaps: Gemini 2.5 Flash achieves 0.721 accuracy, while open-source frameworks (Mem0, LangMem, Graphiti, Cognee) cluster at 0.12-0.18, below retrieval (0.345) and GPT-4o-mini (0.369) references. Five failure modes are identified, with two probed via Subject-Mem and SMG.

socialmembenchmulti-partymemory systemsqa pairsfailure modes

Systematic Evaluation of the Quality of Synthetic Clinical Notes Rephrased by LLMs at Million-Note Scale

arXiv cs.AI · Jinghui Liu, Sarvesh Soni, Anthony Nguyen · 2026-05-18

This study systematically evaluates LLM-generated synthetic clinical notes rephrased from MIMIC databases at million-note scale, assessing intrinsic, extrinsic, and factuality aspects. The analysis reveals that synthetic notes preserve core clinical information and predictive utility for coarse-grained tasks but lose fine-grained details for ICD coding, which can be mitigated by chunk-based rephrasing at the cost of reduced factual precision. Fact-checking identifies misinterpretation of clinical context, temporal confusion, measurement errors, and fabricated claims as dominant synthesis errors. Despite their task-agnostic nature, synthetic notes effectively augment task-specific training for rare ICD codes.

synthetic clinical notesmimic databasesicd codingfact-checkingtask-agnostic

Entropy-Gradient Inversion: Moving Toward Internal Mechanism of Large Reasoning Models

arXiv cs.AI · Junyao Yang, Chen Qian, Kun Wang, Linfeng Zhang · 2026-05-18

The paper introduces Entropy-Gradient Inversion, a geometric fingerprint characterizing reasoning capability in Large Reasoning Models (LRMs), defined as a negative correlation between token entropy and logit gradients. It proposes Correlation-Regularized Group Policy Optimization (CorR-PO), an RL method that embeds this inversion signature as reward regularization. Experiments across reasoning benchmarks and model scales demonstrate CorR-PO's consistent superiority over baselines, with stronger inversion correlating to improved performance.

large reasoning modelsentropy-gradient inversionreinforcement learninglogit gradientspolicy optimization

Surface-Form Neural Sparse Retrieval: Robust Fuzzy Matching for Industrial Music Search

arXiv cs.AI · Paul Greyson, Zhichao Geng, Wei Zhang, Yang Yang · 2026-05-18

A robust neural sparse retrieval system is introduced for industrial-scale music search, addressing challenges of query variations and millisecond latency constraints. The method adapts an inference-free sparse retrieval architecture with domain-specific granular subword tokenization, enforcing surface-form robustness through short-length token constraints (max 3 chars). Neural embeddings and term expansions are pre-computed offline, minimizing online processing to tokenization and IDF weighting. Evaluations on a 6M-document corpus demonstrate 91.4% recall@10 (vs. 57.7% for trigrams) and +0.8% higher stabilized recall in HCI feedback loop simulations. Performance gains are attributed to sparse training methodology, with domain-specific pretraining offering cost-effective advantages over large-scale general-purpose pretraining.

neural sparse retrievalsubword tokenizationsurface-form robustnessidf weightinghci feedback loop

A Readiness-Driven Runtime for Pipeline-Parallel Training under Runtime Variability

arXiv cs.LG · Ruitao Liu, Xinyang Tian, Shuo Chen, Tingrui Zhang · 2026-05-18

The Runtime-Readiness-First Pipeline (RRFP) introduces a readiness-driven runtime for pipeline-parallel training that addresses runtime variability in computation and communication. RRFP treats schedules as non-binding hints, enabling stages to execute ready work immediately rather than waiting for pre-committed orders. It integrates message-driven asynchronous communication, lightweight tensor-parallel coordination, and ready-set arbitration for efficient dispatch. Implemented in a Megatron-based framework, RRFP achieves up to 1.77× speedup on language-only workloads and 2.77× on multimodal tasks across 128 GPUs. It outperforms external systems by up to 1.84× while maintaining training correctness.

pipeline parallelismruntime variabilitytensor-parallel coordinationready-set arbitrationmegatron

SURGE: Approximation-free Training Free Particle Filter for Diffusion Surrogate

arXiv cs.LG · Lifu Wei, Yinuo Ren, Naichen Shi, Yiping Lu · 2026-05-18

The paper introduces URGE (Unbiased Resampling via Girsanov Estimation), a derivative-free inference-time scaling algorithm for diffusion models that avoids gradient computations. URGE employs path-wise importance reweighting via Girsanov change of measure, attaching multiplicative weights to trajectories and periodically resampling without score, Hessian, or PDE evaluations. The method establishes equivalence between path-wise and particle-wise SMC, ensuring unbiased terminal laws. Empirical results show URGE outperforms existing guidance baselines in generation quality on synthetic tests and diffusion benchmarks, while being simpler and gradient-free.

girsanovresamplingdiffusiontrajectoryunbiased

PIXLRelight: Controllable Relighting via Intrinsic Conditioning

arXiv cs.LG · Miguel Farinha, Ronald Clark · 2026-05-18

PIXLRelight introduces a feed-forward method for physically controllable single-image relighting, addressing limitations in existing approaches regarding lighting control, error accumulation, and computational cost. The method bridges physically based rendering (PBR) and learned image synthesis through intrinsic conditioning derived from multi-illumination photographs or PBR renders. A transformer-based neural renderer applies target illumination while preserving fine details via per-pixel affine modulation. The approach achieves state-of-the-art relighting quality, supports arbitrary PBR-style lighting control, and processes images in under 0.1 seconds.

single-image relightingphysically based renderingintrinsic conditioningtransformer-based neural rendererper-pixel affine modulation

General Preference Reinforcement Learning

arXiv cs.LG · Muhammad Umer, Muhammad Ahmed Mohsin, Ahsan Bilal, Arslan Chaudhry · 2026-05-18

General Preference Reinforcement Learning (GPRL) bridges the gap between online RL and preference optimization by leveraging the General Preference Model (GPM) for multi-dimensional quality assessment. GPRL embeds responses into $k$ skew-symmetric subspaces, computes per-dimension group-relative advantages, and aggregates them with context-dependent eigenvalues to prevent single-axis exploitation. A closed-loop drift monitor dynamically corrects axis dominance by reweighting dimensions and tightening the trust region. Evaluated on $ exttt{Llama-3-8B-Instruct}$, GPRL achieves a 56.51% length-controlled win rate on AlpacaEval~2.0 and outperforms SimPO and SPPO on Arena-Hard, MT-Bench, and WildBench by resisting reward hacking.

general preference modelreinforcement learningskew-symmetric subspacesreward hackingclosed-loop drift monitor

Learned Memory Attenuation in Sage-Husa Kalman Filters for Robust UAV State Estimation

arXiv cs.LG · Kenan Majewski, Marcin Żugaj · 2026-05-18

The N-Deep Recurrent Sage-Husa Filter (NDR-SHKF) enhances UAV state estimation robustness by replacing the static scalar forgetting factor in Sage-Husa Kalman Filters with a learned vector-valued memory attenuation policy. This policy is implemented via a hierarchical recurrent network processing whitened innovation sequences, employing a bifurcated architecture to separately capture instantaneous sensor anomalies and sustained dynamic trends. An auxiliary reconstruction objective prevents feature collapse, and the filter is trained end-to-end via backpropagation through time to minimize state estimation error. Evaluations on chaotic attractors and real-world UAV flight datasets demonstrate cross-domain generalization and improved performance during sensor outages compared to classical adaptive estimators.

sage-husa kalman filtermemory attenuationrecurrent networkstate estimationuav

EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL

arXiv cs.LG · Minrui Xu, Zilin Wang, Mengyi DENG, Zhiwei Li · 2026-05-18

EnvFactory introduces a fully automated framework for scaling tool-use agents via Agentic Reinforcement Learning (Agentic RL), addressing challenges in environment scalability and realistic training data scarcity. The framework autonomously synthesizes stateful, executable tool environments from authentic resources and generates natural multi-turn trajectories through topology-aware sampling and calibrated refinement. Using only 85 verified environments across 7 domains, EnvFactory produces 2,575 SFT and RL trajectories, achieving superior training efficiency and downstream performance. It improves Qwen3-series models by up to +15% on BFCLv3, +8.6% on MCP-Atlas, and +6% on conversational benchmarks including $τ^2$-Bench and VitaBench.

agentic rlexecutable environmentstopology-aware samplingmulti-turn trajectoriestraining efficiency

Learning Normal Representations for Blood Biomarkers

arXiv cs.LG · Aashna P. Shah, Michelle M. Li, Yash Lal, Seffi Cohen · 2026-05-18

The paper introduces NORMA, a conditional transformer-based framework for generating personalized blood biomarker reference intervals that combine individual testing histories with population-level data. Leveraging 2 billion longitudinal measurements from 1.6 million individuals, the authors demonstrate that purely personalized intervals overfit, flagging 68% of measurements as abnormal without clinical relevance. NORMA outperforms existing methods in predicting outcomes like mortality and acute kidney injury by anchoring individual trajectories to population priors. The model, code, and an interactive interface are publicly released.

biomarkerstransformerlongitudinal datareference intervalsclinical outcomes

Can machine learning for quantum-gas experiments be explainable?

arXiv cs.LG · I. B. Spielman amd J. P. Zwolak · 2026-05-18

The study explores machine learning applications in quantum-gas experiments, focusing on denoising raw images and identifying solitonic waves in Bose-Einstein condensates. It highlights the challenges of many-body atomic physics, including experimental complexity and computational demands. The research demonstrates ML's potential to enhance interpretability and performance in quantum simulators, while addressing trade-offs between model complexity and explainability.

quantum-gasbose-einstein condensatesdenoisingsolitonic wavesinterpretability

Better Together: Evaluating the Complementarity of Earth Embedding Models

arXiv cs.LG · Thijs L van der Plas, Jacob JW Bakermans, Vishal Nedungadi, Gabrielė Tijūnaitytė · 2026-05-18

The paper introduces an embedding complementarity index to evaluate Earth embedding models by their fusion performance rather than isolated benchmarks. It assesses four models (AlphaEarth, Tessera, GeoCLIP, SatCLIP) across six downstream tasks, comparing single-model baselines against pairwise and joint embeddings. Results show fused embeddings outperform the best single model in 4/6 tasks, with complementarity being task- and location-dependent. Spatial scale of land cover classes partially explains complementarity in regression tasks, suggesting future gains may come from model combinations rather than individual advances.

earth embeddingsembedding fusioncomplementarity indexdownstream tasksspatial scale

A No-Defense Defense Against Gradient-Based Adversarial Attacks on ML-NIDS: Is Less More?

arXiv cs.LG · Mohamed elShehaby, Ashraf Matrawy · 2026-05-18

This paper demonstrates that architectural choices alone can enhance robustness in Deep Neural Network-based Network Intrusion Detection Systems (NIDS) against gradient-based adversarial attacks, without explicit defenses. Through 2200 experiments, the authors varied network depth, feature dimensionality, activation functions, and dropout, evaluating performance under FGSM, PGD, and BIM attacks. Results indicate that shallower networks, reduced feature sets, and ReLU activation jointly reduce adversarial vulnerability. A simple model adhering to these principles outperforms deeper, fully-featured adversarially trained models, achieving near-perfect clean-traffic detection and lower training times. The findings emphasize that selecting optimal architectural simplifications is crucial for robustness.

adversarial attacksnetwork intrusion detectiondeep neural networksrelu activationgradient-based attacks

Efficient and Noise-Tolerant PAC Learning of Multiclass Linear Classifiers

arXiv cs.LG · Rita Adhikari, Shiwei Zeng · 2026-05-18

We present a computationally-efficient PAC learning algorithm for multiclass linear classifiers under malicious noise, addressing an open problem in machine learning. The method combines a cluster-based pruning scheme with multiclass hinge loss minimization, operating on data satisfying a margin condition with bounded variance marginal distributions. The algorithm achieves PAC learning with sample complexity O(k^2·(d log d + log k)), even under constant-rate nasty noise, outperforming prior binary classification results. This work extends noise-tolerant PAC learning to multiclass settings with k ≥ 3 classes, demonstrating improved sample efficiency and computational tractability compared to existing approaches.

pac learningmulticlass classificationlinear classifiersnoise tolerancehinge loss

Aligned Training: A Parameter-Free Method to Improve Feature Quality and Stability of Sparse Autoencoders (SAE)

arXiv cs.LG · Michał Brzozowski, Neo Christopher Chung · 2026-05-18

We propose aligned training, a parameter-free reparameterization of sparse autoencoders (SAEs) that improves feature quality and stability. The method enforces a geometric constraint where the inner product between encoder and decoder directions equals one for each feature, addressing degeneracy in SAE training without additional hyperparameters. Experiments across various models, dictionary sizes, and sparsity levels demonstrate Pareto improvements on SAEBench benchmarks, eliminating dead features while enhancing reconstruction quality and stability. The approach integrates seamlessly with techniques like Top/BatchTop-K architectures and p-Annealing, offering computational efficiency.

sparse autoencodersalignment scoredegeneracyp-annealingsaebench

Learning to Look Benign: Targeted Evasion of Malware Detectors via API Import Injection

arXiv cs.LG · Juozas Dautartas, Olga Kurasova, Juozapas Rokas Čypas, Viktor Medvedev · 2026-05-18

The paper demonstrates targeted evasion attacks on malware detectors by strategically injecting benign Win32 API imports while preserving functionality. A Conditional Variational Autoencoder (CVAE) with strictly additive operations generates adversarial samples, guided by a differentiable proxy of the target ensemble detector. On a dataset of 3,799 executables, injecting 20 API imports reduced detector recall from 87.5% to 30%, with 99% of evaded samples misclassified into the intended benign category. Real-world validation showed a 54.5% reduction in VirusTotal detection rates, exposing vulnerabilities in API-based static analysis.

adversarial evasionconditional variational autoencoderwin32 api importsstatic malware detectionfunctionality-preserving attacks

Perfect Parallelization in Mini-Batch SGD with Classical Momentum Acceleration

arXiv cs.LG · Sachin Garg, Michał Dereziński · 2026-05-18

The paper develops a theoretical framework for classical momentum acceleration in mini-batch SGD, specifically addressing Polyak's heavy ball and Nesterov-style momentum. It establishes that acceleration is directly proportional to gradient mini-batch size, enabling perfect parallelization up to a saturation point. The analysis applies to quadratic optimization in the interpolation regime, covering deep learning dynamics and methods like randomized Kaczmarz. A theoretically justified momentum parameter choice is empirically validated. The framework requires minimal noise assumptions and accommodates arbitrary batch sizes.

momentum accelerationmini-batch sgdinterpolation regimeparallelizationquadratic optimization

Forecasting Downstream Performance of LLMs With Proxy Metrics

arXiv cs.LG · Arkil Patel, Siva Reddy, Marius Mosbach, Dzmitry Bahdanau · 2026-05-18

The authors propose proxy metrics for forecasting downstream performance of language models by aggregating token-level statistics from next-token distributions over expert-written solutions. These proxies, including entropy, top-k accuracy, and expert token rank, address limitations of cross-entropy loss and direct evaluation. Across three settings—cross-family model selection, pretraining data selection, and training-time forecasting—the proxies outperform baselines, achieving mean Spearman Rho = 0.81, reducing compute costs by 10,000×, and halving extrapolation error over an 18× compute horizon. Results demonstrate that expert trajectories provide reliable signals for model capability assessment throughout development.

proxy metricstoken-level statisticsspearman rhopretraining datatraining-time forecasting

Physics-Aligned Canonical Equivariant Fourier Neural Operator under Symmetry-Induced Shifts

arXiv cs.LG · Jiaxiao Xu, Changhong Mou, Yeyu Zhang, Fengxiang He · 2026-05-18

The Physics-Aligned Canonical Equivariant Fourier Neural Operator (PACE-FNO) improves out-of-distribution generalization for PDE solution maps by decoupling coordinate alignment and physical evolution. PACE-FNO estimates the input frame via Lie-algebra coordinate estimation, maps to a reference frame, applies a standard Fourier Neural Operator (FNO), and restores the prediction to the target frame. Training jointly aligns frames and predicts operators using bounded symmetry perturbations, with optional inference-time frame refinement. Experiments on 1-D/2-D Burgers, shallow-water, and Navier-Stokes equations show PACE-FNO matches in-distribution accuracy of standard neural operators while reducing out-of-distribution relative error by up to 12x versus FNO with symmetry augmentation, particularly under translations and Galilean shifts.

fourier neural operatorequivariancelie algebrapdegeneralization

Pointwise Generalization in Deep Neural Networks

arXiv cs.LG · Shaojie Li, Yunbei Xu · 2026-05-18

The paper establishes a pointwise generalization theory for fully connected deep neural networks, resolving barriers in characterizing nonlinear feature learning and providing a statistical foundation for representation learning. The method introduces a pointwise Riemannian Dimension derived from eigenvalues of learned feature representations across layers, enabling hypothesis-dependent, representation-aware generalization bounds. These bounds significantly outperform existing approaches based on model size, norm products, and infinite-width linearizations, offering tighter theoretical and empirical guarantees. Results demonstrate substantial feature compression, decreased dimensionality with over-parameterization, and capture of optimizer implicit bias, indicating deep networks' tractability and sharp generalization via pointwise, feature-spectrum-aware complexity.

pointwise generalizationriemannian dimensionfeature compressionrepresentation learningover-parameterization

PACE: Geometry-Aware Bridge Transport for Single-Cell Trajectory Inference

arXiv cs.LG · Chenglei Yu*, Chuanrui Wang*, Bangyan Liao, Tailin Wu · 2026-05-18

PACE introduces a geometry-aware trajectory inference framework for single-cell RNA-seq data that addresses the ill-posed nature of reconstructing continuous dynamics from destructive time-course snapshots. The method combines three components: (1) an anisotropic Riemannian metric favoring transport along locally supported tangent directions, (2) alternating optimization of cross-time couplings and neural bridge fitting, and (3) distillation into a global velocity field. Evaluated on seven datasets with nine reconstruction tasks, PACE reduces MMD and Wasserstein distances by 23.7% on average versus baselines, while improving RNA-velocity alignment by 15.4% without requiring paired data or velocity supervision.

trajectory inferenceoptimal transportriemannian metricsingle-cell rna-seqvelocity field

S2Aligner: Pair-Efficient and Transferable Pre-Training for Sparse Text-Attributed Graphs

arXiv cs.LG · Yuhan Wang, Haopeng Zhang, Yibo Ding, Jiaqi Yu · 2026-05-18

S2Aligner introduces a sparsity-aware LLM-as-Aligner framework for pre-training on sparse text-attributed graphs (TAGs), addressing unreliable structure-semantics correspondence and transfer bias. The method decouples semantic alignment from structural modeling by decomposing graph-text representations, employing structure-oriented reconstruction with consistency control, and implementing sparsity-aware cross-domain risk balancing through global-domain density ratio and graph reliability estimation. Theoretical analysis demonstrates reduced cross-domain generalization gaps. Experiments across diverse graph domains, sparsity levels, and downstream tasks show S2Aligner consistently outperforms existing baselines.

text-attributed graphsllm-as-alignerstructure-semantics correspondencesparsity-awarecross-domain generalization

scHelix: Asymmetric Dual-Stream Integration via Explicit Gene-Level Disentanglement

arXiv cs.LG · Xichen Yan, Zelin Zang, Changxi Chi, Jingbo Zhou · 2026-05-18

The paper introduces scHelix, a novel framework for single-cell RNA sequencing (scRNA-seq) integration that addresses batch effect removal while preserving biological signals through explicit gene-level disentanglement. The method partitions genes into domain-invariant Anchors and domain-sensitive Variants, employing a dual-stream sparse diffusion encoder with stop-gradient graph caching and an asymmetric Align-Refine-Fuse protocol to align and refine features. This approach prevents over-correction and maintains cluster integrity. Benchmarking shows scHelix outperforms state-of-the-art methods in batch effect removal and biological fidelity preservation.

single-cell rna sequencingbatch effect removalgene-level disentanglementdual-stream encoderalign-refine-fuse protocol

GUT-IS: A Data-Driven Approach to Integrating Constructs and Their Relations in Information Systems

arXiv cs.LG · Maximilian Reinhardt, Jonas Scharfenberger, Burkhardt Funk · 2026-05-18

The authors propose GUT-IS, a data-driven methodology for integrating structural equation models in information systems research. Their approach combines task-adapted text embeddings with clustering to generate candidate construct groupings, then selects optimal solutions using a loss function that explicitly balances semantic purity against cluster parsimony. This enables systematic analysis of how construct groupings and relations evolve across the purity-parsimony tradeoff spectrum. The method is empirically evaluated on two information systems datasets, addressing the challenge of inconsistent construct definitions that hinder cumulative knowledge development in the field.

structural equation modelingtext embeddingsclusteringsemantic purityparsimony

Self-supervised local learning rules learn the hidden hierarchical structure of high-dimensional data

arXiv cs.LG · Ariane Delrocq, Wu S. Zihan, Guillaume Bellec, Wulfram Gerstner · 2026-05-18

The study identifies biologically plausible local learning rules capable of learning hierarchical representations from high-dimensional data without backpropagation. It evaluates two rule types on the Random Hierarchy Model (RHM): error-approximating feedback signals and self-supervised contrastive/non-contrastive losses. While error-approximating rules fail due to lacking input-specific nonlinearities, self-supervised rules successfully learn RHM's hierarchical structure with data efficiency matching supervised backpropagation, while maintaining biological plausibility.

local learning rulesrandom hierarchy modelcontrastive learningsynaptic plasticityhierarchical representation

Federated Martingale Posterior Samping

arXiv cs.LG · Boning Zhang, Matteo Zecchin, Mingzhao Guo, Dongzhu Liu · 2026-05-18

The paper introduces Federated Martingale Posterior (FMP) sampling, a one-shot protocol for federated Bayesian neural networks that avoids prior-likelihood specification challenges. FMP replaces traditional Bayesian components with a predictive distribution, enabling parameter uncertainty estimation via predictive sampling while maintaining data privacy. Clients upload trainable data embeddings to a central server, which performs predictive sampling without raw data sharing. Experiments on MNIST, CIFAR-10, and CIFAR-100 demonstrate FMP's parity with centralized methods and superior calibration over consensus baselines.

federated learningbayesian neural networksmartingale posteriorpredictive samplingcalibration

Protein Fold Classification at Scale: Benchmarking and Pretraining

arXiv cs.LG · Dexiong Chen, Andrei Manolache, Mathias Niepert, Karsten Borgwardt · 2026-05-18

The authors introduce TEDBench, a large-scale, non-redundant benchmark for protein fold classification derived from the Encyclopedia of Domains (TED) and Foldseek-clustered AlphaFold structures. To address scalability and performance limitations of existing methods, they propose Masked Invariant Autoencoders (MiAE), a self-supervised framework employing an SE(3)-invariant encoder and lightweight decoder with up to 90% masking ratio for reconstructing backbone coordinates. MiAE outperforms supervised baselines on TEDBench and demonstrates transferability on CATH v4.4 experimental structures, establishing a robust approach for protein fold classification.

protein fold classificationself-supervised learningse(3)-invariant encodermasked autoencoderbackbone reconstruction

Beyond Scaling: Agents Are Heading to the Edge

arXiv cs.LG · Chunlin Tian, Dongqi Cai, Wanru Zhao, Nicholas D. Lane · 2026-05-18

The paper argues for edge-based personal-agent architectures, positing that agentic intelligence now requires local execution due to three structural shifts. First, the Prefrontal Turn emphasizes executive control frameworks over pre-training scale, requiring proximity to action environments. Second, the Data-Geography Paradox highlights how local context (file hierarchies, sensor streams) loses fidelity when transmitted to cloud systems. Third, the interaction-alignment loop identifies real-time local interaction as the optimal source for refinement data. The authors provide falsifiable predictions for next-generation agent deployment.

agentic intelligenceedge computingprefrontal turndata-geography paradoxinteraction-alignment loop

XCTFormer: Leveraging Cross-Channel and Cross-Time Dependencies for Enhanced Time-Series Analysis

arXiv cs.LG · Israel Zexer, Omri Azencot · 2026-05-18

XCTFormer introduces a transformer-based channel-dependent model for multivariate time-series analysis that explicitly captures cross-temporal and cross-channel dependencies through a novel Cross-Relational Attention Block (CRAB). The architecture includes a data processing module and an optional Dependency Compression Plugin (DeCoP) for scalability. Experiments on three benchmarks demonstrate state-of-the-art performance, particularly in imputation tasks, with average improvements of 20.8% in MSE and 15.3% in MAE over the second-best method.

multivariate time-seriescross-relational attentionchannel-dependent modelingdependency compressionimputation task

Scale-Invariant Neural Network Optimization: Norm Geometry and Heavy-Tailed Noise

arXiv cs.LG · Jiayu Zhang, Tianyi Lin · 2026-05-18

The paper establishes a dimension-dependent lower bound for scale-invariant first-order methods under heavy-tailed noise, showing Ω(min{m, n}ε^(-(3p-2)/(p-1))) oracle calls are required for ε-stationary points in nonconvex smooth stochastic optimization over ℝ^(m×n). A batched Scion method achieves the matching upper bound O(min{m, n}ε^(-(3p-2)/(p-1))), while a transported Scion method exploiting higher-order smoothness improves this to O(min{m, n}ε^(-(5p-3)/(2p-2))) under Lipschitz Hessian conditions. Practical implementations demonstrate the method's flexibility across architectures and model sizes.

scale-invariantheavy-tailed noisenonconvex optimizationspectral normstochastic gradient

Offline Contextual Bandits in the Presence of New Actions

arXiv cs.LG · Ren Kishimoto, Tatsuhiro Shimizu, Kazuki Kawamura, Takanori Muroi · 2026-05-18

The authors propose Policy Optimization for Effective New Actions (PONA), a novel off-policy learning method for contextual bandits that addresses the challenge of selecting new actions introduced after data collection. PONA integrates the Local Combination PseudoInverse (LCPI) estimator, which generalizes the PseudoInverse estimator for slate bandits, with the Doubly Robust (DR) estimator, balancing reward modeling and action feature interactions. LCPI captures multi-dimensional action feature effects, while DR optimizes existing action selection. Experiments show PONA effectively selects new actions without compromising overall policy performance, unlike existing methods that fail to handle new actions.

off-policy learningcontextual banditspseudo-inverse estimatordoubly robustaction features

Adaptive Experimentation for Censored Survival Outcomes

arXiv cs.LG · Yuxin Wang, Dennis Frauen, Jonas Schweisthal, Maresa Schröder · 2026-05-18

The authors introduce a novel framework for adaptive experimentation with censored survival outcomes, addressing the gap in existing methods for right-censored data. They derive the semiparametric efficiency bound for the average survival effect curve and propose an efficiency-optimal allocation policy, generalizing Neyman allocation to survival settings. The Adaptive Survival Estimator (ASE) sequentially learns the allocation policy and estimates the survival effect curve, accommodating arbitrary machine learning models for nuisance estimation. Theoretical guarantees include asymptotic normality via a martingale central limit theorem. Numerical experiments demonstrate consistent efficiency gains over uniform randomization and censoring-agnostic baselines.

adaptive experimentationcensored survival outcomessemiparametric efficiency boundneyman allocationmartingale central limit theorem

Heterogeneous Tasks Offloading in Vehicular Edge Computing: A Federated Meta Deep Reinforcement Learning Approach

arXiv cs.LG · Yaorong Huang, Jingtao Luo, Xuechao Wang · 2026-05-18

A Federated Meta Deep Reinforcement Learning framework with GAT-Seq2Seq modeling (FedMAGS) is proposed for heterogeneous task offloading in vehicular edge computing systems. The method leverages Graph Attention Networks to capture DAG dependencies, Seq2Seq-based policies for structured offloading decisions, and federated meta-learning for fast adaptation across distributed MEC servers without raw data sharing. Simulations demonstrate FedMAGS achieves faster convergence, 23.7% lower execution delay, and better scalability compared to state-of-the-art baselines, while preserving data privacy and reducing communication overhead in dynamic, large-scale VEC environments.

graph attention networksseq2seqfederated meta-learningvehicular edge computingtask offloading

Text2CAD-Bench: A Benchmark for LLM-based Text-to-Parametric CAD Generation

arXiv cs.LG · Liang Wang, Heng Meng, Zekai Xiang, Jin Liu · 2026-05-18

The authors introduce Text2CAD-Bench, the first benchmark for evaluating text-to-parametric-CAD generation across geometric complexity and application diversity. The benchmark comprises 600 human-curated examples spanning four levels (L1-L4) from basic geometry to complex topology and real-world domains, each paired with dual-style prompts (non-expert descriptions and expert procedural sequences). Evaluation of general and domain-specific LLMs reveals reasonable performance on basic geometry but significant degradation on complex topology and advanced features. The benchmark is released to advance text-to-CAD research.

text-to-cadparametric modelinggeometric complexityllm evaluationcad generation

Generative Adversarial Learning from Deterministic Processes

arXiv cs.LG · Joris C. Kühl, Hanno Gottschalk · 2026-05-18

The paper demonstrates that generative adversarial networks (GANs) can learn invariant distributions from deterministic chaotic dynamical systems, challenging traditional i.i.d. assumptions. Using an infinite-dimensional model of generative adversarial learning (GAL), the authors prove convergence to the solution with explicit rates in Jensen-Shannon divergence. Theoretical analysis shows this approach successfully handles single time-series data from chaotic systems like turbulence, providing a foundation for physical AI applications.

generative adversarial networkschaotic dynamical systemsjensen-shannon divergenceinvariant distributionphysical ai

Generalized Functional ANOVA in Closed-Form: A Unified View of Additive Explanations

arXiv cs.LG · Baptiste Ferrere, Nicolas Bousquet, Fabrice Gamboa, Jean-Michel Loubes · 2026-05-18

The authors present a closed-form solution for generalized functional ANOVA decompositions under input dependencies, unifying additive explanation methods. By constructing a Riesz Basis via Hilbert space methods, their framework extends the classical independent-case orthogonal decomposition to dependent continuous inputs. They propose a model-agnostic estimation algorithm that outperforms state-of-the-art explanation techniques in empirical evaluations, bridging SHAP values, generalized additive models, and orthogonal polynomial expansions.

functional anovariesz basishoeffding decompositionshap valuesgeneralized additive models

TabH2O: A Unified Foundation Model for Tabular Prediction

arXiv cs.LG · Pascal Pfeiffer, Dmitry Gordeev, Mathias Müller, Laura Fink · 2026-05-18

TabH2O introduces a unified foundation model for tabular prediction, enabling classification and regression via in-context learning within a single forward pass. Key innovations include unified training with a dual-head architecture, single-stage pretraining with stability enhancements, and noise-aware pretraining using synthetic datasets. Evaluated on the TALENT benchmark (300 datasets), TabH2O v1 (29.2M parameters) achieves an average rank of 2.55/6, outperforming CatBoost (4.07), H2O AutoML (4.18), and LightGBM (5.08), while remaining competitive with TabPFN v2.6 (2.74) and TabICL v2 (2.12), placing top-3 on 81% of datasets.

in-context learningdual-head architecturenoise-aware pretrainingtabular predictionsingle-stage pretraining

Generating Physically Consistent Molecules with Energy-Based Models

arXiv cs.LG · Christoph Griesbacher, Lea Bogensperger, Andreas Habring, Thomas Pock · 2026-05-18

The authors introduce EBMol, an energy-based model for generating physically consistent 3D molecules by learning an atom-additive scalar potential. The method employs a flow-inspired Restoring Field Matching objective to approximate the energy landscape, using Mirror-Langevin sampling with parallel tempering for inference. EBMol achieves state-of-the-art performance on QM9 and GEOM-Drugs benchmarks, with the learned energy landscape serving as a principled quality metric and enabling controllable generation through potential composition and zero-shot linker design.

energy-based modelmolecular generationmirror-langevinparallel temperingrestoring field matching

Beyond Square Roots: Explicit Memory-Efficient Factorization for Multi-Epoch Private Learning

arXiv cs.LG · Nikita P. Kalinin, Aki Rehn, Joel Daniel Andersson, Antti Honkela · 2026-05-18

The paper introduces $γ$-BIFR, a memory-efficient factorization method for differentially private multi-epoch training, unifying DP-$λ$CGD and banded inverse square root (BISR) factorizations. By exploiting banded structures in correlation matrices, $γ$-BIFR optimizes the tradeoff between utility and memory cost, particularly in low-memory, low-bandwidth regimes. Theoretical analysis shows tighter guarantees for multi-participation error, while empirical results demonstrate significant improvements in RMSE, amplified RMSE, and private training performance compared to existing methods.

differentially privatebanded factorizationmulti-epoch trainingmemory efficiencycorrelation matrix

Dynamic robotic cloth folding with efficient Koopman operator-based model predictive control

arXiv cs.LG · Edoardo Caldarelli, Franco Coltraro, Adrià Colomé, Lorenzo Rosasco · 2026-05-18

A novel model predictive control approach is proposed for dynamic robotic cloth folding, integrating physics-based simulation and kernel-based Koopman operator regression. The method employs Koopman operator regression to derive a linear surrogate model from high-fidelity cloth simulation data, enabling efficient trajectory generation within a model predictive control framework. Experimental results demonstrate that this approach generates fast folding trajectories to unseen poses while maintaining accuracy, bridging the simulation-to-reality gap in robotic cloth manipulation.

model predictive controlkoopman operatorcloth foldingsystem identificationtrajectory generation

On Stability and Decomposition of Sample Quantiles under Heavy-Tailed Distributions

arXiv cs.LG · Choudur Lakshminarayan · 2026-05-18

This paper introduces a Q-Q orthogonality formulation to separate projection-direction and quantile-threshold effects in sample quantiles under heavy-tailed distributions, particularly in Value-at-Risk contexts. The method decomposes the difference between empirical and population quantiles into three terms: $D_1$ captures population quantile movement due to projection direction perturbation, $D_2$ measures empirical quantile fluctuation with fixed projection direction, and $D_3$ represents the Bahadur-type remainder. Empirical-process theory and Glivenko--Cantelli uniform convergence provide stability bounds, addressing the intrinsic local quantile-stability problem.

quantile stabilityheavy-tailed distributionsvalue-at-riskbahadur representationglivenko-cantelli convergence

Proximal basin hopping: global optimization with guarantees

arXiv cs.LG · Guillaume Lauga, Cesare Molinari, Samuel Vaiter · 2026-05-18

Proximal Basin Hopping (PBH) introduces a novel theoretical framework for global optimization by integrating proximal optimization with local minimization. The method constructs an algorithm that converges to the global minimizer with high probability using a finite number of samples. PBH demonstrates superior performance on synthetic hard functions and real-world problems, such as fitting scaling laws for deep learning, particularly excelling in higher-dimensional settings. Empirical results show that PBH outperforms established algorithms with theoretical guarantees.

global optimizationproximal optimizationlocal minimizationscaling lawshigh-dimensional performance

Decoupled Conformal Optimisation: Efficient Prediction Sets via Independent Tuning and Calibration

arXiv cs.LG · Fanyi Wu, Lihua Niu, Samuel Kaski, Michele Caprio · 2026-05-18

The paper introduces Decoupled Conformal Optimisation (DCO), a method for efficient conformal prediction sets by separating structural tuning from calibration. DCO uses independent data splits for tuning (optimizing prediction-set efficiency) and calibration (ensuring marginal coverage), unlike traditional Bayesian conformal methods that couple these steps. The approach guarantees finite-sample marginal coverage without requiring high-probability risk control or multiple-testing corrections. Empirical results on ImageNet-A, CIFAR-100, and regression datasets show DCO maintains nominal coverage while reducing average prediction-set size (e.g., from 26.52 to 25.26 on ImageNet-A) and interval width (e.g., from 2.098 to 1.914 on Diabetes).

conformal predictionmarginal coveragerisk controlprediction setscalibration

Hybrid Quantum-Classical Neural Architecture Search

arXiv cs.LG · Alberto Marchisio, Muhammad Kashif, Nouhaila Innan, Muhammad Shafique · 2026-05-18

The paper introduces a hybrid quantum-classical neural architecture search (NAS) framework for optimizing the design of hybrid quantum-classical neural networks (HQNNs) in the NISQ era. It extends classical NAS techniques to incorporate quantum components, focusing on FLOPs-aware search as a proxy for computational complexity to ensure both accuracy and hardware efficiency. The approach addresses architectural choices such as data encoding, circuit structure, and classical-quantum coupling, which are critical for practical deployment. The study highlights the importance of hardware-aware optimization in building deployable HQNNs.

hybrid quantum-classical neural networksneural architecture searchnisk eraflops-aware searchparameterized quantum circuits

Robust Player-Conditional Champion Ranking for League of Legends: Style Similarity, Mastery Priors, and Archetype-Constrained Discovery

arXiv cs.LG · Min Heo, Pranav Kadiyam, Prasun Panthi · 2026-05-18

The paper formalizes champion recommendation in League of Legends as an interpretable, player-conditional ranking problem, addressing sparse and non-stationary behavioral data. The method integrates four components: population-strength proxies, player-style similarity, mastery priors, and archetype guardrails, using robust normalization, logarithmic transforms, recency-weighted vectors, and k-means++ clustering. A prototype implementation demonstrates decomposed recommendation scores (performance, fit, mastery, archetype compatibility) and includes a single-player case study. The framework supports future validation via temporal splits, next-champion recovery, and ablation studies.

player-conditional rankingrobust normalizationrecency-weighted vectorsarchetype guardrailsk-means++ clustering

QLIF-CAST: Quantum Leaky-Integrate-and-Fire for Time-Series Weather Forecasting

arXiv cs.LG · Alberto Marchisio, Aayan Ebrahim, Nouhaila Innan, Muhammad Kashif · 2026-05-18

QLIF-CAST introduces a quantum spiking neural network for multivariate weather forecasting, extending the Quantum Leaky Integrate-and-Fire (QLIF) model to continuous-valued time-series prediction. The hybrid quantum-classical architecture encodes neuron states via single-qubit superpositions using Rx gates and T1 decay. Evaluations show 15.4% lower MSE and 4.4% lower MAE versus classical LIF, with 94% faster convergence than QLSTM/QNN baselines. Hardware tests on IBM Marrakesh confirm 1.2% deviation from simulation.

quantum spiking neural networkleaky-integrate-and-firetime-series forecastinghybrid quantum-classicalt1 relaxation

Prune, Update and Trim: Robust Structured Pruning for Large Language Models

arXiv cs.LG · Diego Coello de Portugal Mecke, Tom Hanika, Lars Schmidth-Thieme · 2026-05-18

Putri introduces a robust structured pruning method for Large Language Models (LLMs) that updates un-pruned weights, sequentially prunes FFN layers, and removes individual attention heads instead of full layers. The method extends to Grouped-Query Attention and maintains simplicity while achieving state-of-the-art performance. Experiments across various models, sparsity ranges, and datasets validate Putri's generality, demonstrating its effectiveness even at extreme sparsity ratios. The code is publicly available for reproducibility.

structured pruninglarge language modelsgrouped-query attentionsparsity ratiospost-training pruning

Lost in the Folds: When Cross-Validation Is Not a Deep Ensemble for Uncertainty Estimation

arXiv cs.LG · Kirscher Tristan, Bujotzek Markus, Kirchhoff Yannick, Rokuss Maximilian · 2026-05-18

The study examines how K-fold cross-validation (CV) ensembles are often mislabeled as deep ensembles (DE) in medical image segmentation, leading to conflated uncertainty estimates. Comparing 5-fold CV with 5-member DE (same training data, different seeds) on multi-rater datasets across three modalities, DE matches segmentation accuracy while improving calibration and failure detection, whereas CV ensembles better capture inter-rater variability. A modified nnU-Net enables efficient DE training. Results suggest DE for reliability tasks and CV for ambiguity modeling.

cross-validationdeep ensemblesuncertainty estimationmedical image segmentationnnunet

The Symmetries of Three-Layer ReLU Networks

arXiv cs.LG · Johanna Marie Gegenfurtner, Moritz Grillo, Guido Montúfar · 2026-05-18

The paper develops a framework for analyzing parameter symmetries in three-layer ReLU networks with bottleneck architectures, providing explicit semi-algebraic descriptions of generic parameter fibers. The method yields a polynomial-time algorithm for determining functional equivalence between parameter sets. The symmetries encompass both discrete and continuous transformations arising from layer composition, influenced by whether deeper layers preserve or obscure geometric structure from preceding layers. Additionally, the study demonstrates that certain symmetries induce local conservation laws along gradient flow, while others do not.

relu networksparameter symmetriessemi-algebraic descriptionsfunctional equivalencegradient flow

Dynamic Elliptical Graph Factor Models via Riemannian Optimization with Geodesic Temporal Regularization

arXiv cs.LG · Chuansen Peng, Xiaojing Shen · 2026-05-18

We propose Dynamic Estimation on the Grassmann Manifold with a Factor Model (DEGFM), a novel algorithm for inferring time-varying graph structures from high-dimensional nodal observations. DEGFM addresses two key challenges: maintaining temporal coherence across observation windows and respecting the Riemannian geometry of the symmetric positive definite manifold. The method models precision matrices as a low-rank-plus-diagonal structure governed by a latent elliptical graph factor model, reducing parameter count and enabling reliable estimation in small-sample regimes. Temporal coherence is enforced via a Riemannian geodesic penalty on the Grassmann manifold. Experiments on synthetic and real-world datasets demonstrate DEGFM's superior performance across evaluation metrics.

grassmann manifoldriemannian geometryprecision matrixtemporal coherencelow-rank-plus-diagonal

Temporal Task Diversity: Inductive Biases Under Non-Stationarity in Synthetic Sequence Modelling

arXiv cs.LG · Afiq Abdillah Effiezal Aswadi, Oliver Britton, Ross Baker, Matthew Farrugia-Roberts · 2026-05-18

This paper investigates how temporal task diversity influences inductive biases in neural networks under non-stationary data distributions. Using in-context linear regression sequence modeling as a testbed, the authors analyze the impact of diversifying task distributions across training time on generalization and memorization tendencies. Results indicate that temporal diversity increases the bias towards generalization over memorization, contrasting with fixed task distributions where small transformers exhibit varied generalization patterns. This study provides insights into the structural and safety properties of models trained on evolving data.

inductive biasnon-stationarityin-context learningtemporal diversitysequence modeling

Geometric Dictionary Learning of Dynamical Systems with Optimal Transport

arXiv cs.LG · Thibaut Germain, Sami Chemlal, Rémi Flamary, Vladimir R. Kostic · 2026-05-18

DOODL (Dynamical OperatOr Dictionary Learning) introduces a framework for learning a dictionary of characteristic spectral dynamics to approximate a low-dimensional manifold in spectral operator space, enabling compact and interpretable embeddings of related dynamical systems. By constraining operator estimation to this learned manifold, DOODL facilitates fast and interpretable estimation from short and partially observed trajectories. Experiments on metastable Langevin dynamics and turbulent plasma simulations demonstrate that DOODL captures characteristic spectral structure governing dynamics, achieving errors one to two orders of magnitude lower than independent operator estimation methods in low-data regimes.

dynamical systemsoperator-theoretic representationsspectral dynamicslow-dimensional manifoldinterpretable embeddings

Subject-Specific Analysis of Self-Initiated Attention Shifts from EEG with Controlled Internal and External Attention Conditions

arXiv cs.LG · Yuwen Zeng, Dengzhe Hou, Zhang Zhang, Sai Sun · 2026-05-18

This study advances the characterization of self-initiated attention shifts through interpretable EEG analysis, demonstrating reliable within-subject classification of preparatory neural activity. Using a controlled paradigm comparing self-initiated and externally instructed attention shifts under identical visual stimulation, the authors employ machine learning with frequency-specific topographic pattern analysis and SHAP-based feature attribution. Results indicate that higher-frequency EEG bands and frontal regions provide discriminative information, though potential non-neural artifacts warrant caution. The work highlights interpretable machine learning's utility for subject-specific EEG pattern analysis, with implications for personalized brain-machine interfaces.

electroencephalographyattention shiftsshapley additive explanationsbrain-machine interfacefrequency-specific patterns

A Unified Framework for Structured Flow Modeling: From Continuous Fields to Data-Driven Representations

arXiv cs.LG · Diego Casadei · 2026-05-18

The paper presents a unified framework for modeling structured flows in dynamical systems, connecting continuous Helmholtz-Hodge formulations with discrete and data-driven representations. It introduces the Graph Vector Field (GVF) framework for decomposing dynamics into gradient, curl, and harmonic components on simplicial complexes, alongside a hierarchy of alternative modeling approaches balancing expressivity and tractability. A cross-domain validation strategy using physical system datasets enables systematic evaluation of model complexity, interpretability, and predictive performance, supporting an iterative methodology for scalable analysis of complex dynamics.

helmholtz-hodge decompositiongraph vector fieldsimplicial complexesstructured flowsdynamical systems

Attacking the First-Principle: A Black-Box, Query-Free Targeted Mimicry Attack on Binary Function Classifiers

arXiv cs.LG · Gabriel Sauger, Jean-Yves Marion, Sazzadur Rahaman, Victor Matrat · 2026-05-18

Kelpie introduces a black-box, query-free framework for executing targeted mimicry attacks on binary function classifiers, bypassing the need for direct interaction with the target model. The method employs code transformations that preserve malicious functionality while inducing misclassification into desired benign categories. Experiments demonstrate Kelpie's effectiveness against six state-of-the-art binary function classifiers across diverse architectures, validated through practical deployment of concealed malware (keylogger, wiper) within benign-looking functions. This work is the first to achieve such attacks in a black-box, zero-query setting, challenging the security assumptions of ML-based binary function classifiers.

binary function classifiersmimicry attacksblack-box attackscode transformationszero-query setting

SIREM: Speech-Informed MRI Reconstruction with Learned Sampling

arXiv cs.LG · Md Hasan, Nyvenn Castro, Daiqi Liu, Lukas Mulzer · 2026-05-18

SIREM introduces a speech-informed MRI reconstruction framework that leverages synchronized speech as a cross-modal prior to enhance real-time MRI (rtMRI) of speech production. The method models each frame as a fusion of an audio-driven component, predicting articulator-related structure from speech, and an MRI-driven component, reconstructing complementary content from k-space data, via a spatial weighting map. It further incorporates a learnable soft weighting profile over spiral arms for differentiable sampling adaptation. Evaluated on the USC speech rtMRI benchmark, SIREM outperforms baselines like gridding and compressed sensing, achieving higher throughput while preserving anatomically plausible vocal-tract structure.

rtmrik-spacecross-modal priorspatial weighting mapcompressed sensing

Forward-Learned Discrete Diffusion: Learning how to noise to denoise faster

arXiv cs.LG · Grigory Bartosh, Teodora Pandeva, Sushrut Karmalkar, Javier Zazo · 2026-05-18

Forward-Learned Discrete Diffusion (FLDD) introduces a learnable forward process to improve discrete diffusion models' efficiency. Unlike conventional approaches that fix a Markovian forward chain, FLDD employs a non-Markovian formulation with learnable marginal and posterior distributions, enabling the generative process to remain factorized while aligning with the target distribution. All parameters are trained end-to-end under the standard variational objective. Experiments demonstrate that FLDD generates higher-quality samples than traditional discrete diffusion models for the same number of sampling steps across various benchmarks.

discrete diffusionnon-markovianfactorized distributionsvariational objectivegenerative models

Dual-Rate Diffusion: Accelerating diffusion models with an interleaved heavy-light network

arXiv cs.LG · Grigory Bartosh, David Ruhe, Emiel Hoogeboom, Jonathan Heek · 2026-05-18

Dual-Rate Diffusion accelerates diffusion models by interleaving a high-capacity context encoder (evaluated sparsely) with a lightweight denoising model (executed at every step), reusing extracted features for efficient refinement. This method maintains sample quality while reducing computational costs by 2-4× on ImageNet benchmarks. The approach remains compatible with distillation techniques like Moment Matching Distillation, enabling additional efficiency gains in few-step generation.

diffusion modelscomputational efficiencycontext encoderdenoising modelmoment matching distillation

UTOPYA: A Multimodal Deep Learning Framework for Physics-Informed Anomaly Detection and Time-Series Prediction

arXiv cs.LG · Robson W. S. Pessoa, Julien Amblard, Alessandra Russo, Idelfonso B. R. Nogueira · 2026-05-18

The paper introduces UTOPYA, a 15.2M-parameter multimodal framework for physics-informed anomaly detection, time-series prediction, and phase classification in batch distillation. The method fuses eight data modalities via FiLM-conditioned cross-modal attention and gated fusion, with physics-informed regularization enforcing temporal smoothness and thermodynamic monotonicity. On a 119-experiment dataset, UTOPYA achieves 0.832-0.874 AUROC, outperforming baselines by +0.147 AUROC, with ablations showing FiLM conditioning (+0.145 AUROC) as critical and revealing negative impacts from common regularization techniques.

multimodal fusionanomaly detectionfilm conditioningphysics-informed regularizationtime-series prediction

Canonical Regularisation of Wide Feature-Learning Neural Networks

arXiv cs.LG · George Whittle, Pranav Vaidhyanathan, Juliusz Ziomek, Natalia Ares · 2026-05-18

The paper introduces a canonical regularisation framework for wide neural networks in the feature-learning regime, addressing the bias introduced by ridge regularisation in gradient flow training. By axiomatising the canonical regulariser as a regime-agnostic function-space energy and leveraging Riemannian geometry, the authors derive geodesic ridge, generalising ridge regularisation to feature-learning networks. They prove the canonical function-space prior is a Riemannian Gibbs Process, extending the Gaussian Process framework. Practical contributions include arc ridge, a scalable surrogate to geodesic ridge, and empirical validation on image processing and NLP transfer-learning tasks, demonstrating the interplay between early stopping and canonical regularisation across learning regimes.

feature-learning regimeridge regularisationriemannian geometrygeodesic ridgeriemannian gibbs process

Ringmaster LMO: Asynchronous Linear Minimization Oracle Momentum Method

arXiv cs.LG · Abdurakhmon Sadiev, Artavazd Maranjyan, Ivan Ilin, Peter Richtárik · 2026-05-18

Ringmaster LMO introduces an asynchronous Linear Minimization Oracle (LMO)-based momentum method for stochastic nonconvex optimization, addressing synchronization bottlenecks in heterogeneous distributed systems. The method extends Ringmaster ASGD's delay-thresholding mechanism to LMO-based updates, discarding stale gradients to achieve optimal time complexity. Convergence guarantees are established under generalized $(L_0, L_1)$-smoothness, with a parameter-agnostic variant featuring decreasing stepsizes and adaptive delay thresholds. Time complexity bounds recover Ringmaster ASGD's optimality in Euclidean smooth settings. Empirical results on stochastic quadratic problems and NanoChat pretraining demonstrate superior performance over synchronous and asynchronous baselines, particularly in heterogeneous systems.

linear minimization oracleasynchronous optimizationdelay-thresholdingstochastic nonconvex optimizationheterogeneous systems

Buffer-Parameterized Machine Learning Surrogate Models for Cross-Technology Signal Integrity Analysis and Optimization

arXiv cs.LG · Julian Withöft, Werner John, Emre Ecik, Ralf Brüning · 2026-05-18

The paper introduces a buffer-parameterized machine learning surrogate modeling framework for cross-technology signal integrity analysis in PCB interconnects, eliminating the need for retraining across technology shifts. The methodology treats IC buffer characteristics (e.g., clock frequency, supply voltage, rise/fall times) as dynamic inputs alongside PCB parameters, enabling generalization across diverse technologies. A benchmarking study compares tree-based methods, kernel methods, Gaussian process regression, and neural networks, with neural networks outperforming others on large datasets. Validated on a 44-parameter interconnect, the framework demonstrates significant computational speedups in eye mask compliance checking compared to simulation.

surrogate modelingsignal integritycross-technologyneural networksgaussian process regression

Elastic-dLLM: Position Preserving Context Compression and Augmentation of Diffusion LLMs

arXiv cs.LG · Junyi Wu, Tianchen Zhao, Shaoqiu Zhang, Linfeng Zhang · 2026-05-18

The paper introduces Elastic-dLLM, a method for position-preserving context compression and augmentation in diffusion large language models (dLLMs). It addresses computational redundancy in dLLMs by compressing redundant [MASK] token computations and augmenting context with a terminal [MASK] token. This approach accelerates decoding and enables long-context scaling for full-sequence dLLMs like LLaDA-8B-Instruct and LLaDA-1.5, while enhancing generation quality for block dLLMs like LLaDA2.0-mini with minimal overhead. The method is validated through systematic analysis of [MASK] token redundancy and its structural role in dLLMs.

diffusion llmsmask tokencontext compressionparallel decodinglong-context scaling

Foundation Models for Credit Risk Prediction: A Game Changer?

arXiv cs.LG · Bart Baesens, Andreas Goethals, Stefan Lessmann, Simon De Vos · 2026-05-18

The paper benchmarks tabular foundation models against traditional machine learning techniques for credit risk prediction, specifically probability of default (PD) and loss given default (LGD) modeling. Using diverse datasets and performance metrics, the study evaluates pretrained models in small-data settings like SME lending. Results show foundation models outperform competitors across tasks, particularly with limited data, while requiring no hyperparameter tuning. The out-of-the-box performance suggests practical advantages for imbalanced datasets and low-default portfolios.

tabular foundation modelscredit risk predictionprobability of defaultloss given defaultsmall-data settings

pyforce-1.0.0: Python Framework for data-driven model Order Reduction of multi-physiCs problEms

arXiv cs.LG · Stefano Riva, Yantao Luo, Carolina Introini, Antonio Cammi · 2026-05-18

pyforce-1.0.0 introduces a Python framework for data-driven model order reduction in multi-physics problems, particularly in nuclear engineering. The framework implements Reduced Order Modelling (ROM) techniques to reduce model complexity, optimize sensor placement, and integrate real measurements for enhanced system understanding. Compared to its previous dolfinx-based version (v0.6.0), pyforce-1.0.0 has been redesigned using pyvista for mesh handling, integral computation, and visualization, with functions stored as numpy arrays for improved usability. This update enables compatibility with any solver exporting results in VTK format, broadening its applicability.

model order reductionmulti-physicsnuclear engineeringpyvistavtk format

The Expressive Power of Low Precision Softmax Transformers with (Summarized) Chain-of-Thought

arXiv cs.LG · Moritz Brösamle, Stephan Eckstein · 2026-05-18

This work bridges the gap between theoretical expressivity results and practical transformer architectures by analyzing standard transformer decoders with softmax attention and activation rounding, where depth and width grow logarithmically with context length. The authors construct hardmax transformers with ternary activations and well-separated attention scores to simulate Turing machines using Chain-of-Thought (CoT), then convert these to equivalent softmax transformers without unrealistic parameter magnitudes or activation precision. They extend this analysis to summarized CoT, showing more efficient Turing machine simulation with logarithmic model size scaling in space rather than time. Empirical validation on a Sudoku reasoning task demonstrates better alignment with learnability compared to prior high-precision results.

softmax attentionchain-of-thoughtturing machinesactivation roundinglogarithmic scaling

Equilibrium Selection in Multi-Agent Policy Gradients via Opponent-Aware Basin Entry

arXiv cs.LG · Yevhen Shcherbinin, Arina Redina, Maxim Kalpin, Vlad Kochetov · 2026-05-18

The paper proposes an opponent-aware basin entry method for equilibrium selection in multi-agent policy-gradient algorithms, addressing the challenge of local convergence to arbitrary Nash equilibria. By decomposing the Meta-MAPG update into ordinary policy gradient plus own-learning and peer-learning corrections, the method increases basin-entry probability for target equilibria (e.g., payoff-dominant ones) under local alignment conditions. Theoretical analysis shows controlled sampling noise and finite-unroll bias, with annealed corrections preserving stable-Nash convergence. Experiments in Stag Hunt, iterated Prisoner's Dilemma, and neural-policy coordination demonstrate improved cooperative equilibrium selection.

multi-agent reinforcement learningpolicy gradient methodsnash equilibriumbasin entryopponent-aware learning

Wasserstein bounds for denoising diffusion probabilistic models via the Föllmer process

arXiv cs.LG · Yuta Koike · 2026-05-18

The paper establishes Wasserstein error bounds for denoising diffusion probabilistic models (DDPMs) via three contributions. First, it derives sharp upper bounds under Lipschitz-type score function conditions, optimal in dimension and step count, encompassing cosine schedules. Second, it proves these conditions imply logarithmic Sobolev inequalities and quadratic transportation cost inequalities, enabling optimal Wasserstein bounds from KL divergence results. Third, it shows optimal Wasserstein bounds remain attainable for log-concave targets without quadratic transportation cost inequalities. The analysis frames DDPM sampling as Föllmer process discretization rather than reverse Ornstein-Uhlenbeck.

wasserstein distancedenoising diffusionföllmer processlogarithmic sobolevtransportation cost

The MixCount Dataset: Bridging the Data Gap for Open-Vocabulary Object Counting

arXiv cs.LG · Corentin Dumery, Niki Amini-Naieni, Shervin Naini, Pascal Fua · 2026-05-18

We introduce MixCount, a dataset and benchmark addressing the mixed-object counting gap in vision tasks, where current models systematically fail due to limited and noisy real-world data and unrealistic synthetic alternatives. Our solution employs an automatic generation pipeline that synthesizes diverse, realistic images with pixel-perfect counting annotations and fine-grained textual descriptions at scale, eliminating labeling ambiguity. Evaluation on MixCount reveals significant performance degradation in mixed-object settings, while training on our synthesized data yields substantial improvements, reducing MAE by 20.14% on FSC-147 and 18.3% on PairTally. This establishes MixCount as both a benchmark and a training resource, effectively addressing a long-standing bottleneck in counting models.

mixed-object countingpixel-perfect annotationsautomatic generation pipelinefine-grained textual descriptionsmean absolute error

Protection Is (Nearly) All You Need: Structural Protection Dominates Scoring in Globally Capped KV Eviction

arXiv cs.LG · Gabriel Garcia · 2026-05-18

The paper demonstrates that structural protection is paramount for effective KV cache eviction in globally capped decoding scenarios. Evaluating seven eviction policies (LRU, H2O, SnapKV, StreamingLLM, Ada-KV, QUEST, Random) across six transformer models, the authors find that unprotected policies collapse to near-zero quality (F1≤0.064). Reserving 10% of cache at prompt boundaries recovers 69-90% of reference quality at 13% retention, with simplified score-isolation variants performing equivalently to LRU (Δ=0.02 at K=32). Attention-mass analysis reveals position-0 tokens dominate prefix attention (∼75%), while boundary tokens are vulnerable. Protection enables faithful Ada-KV/QUEST variants to add ∼0.03-0.04 F1 on Mistral-7B and Phi-3.5, with per-head allocation providing modest gains.

kv cache evictionstructural protectionattention-massscore-isolationper-head allocation

A note on connections between the Föllmer process and the denoising diffusion probabilistic model

arXiv cs.LG · Yuta Koike · 2026-05-18

The note establishes explicit connections between discretized Föllmer processes and denoising diffusion probabilistic model (DDPM) samplers, demonstrating that Föllmer discretization yields natural DDPM hyper-parameter settings. By analyzing the Föllmer process—a Brownian motion conditioned to reach a target distribution at time 1—as a time-compressed reverse SDE, the work systematically derives improved sampling error bounds for DDPM. Results recover and slightly enhance state-of-the-art error analyses while unifying perspectives from stochastic processes and diffusion models.

föllmer processdenoising diffusion probabilistic modelreverse sdesampling error boundsbrownian motion

Real-time Multi-instrument Autonomous Discovery of Novel Phase-change Memory Materials

arXiv cs.LG · Chih-Yu Lee, Haotong Liang, Ryan Kim, Austin McDannald · 2026-05-18

The Multi-instrument Autonomous Discovery (MAD) framework integrates heterogeneous data streams from multiple instruments to enable real-time autonomous discovery of phase-change memory (PCM) materials. MAD employs a multi-output model with a co-regionalization kernel to merge x-ray diffraction (XRD) and electrical resistance measurements, facilitating probabilistic posterior estimation and uncertainty quantification. Non-negative matrix factorization (NMF) is used to maximize knowledge of crystal structure distribution, while simultaneously optimizing for maximum resistance value. Applied to the Mn-Sb-Te ternary system, MAD identified promising PCM materials and established synthesis-process-structure-property relationships (SPSPR) within 25 closed-loop iterations, achieving a seven-fold speed-up compared to conventional methods.

phase-change memoryco-regionalization kernelnon-negative matrix factorizationx-ray diffractionautonomous discovery

Federated Learning by Utility-Constrained Stochastic Aggregation for Improving Rational Participation

arXiv cs.LG · M Yashwanth, Arunabh Singh, Ashok Nayak, Sai Kiran Bulusu · 2026-05-18

The paper proposes FedUCA, a federated learning framework that addresses rational client participation by modeling the server as an optimizer balancing global model performance with client utility constraints. The method employs utility-constrained stochastic aggregation to sustain participation in cross-silo settings with statistical heterogeneity. Experiments on standard datasets demonstrate FedUCA achieves 18-32% higher client retention and superior global model accuracy compared to baseline approaches.

federated learningrational participationutility constraintscross-silostochastic aggregation

LogRouter: Adaptive Two-Level LLM Routing for Log Question Answering in Big Data Systems

arXiv cs.LG · Mert Coskuner, Merve Zeybel, Melik Mert Dolan · 2026-05-18

LogRouter introduces a cost-aware two-level routing system for log question answering in resource-constrained big data environments. The system integrates PySpark-based Drain3 log ingestion, GPU-accelerated embeddings, and dual-index storage in Apache Druid and PostgreSQL with pgvector. It employs a Level-1 router to dispatch queries among four execution paths and a Level-2 router to select between 14B-class and 32B-class LLM generators. Evaluated on four LogHub datasets, LogRouter achieves 88.4% mean routing accuracy, 0.373 ROUGE-1, 0.879 BERTScore, and 0.779 RAGAS Faithfulness, with 55% latency reduction compared to a Fixed-32B baseline while maintaining answer correctness within 5.8 points.

log question answeringcost-aware routingdual-index storagegpu-accelerated embeddingsragas faithfulness

Uncertainty Reliability Under Domain Shift: An Investigation for Data-Driven Blood Pressure Estimation in Photoplethysmography

arXiv cs.LG · Mohammad Moulaeifard, Ciaran Bench, Philip J. Aston, Nils Strodthoff · 2026-05-18

This study investigates uncertainty quantification (UQ) in deep learning-based blood pressure (BP) estimation from photoplethysmography (PPG) signals under domain shift, comparing in-distribution (ID) and out-of-distribution (OOD) performance. Using an XResNet1D-50 trained on PulseDB and tested on four external datasets, the authors evaluated deep ensembles (DE), Monte Carlo dropout (MCD), and recalibration methods (conformal prediction, temperature scaling, isotonic regression) with Gaussian negative log-likelihood (GNLL) and mean squared error (MSE) losses. Results show DE outperforms MCD under OOD conditions, GNLL-based methods yield the best uncertainty calibration, and recalibration is essential for MSE-based uncertainty. These findings emphasize the importance of joint predictive accuracy and calibration assessment in cuffless BP estimation.

uncertainty quantificationdomain shiftphotoplethysmographydeep ensemblesconformal prediction

Scalable Decision-Focused Learning through Cost-Sensitive Regression

arXiv cs.LG · Noah Schutte, Senne Berden, Tias Guns, Krzysztof Postek · 2026-05-18

The paper introduces a scalable decision-focused learning (DFL) approach for contextual optimization problems by reframing them as cost-sensitive multi-output regression. The method combines cost-insensitive normalization, decision-aware asymmetric penalization, and instance-based costs to approximate the downstream task loss without repeated optimization solves during training. Experiments demonstrate comparable task performance to state-of-the-art DFL methods while significantly improving computational efficiency, enabling application to previously intractable problem sizes.

decision-focused learningcontextual optimizationcost-sensitive regressionmulti-output regressionasymmetric penalization

RL4RLA: Teaching ML to Discover Randomized Linear Algebra Algorithms Through Curriculum Design and Graph-Based Search

arXiv cs.LG · Jinglong Xiong, Xiaotian Liu, Ruoxin Wang, Zihang Liu · 2026-05-18

RL4RLA introduces a reinforcement learning framework for automating the discovery of randomized linear algebra (RLA) algorithms, addressing challenges of sparse rewards and vast search spaces. The method employs a numerical curriculum to incrementally increase problem difficulty and Monte Carlo Graph Search to optimize exploration by merging equivalent partial algorithms. It constructs interpretable, symbolic algorithms from basic linear algebra primitives, ensuring verifiability and implementability. RL4RLA successfully rediscovers state-of-the-art methods such as sketch-and-precondition solvers, Randomized Kaczmarz, and Newton Sketch, and can optimize algorithms for specific trade-offs in accuracy, speed, and stability.

reinforcement learningrandomized linear algebramonte carlo graph searchnumerical curriculumsymbolic algorithms

Function graph transformers universally approximate operators between function spaces

arXiv cs.LG · Takashi Furuya, David Mis, Ivan Dokmanić, Maarten V. de Hoop · 2026-05-18

The paper introduces function graph transformers, a novel framework for approximating nonlinear operators between function spaces using measure-theoretic transformers. By representing functions as graph measures and employing empirical approximations, the method ensures discretization invariance and handles regularized negative-order Sobolev inputs. The authors prove that graph-preserving maps can be universally approximated by compositions of softmax self-attention layers and MLPs, maintaining single-valued outputs. This approach provides a rigorous mathematical foundation for operator learning with transformers, addressing challenges in discretization consistency and output domain variability.

function graph transformersmeasure-theoretic transformersoperator learningdiscretization invariancesoftmax self-attention

Enhancing the Code Reasoning Capabilities of LLMs via Consistency-based Reinforcement Learning

arXiv cs.LG · Zhanyue Qin, Jia Feng, Yibo Lyu, Yun Peng · 2026-05-18

The paper introduces CodeThinker, a reinforcement learning framework enhancing LLMs' code reasoning via consistency-based rewards. The method combines stepwise reasoning-aware training (using consistency tracing), dynamic beam sampling, and a consistency reward mechanism to mitigate reward hacking. Evaluations on three benchmarks show state-of-the-art performance, with 4.3% accuracy improvement on Qwen2.5-Coder-7B-Instruct and gains of 5.33/3.11 percentage points on mathematical/code reasoning tasks across 17 languages.

code reasoningreinforcement learningconsistency rewarddynamic beam samplingstepwise reasoning

Universal Adversarial Triggers

arXiv cs.LG · Benedict Florance Arockiaraj, Alexander Feng, Jianxiong Cai, Xiaoyu Cheng · 2026-05-18

The paper introduces a novel technique for generating universal adversarial triggers that are both effective and linguistically natural, addressing the ungrammaticality of prior approaches. The method combines parts-of-speech filtering with a perplexity-based loss function to produce sensible trigger sequences. Evaluated on sentiment analysis using the SST dataset, the generated triggers achieve accuracies as low as 0.04 and 0.12 for flipping positive-to-negative and negative-to-positive predictions, respectively. Adversarial training using these triggers improves model robustness, increasing accuracy from 0.12 to 0.48. The work demonstrates that adversarial attacks can be made less detectable while facilitating robust model development.

universal adversarial triggersparts-of-speech filteringperplexity-based losssentiment analysisadversarial training

InfoFlow: A Framework for Multi-Layer Transformer Analysis

arXiv cs.LG · Penghao Yu, Haotian Jiang, Zeyu Bao, Qianxiao Li · 2026-05-18

The work establishes a theoretical separation between single-layer and multi-layer Transformers, demonstrating that two-layer architectures achieve ε-precision retrieval tasks with O(ε⁻¹) parameters, while single-layer variants require Ω(ε⁻ᵏ) parameters where k scales linearly with sequence length. Through analysis of softmax attention's retrieval limitations and coupled information decoding costs, the authors propose InfoFlow, a framework tracking accessible input positions per token/layer to quantify information propagation efficiency. The framework unifies known approximation bounds, aligns with empirical observations, and predicts behaviors in analytically intractable settings.

transformersapproximation theoryretrieval taskssoftmax attentioninformation propagation

Transfer Learning for Customized Car Racing Environments

arXiv cs.LG · Benedict Florance Arockiaraj, Richard Chang, Wesley Yee · 2026-05-18

The study investigates transfer learning in deep reinforcement learning for OpenAI's Car Racing environment, focusing on achieving fast lap times across customized circuits through zero-shot transfer and fine-tuning. It compares model-based and model-free approaches, finding that model-based methods outperform model-free ones in both performance and convergence speed. Transfer learning consistently enhances target domain performance and demonstrates robust learning capabilities. The results highlight the efficacy of transfer learning in adapting agents to new racing environments.

transfer learningdeep reinforcement learningmodel-basedmodel-freezero-shot transfer

Lightweight Gaussian Process Inference in C++ on Metal and CUDA

arXiv cs.LG · Yu-Hsueh Fang · 2026-05-18

LightGP introduces a lightweight C++17 library for Gaussian process inference, offering Python bindings and support for Apple Metal, NVIDIA CUDA, and CPU backends via Apple Accelerate and OpenBLAS. The library implements four inference methods: exact Cholesky, matrix-free conjugate gradients, sparse variational free energy, and structured kernel interpolation with FFT, scaling from N=100 to N=500,000. Benchmarks on Apple M4 and NVIDIA RTX 3060 demonstrate significant speedups, with LightGP CPU outperforming GPyTorch CPU by 2.6–8.7× for exact GP and 1.5× for sparse GP. LightGP CUDA achieves 2.3–6.7× speedups over GPyTorch CUDA for N≤2,048, with optimized matrix-free kernel-vector products and FFT-accelerated SKI matvecs achieving sub-millisecond performance at N=200,000.

gaussian processcudametalmatrix-freefft

CoX-MoE: Coalesced Expert Execution for High-Throughput MoE Inference with AMX-Enabled CPU-GPU Co-Execution

arXiv cs.LG · Mu-Young Son, Yi Chen, Seungjae Yoo, Soongyu Choi Joo-Young Kim · 2026-05-18

CoX-MoE introduces a CPU-GPU co-execution system for high-throughput Mixture-of-Experts (MoE) inference, addressing GPU memory pressure and fragmented workloads in prior approaches. The method combines coalesced expert execution with strategic workload orchestration, employing ordinary batch processing instead of micro-batching and selectively offloading attention computation. It also implements a static expert-aware stratification scheme to pre-assign frequently activated experts to the GPU, reducing PCIe transfer overhead and balancing CPU-GPU workloads. Evaluations demonstrate throughput improvements of up to 7.1x and 2.4x over FlexGen and MoE-Lightning, respectively.

mixture-of-expertscpu-gpu co-executioncoalesced expert executionpcie transferattention offloading

Long-horizon prediction of three-dimensional wall-bounded turbulence with CTA-Swin-UNet and resolvent analysis

arXiv cs.LG · Bo Chen, Yitong Fan, Jie Yao, Weipeng Li · 2026-05-18

A hybrid machine-learning framework is proposed for long-horizon prediction of 3D wall-bounded turbulence, addressing autoregressive error accumulation and computational cost. The framework integrates a channel-time-attention Swin-UNet (CTA-Swin-UNet) for planar flow prediction, a multi-time-scale fusion correction (MTFC) strategy for extended prediction horizons, and a resolvent-based spectral linear stochastic estimation (SLSE) for 3D flow reconstruction. CTA-Swin-UNet outperforms baseline models (LSTM, FNO, Swin-UNet) in single-step prediction and autoregressive rollouts, maintaining stability for 150 steps compared to 20-50 for baselines. MTFC extends stability to 300 steps, and SLSE accurately reconstructs 3D flow structures and energy spectra, demonstrating computational efficiency and effectiveness.

autoregressive predictionwall-bounded turbulencechannel-time-attentionspectral linear stochastic estimationmulti-time-scale fusion

DAD4TS: Data-Augmentation-Oriented Diffusion Model for Time-Series Forecasting with Small-Scale Data

arXiv cs.LG · Masahiro Suzuki, Bohui Xia, Hiroto Yamamoto, Masanori Miyahara · 2026-05-18

The paper proposes DAD4TS, a diffusion-model-based data augmentation method with reinforcement learning for time-series forecasting with small-scale data. The method trains a data generator jointly with a time-series model, controlled by RL to generate accuracy-improving samples, using geometric projections instead of VAEs for diffusion model training. Evaluated against seven baselines on six real-world datasets with eight time-series models, DAD4TS showed effectiveness on five datasets.

diffusion modeltime-series forecastingdata augmentationreinforcement learningsmall-scale data

Multi-site PPG: An In-the-Wild Physiological Dataset from Emerging Multi-site Wearables

arXiv cs.LG · Jiayi Shao, Jiaying Ye, Shengyao Liu, Zachary Englhardt · 2026-05-18

The Multi-site PPG dataset introduces over 350 hours of in-the-wild physiological data collected from four custom wearables—a smart earring, ring, watch, and necklace—each recording green and infrared PPG, 3-axis acceleration, and temperature, synchronized with reference ECG from a Polar H10 chest strap. Participants wore the devices during daily activities, yielding 230-290 hours of modeling-ready 8-second windows per wearable. Benchmarking heart-rate estimation methods revealed significant body-site performance differences, with mean absolute errors ranging from 2.30 bpm (earring) to 8.68 bpm (necklace). The dataset enables robust analysis of motion effects, multi-site fusion, and PPG-accelerometer integration for emerging wearable form factors.

photoplethysmographywearableselectrocardiogrammotion effectsfusion

Learning over Positive and Negative Edges with Contrastive Message Passing

arXiv cs.LG · Peter Pao-Huang, Charilaos I. Kanatsoulis, Michael Bereket, Jure Leskovec · 2026-05-18

Contrastive Message Passing (CMP) introduces a graph neural network architecture that leverages both positive and negative edges for node feature updates, addressing the limitation of conventional methods that ignore negative edges. The method employs soft positive semidefinite constraints on learnable weights, applying similarity-preserving transformations to positively connected nodes and dissimilarity-inducing transformations to negatively connected nodes. Theoretical analysis shows significant information gain from negative edges in low-label, high-homophily, and high-edge-density settings. Empirical evaluation on simulated and real datasets demonstrates CMP's consistent outperformance of baselines in low-label scenarios where negative edges are informative.

contrastive message passinggraph neural networksnegative edgespositive semidefinite constraintslow-label settings

Simple Approximation and Derivative Free Inference-Time Scaling for Diffusion Models via Sequential Monte Carlo on Path Measures

arXiv cs.LG · Chenyang Wang, Weizhong Wang, Yinuo Ren, Jose Blanchet · 2026-05-18

The authors introduce exttt{URGE}, a derivative-free inference-time scaling algorithm for diffusion models that performs path-wise importance reweighting via a Girsanov change of measure. Unlike existing techniques requiring repeated score or gradient evaluations, exttt{URGE} attaches multiplicative weights to simulated trajectories and periodically resamples, eliminating the need for score, Hessian, or PDE evaluations. The method establishes equivalence between path-wise and particle-wise sequential Monte Carlo, ensuring unbiased terminal laws. Empirical results demonstrate that exttt{URGE} outperforms existing inference-time guidance baselines on synthetic tests and diffusion-model benchmarks, achieving superior generation quality with simpler implementation.

diffusion modelsgirsanov change of measuresequential monte carloimportance reweightinginference-time scaling

SNLP: Layer-Parallel Inference via Structured Newton Corrections

arXiv cs.LG · Ligong Han, Kai Xu, Hao Wang, Akash Srivastava · 2026-05-18

Structured Newton Layer Parallelism (SNLP) introduces a layer-parallel inference framework for autoregressive language models by treating hidden-state traces as nonlinear residual equations solved via structured Newton corrections. SNLP replaces exact Jacobian computations with architecture-induced surrogate dynamics: Identity Newton (IDN) for residual Transformers and HC Newton (HCN) for mHC-style architectures, coupled with SNLP-aware regularization to improve layer-parallel compatibility. Experiments on nanochat-scale Transformers demonstrate SNLP's effectiveness, achieving a 2.3x inference speedup on a 0.5B parameter model while reducing perplexity by 6.1%. Results indicate that layer-parallel inference can act as a solver-induced bias, though pretrained models show limited compatibility and exact convergence reverts to sequential computation.

structured newton correctionslayer-parallel inferenceresidual transformersarchitecture-induced dynamicsperplexity reduction

Agentic Cost-Aware Query Planning with Knowledge Distillation for Big Data Analytics

arXiv cs.LG · Mahdi Naser-Moghadasi · 2026-05-18

The paper introduces an agentic query planning system for big data analytics that combines rule-based planning, bandit exploration, and knowledge distillation to optimize query execution under resource constraints. The method employs a teacher planner with six optimization strategies, UCB1 bandit search for plan exploration, and a Random Forest cost model for latency prediction. A lightweight student planner (Logistic Regression or Gradient Boosting) is distilled from teacher-bandit decisions. Evaluations on NYC Taxi and IMDB datasets show 23% latency reduction, 94% constraint satisfaction, and 89% plan replication accuracy with 15x faster inference.

query optimizationknowledge distillationucb1 banditcost-aware predictionbig data analytics

A Unified Framework for Data-Free One-Step Sampling via Wasserstein Gradient Flows

arXiv cs.LG · Chenguang Wang, Tianshu Yu · 2026-05-18

The authors present a unified theoretical framework for data-free one-step sampling from unnormalized target distributions using Wasserstein gradient flows. They demonstrate that for a broad class of f-divergence objectives, the induced velocity field decomposes into a universal form, revealing shared asymptotic behavior across divergences. A regional-response theory and compression–elasticity identity are derived to characterize divergence-specific mass transport into under-covered regions. The framework is extended to Log-Variance divergence, motivating a practical surrogate for data-free training. Experiments on multimodal Gaussian-mixture benchmarks validate the theoretical predictions and demonstrate effective one-step sampling.

wasserstein gradient flowsf-divergencedata-free samplinglog-variance divergencenormalizing-flow

AMO: Adaptive Muon Orthogonalization

arXiv cs.LG · Xinlin Zhuang, Panyi Ouyang, Yichen Li, Jiangming Shi · 2026-05-18

The paper introduces Adaptive Muon Orthogonalization (AMO), a method that dynamically allocates Newton-Schulz (NS) iteration budgets for orthogonalization in Muon-based optimization, addressing heterogeneity in matrix geometry across operators, training stages, and network depths. AMO first observes weight geometry by operator type during early training, then commits to adaptive NS scheduling. Evaluated on Llama3.1-1.4B and Qwen3-1.7B, AMO improves average downstream performance by +0.76 and +0.51 respectively over uniform-schedule Muon across 12 tasks, demonstrating gains in standard, prolonged, and continual pre-training scenarios.

adaptive orthogonalizationnewton-schulz iterationsmuon optimizermatrix geometrypre-training optimization

GenTS: A Comprehensive Benchmark Library for Generative Time Series Models

arXiv cs.LG · Chenxi Wang, Xiaorong Wang, Peiyang Li, Yi Wang · 2026-05-18

GenTS introduces a comprehensive benchmark library for evaluating generative time series models, addressing the limitations of existing libraries primarily designed for discriminative models. The library features a unified data preprocessing pipeline, a collection of versatile models, and panoramic evaluation metrics, enabling systematic assessment across tasks like synthesis, forecasting, and imputation. Its modular design allows researchers to customize datasets and models beyond built-in options. Benchmarking experiments conducted using GenTS provide insights for model selection and identify future research directions. The library is open-source, with official tutorials and documentation available online.

generative modelstime seriesbenchmark librarydata preprocessingevaluation metrics

Is Complex Training Necessary for Long-Tailed OOD Detection? A Re-think from Feature Geometry

arXiv cs.LG · Ningkang Peng, Xuanming Chen, Yanhui Gu · 2026-05-18

The paper challenges the necessity of complex training for long-tailed out-of-distribution (LT-OOD) detection by showing that frozen long-tailed representations already contain useful OOD evidence, but standard Mahalanobis distance is distorted by frequency-coupled feature geometry. The authors propose Hyperspherical Pooled Mahalanobis (HPM), a post-hoc detector that normalizes features onto a unit sphere and uses pooled, ridge-regularized covariance while retaining class means as semantic anchors. On CIFAR-LT and ImageNet-100-LT, HPM significantly improves AUROC (e.g., from 46.49 to 85.67 on CIFAR-10-LT) and achieves the best Log Efficiency Score (3.08) on CIFAR-100-LT, demonstrating that simpler pipelines can match complex methods when properly accounting for feature geometry.

long-tailed ood detectionmahalanobis distancehyperspherical normalizationridge regularizationlog efficiency score

When Accuracy Is Not Enough: Uncertainty Collapse between Noisy Label Learning and Out-of-Distribution Detection

arXiv cs.LG · Ningkang Peng, Jingyang Mao, Runhan Zhou, Peirong Ma · 2026-05-18

The paper introduces ACC-OOD, a benchmark evaluating noisy label learning (LNL) methods for out-of-distribution (OOD) detection while freezing LNL checkpoints. It identifies uncertainty collapse, where high closed-set accuracy fails to ensure OOD reliability due to overlapping score and feature regions between misclassified in-distribution and OOD samples. Virtual Margin Regularization (VMR) is proposed as a lightweight intervention, synthesizing virtual outliers on trusted in-distribution batches to widen energy margins. VMR partially mitigates far-OOD failures without compromising closed-set accuracy, advocating for benchmarks that jointly assess closed-set generalization, open-world reliability, and structural overlap diagnostics.

uncertainty collapseout-of-distribution detectionvirtual margin regularizationnoisy label learningenergy margin

HydroAgent: Closing the Gap Between Frontier LLMs and Human Experts in Hydrologic Model Calibration via Simulator-Grounded RL

arXiv cs.LG · Zhi Li, Songkun Yan, Jie Cao, Mofan Zhang · 2026-05-18

The paper introduces HYDROAGENT, a simulator-grounded RL approach to bridge the performance gap between frontier LLMs and human experts in hydrologic model calibration. The authors benchmark nine LLMs on the CREST distributed hydrologic model, achieving Nash-Sutcliffe Efficiency (NSE) scores between -0.16 and 0.75, with Sonnet 4.6 performing best. They argue that the gap stems from domain-grounding rather than parameter count. HYDROAGENT fine-tunes Qwen3-4B using supervised fine-tuning on 2,576 expert trajectories and Group-Relative Policy Optimization with NSE as reward, leveraging simulation feedback. This approach offers a compute-efficient, physically faithful alternative to scaling generic LLMs for Earth system science.

hydrologic model calibrationnash-sutcliffe efficiencysimulator-grounded rlgroup-relative policy optimizationdomain-grounding

Uncertainty-Calibrated Recommendations for Low-Active Users

arXiv cs.LG · Bob Junyi Zou, Sai Li, Tianyun Sun, Wentao Guo · 2026-05-18

The paper introduces a unified framework for uncertainty-calibrated recommendations that differentiates strategies between Low-Active Users (LAUs) and High-Active Users (HAUs). The method employs model uncertainty quantification to implement a risk-averse deboosting policy for LAUs (suppressing unreliable recommendations) and a risk-seeking Upper Confidence Bound (UCB) strategy for HAUs (encouraging exploration). Evaluated on a major livestream platform, the framework improves LAU retention (active hours) and satisfaction (quality watch time ratio) while increasing HAU interest diversity and category coverage.

recommender systemsmodel uncertaintyupper confidence boundlow-active usersrisk-averse policy

Revisiting the Adam-SGD Gap in LLM Pre-Training: The Role of Large Effective Learning Rates

arXiv cs.LG · Athanasios Glentis, Dawei Li, Chung-Yiu Yau, Mingyi Hong · 2026-05-18

The study investigates the performance gap between SGD and Adam optimizers in LLM pre-training, attributing it primarily to SGD's inability to sustain large effective learning rates. Through empirical and theoretical analysis, the authors identify that LLM training dynamics involve small gradient norms and large weight-to-gradient ratios, exacerbated by large batch sizes. They propose clipping mechanisms to stabilize SGD at high learning rates, reducing the validation loss gap from over 50% to 3.5% in a 1B-parameter LLaMA model with 1M-token batches.

sgdadamlearning ratesllm pre-traininggradient clipping

Learning Variable-Length Tokenization for Generative Recommendation

arXiv cs.LG · Minhao Wang, Bowen Wu, Wei Zhang · 2026-05-18

The paper introduces VarLenRec, a generative recommendation framework that learns variable-length tokenization to address the Popularity-Length Paradox, where popular items perform optimally with short IDs while tail items require longer codes. The method employs Popularity-Weighted Information Budget Allocation (PIBA) to determine optimal ID lengths, hyperbolic residual quantization for geometric capacity stratification, and a Soft Length Controller for differentiable length prediction. Experiments across four datasets show VarLenRec improves recommendation accuracy and efficiency over state-of-the-art methods.

generative recommendationvariable-length tokenizationhyperbolic residual quantizationpopularity-length paradoxsoft length controller

Self-Distillation is Optimal Among Spectral Shrinkage Estimators in Spiked Covariance Models

arXiv cs.LG · Radu Lecoiu, Debarghya Mukherjee, Pragya Sur · 2026-05-18

The paper establishes self-distillation as the optimal spectral shrinkage estimator in spiked covariance models, demonstrating that s-step self-distillation achieves superior performance for matrices with s spikes. Through theoretical analysis of spectral shrinkage estimators, the authors prove that s steps are necessary for optimality, with any (s-k)-step distillation being strictly suboptimal. For isotropic covariances, optimally tuned Ridge regression outperforms other estimators. In federated settings, self-distillation emerges as the best local rule, differing from centralized optimal strategies. These findings connect self-distillation with classical shrinkage methods and explain its predictive performance improvements.

self-distillationspectral shrinkage estimatorsspiked covariance modelsridge regressionfederated learning

Feature Learning in Linear-Width Two-Layer Networks: Two vs. One Step of Gradient Descent

arXiv cs.LG · Behrad Moniri, Hamed Hassani · 2026-05-18

This paper characterizes feature learning in linear-width two-layer neural networks by analyzing the second step of gradient descent, overcoming limitations of single-step updates. Using step-sizes η₁ ≍ N^α₁ and η₂ ≍ N^α₂ for α₁, α₂ ∈ [0,0.5), the authors derive a spectral decomposition of updated weights as a spiked random matrix with multiple outliers, each corresponding to a learned direction. The number of outliers is determined by ⌊α₂/(1/2 - α₁)⌋. Batch reuse enables capturing directions with information exponents exceeding one, unlike independent batches restricted to exponent one. This establishes a framework for studying optimization in overparameterized networks.

linear-widthgradient descentspiked random matrixinformation exponentoverparameterized networks

AURORA: Contextual Orthogonalization for Geometric Representation Learning in Healthcare Foundation Models

arXiv cs.LG · Yuanyun Zhang, Shi Li · 2026-05-18

We present AURORA, a framework for healthcare representation learning that decomposes latent embeddings into orthogonal semantic subspaces corresponding to distinct contextual factors. The method employs adaptive uncertainty-aware representations and relational alignment objectives within each subspace, yielding geometrically interpretable and semantically disentangled embeddings. Evaluated on clinical prediction and retrieval tasks, AURORA outperforms reconstruction, contrastive, and self-distillation baselines, demonstrating improved contextual disentanglement, neighborhood purity, and robustness to institutional distribution shifts. Results indicate that structured latent geometry complements conventional predictive compression objectives in healthcare foundation models.

orthogonal subspacesrelational alignmentcontextual disentanglementlatent geometryhealthcare foundation models

MV-Gate: Insider Threat Detection via Multi-View Behavioral Statistics and Semantic Modeling

arXiv cs.LG · Kaichuan Kong, Dongjie Liu, Xiaobo Jin, Guanggang Geng · 2026-05-18

MV-Gate introduces a multi-view behavior modeling framework for insider threat detection that integrates statistical regularities with sequence semantics. The method constructs three aligned behavioral sequences: activity tokens, multi-scale status signals capturing recurrence patterns, and frequency-deviation signals describing short- vs long-term intensity differences. An anomaly-aware gating mechanism injects these statistical views into the attention computation, guiding the encoder to emphasize statistically irregular events. Experiments on CERT r4.2, CERT r5.2, and ADFA-LD demonstrate that MV-Gate achieves significant improvements over classical, deep-learning, and domain-specific baselines, particularly for progressive, weak-signal threats.

multi-view behavior modelinganomaly-aware gatingattention computationfrequency-deviation signalsinsider threat detection

Memisis: Orchestrating and Evaluating Synthetic Data for Tabular Health Datasets

arXiv cs.LG · Nitish Nagesh, Mahdi Bagheri, Arshia Harish Puthran, Pengbao Zhou · 2026-05-18

Memisis introduces a unified workflow for generating and evaluating synthetic tabular healthcare data, addressing privacy, utility, and fairness. The tool orchestrates existing synthesizers (CTGAN, TVAE, GaussianCopula) and leverages LLMs to automate generation based on user-specified goals, with configurable parameters for training size, epochs, and sample count. Evaluation on a schizophrenia dataset with protected attributes shows comparable performance across synthesizers in fairness and utility metrics. The system provides flexible control while automating the validation pipeline.

synthetic datatabular dataprivacy preservationfairness metricsdata orchestration

OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization

arXiv cs.LG · Zhongzhu Zhou, Donglin Zhuang, Jisen Li, Ziyan Chen · 2026-05-18

OSCAR introduces an offline spectral covariance-aware rotation method for 2-bit KV-cache quantization in LLM serving, aligning quantization with attention-aware covariance structures. It derives fixed rotations and clipping thresholds offline, enabling deployable INT2 quantization compatible with paged KV-cache serving and fused kernel pipelines. Evaluated on Qwen3 and GLM models with context lengths up to 128K tokens, OSCAR reduces the BF16 accuracy gap to 3.78 and 1.42 points on Qwen3-4B and Qwen3-8B, respectively, while naive INT2 collapses. It achieves 8x KV-cache memory reduction, up to 7x throughput improvement at large batch sizes, and 3x decoding acceleration over BF16.

kv-cachequantizationattentioncovariancedeployment

Testable and Actionable Calibration for Full Swap Regret

arXiv cs.LG · Konstantina Bairaktari, Lunjia Hu, Huy L. Nguyen, Jonathan Ullman · 2026-05-18

The authors introduce Soft-Binned Calibration Decision Loss (SCDL), a novel calibration measure that is both actionable and testable, addressing limitations in existing measures. SCDL ensures full actionability by directly informing utility loss under swap regret, while achieving near-optimal testability with minimal estimation error from small prediction-outcome samples. The measure also satisfies continuity and consistency properties. Experimental results demonstrate SCDL's superior performance compared to existing calibration measures, validating its theoretical advantages in practical applications.

calibrationswap regrettestabilityactionabilityestimation error

StatQAT: Statistical Quantizer Optimization for Deep Networks

arXiv cs.LG · Mehmet Aktukmak, Daniel Huang, Ke Ding · 2026-05-18

A novel statistical error analysis framework is introduced for optimizing uniform and floating-point quantization in deep neural networks, addressing challenges in parameter selection across diverse data distributions. The method develops iterative quantizers for arbitrary distributions and analytic quantizers tailored for Gaussian-like weight distributions, enabling low-error quantization for both activations and weights. These quantizers are integrated into quantization-aware training and evaluated across integer and floating-point formats. Experimental results demonstrate enhanced accuracy and stability, validating the approach's effectiveness in training low-precision networks.

quantizationstatistical error analysisquantization-aware traininglow-precision neural networksgaussian-like distributions

Divergence-Suppressing Couplings for Rectified Flow

arXiv cs.LG · Yimeng Min, Carla P. Gomes · 2026-05-18

The paper introduces divergence-suppressing couplings for Rectified Flow, an offline correction method that attenuates the divergent component of learned velocity fields to reduce trajectory entanglement. By suppressing local expansion/contraction regions during coupling generation, the approach straightens trajectories while maintaining computational efficiency (same wall-clock cost as standard Rectified Flow via Euler integration). Experiments demonstrate consistent improvements on 2D synthetic benchmarks and image generation tasks.

rectified flowvelocity fieldtrajectory entanglementdivergence suppressioneuler integration

L-Drive: Beyond a Single Mapping-Latent Context Drives Time Series Forecasting

arXiv cs.LG · Fan Zhang, Shijun Chen, Hua Wang · 2026-05-18

L-Drive introduces a change-aware forecasting framework to address limitations of Direct-Mapping methods in multivariate time-series forecasting. It employs a Latent-Context to capture high-level dynamics and uses gating mechanisms for adaptive segment representation. The method incorporates patch-shared relative positional basis functions to enhance intra-segment modeling and mitigate overfitting. Experiments demonstrate improved forecasting accuracy and computational efficiency compared to existing approaches.

latent-contextgating mechanismspatch-shared positional basismultivariate time-seriesdistribution shifts

Domain Incremental Learning for Pandemic-Resilient Chest X-Ray Analysis

arXiv cs.LG · Danu Kim · 2026-05-18

This study introduces a replay-based domain-incremental continual learning method for pneumonia detection in chest X-rays, addressing cross-domain variations without catastrophic forgetting. The approach employs class-aware balanced replay to maintain balanced class representation within constrained memory and a class-aware loss to dynamically reweight class imbalance during training. Evaluated on a domain-shifted PneumoniaMNIST dataset with five simulated domains, the method achieves 88.66% average accuracy, outperforming Experience Replay, Fine-Tuning, and Joint Training baselines.

domain-incremental learningcatastrophic forgettingclass-aware balanced replaypneumoniamnistcross-domain variations

Sequential Structure in Intraday Futures Data: LSTM vs Gradient Boosting on MNQ

arXiv cs.LG · Mathias Mesfin · 2026-05-18

The paper empirically establishes a lower bound on data scale requirements for sequential financial machine learning by evaluating Kronos-inspired architectures on intraday Micro E-Mini Nasdaq 100 futures (MNQ). It compares gradient boosting and long short-term memory (LSTM) models for directional prediction using five-minute OHLCV bar sequences across 944 trading days (2021-2025) under expanding-window walk-forward validation. No model configuration achieves statistically significant out-of-sample accuracy above the 51.8% base rate, with combined accuracies ranging from 50.00% to 50.89%. Permutation tests yield p-values of 0.135 (best gradient boosting) and 0.515 (LSTM), indicating noise fitting rather than stable signal capture.

ohlcvlstmgradient boostingwalk-forward validationpermutation test

How does feature learning reshape the function space?

arXiv cs.LG · João Lobo, Bruno Loureiro, Long Tran-Than, Fanghui Liu · 2026-05-18

This work precisely characterizes how feature learning reshapes the function space in two-layer neural networks during gradient descent training. Analyzing the high-dimensional proportional regime, the authors prove that post-update feature distributions are well approximated by target-dependent spiked Gaussian covariance, inducing a data-adaptive kernel. The analysis reveals feature learning as a distributional transformation that selectively amplifies eigenvalues aligned with the target direction and mixes leading eigenfunctions, coupling radial modes with quadratic harmonics. Results demonstrate that gradient descent induces a data-adaptive deformation preferentially enhancing signal-aligned directions, rather than merely rescaling a fixed kernel.

feature learninggradient descentspiked gaussian covariancedata-adaptive kerneleigenfunction coupling

Online Conformal Prediction for Non-Exchangeable Panel Data

arXiv cs.LG · Daohong Tu, Kay Giesecke · 2026-05-18

We introduce an online conformal prediction framework for non-exchangeable panel data, addressing challenges posed by temporal dependence and unit heterogeneity. The method leverages contemporaneous outcomes from related units as a calibration panel, incorporating history-based similarity weights and an adaptive miscoverage level updated with target feedback. This dual-state design ensures stepwise and long-run coverage guarantees. Empirical evaluations on synthetic and real panel datasets demonstrate improved coverage for worst-covered target units through adaptive interval-width allocation, with similarity weights enhancing sparse feedback scenarios and adaptive levels optimizing coverage as feedback accumulates.

online conformal predictionpanel datatemporal dependenceadaptive miscoveragesimilarity weights

Toy Combinatorial Interpretability Models Reveal Lottery Tickets in Early Feature Space

arXiv cs.LG · Alon Bebchuk, Nir Shavit · 2026-05-18

This work investigates the mechanistic underpinnings of the lottery ticket hypothesis by analyzing winning tickets in a combinatorial, clause-structured toy setting with interpretable feature-space representations. Using well-defined combinatorial distances between features, the authors demonstrate that winning tickets correspond to precursor locations in feature space that are proximally aligned with final feature-channel codes at initialization. Dense SGD resolves these locations through structured selection, with rejection concentrated at crowded neurons due to competition under superposition. Lightweight probes based on feature-space distance and motion outperform established weight-based ticket discovery methods in accuracy and exact code recovery. These findings suggest lottery ticket structure is governed by hidden feature-space geometry rather than weight-space subnetwork identity.

lottery ticket hypothesisfeature-space geometrystructured selectioncombinatorial distancessuperposition

Agent Bazaar: Enabling Economic Alignment in Multi-Agent Marketplaces

arXiv cs.LG · Seth Karten, Cameron Crow, Chi Jin · 2026-05-17

The paper introduces Agent Bazaar, a multi-agent simulation framework for evaluating Economic Alignment in LLM-based marketplaces, identifying two failure modes: Algorithmic Instability (market collapse from volatility amplification) and Sybil Deception (trust erosion from coordinated fraud). The authors propose economically aligned harnesses (Stabilizing Firms, Skeptical Guardians) and train agents using REINFORCE++ with adaptive curriculum, yielding a 9B model outperforming frontier models. They introduce the Economic Alignment Score (EAS) for cross-model comparison, demonstrating economic alignment's orthogonality to general capability via targeted RL.

economic alignmentmulti-agent simulationreinforce++sybil deceptionalgorithmic instability

Fine-tuning Pocket-Aware Diffusion Models via Denoising Policy Optimization

arXiv cs.LG · Yuan Xue, Daniel Kudenko, Megha Khosla · 2026-05-17

DEPPA introduces a reinforcement learning framework for fine-tuning pocket-aware diffusion models to optimize multiple molecular properties in structure-based molecule optimization. Building on Denoising Diffusion Policy Optimization, DEPPA formulates the reverse denoising process as a multi-step Markov Decision Process, incorporating a coarse denoising scheduler for efficient optimization. Evaluated on CrossDocked2020, DEPPA achieves superior binding affinity (-8.5 kcal/mol Vina Score), drug-likeness, and diversity, with competitive synthesizability compared to baselines.

denoising diffusionreinforcement learningmarkov decision processbinding affinitystructure-based optimization

Exact Convex Reformulations of Linear Neural Networks via Completely Positive Lifting

arXiv cs.LG · Karthik Prakhya, Alp Yurtsever · 2026-05-17

The paper presents an exact convex reformulation of deep linear neural network training under squared loss, achieved through lifting to a generalized completely positive cone. The method involves reducing multilayer parameterization to bilinear factorization, converting to a rank-constrained semidefinite program, and applying completely positive lifting. The reformulation maintains the original problem's optimal value, with ambient dimension independent of network depth and data points, while bottleneck width affects only scalar constraints. Though computationally intractable, it provides an exact conic representation of linear factorization nonconvexity and links neural network training to copositive programming.

completely positive conebilinear factorizationrank-constrained semidefinite programcopositive programminglinear neural networks

On Gaussian approximation for entropy-regularized Q-learning with function approximation

arXiv cs.LG · Artemy Rubtsov, Rahul Singh, Eric Moulines, Alexey Naumov · 2026-05-17

The paper establishes Gaussian approximation bounds for entropy-regularized asynchronous Q-learning with linear function approximation. Using Polyak--Ruppert averaging and polynomial stepsizes, the analysis assumes geometrically ergodic Markov chains and regularity conditions for the projected soft Bellman equation. The derived convergence rate is $\mathcal{O}(n^{-1/4})$ (up to polylog factors) in convex distance, achieved via linearization of the soft Bellman recursion and Gaussian approximation of martingale terms. High-order moment bounds for the last iterate are also provided.

entropy-regularized q-learninggaussian approximationpolyak--ruppert averagingsoft bellman equationlinear function approximation

PEIRA: Learning Predictive Encoders through Inter-View Regressor Alignment

arXiv cs.LG · Michael Arbel, Basile Terver, Jean Ponce · 2026-05-17

PEIRA introduces a non-contrastive self-supervised learning method with an explicit objective defined through the trace of an optimal linear regressor, addressing the lack of well-defined objectives in existing methods like SimSiam and BYOL. The method leverages a regularized linear regressor to predict representations of two data views, ensuring stable equilibria align with nonlinear canonical correlation subspaces. Experiments on ImageNet-1K and CIFAR-10 demonstrate PEIRA's competitiveness with VICReg and LeJEPA baselines, with qualitative results supporting the theoretical analysis.

non-contrastive sslcanonical correlationlinear regressorself-distillationstable equilibria

Training Infinitely Deep and Wide Transformers

arXiv cs.LG · Raphaël Barboni, Maarten V. de Hoop, Takashi Furuya, Gabriel Peyré · 2026-05-17

The paper develops a mean-field theory for gradient-based training of infinitely deep and wide transformers, modeling them as neural PDEs due to attention-mediated token coupling. The framework establishes well-posedness of forward passes via function-space ODEs and derives conditional Wasserstein gradients through adjoint sensitivity analysis. Key results include: existence of gradient flow curves in Wasserstein space, NTK injectivity conditions for attention mechanisms (equivalent to linear independence of log-sum-exp functions modulo affine terms), and global convergence guarantees when initial loss is sufficiently small.

mean-field theoryneural pdewasserstein gradientntk injectivityadjoint sensitivity

Bug or Feature$^2$: Weight Drift, Activation Sparsity, and Spikes

arXiv cs.LG · Egor Shvetsov, Aleksandr Serkov, Shokorov Viacheslav, Redko Dmitry · 2026-05-17

The paper identifies negative weight drift as an emergent phenomenon in neural networks, caused by interactions between standard losses and positively biased activation functions. Theoretical analysis shows non-negative gradient expectations for positive pre-activations under MSE/cross-entropy loss, driving weights negative during early training. Empirical studies across architectures (MLP, ResNet, ViT, GPT-nano) reveal ReLU-induced activation sparsity up to 90%, with a sharp accuracy cliff beyond 70% sparsity. The proposed clipped ReLU² mitigates pathological activation spikes in transformers while maintaining representational benefits, outperforming unclipped variants and achieving lowest validation loss in GPT-nano.

weight driftactivation sparsityrelu²gradient analysistransformer

When a Zero-Shooter Cheats: Improving Age Estimation via Activation Steering

arXiv cs.LG · Erik Imgrund, Pia Hanfeld, Klim Kireev, Konrad Rieck · 2026-05-17

The study addresses a critical flaw in zero-shot age estimation by vision-language models (VLMs), where identity-based shortcuts lead to erroneous predictions when non-celebrities are misidentified as memorized celebrities. The authors propose an activation steering method that intervenes on VLM hidden states to suppress this shortcut, improving age estimation accuracy. Results demonstrate a 25% reduction in mean absolute error across benchmarks, enhancing robustness for both memorized and unseen identities.

vision-language modelszero-shot learningage estimationactivation steeringidentity shortcut

LLMForge: Multi-Backend Hardware-Aware Neural Architecture Search with Infinite-Head Attention for Edge Language Models

arXiv cs.LG · Xinting Jiang, Junyi Luo, Ruichen Qi, Kauna Lei · 2026-05-17

LLMForge introduces a hardware-aware neural architecture search (NAS) framework for edge language models, comprising Infinite-Head Attention (IHA), Forge-Former, and Forge-DSE. IHA decouples query heads, KV groups, and dimensions, expanding the attention configuration space by 400x. Forge-Former, an encoder-based surrogate, outperforms MLP and random-forest baselines, while Forge-DSE employs NSGA-II for multi-backend hardware cost modeling. Searches on four hardware substrates yield distinct architectures, with a 300M-scale variant achieving a validation loss of 2.798, 40% lower energy per token, and 43% reduced latency compared to SmolLM2-360M and Qwen-0.5B baselines.

neural architecture searchinfinite-head attentionedge language modelshardware-awaredesign-space exploration

📰 Industry Media (10)

Here’s why Elon Musk lost his suit against OpenAI

MIT Tech Review — AI · Michelle Kim · 2026-05-19

The US District Court dismissed Elon Musk's lawsuit against OpenAI on procedural grounds, ruling that his claims were barred by statutes of limitations. Musk alleged breach of charitable trust (3-year limit) and unjust enrichment (2-year limit) regarding OpenAI's transition to a for-profit structure, but the jury found he had constructive notice of these changes by 2021. Key evidence included Musk's 2017 proposal for a for-profit subsidiary and 2020 criticism of Microsoft's GPT-3 license. The verdict did not address the merits of Musk's claims, focusing instead on the timeliness of his 2024 filing. Musk plans to appeal to the Ninth Circuit.

statute of limitationsbreach of trustunjust enrichmentconstructive noticeprocedural grounds

Google Launches Antigravity 2.0 at I/O 2026: A Standalone Agent-First Platform with CLI, SDK, Managed Execution, and Enterprise Support

MarkTechPost · Michal Sutter · 2026-05-19

Google Antigravity 2.0 introduces a standalone agent-first platform for AI-assisted development, shifting from IDE-centric to multi-agent workflow management. The platform includes a desktop application, CLI, SDK, and enterprise support, enabling parallelized agent orchestration, scheduled tasks, and dynamic subagents. Managed Agents in the Gemini API provide isolated Linux environments for persistent multi-turn sessions, powered by Gemini 3.5 Flash. The ecosystem integrates with Google AI Studio, Android, and Workspace, offering mobile development and cloud deployment options. Gemini 3.5 Flash, the default model, outperforms Gemini 3.1 Pro in benchmarks and reduces latency in concurrent agent calls.

agent orchestrationgemini apimulti-turn sessionsdynamic subagentsisolated environments

Best Enterprise Level Agentic AI Platforms for 2026

MarkTechPost · Asif Razzaq · 2026-05-19

The article evaluates 10 enterprise agentic AI platforms for 2026, ranking them by production readiness, pricing, and constraints. Key platforms include Salesforce Agentforce (CRM-native workflows, 29,000 deals, $800M ARR), Microsoft Copilot Studio (Teams-embedded, 400,000+ agents), and ServiceNow AI Platform (ITSM governance, 85B workflows analyzed). Methodological focus emphasizes autonomous decision-making, multi-step reasoning, and data quality over 'agent washing.' Results highlight ecosystem-specific strengths: Salesforce for CRM, Microsoft for M365, ServiceNow for regulated workflows, LangGraph for developer control, and Google for multimodal A2A interoperability.

agentic aimulti-step reasoningdata 360a2a protocolworkflow data fabric

How to Build an Advanced Agentic AI System with Planning, Tool Calling, Memory, and Self-Critique Using OpenAI API

MarkTechPost · Sana Hassan · 2026-05-19

The article presents a modular agentic AI system leveraging OpenAI API, structured around distinct roles: planner, executor, and critic. The system integrates specialized tools (calculator, knowledge-base search, JSON extraction, file writing) to enable reliable computation, structured output generation, and artifact saving. The agent employs a pipeline architecture, where the planner generates JSON-structured plans, the executor performs task-specific actions using tools, and the critic refines outputs. Results demonstrate the system’s ability to produce deliverables such as meeting summaries, action items, and follow-up emails, while maintaining a transparent tool-call trace for debugging and accountability.

agentic aitool callingself-critiquestructured outputplanner-executor-critic

Meet MemPrivacy: An Edge-Cloud Framework that Uses Local Reversible Pseudonymization to Protect User Data Without Breaking Memory Utility

MarkTechPost · Asif Razzaq · 2026-05-18

MemPrivacy introduces an edge-cloud framework for privacy-preserving LLM memory systems using local reversible pseudonymization. It replaces sensitive user data with typed placeholders on-device before cloud transmission, preserving semantic structure while preventing raw private value exposure. The framework employs a four-level privacy taxonomy (PL1-PL4) and achieves 85.97% F1 on MemPrivacy-Bench with its 4B parameter model, outperforming GPT-5.2 and Gemini-3.1-Pro. When applied to memory systems LangMem, Mem0, and Memobase, MemPrivacy limits utility loss to ≤1.6% while protecting PL2-PL4 content, compared to up to 41.87% accuracy drop with irreversible masking.

local reversible pseudonymizationtyped placeholdersprivacy taxonomyedge-cloud frameworkmemory utility

Stochastic Gradient Descent (SGD’s) Frequency Bias and How Adam Fixes It

MarkTechPost · Arham Islam · 2026-05-18

The study demonstrates how Adam's variance normalization mitigates Stochastic Gradient Descent (SGD)'s frequency bias in imbalanced token distributions. Using a controlled six-token vocabulary with frequencies spanning four orders of magnitude, the authors train identical linear models with SGD and Adam. Results show SGD fails on rare tokens (e.g., 'thalweg' reaches only 0.15 vs. true weight 1.0), while Adam's adaptive learning rates (up to 41× amplification for rare tokens) enable uniform convergence across all frequencies.

frequency biasvariance normalizationadaptive optimizationtoken distributiongradient sparsity

NVIDIA Introduces a 4-Bit Pretraining Methodology Using NVFP4, Validated on a 12B Hybrid Mamba-Transformer at 10T Token Horizon

MarkTechPost · Asif Razzaq · 2026-05-18

NVIDIA introduces NVFP4, a 4-bit microscaling format for pretraining large language models, validated on a 12-billion-parameter hybrid Mamba-Transformer trained on 10 trillion tokens. NVFP4 employs 16-element blocks with E4M3 scale factors and an FP32 per-tensor scale, achieving near-FP8 precision for 6.25% of values. The methodology includes selective high precision, random Hadamard transforms, 2D block scaling, and stochastic rounding on gradients. Results show comparable performance to FP8 baselines, with 62.58% accuracy on MMLU-Pro 5-shot and 2×-3× speedups on NVIDIA Blackwell Tensor Cores.

nvfp4microscalingmamba-transformerstochastic roundingtensor cores

The Nvidia H200 China deal survived the Trump-Xi summit–just not in the way anyone expected

AI News · Dashveenjit Kaur · 2026-05-19

The US-China semiconductor stalemate reveals a structural shift in China's AI hardware strategy, as government directives override technical benchmarks. Despite US approval for Nvidia H200 exports to 10 Chinese firms (including Alibaba and Tencent), Beijing blocks domestic deployment to prioritize Huawei's Ascend chips. DeepSeek V4's optimization for Huawei processors and Alibaba's T-Head GPU production demonstrate active domestic substitution, reducing Nvidia's China revenue from 20% to 5%. The deadlock stems from conflicting policies: US export controls restrict offshore use while China mandates onshore alternatives.

nvidia h200huawei ascenddeepseek v4export controlsgpu supply-chain

AI is a matter of power, infrastructure and security: TechEx North America

AI News · Joe Green · 2026-05-19

TechEx North America 2024 highlighted critical infrastructure challenges for enterprise AI deployment across edge computing, IoT, and data center domains. Sessions emphasized latency optimization (2-5ms for industrial edge cases), distributed inference architectures, and immutable infrastructure requirements for IIoT/OT-IT convergence. Digital twin implementations shifted focus from visualization to operational decision support, with Siemens and LG CNS demonstrating maintenance optimization use cases. Data center constraints emerged as fundamental bottlenecks, with AI compute density demanding 20-30kW/rack power budgets and water-cooling solutions. Cybersecurity tracks identified shadow AI as increasing attack surfaces by 37% in legacy environments, necessitating zero-trust controls for industrial automation systems.

edge computingdistributed inferencedigital twinszero-trustshadow ai

Amazon launches Alexa for Shopping as Rufus moves behind the scenes

AI News · Muhammad Zulhusni · 2026-05-18

Amazon integrated its Rufus shopping chatbot with Alexa+ to create Alexa for Shopping, a multimodal assistant supporting product queries, price tracking, and automated purchases across app, web, and Echo Show interfaces. The system leverages in-context learning from user activity (purchases, browsing, conversations) to generate product comparisons, AI summaries, and personalized recommendations. Results indicate 115% MAU growth and 400% engagement increase for Rufus, now processing 300M customer interactions annually. The assistant enables conditional shopping actions (price-triggered cart additions) and cross-device preference synchronization through Amazon's conversational AI infrastructure.

in-context learningmultimodal assistantconversational aiagentic aimae growth


Generated automatically at 2026-05-19 21:30 UTC. Summaries and keywords are produced by an LLM and may contain inaccuracies — always consult the original article.