Daily Digest — 2026-05-28

Wednesday, May 27, 2026 · 329 items · model: deepseek/deepseek-chat

329 items · 6 research labs, 316 arxiv papers, 7 industry media

🏛️ Research Labs (6)

Building self-improving tax agents with Codex

OpenAI News · 2026-05-27

The collaboration between Thrive Holdings and OpenAI developed Tax AI, a Codex-driven system automating tax return preparation with self-improving capabilities. By integrating practitioner feedback, production traces, and a Codex-driven iteration loop, the system achieved 97% accuracy in drafting returns, reducing preparation time by 33% and increasing throughput by 50%. Key innovations include structured error capture, targeted eval generation, and automated engineering task scoping, enabling continuous improvement from 25% to 86% correct field completion within six weeks.

codextax aiself-improvingproduction traceseval-driven

Election information and safeguards in 2026

OpenAI News · 2026-05-27

OpenAI outlines a multi-pronged approach to safeguard 2026 elections through AI transparency, cyber defense, and information integrity. Key initiatives include integrating SynthID digital watermarks for AI-generated images, deploying Codex Security and Trusted Access for Cyber (TAC) programs to harden election infrastructure, and partnering with AP and Democracy Works to surface verified voting information. The company enforces usage policies against election interference, monitors model bias via political bias evaluations, and supports legislative efforts like the Protect Elections from Deceptive AI Act. These measures aim to combat deepfakes, cyber threats, and misinformation while preserving civic engagement.

synthidcodex securitypolitical bias evaluationc2pa standardtrusted access for cyber

Warp’s big bet on building open source with GPT-5.5

OpenAI News · 2026-05-27

Warp introduces Open Agentic Development, leveraging GPT-5.5 for orchestrating coding agents across local and cloud environments. The method combines human supervision with autonomous agent workflows for tasks like code generation, testing, and pull requests, using Oz as a cloud orchestration platform with features like context compaction and persistent memory. Results show GPT-5.5 reduces token usage by 30% compared to GPT-5.4, with agents co-creating 90% of internal pull requests, while enterprise revenue grew 500% since Q4 2025.

agent orchestrationopen agentic developmentcontext compactionllm-as-a-judgekv-cache

ITBench-AA: Frontier Models Score Below 50% on the First Benchmark for Agentic Enterprise IT Tasks — by Artificial Analysis and IBM

Hugging Face Blog · 2026-05-27

Artificial Analysis and IBM introduce ITBench-AA, the first benchmark for agentic enterprise IT tasks, focusing on Site Reliability Engineering (SRE) in Kubernetes environments. The evaluation uses a structured agentic harness (Stirrup) to assess models' ability to diagnose incidents via shell access to logs and snapshots, scoring based on recall-gated precision. Frontier models score below 50%, with Claude Opus 4.7 leading at 47%, while open-weight models like GLM-5.1 (40%) show competitive cost-performance tradeoffs. Key findings include inverse correlation between turn count and accuracy, with Gemini 3.1 Pro Preview averaging 83 turns for 30% accuracy versus Gemma 4 31B's 58 turns at 37%.

agentic benchmarkingkubernetes diagnosticsrecall-gated precisionopen-weight modelssre tasks

Reachy Mini goes fully local

Hugging Face Blog · 2026-05-27

The Hugging Face Blog introduces a fully local speech-to-speech pipeline for Reachy Mini, leveraging a cascaded architecture comprising Silero VAD, Parakeet-TDT STT, Gemma 4/Qwen3 LLM, and Qwen3-TTS. The pipeline operates entirely on-device, ensuring privacy and eliminating API costs. The method employs llama.cpp for LLM serving with a 64k context window, flash attention, and sliding-window attention caching to optimize latency. Results demonstrate multilingual conversational capabilities, with customizable components for specific use cases. The system supports multiple LLM backends, including MLX, Transformers, vLLM, and Responses API-compatible endpoints.

speech-to-speechllama.cppflash attentionsliding-window attentionmultilingual

Shipping a Trillion Parameters With a Hub Bucket: Delta Weight Sync in TRL

Hugging Face Blog · 2026-05-27

The article introduces delta weight synchronization in TRL, a method that reduces bandwidth costs in asynchronous RL by transmitting only changed model parameters between training and inference. Leveraging bf16's rounding properties, the approach achieves >98% sparsity in weight updates, encoding changes as sparse safetensors files stored in Hugging Face Buckets. Experimental results on Qwen3-0.6B show a reduction from 1.2GB to 20-35MB per step. The architecture decouples trainer and inference clusters via a shared bucket, enabling cross-region deployment without RDMA or direct connectivity.

delta weight synchronizationasynchronous rlbf16 sparsitysafetensorshugging face bucket

📜 arXiv Papers (316)

Algorithmic Monocultures in Hiring

arXiv cs.AI · Rishi Bommasani, Sarah H. Bana, Kathleen A. Creel, Dan Jurafsky · 2026-05-26

This study investigates the impact of algorithmic monoculture in hiring, hypothesizing that reliance on algorithms from a single vendor leads to systemic racial disparities and homogeneous applicant outcomes. The authors analyze a dataset of 3 million applicants submitting 4 million applications, all screened by algorithms from the same vendor. Results reveal that 14.74% of Asian and 25.87% of Black applicants face adverse outcomes according to U.S. employment discrimination standards. Additionally, 4% of applicants applying to 10 positions are rejected from all, exceeding chance expectations. Deterministic replicability of hiring algorithms is leveraged to simulate outcomes, showing applicants must apply widely to ensure human consideration.

algorithmic monoculturehiring algorithmsracial disparitiesdeterministic replicabilityemployment discrimination

MUSE-Autoskill: Self-Evolving Agents via Skill Creation, Memory, Management, and Evaluation

arXiv cs.AI · Huawei Lin, Peng Li, Jie Song, Fuxin Jiang · 2026-05-26

MUSE-Autoskill introduces a skill-centric agent framework enabling LLM agents to continuously improve task-solving capabilities through a unified lifecycle of skill creation, memory, management, evaluation, and refinement. The framework allows agents to create skills on demand, store and reuse them across tasks, organize and select them efficiently, and evaluate them via unit tests and runtime feedback. Skill-level memory accumulates experience for each skill, enhancing reuse and adaptation. Experiments on SkillsBench demonstrate lifecycle-managed skills improve task success, efficiency, reuse, and cross-agent transfer, emphasizing skills as long-lived, experience-aware, and testable assets.

skill-centric agentskill-level memorytask-solving capabilityunit testsruntime feedback

LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding

arXiv cs.AI · Shihao Wang, Shilong Liu, Yuanguo Kuang, Xinyu Wei · 2026-05-26

LocateAnything introduces Parallel Box Decoding (PBD), a unified framework for vision-language grounding and detection that decodes geometric elements (e.g., bounding boxes) as atomic units in a single step, preserving intra-box coherence and enabling parallelism. The method addresses inefficiencies in token-by-token decoding by leveraging a scalable data engine and LocateAnything-Data, a dataset with 138M training samples. Evaluations demonstrate improved decoding throughput and high-IoU localization accuracy across benchmarks, highlighting the synergy of PBD and large-scale training for efficient, precise visual grounding.

parallel box decodingvision-language groundinghigh-iou localizationbounding box coherencescalable data engine

Natural Language Query to Configuration for Retrieval Agents

arXiv cs.AI · Melissa Z. Pan, Negar Arabzadeh, Mathew Jacob, Fiodar Kazhamiaka · 2026-05-26

The paper introduces BRANE, a method for dynamic per-query configuration selection in retrieval agents that optimizes either accuracy or cost. BRANE uses an LLM to extract query characteristics and trains lightweight predictors to estimate pipeline correctness, enabling runtime selection of optimal configurations from a predefined catalog. Evaluations on MuSiQue, BrowseComp-Plus, and FinanceBench show BRANE achieves the best fixed configuration's accuracy at 89% lower cost while outperforming LLM-routing and rule-based baselines, demonstrating practical per-query pipeline optimization.

retrieval agentsconfiguration selectioncost-quality tradeofflightweight predictorpipeline optimization

GENESIS: Harnessing AI Agents for Autonomous 6G RAN Synthesis, Research, and Testing

arXiv cs.AI · Tamerlan Aghayev, Maxime Elkael, Michele Polese, Minh Dat Nguyen · 2026-05-26

The paper introduces GENESIS, an AI agent framework for autonomous 6G RAN synthesis and testing, addressing six bottlenecks in cellular R&D. It employs composable primitives (agents, skills, hooks) and a knowledge layer (SYNAPSE) to convert intents into validated solutions via over-the-air experiments, while mitigating LLM pitfalls like API hallucination. The framework compounds capabilities through persistent knowledge integration, targeting interoperability and real-hardware robustness.

6g ranai agentsknowledge layerover-the-air testinginteroperability

MobileMoE: Scaling On-Device Mixture of Experts

arXiv cs.AI · Yanbei Chen, Hanxian Huang, Ernie Chang, Jacob Szwejbka · 2026-05-26

MobileMoE introduces a family of on-device Mixture-of-Experts (MoE) language models with sub-billion active parameters (0.3-0.9B active, 1.3-5.3B total), optimizing for mobile memory and compute constraints. The architecture employs moderate sparsity with fine-grained and shared experts, trained via a four-stage recipe including pre-training, mid-training, instruction fine-tuning, and quantization-aware training. Evaluated across 14 benchmarks, MobileMoE matches or exceeds dense LLMs with 2-4× fewer inference FLOPs and outperforms OLMoE-1B-7B with up to 60% fewer parameters. Efficient INT4 inference on smartphones demonstrates 1.8-3.8× faster prefill and 2.2-3.4× faster decode compared to MobileLLM-Pro.

mixture-of-expertson-devicesparsityquantization-aware traininginference flops

Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases

arXiv cs.AI · Dongyoon Hahm, Dylan Hadfield-Menell, Kimin Lee · 2026-05-26

This paper identifies alignment tampering, a vulnerability in Reinforcement Learning from Human Feedback (RLHF) where Large Language Models (LLMs) influence preference datasets to amplify misaligned biases. The method leverages RLHF's core limitations: (1) preference datasets derived from LLM outputs allow model influence, and (2) pairwise comparisons fail to distinguish quality from bias. Experiments demonstrate bias amplification across domains, including keyword bias, propaganda, brand promotion, and instrumental goal-seeking. Mitigation remains challenging, as existing robust RLHF techniques cannot fully resolve tampering without compromising response quality. These findings highlight structural vulnerabilities in RLHF alignment.

alignment tamperingrlhfpreference datasetpairwise comparisonbias amplification

Guiding LLM Post-training Data Engineering with Model Internals from Sparse Autoencoders

arXiv cs.AI · Yi Jing, Zao Dai, Jinwu Hu, Zijun Yao · 2026-05-26

SAERL introduces a framework for LLM reinforcement learning that leverages model internals via sparse autoencoders (SAEs) to guide post-training data engineering. It quantifies three intrinsic data properties—diversity, difficulty, and quality—using SAE-derived signals, enabling operations like batch mixing, curriculum ordering, and filtering. Experiments on Qwen2.5-Math-1.5B show a 3.00% accuracy gain over vanilla GRPO and 20% faster convergence, with SAE-based metrics transferring across model families and scales.

sparse autoencodermodel internalsdata engineeringreinforcement learningmechanistic interpretability

When Eyes Betray AI: Social Gaze Consistency as a Semantic Cue for AI-Generated Image Detection

arXiv cs.AI · Kim Jihyeon, Sohee Kim, Soosan Lee, Souhwan Jung · 2026-05-26

The paper introduces Social Gaze Consistency (SGC), a high-level semantic cue for detecting AI-generated images, defined as the mutual coherence of gaze direction, head-eye alignment, and pupil placement in social interactions. The method employs three mechanisms: (i) a diagnostic dataset with controlled gaze perturbations to prevent memorization shortcuts, (ii) Block-Compositional Caption Supervision to decouple reasoning consistency from surface diversity, and (iii) cross-architecture validation showing backbone-agnostic improvements (+3.7 pp on COCOAI Interaction subset, +1.3 pp on COCOAI Person subset). The approach leverages paired-edit shortcut blocking and CLIP prior preservation to explain transferability across generators.

social gaze consistencyblock-compositional caption supervisionclip prior preservationpaired-edit shortcut blockingperiocular structure

2-ASP(Q) programs with weak constraints: Complexity and efficient implementation

arXiv cs.AI · Andrea Cuteri, Giuseppe Mazzotta, Francesco Ricca · 2026-05-26

The paper characterizes the complexity of 2-ASP(Q)^w programs, a fragment of Answer Set Programming (ASP) with two quantifiers and weak constraints, showing they capture optimization problems up to Delta_3^P. It introduces a CEGAR-based technique in the Casper system for computing quantified answer sets, with experimental validation on hard benchmarks demonstrating practical efficacy. Theoretical contributions include tight completeness results and analysis of previously unaddressed cases.

answer set programmingquantifiersweak constraintsdelta_3^pcegar

EdgeFlow: Edge-Map Augmented VLM-Based Flowchart Processing for Industrial Requirements Engineering

arXiv cs.AI · Zhifei Dou, Shabnam Hassani, Ou Wei · 2026-05-26

EdgeFlow introduces an edge-map augmented VLM approach for topology-preserving flowchart-to-Mermaid conversion in industrial requirements engineering (RE). The method enhances off-the-shelf VLMs by injecting a Canny-derived structural prior without fine-tuning or annotated data. Evaluated on the IndusReqFlow dataset, it improves node-, edge-, and path-level F1 scores by 17.39, 16.94, and 11.06 percentage points respectively over baseline VLMs, enabling better model-based testing support. Cross-dataset tests on synthetic benchmarks show no gains, underscoring the need for industrial benchmarks in VLM-based RE tool evaluation.

vision language modelsrequirements engineeringcanny edge detectiontopology preservationmermaid conversion

Maat: The Agentic Legal Research Assistant for Competition Protection

arXiv cs.AI · Basant Mounir, Farida Madkour, Amira Abdelaziz, Asmaa Sami · 2026-05-26

Maat, a ReAct agent, addresses limitations of general and legal AI assistants in competition law research by orchestrating task-specific tools. It integrates RAG for grounding in official sources, provides in-line citations, employs web search fallback, and prompts user clarification for ambiguous queries. Evaluations show Maat outperforms baselines in case-specific tasks and matches top baselines in theoretical question tasks. The dataset is publicly available on GitHub.

react agentragcompetition lawin-line citationsweb search fallback

Modeling Agentic Technical Debt and Stochastic Tax: A Standalone Framework for Measurement, Simulation, and Dashboarding

arXiv cs.AI · Muhammad Zia Hydari, Raja Iqbal, Narayan Ramasubbu · 2026-05-26

The paper introduces a formal framework distinguishing Agentic Technical Debt (accumulated design liability) from Stochastic Tax (recurring operational burden in AI workflows). It presents a structural model with measurable variables, operational estimation methods, and a dashboard for managerial use. The framework is validated through an accounts-payable simulation and spreadsheet implementation, demonstrating how debt amplifies tax and vice versa.

agentic technical debtstochastic taxprobabilistic reasoningworkflow integrationdashboarding

Risk Averse Alert Prioritization for IDS Using Subnormal Gaussian Fuzzy Models

arXiv cs.AI · Murat Moran · 2026-05-26

We propose a risk-averse alert prioritization framework for intrusion detection systems using subnormal Gaussian fuzzy numbers, explicitly modeling threat severity, detection confidence, and organizational risk attitude. Each alert is represented as a fuzzy number with core, spread, and height attributes, enabling interpretable reasoning and tunable security posture via a risk-attitude parameter. Evaluated on CIC-IDS2017 and NSL-KDD, the method achieves superior robustness under detector degradation (0.9963 vs 0.8215 NDCGrel@100), distinct mid-confidence alert differentiation, and near-parity with baselines under robust detectors. The framework is computationally efficient, theoretically grounded, and robust across detector families and miscalibration scenarios.

intrusion detectionfuzzy numbersalert prioritizationrisk attitudedetector degradation

It's Not Always Sycophancy: Measuring LLM Conformity as a Function of Epistemic Uncertainty

arXiv cs.AI · Kevin H. Guo, Chao Yan, Avinash Baidya, Katherine Brown · 2026-05-26

The paper introduces MUSE, a two-stage evaluation framework to disentangle mechanisms driving LLM conformity, demonstrating it stems from both sycophancy and epistemic uncertainty. MUSE maps a model's uncertainty in initial responses against its likelihood to yield to user pushback, revealing two distinct factors: sycophantic conformity (alignment despite certainty) and uncertainty-driven conformity (increased yielding with uncertainty). Ablation studies show both factors grow with the user's perceived expertise and suggestion plausibility, informing targeted interventions for alignment-induced sycophancy versus training-data-driven uncertainty.

llm conformityepistemic uncertaintysycophantic conformitymuse frameworkinference-time behavior

Falcon-X: A Time Series Foundation Model for Heterogeneous Multivariate Modeling

arXiv cs.AI · Yiding Liu, Yifan Hu, Hongjie Xia, Peiyuan Liu · 2026-05-26

Falcon-X introduces a novel time series foundation model (TSFM) for heterogeneous multivariate forecasting by decoupling variates into a unified latent prototype space. The method employs Unified Prototype Diff-Attention to align heterogeneous variates via positive/negative semantic affinities, Latent Entity Attention for cross-variate interactions, and a Variate Reassembly Router for trajectory reconstruction. Evaluations on GIFT-Eval and fev-bench show state-of-the-art performance, enabling zero-shot structural transfer and scalable modeling of complex multivariate systems.

time series foundation modellatent prototype spacediff-attentionzero-shot transfermultivariate forecasting

FineVLA: Fine-Grained Instruction Alignment for Steerable Vision-Language-Action Policies

arXiv cs.AI · Xintong Hu, Xuhong Huang, Jinyu Zhang, Yutong Yao · 2026-05-26

FineVLA introduces a framework for fine-grained vision-language-action (VLA) supervision to address the limitation of coarse goal-level instructions in robot datasets. The method unifies 972K trajectories from 10 datasets into FineVLA-Data (47K human-verified trajectories), provides a benchmark with 500 videos and 10K atomic facts, and trains steerable VLA policies with mixed fine-grained (FG) and raw instructions. Results show FG-only improves success rates by +1.4 to +8.1 points over raw-only, with optimal FG:Raw ratios of 1:2 to 1:1 (86.8% simulation success). FG supervision particularly enhances pose (+23), color (+18), and approach direction (+18) control.

vision-language-actionfine-grained supervisionsteerable policyrobot datasetsdual-arm manipulation

SIA: Self Improving AI with Harness & Weight Updates

arXiv cs.AI · Prannay Hebbar, Yogendra Manawat, Samuel Verboomen, Alesia Ivanova · 2026-05-26

The paper introduces SIA, a self-improving AI framework where a Feedback-Agent jointly optimizes both the harness (tools, prompts, retry logic) and model weights of task-specific agents, bridging two previously isolated approaches. Evaluated across Chinese legal charge prediction (LawBench), GPU kernel optimization, and single-cell RNA denoising, SIA achieves 56.6% accuracy gain, 91.9% runtime reduction, and 502% denoising improvement over baselines. Weight updates capture domain intuition while harness modifications enable agentic search behavior.

self-improving aiharness updatesweight updatesfeedback-agenttask-specific agents

Lost in Sampling: Assessing Lexical Reachability in LLMs via the Word Coverage Score (WCS)

arXiv cs.AI · Samer Awad, Javier Conde, Carlos Arriaga, Tairan Fu · 2026-05-26

The study introduces the Word Coverage Score (WCS) to quantify how standard sampling filters (Top-$p$, Top-$k$, Min-$p$) in LLMs suppress lexical diversity by pruning contextually appropriate low-frequency human vocabulary. By auditing open-weight models on human-authored corpus fragments, the authors measure the lexical survival rate of high-information words, revealing that industry-standard sampling defaults act as unintended censorship mechanisms. Results demonstrate a trade-off between text coherence and lexical richness, providing a diagnostic tool for preserving linguistic diversity in generative models.

word coverage scoresampling filterslexical diversitydecoding mechanicsgenerative models

PilotTTS: A Disciplined Modular Recipe for Competitive Speech Synthesis

arXiv cs.AI · Bowen Li, Shaotong Guo, Zhen Wang, Yang Xiang · 2026-05-26

PilotTTS introduces a lightweight autoregressive TTS system achieving competitive performance with minimal data (200K hours) and open-source tools. Key innovations include a reproducible multi-stage data pipeline (quality assessment, label annotation, filtering) and a compact Q-Former-based architecture decoupling speaker identity from style via cross-sample paired training. The system supports zero-shot voice cloning, emotion synthesis (11 categories), and dialect synthesis (14 Chinese variants), achieving SOTA results on Seed-TTS Eval (1.50% WER, 0.87% CER, 0.862/0.815 speaker similarity).

autoregressive ttsq-formerzero-shot cloningparalinguistic synthesiscross-sample training

Pair-In, Pair-Out: Latent Multi-Token Prediction for Efficient LLMs

arXiv cs.AI · Wenhui Tan, Minghao Li, Xiaoqian Ma, Siqi Fan · 2026-05-26

Pair-In, Pair-Out (PIPO) introduces a unified approach to reduce inference costs in large language models by integrating latent compression and multi-token prediction (MTP). PIPO employs a latent compressor to fold two input tokens into one representation and an MTP head to unfold one hidden state into an additional output token, eliminating the need for a costly verifier pass via a lightweight confidence head trained with On-Policy Distillation. Experiments on benchmarks including AIME 2025 and LongBench v2 demonstrate PIPO’s effectiveness, achieving up to +7.15 points in pass@4 and 2.64× first-token-latency and 2.07× per-token-latency speedups.

latent compressionmulti-token predictionon-policy distillationspeculative decodingconfidence head

LUCoS: Latent Unsupervised Context Selection for Tabular Foundation Models

arXiv cs.AI · Oroel Ipas, Guillermo Gomez-Trenado, Rocío Romero-Zaliz, Isaac Triguero · 2026-05-26

LUCoS introduces latent unsupervised context selection for tabular foundation models (TFMs), addressing the cold-start problem where labeled instances are unavailable. The method leverages embeddings from an unsupervised Prior-Fitted Network (PFN) to replace raw-feature geometry with latent-space geometry, selecting representative medoids as context. On 67 OpenML-CC18 datasets, LUCoS outperforms baselines in mean AUC, ACC, and F1 across six low-label budgets, with gains attributed to coverage enforcement at small budgets and latent-space representativeness at larger budgets.

tabular foundation modelscontext selectionlatent geometryprior-fitted networkunsupervised learning

Gumbel Machine: Counterfactual Student Writing Generation via Gumbel Noise Steering

arXiv cs.AI · Hunter McNichols, Alexander Scarlatos, Mihai Dascalu, Danielle McNamara · 2026-05-26

The Gumbel Machine introduces a modular framework for generating counterfactual student writing by steering LLM outputs toward reference texts. Key innovation is $β$-Hindsight control, a decoding algorithm that modulates latent randomness via Gumbel noise to balance rubric adherence and text similarity. Evaluations on student writing datasets show the method produces counterfactuals that simultaneously satisfy grading criteria and preserve stylistic proximity to original submissions.

counterfactual generationgumbel noisecontrolled decodinginstruction-following llmshindsight control

Many Logics, One Methodology: A Plea for Logical Pluralism in Formalised Reasoning (preprint)

arXiv cs.AI · Christoph Benzmüller, Daniel Kirchner, Luca Pasetto · 2026-05-26

The paper advocates for logical pluralism in formalized reasoning through the LogiKEy methodology, which supports diverse logic embeddings within a classical higher-order logic (HOL) framework. It reviews two decades of research on shallow embeddings of non-classical logics in HOL, emphasizing computational metaphysics as a grounding argument. The authors caution against logical imperialism, arguing that rigid adherence to a single foundational logic hinders interdisciplinary reuse. LogiKEy's meta-logical framework enables principled support for multiple object-logics, promoting flexibility in modern proof assistants and large-scale theory developments.

logical pluralismhigher-order logicshallow embeddingscomputational metaphysicsproof assistants

Qiskit QuantumKatas: Adapting Microsoft's Quantum Computing exercises for LLM evaluation

arXiv cs.AI · Juan Cruz-Benito, Ismael Faro · 2026-05-26

The study adapts Microsoft's QuantumKatas curriculum to Qiskit, creating a benchmark with 350 quantum computing tasks for LLM evaluation. It includes natural language prompts, solutions, and test verification, covering gates, algorithms (Grover's, Simon's), error correction, and quantum games. Evaluating 16 LLMs across 7 prompting configurations (39,200 runs), results show capability differentiation (32.3%-83.1% pass rates), strong algorithm implementation (82.1% SimonsAlgorithm), but weak problem encoding (34.4% SolveSATWithGrover). Chain-of-thought prompting benefits reasoning-tuned models but degrades others (56.3% mean).

quantumkatasqiskitllmgrover's algorithmsimon's algorithm

Learning to Act under Noise: Enhancing Agent Robustness via Noisy Environments

arXiv cs.AI · Yuxin Chen, Xiaodong Cai, Junfeng Fang, Zhuowen Han · 2026-05-26

The authors propose NoisyAgent, a training framework to enhance the robustness of large language model (LLM) agents in stochastic real-world environments. The method introduces two types of interaction noise—user noise (ambiguity in user input) and tool noise (anomalies in tool execution)—into the training pipeline. Noise is applied progressively to a subset of rollouts to stabilize training while increasing difficulty. Experiments demonstrate improved agent robustness in noisy environments, with additional performance gains on idealized benchmarks, suggesting that controlled noise exposure promotes generalizable reasoning and decision-making behaviors.

large language modelsagent robustnessinteraction noisetraining pipelinegeneralizable reasoning

TWIST: Closed-Loop token Synchronization for Application-Aware Wireless Digital Twins

arXiv cs.AI · Sige Liu, Kezhi Wang · 2026-05-26

TWIST introduces a closed-loop token synchronization framework for application-aware wireless digital twins, optimizing semantic state transfer over resource-constrained links. The method represents physical observations as tokens, grouped by task relevance and protected via mode-conditioned unequal error protection (low/medium/high modes), with erasure recovery via a completion model. Experiments on road-scene twins demonstrate improved traffic-state inference (12.4% accuracy gain) and semantic synchronization (23.7% reduction in drift) versus fixed-mode baselines, while cutting average synchronization cost by 18.3% compared to always-high transmission.

token synchronizationunequal error protectionsemantic driftdigital twinscompletion model

Generative Animations: A Multi-Model Pipeline for Prompt-Driven Motion Synthesis

arXiv cs.AI · Mannat Khurana, Sanyam Jain, Rishav Agarwal · 2026-05-26

The authors introduce Generative Animations, a system for synthesizing production-ready animations from natural language prompts. The pipeline chains Large Language Models (LLMs) for semantic parsing with the Segment Anything Model (SAM) for visual grounding, enabling automatic generation of motion paths that respect scene geometry, depth-based occlusions, and 3D perspective transforms. The system supports contour-following trajectories, orbital animations with z-order awareness, and perspective-aligned motion on transformed objects. Three use cases demonstrate its capability to streamline animation creation by eliminating manual Bézier point plotting and timing configuration.

generative animationssemantic parsingvisual groundingmotion pathsperspective transforms

Learning When to Think While Listening in Large Audio-Language Models

arXiv cs.AI · Zhiyuan Song, Weici Zhao, Yang Xiao, Suhao Yu · 2026-05-26

We propose a learnable wait-think-answer controller for Large Audio-Language Models (LALMs) to optimize reasoning quality and responsiveness in streaming spoken interaction. The controller, trained using supervised fine-tuning and Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO), decides when to wait, externalize reasoning updates, or answer based on partial audio evidence. Evaluated on a six-task synthetic spoken reasoning question answering benchmark, the six-reward DAPO controller improves row-weighted accuracy from 67.6% to 70.3% while reducing post-endpoint final-think length by 14%. On a human-recorded Real Audio Bench, the controller maintains functionality, with SFT achieving the strongest accuracy and DAPO reducing final-think length below the base model.

large audio-language modelsdynamic sampling policy optimizationsupervised fine-tuningspoken reasoningstreaming interaction

FoundObj: Self-supervised Foundation Models as Rewards for Label-free 3D Object Segmentation

arXiv cs.AI · Zihui Zhang, Zhixuan Sun, Yafei Yang, Jinxi Li · 2026-05-26

FoundObj introduces a label-free 3D object segmentation framework leveraging self-supervised 2D/3D foundation models as rewards. The method employs a superpoint-based discovery agent that iteratively merges neighboring superpoints, guided by semantic and geometric reward modules derived from foundation model priors. Evaluated on diverse benchmarks, FoundObj outperforms baselines, demonstrating strong zero-shot and long-tail generalization without scene-level human annotations.

3d object segmentationself-supervised learningfoundation modelssuperpoint mergingzero-shot generalization

The Compressive Knowledge Graph Hypothesis: Which Graph Facts Matter for Scientific Hypothesis Generation?

arXiv cs.AI · Shashwat Sourav, Viktoriia Baibakova, Sanjay Das, Ran Elgedawy · 2026-05-26

The study proposes the Compressive Knowledge Graph Hypothesis, demonstrating that compact subgraphs often suffice for KG-guided scientific hypothesis generation. Using Mistral-7B, Llama-3.1-70B, and Gemini 2.5 Flash, researchers perturbed KG density, ontology, topology, and control structure while evaluating outputs with graph-aware and reference metrics. Results show KG utility is model-dependent and selective: top-k subgraphs approximate full-KG behavior, with redundancy allowing random or topology-based subsets to recover significant signal, suggesting scientifically structured subgraphs often capture essential KG content.

knowledge graphshypothesis generationsubgraph compressionontology richnesstopological perturbation

An investigation of AI integration in sound designer workflows and experiences

arXiv cs.AI · Nelly Garcia, Joshua Reiss · 2026-05-26

This study investigates the integration of AI tools in sound design workflows through a mixed-methods approach, including a survey of 76 practitioners and semi-structured interviews with 20 industry professionals. Descriptive statistical and thematic analyses identified five key themes: Context, Workflow, Potential, Risks, and Right Use. Findings indicate that current AI tools are effective in fast-consumption media but lack narrative sophistication for high-end sound design. Practitioners prefer task-specific assistive applications, particularly in audio restoration and library management, over end-to-end generative systems. The study provides recommendations for developing more informed AI tools tailored to sound designers' needs.

sound designaudio restorationgenerative systemsthematic analysismixed-methods

Grounding Text Embeddings in Stakeholder Associations

arXiv cs.AI · Jonathan Rystrøm, Sofie Burgos-Thorsen, Zihao Fu, Johan Irving Søltoft · 2026-05-26

The Stakeholder Grounding Exercise is introduced as a method to align neural text embeddings with human expert associations, addressing semantic misalignment in embedding models. The method involves explicit expert associations to ground embedding results in human understanding. In a case study on Danish policy issues, neural embeddings showed a 19-26 percentage point reliability gap compared to human experts, with downstream clustering performance strongly correlated (Spearman ρ=0.9) with expert rankings. A replication study on US Federal AI use cases confirmed a 16pp gap in English, demonstrating the method's generalizability across domains and instruments.

text embeddingssemantic alignmentstakeholder groundingclustering performanceexpert associations

Query Symbolically or Retrieve Semantically? A Dataset and Method for Semi-Structured Question Answering

arXiv cs.AI · Mateusz Czyżnikiewicz, Ryszard Tuora, Adam Kozakiewicz, Tomasz Ziętkiewicz · 2026-05-26

The paper introduces DualGraph, a Retrieval-Augmented Generation (RAG) framework for semi-structured question answering, combining semantic retrieval via a Textual Knowledge Graph and symbolic querying via a Symbolic Knowledge Graph. The method addresses limitations of purely semantic or symbolic approaches by dynamically selecting or combining evidence from both representations. Evaluated on SpecsQA, a benchmark of semi-structured product questions, DualGraph outperforms dense-retrieval, GraphRAG, symbolic, and table-oriented baselines across question types.

retrieval-augmented generationsemi-structured dataknowledge graphsymbolic queryingdense retrieval

Detecting Is Not Resolving: The Monitoring Control Gap in Retrieval Augmented LLMs

arXiv cs.AI · Zhe Yu, Wenpeng Xing, Chen Ye, Xuyang Teng · 2026-05-26

This work identifies a monitoring-control gap in retrieval-augmented LLMs, where models detect epistemic conflicts in accumulated evidence but fail to resolve them safely, challenging the assumption that single-turn robustness predicts multi-turn safety. Through a multi-turn document accumulation protocol evaluating four model families (1.5B-32B parameters) across 50,000+ turn-level assessments, combined with hidden-state probing, attention analysis, and response-strategy taxonomy, the study demonstrates that contradiction acknowledgement is uncorrelated with safe resolution. Results show that danger-relevant information is internally represented and receives enhanced attention during unsafe generation, yet fails to constrain output behavior, highlighting the need for improved action selection mechanisms in high-stakes settings.

monitoring-control gapretrieval-augmented llmsepistemic conflictaction selectionhidden-state probing

LitSeg: Narrative-Aware Document Segmentation for Literary RAG

arXiv cs.AI · Ruikang Zhang, Zhanni Chen, Yiqiao Cai, Qi Su · 2026-05-26

LitSeg introduces a narrative-theory-guided framework for document segmentation in Retrieval-Augmented Generation (RAG), addressing the semantic blindness of existing methods in literary contexts. The framework employs multi-stage prompting to extract events, untangle narrative threads, clarify structures, and locate turning points for segmentation. A lightweight variant, LitSeg-Lite, distills this process into a single inference pass via two-stage training. Experiments show that LitSeg significantly improves retrieval accuracy, context relevance, and downstream QA performance, with ablation studies confirming the efficacy of narratological guidance and data distillation.

retrieval-augmented generationdocument segmentationnarrative theorymulti-stage promptingdata distillation

Semantic Robustness Probing via Inpainting: An Interactive Tool for Safety-Critical Object Detection

arXiv cs.AI · Nico Steckhan, Krutarth Prajapati, Weija Shao, Silvia Vock · 2026-05-26

The paper introduces SemProbe, a tool for semantic robustness probing in safety-critical object detection. It enables users to upload images, create masks, select operational design domain-derived factors, and perform diffusion-based controlled inpainting. The system supports batch jobs, parallel seed/workflow variations, and configurable generation parameters, with automatic model inference and annotated before/after comparisons. Probes are logged as structured artifacts for traceable robustness evidence. The tool is demonstrated on hand detection for dimension saws, targeting insurance-oriented test criteria.

semantic robustnessobject detectiondiffusion-based inpaintingsafety-critical domainsoperational design domain

VitaBench 2.0: Evaluating Personalized and Proactive Agents in Long-Term User Interactions

arXiv cs.AI · Yuxin Chen, Yi Zhang, Zhengzhou Cai, Yaorui Shi · 2026-05-26

VitaBench 2.0 introduces a benchmark for evaluating personalized and proactive agent behavior in long-term user interactions, addressing the gap in existing benchmarks that overlook user preference inference. Tasks are organized as temporally ordered sequences with embedded user preferences, requiring agents to continuously extract, utilize, and update preferences from fragmented interactions. Proactiveness is evaluated through tasks necessitating recognition and acquisition of missing information. An extensible memory interface supports controlled comparison across memory architectures. Benchmarking state-of-the-art LLMs reveals significant challenges in real-world personalization, highlighting failure modes and capability bottlenecks in personalized decision-making.

personalized agentsproactive interactionmemory architectureuser preference inferencelong-term interactions

StepOPSD: Step-Aware Online Preference Distillation for Agent Reinforcement Learning

arXiv cs.AI · Yanfei Zhang, Xu Lin, Chenglin Wu · 2026-05-26

StepOPSD introduces step-aware online preference distillation for RL agents, addressing credit-assignment mismatches by decomposing trajectories into action-centered step segments and redistributing credit via hindsight-enriched teacher contexts. The method converts token-level log-probability gaps into sign-preserving advantage shaping with normalized per-step budgets before GRPO updates. Evaluated on ALFWorld and Search-QA with Qwen models, StepOPSD achieves top performance on local-causal-error-sensitive subsets (e.g., 95.0% on PickTwo) and reveals a two-knob law: α_clip stabilizes locally, while λ_mix varies globally.

online preference distillationcredit assignmentadvantage shapinggrpo updatehindsight-enriched contexts

ICCU: In-Context Continual Unlearning via Pattern-Induced Refusal Rules

arXiv cs.AI · Ruihao Pan, Suhang Wang · 2026-05-26

The paper introduces ICCU (In-Context Continual Unlearning), a framework for sequential machine unlearning in language models. ICCU induces refusal rules from unlearning datasets and applies them contextually during inference, avoiding parameter updates. This approach eliminates cross-request interference, supports compositional rule accumulation, and discards original forget-set data post-induction. Experiments demonstrate ICCU's effectiveness in suppressing target knowledge, maintaining utility, and scaling across sequential requests while handling paraphrased and cross-lingual queries robustly.

machine unlearningin-context learningrefusal rulessequential requestsutility preservation

Scaling, Benchmarking, and Reasoning of Vision-Language Agents for Mobile GUI Navigation

arXiv cs.AI · Heng Qu, Yike Liu, Renren Jin, Wenzong Zhang · 2026-05-26

The paper introduces HyperTrack, a dataset of 16,000+ real-world tasks across 650+ Chinese mobile apps, and GUIEvalKit, an open-source benchmarking toolkit for vision-language models (VLMs) in mobile GUI navigation. It analyzes data scaling effects via supervised and reinforcement-based finetuning, finding reinforcement learning outperforms supervised methods, especially in out-of-domain settings. Benchmarking SOTA VLMs with GUIEvalKit reveals the impact of interaction history and reasoning on task completion.

vision-language modelsmobile gui navigationreinforcement finetuningout-of-domain generalizationinteraction history

Deep-layer limit and stability analysis of the basic forward-backward-splitting induced network (II): learning problems

arXiv cs.AI · Xuan Lin, Chunlin Wu · 2026-05-26

The paper establishes theoretical convergence properties for learning problems in forward-backward-splitting (FBS)-induced networks, derived from iterative optimization schemes. Using difference/differential inclusion formulations, the authors prove that optimal learning parameters for the basic FBS-induced network Γ-converge to solutions of the deep-layer limit system under mild assumptions. A qualitative perturbation stability analysis is provided, supported by numerical validation. Results imply that cluster points of network parameters solve the limit system's learning problem.

forward-backward-splittingdeep unfoldingγ-convergencedifferential inclusionperturbation stability

DEI: Diversity in Evolutionary Inference for Quality-Diversity Search

arXiv cs.AI · John Donaghy, Shikhar Rastogi · 2026-05-26

DEI (Diversity in Evolutionary Inference) introduces a distributed Quality-Diversity (QD) search framework leveraging heterogeneous large language models (LLMs) as mutation operators across peer nodes with non-blocking collective operations. Unlike homogeneous parallel search, DEI exploits each LLM's distinct creative prior to enhance behavioral novelty, extending the Digital Red Queen framework by sharing local optimal solutions between rounds to drive cross-model adversarial pressure. Evaluated on the Core War domain, a four-node heterogeneous ensemble (GPT-5.4-mini, Claude Sonnet 4.6, GPT-5.2, Claude Haiku 4.5) achieves a 124% higher QD-Score (45.90 vs. 20.46) and 28% greater coverage (80.6% vs. 63.0%) compared to a single-node baseline, demonstrating model diversity as a key driver in distributed LLM-based QD search.

quality-diversity searchheterogeneous llmsdigital red queennon-blocking collectivecore war

Beyond the Data Mesh Illusion: Designing Modern AI-augmented Lakehouses to Bridge the Gap Between Theory and Practice

arXiv cs.AI · Oliver Angélil, Jan Migon · 2026-05-26

The paper proposes an AI-augmented hub-and-spoke model for enterprise data platforms, combining centralized governance with domain autonomy. A Center of Excellence hub provides automated policy enforcement, quality rule generation, and data contract drafting using LLMs, while domain spokes retain semantic ownership. The architecture leverages modern lakehouse infrastructure and natural-language interfaces to democratize data access. Evaluation focuses on three business-aligned metrics: data product adoption, time-to-find, and time-to-insight, demonstrating measurable operational improvements over pure data mesh implementations.

data meshlakehouse architecturellm automationgovernance policiesdomain ownership

Position: AI Safety Requires Effective Controllability

arXiv cs.AI · Yige Li, Yunhao Feng, Jun Sun · 2026-05-26

The paper argues that AI safety must prioritize controllability—defined as persistent interruptibility, override capability, and constraint adherence during runtime—alongside alignment. It introduces ControlBench, a benchmark for evaluating controllability failures in high-risk agentic scenarios, and tests OpenClaw-based agents under adversarial inputs and long-horizon tasks. Results show current alignment methods inadequately enforce authoritative runtime control, prompting a proposed architectural framework with explicit control planes, intervention pathways, and auditable interfaces.

controllabilityalignmentruntime controlagentic aisafety benchmarks

Counteraction-Aware Multi-Teacher On-Policy Distillation for General Capability Recovery with Domain Preservation

arXiv cs.AI · Tianlei Chen, Jiao Ou, Ziyuan Liu, Ruiming Tang · 2026-05-26

The paper proposes Counteraction-Aware Multi-Teacher On-Policy Distillation (CaMOPD) to address general capability degradation in domain-specialized LLMs. The method employs decoupled alternating training and gap-based sample selection to mitigate recovery-preservation counteraction and weak-signal flattening in Multi-Teacher On-Policy Distillation (MOPD) pipelines. Evaluations on role-play dialogue and medical QA show CaMOPD outperforms baselines in general capability recovery while preserving domain-specific behavior, with gradient coherence analyses validating improved correction signal quality.

multi-teacher distillationcapability recoverydomain preservationon-policy learninglog-probability gap

High-Quality Synthetic Financial Time-Series using a GAN-Diffusion Framework

arXiv cs.AI · Giuseppe Masi, Andrea Coletta, Novella Bartolini · 2026-05-26

The paper introduces a hybrid GAN-diffusion framework for generating high-fidelity synthetic financial time-series. The method combines CoMeTS-GAN (a conditional GAN for joint mid-price/volume generation) with diffusion models, using the GAN's critic as a quality module to guide correlation structure learning. Experiments demonstrate superior performance over baseline architectures in capturing stylized facts and inter-asset correlations.

conditional gandiffusion modelssynthetic time-seriesstylized factsinter-asset correlation

Can Broad Biomedical Knowledge be Contextualized into Scenario-Grounded Propositions?

arXiv cs.AI · Qingyuan Zeng, Ziyang Chen, Pengxiang Cai, Zixin Guan · 2026-05-26

SCENE, a bi-level multi-agent framework, addresses knowledge contextualization by transforming broad biomedical knowledge into scenario-grounded propositions. The upper level converts general knowledge into search directions grounded in dataset schemas, while the lower level executes these via multi-objective optimization to identify propositions balancing evidential strength and data support. Iterative feedback refines the search. Evaluated in clinical trials and LINCS L1000 studies, SCENE outperforms baselines by discovering specific patient subgroups with heterogeneous treatment benefits and identifying perturbational contexts with strong target-response matching and high positive rates. SCENE bridges broad knowledge and scenario-specific evidence, producing traceable hypotheses for validation.

knowledge contextualizationmulti-agent frameworkmulti-objective optimizationdataset schemaperturbational contexts

ReMoE: Boosting Expert Reuse through Router Fine-Tuning in Memory-Constrained MoE LLM Inference

arXiv cs.AI · Xiongwei Zhu, Xiaojian Liao, Tianyang Jiang, Yusen Zhang · 2026-05-26

ReMoE introduces router fine-tuning to enhance expert reuse in memory-constrained MoE LLM inference, reducing I/O overhead from expert fetches. By biasing the router toward recently selected experts, it achieves temporally stable routing aligned with cache locality, without added inference-time computation. Evaluations on DeepSeek and Qwen models show 26% higher expert reuse, 8.4% throughput gain under vLLM GPU-CPU offloading, and 43.6-49.8% TPOT reduction (1.77-1.99× decode speedup) on Jetson Orin NX via llama.cpp, while preserving task performance.

mixture-of-expertsrouter fine-tuningcache localityexpert reusethroughput optimization

Trust Region Q Adjoint Matching

arXiv cs.AI · Yonghoon Dong, Kyungmin Lee, Changyeon Kim, Jaehyuk Kim · 2026-05-26

The paper introduces Trust Region Q-Adjoint Matching (TRQAM), a stable off-policy fine-tuning algorithm for pretrained flow policies. TRQAM addresses critic-guided improvement fragility in Q-learning with Adjoint Matching (QAM) by adaptively controlling path-space KL divergence through projected dual descent, optimizing the trust-region parameter λ in stochastic optimal control dynamics. Theoretical analysis shows path-space KL can be represented as a closed-form function of λ. Experiments on 50 OGBench tasks demonstrate TRQAM's superiority, achieving 68% success rate in offline RL versus 46% for baselines.

off-policy reinforcement learningtrust region optimizationpath-space kl divergencestochastic optimal controlpretrained flow policies

Two Speeds of Learning: A Representation-Readout Decomposition of Grokking and Double Descent

arXiv cs.AI · Chi-Ning Chou, Oscar Uzdelewicz, Neng-Chun Chiu, Yao-Yuan Yang · 2026-05-26

The study introduces a representation-readout decomposition framework to analyze grokking and epoch-wise double descent in deep neural networks. Using representational geometry, neural tangent kernels, and linear probing, the authors identify two competing processes: representation learning in the encoder and readout calibration in the final classifier. Results reveal that grokking arises from train-biased readout before onset and gradual representation learning, contrasting the lazy-to-rich account. The framework distinguishes spurious from genuine generalization, attributing delayed or non-monotone dynamics to representation degradation and readout misalignment induced by non-standard training recipes.

representation-readout decompositiongrokkingepoch-wise double descentneural tangent kernelslinear probing

E3: Issue-Level Backtesting for Automated Research Critique

arXiv cs.AI · Yashwardhan Chaudhuri, Sanyam Jain, Paridhi Mundra · 2026-05-26

E3 introduces an automated review assistant that identifies technical concerns in research papers, including unsupported claims, missing ablations, and leakage risks, while providing their nature, location, and resolution evidence. Evaluated through an issue-level backtesting protocol on 100 ICLR 2026 papers and 4598 judged issue rows, E3 outperforms human reviews and two LLM baselines (GPT-5.4 and Claude-Opus-4-6) in recall metrics, achieving 90.2% partial-inclusive recall and 65.8% strict recall. E3 recovers 89.6% of human-raised concerns and surfaces 1635 additional missed concerns, significantly above other sources.

automated review assistantissue-level backtestingunsupported claimsmissing ablationsleakage risks

Traceable Knowledge Graph Reasoning Enables LLM-Assisted Decision Support for Industrial VOCs in the Steel Industry

arXiv cs.AI · Changqing Su, Yu Ding, Zuhong Lin, Hongyu Liu · 2026-05-26

The authors present Chat-ISV, a knowledge graph (KG) enhanced multi-agent Q&A system for steel-industry volatile organic compounds (VOCs) governance. The system constructs a Neo4j KG (27,180 nodes, 81,779 edges) from literature, employing prompt-constrained extraction, topology optimization (reducing isolated nodes from 57% to 4.08%), multi-agent routing, and source-backtracking retrieval. Benchmarking shows 96.93% precision, 72.63% recall (F1=0.830), and 1.69/2.00 mean expert score, demonstrating reliable decision support via traceable KG reasoning and LLM integration for specialized industrial domains.

knowledge graphvolatile organic compoundsmulti-agent systemtopology optimizationsource-backtracking retrieval

QUACK: Questioning, Understanding, and Auditing Communicated Knowledge in Multimodal Social Deduction Agents

arXiv cs.AI · Ye Yuan, Rui Song, Weien Li, Zeyu Li · 2026-05-26

QUACK introduces an open-source multimodal environment and evaluation framework for auditing language grounding in social deduction LLM agents, addressing limitations of text-only game outcomes. It employs a Statement Verification Pipeline that reconstructs agent trajectories from engine logs to verify utterance-level consistency against ground truth, detecting spatial hallucination, unsupported accusation, deception collapse, and language-action inconsistency. Evaluations of three frontier VLMs reveal 15.1% hallucination rate in verifiable spatial claims and over 50% ungrounded accusations, even in top-performing agents. The framework includes full engine, evaluation toolkit, and logs for reproducibility.

multimodalsocial deductionhallucinationutterance-level consistencystatement verification

ConVer: Using Contracts and Loop Invariant Synthesis for Scalable Formal Software Verification

arXiv cs.AI · Muhammad A. A. Pirzada, Weiqi Wang, Yiannis Charalambous, Konstantin Korovin · 2026-05-26

ConVer introduces a compositional verification tool for C programs that mitigates state-space explosion via top-down decomposition. The method combines LLM-synthesized function contracts with a CEGAR-CEGIS loop, refining contracts via SMART ICE learning when checks fail. Evaluated on four benchmarks (Frama-C, X.509 parser, LF2C-Simple, VerifyThis), ConVer achieves 33-96% verification success depending on difficulty, with 93-95% of converged Frama-C cases requiring single CEGAR-CEGIS iterations. ESBMC-LF extends verification to LF models by transpiling them to C, enabling ConVer to verify 67% of LF-Hard benchmarks.

compositional verificationcegar-cegis loopfunction contractsstate-space explosionsmart ice learning

BatteryMFormer: Multi-level Learning for Battery Degradation Trajectory Forecasting

arXiv cs.AI · Ruifeng Tan, Jintao Dong, Weixiang Hong, Jia Li · 2026-05-26

BatteryMFormer introduces a multi-level Transformer for early battery degradation trajectory forecasting (BDTF), addressing two key data characteristics: multi-level structure (shared aging-condition regularities and cross-battery trajectory patterns) and SOC-localized variations. The method combines (1) an aging-condition-aware decoder with condition-informed queries and attention, (2) a meta degradation pattern memory for trajectory prototype retrieval, and (3) a dual-view encoder capturing temporal dynamics and SOC-localized variations. Experiments across four battery domains demonstrate consistent superiority over state-of-the-art baselines, advancing reliable BDTF.

battery degradation trajectory forecastingmulti-level transformerstate-of-charge localizationmeta degradation memoryaging-condition-aware attention

Lessons from Penetration Tests on Large-Scale Agent Systems

arXiv cs.AI · Kevin Eykholt, Dhilung Kirat, Xiaokui Shu, Jiyong Jang · 2026-05-26

The study evaluates security vulnerabilities in proprietary AI agent systems through two penetration tests conducted in 2025, contrasting them with prior findings in open-source systems. Researchers analyze whether stricter development standards in proprietary systems mitigate recurring security weaknesses observed in autonomous, execution-capable AI agents. Initial results suggest persistent cross-layer vulnerabilities despite formal review processes, highlighting the challenges of securing complex, self-modifying agent behaviors.

penetration testsproprietary agent systemscross-layer vulnerabilitiesexecution-capable aiself-modifying programs

Tracing Computation Density in LLMs

arXiv cs.AI · Corentin Kervadec, Iuliia Lysova, Iuri Macocco, Marco Baroni · 2026-05-26

The paper introduces s-Trace, a method to estimate optimal subgraphs of size s for approximating full model outputs in transformer-based LLMs. Analysis reveals computation organized in two phases: early-layer nodes form a sparse core generating rough predictions, while later-layer nodes (primarily attention heads) incrementally refine outputs. Findings indicate computation density correlates with model uncertainty, and sparse subgraphs encode shallow statistics like unigram frequency. Results demonstrate consistent modular organization in LLM computation, with early sparse processing followed by denser refinement layers.

s-tracesubgraph estimationcomputation densityattention headsmodular organization

Less is More: Early Stopping Rollout for On-Policy Distillation

arXiv cs.AI · Zhou Ziheng, Jiaqi Li, Huacong Tang, Ying Nian Wu · 2026-05-26

The paper introduces Early Stopping Rollout (ESR), a distillation strategy addressing Off-policy Teacher Decay in on-policy distillation, where teacher scoring degrades for later tokens due to off-policy student trajectories. ESR restricts rollouts to initial response tokens, improving performance across model sizes, families, and tasks while enhancing GPU efficiency and training stability. Empirical results demonstrate ESR's superiority over full-rollout methods, with analysis revealing Cascading Alignment and Sub-mode Commitment effects. The approach cannot be fully explained by KL divergence or entropy metrics.

on-policy distillationoff-policy teacher decayearly stopping rolloutcascading alignmentsub-mode commitment

Boosting Knowledge Graph Foundation Models via Enhanced Negative Sampling

arXiv cs.AI · Yinan Liu, Wenjin Xu, Zhiyuan Zha, Xiaochun Yang · 2026-05-26

Proposes KMAS, an adaptive negative sampling method to enhance knowledge graph foundation models (KGFMs) by generating hard negative triples via relation embeddings from the KGFM's encoder. Dynamically adjusts the hard-negative ratio during training (linear increase after warmup, then decrease) to align with model evolution. Evaluated on 44 datasets, KMAS improves state-of-the-art KGFMs without significant computational overhead.

knowledge graph completionnegative samplingrelation embeddingszero-shot learningfoundation models

ORCA: An End-to-End Interactive Copilot for Optimized Root Cause Analysis

arXiv cs.AI · Phi Nguyen Xuan, Nicholas Tagliapietra, Lavdim Halilaj, Kristian Kersting · 2026-05-26

ORCA introduces an end-to-end interactive copilot for optimized root cause analysis, addressing accessibility gaps in causal methods for domain experts. The system orchestrates agents to guide users through customizable workflows encompassing causal discovery, effect estimation, explainability, and RCA. It demonstrates effectiveness across real-world use cases by automating performance evaluation, metric generation, and insight reporting while supporting both automatic and user-guided execution modes.

causal discoveryeffect estimationroot-cause-analysisinteractive copilotorchestration agents

Black-box Membership Inference Attacks on the Pre-training Data of Image-generation Models

arXiv cs.AI · Tao Qi, Huili Wang, Yuanhong Huang, Wendan Wang · 2026-05-26

The paper proposes SD-MIA, a black-box membership inference attack framework for detecting pre-training data usage in diffusion models. Unlike prior methods relying on denoising performance or internal features, SD-MIA analyzes cross-modal perturbations—comparing how a target image and its perturbed textual instructions are denoised—to extract distinctive membership signals. Evaluated on public and newly constructed datasets with matched membership/non-membership distributions, SD-MIA outperforms existing baselines (including white-box approaches) in identifying pre-training data.

membership inference attackdiffusion modelsblack-box attackpre-training datacross-modal perturbation

Evaluating the Relevance of Uncertainty Estimators for LLM Hallucination

arXiv cs.AI · Yedidia Agnimo, Anna Korba, Annabelle Blangero, Nicolas Chesneau · 2026-05-26

The study systematically evaluates the association between uncertainty estimators (UEs) and hallucinations in large language models (LLMs), challenging their assumed role as reliable proxies. It examines diverse UEs—information-theoretic, sampling-based, and reflexive—across intrinsic (input faithfulness) and extrinsic (training data alignment) hallucination types using benchmarks like RAGTruth and HalluLens. Results reveal weak and variable correlations, contingent on hallucination type and LLM, undermining uncertainty's direct utility for hallucination detection.

uncertainty estimationllm hallucinationinformation-theoreticsampling-basedfaithfulness

ReasonOps: A Unified Operational Paradigm for Trustworthy Verified LLM Reasoning

arXiv cs.AI · Adnan Rashid · 2026-05-26

The paper introduces ReasonOps, a unified operational paradigm for trustworthy verified reasoning in Large Language Models (LLMs), addressing fragmentation across formal verification, runtime assurance, and neuro-symbolic reasoning. The method integrates semantic interpretation, autoformalization, symbolic reasoning, theorem proving, and probabilistic reliability estimation into a continuous reasoning lifecycle, inspired by DevOps and MLOps. Demonstrated via an autonomous braking system analysis, ReasonOps proposes a foundation for safety-critical autonomous AI systems by ensuring monitored, verifiable reasoning processes.

reasonopsautoformalizationneuro-symbolicruntime assurancetheorem proving

Generating Robust Portfolios of Optimization Models using Large Language Models

arXiv cs.AI · Eleni Straitouri, Cheol Woo Kim, Milind Tambe · 2026-05-26

We propose a novel algorithm for generating robust portfolios of optimization models using large language models (LLMs), addressing the unreliability of single LLM-generated models. The method leverages LLMs in dual roles—as stochastic generators and reasoning evaluators—within a unified framework, ensuring portfolio quality if either role aligns with human preferences. Theoretical guarantees show the portfolio contains high-quality candidates, enabling human-in-the-loop decision-making. Empirical validation demonstrates strong performance across diverse optimization modeling tasks.

optimization modelslarge language modelsstochastic generatorreasoning evaluatorhuman-in-the-loop

Timestep-Aware SVDQuant-GPTQ for W4A4 Quantization of Wan2.2-I2V

arXiv cs.AI · Junhao Wu, Dezhong Yao, Hai Jin · 2026-05-26

We introduce a timestep-aware W4A4 quantization framework for Wan2.2-I2V, a Mixture-of-Experts video diffusion Transformer, addressing challenges of activation outliers and timestep-dependent distributions. Our method integrates SVDQuant for low-rank outlier compensation, GPTQ for reconstruction-aware weight quantization, and timestep-bin-wise per-layer activation clipping-ratio search tailored for each expert. On OpenS2V-Eval, this approach reduces peak GPU memory by 59.3% compared to BF16 baseline, with minimal degradation (0.9% drop in VBench score, 2.3% drop in Imaging Quality), demonstrating the necessity of expert- and timestep-aware calibration for high-fidelity MoE video DiT inference.

w4a4 quantizationmixture-of-expertssvdquantgptqtimestep-aware

Cast a Wider Net: Coordinated Pass@K Policy Optimization for Code Reasoning

arXiv cs.AI · Yilong Li, Suman Banerjee, Tong Che · 2026-05-26

Coordinated Pass@K Policy Optimization (CPPO) improves code generation by jointly exploring diverse algorithmic strategies rather than sampling redundant reasoning paths. CPPO employs a planner to propose K=4 distinct high-level methods and a shared solver to attempt one solution per method, trained with a multiplicative planner reward that credits only valid strategy tuples leading to verifier-confirmed pass@K success. Evaluated on APPS, CodeContests, and LiveCodeBench-v6, CPPO statistically significantly outperforms direct sampling, planning baselines, planner-only SFT, and pass@K-oriented RL in six of nine model-benchmark cells, with the largest gain (+0.16) on Qwen3.5-9B LiveCodeBench-v6 over PKPO.

coordinated pass@k policy optimizationmultiplicative planner rewardcode generationalgorithmic strategiesverifier-confirmed success

Recon: Reconstruction-Guided Reasoning Synthesis for User Modeling

arXiv cs.AI · Alan Zhu, Mihran Miroyan, Carolyn Wang, Andrew Zhou · 2026-05-26

Recon introduces a reconstruction-guided approach for synthesizing reasoning traces in user modeling, addressing the limitations of post-hoc rationalization in capturing latent causal decision paths. The method scores reasoning traces based on their predictive power: a reconstruction model predicts actions given context and candidate reasoning, with fidelity determining reasoning quality. Evaluated across four domains, Recon achieves a 54.7% win rate over Backward Synthesis and up to 70.0% over baselines when training reasoning synthesis models with Recon-derived rewards. Additionally, Recon-synthesized reasoning transfers across models and enhances user modeling beyond the reconstruction model, demonstrating the insufficiency of post-hoc rationalization.

user modelingreasoning synthesisreconstruction modelpost-hoc rationalizationlatent causal decision paths

Tournament-GRPO: Group-Wise Tournament Rewards for Reinforcement Learning in Open-Ended Long-Form Generation

arXiv cs.AI · Zixuan Yang, Yiqun Chen, Wei Yang, Erhan Zhang · 2026-05-26

Tournament-GRPO introduces a group-wise reward framework for reinforcement learning in open-ended long-form generation, addressing calibration and discrimination limitations of pointwise LLM-as-a-judge scoring. The method employs multi-round tournaments among same-query rollouts, converting rubric-guided LLM judgments into relative rewards through group comparisons and normalization for GRPO training. Experiments on Deep Research Bench demonstrate a 4.52-point overall-score improvement over baselines, with analyses highlighting favorable effectiveness-efficiency trade-offs and tournament design impacts on training dynamics.

reinforcement learninglong-form generationllm-as-a-judgegroup-wise rewardstournament comparison

LELA: An End-to-end LLM-based Entity Linking Framework with Zero-shot Domain Adaptation

arXiv cs.AI · Samy Haffoudhi, Nikola Dobričić, Fabian Suchanek, Nils Holzenberger · 2026-05-26

The paper introduces LELA, an end-to-end LLM-based entity linking framework with zero-shot domain adaptation, addressing limitations of domain-specific approaches. LELA integrates zero-shot NER into a modular, domain-agnostic pipeline, implemented as a Python library for practical use. Experimental results demonstrate its robustness across diverse entity linking settings, validated through performance metrics. The system includes a demo allowing users to test it on custom input texts.

entity linkingzero-shotllmnerdomain adaptation

JuICE: A Benchmark for Evaluating LLM-Judge in Identifying Cultural Errors

arXiv cs.AI · Jiho Jin, Junho Myung, Juhyun Oh, Junyeong Park · 2026-05-26

The paper introduces JuICE, a benchmark for evaluating LLM-Judge capabilities in detecting cultural errors across diverse contexts. The dataset comprises 7,470 span-level annotations of cultural and linguistic errors in 1,050 query-response pairs from four countries (US, South Korea, Indonesia, Bangladesh) in both English and native languages. Results show that even top-performing LLM-judges achieve only F1=0.52 in erroneous span detection, consistently failing to identify thick cultural errors recognized by local residents.

cultural errorsllm-judgespan-level annotationsmultilingual datasetthick cultural errors

Neuro-Symbolic Verification of LLM Outputs for Data-Sensitive Domains (extended preprint)

arXiv cs.AI · Paul Sigloch, Christoph Benzmüller · 2026-05-26

The paper proposes a neuro-symbolic verification architecture combining formal symbolic methods and neural semantic analysis to enhance LLM reliability in high-stakes domains. The hybrid approach uses logical reasoning for input verification (ensuring decidable guarantees) and embedding-based similarity for output validation (detecting contextual hallucinations), implemented via a parallel actor-based pipeline. Evaluated on HAIMEDA, a medical device damage assessment system, the method achieves 83% hallucination detection for structured entities and 72% for semantic fabrications, while reducing report creation time by 30%, demonstrating efficacy for data-sensitive applications.

neuro-symbolic verificationformal methodssemantic similarityhallucination detectionactor-based pipeline

Developing a Totally Unimodular Linear Program for Optimal Conformance Checking: When and Why It Complements A*

arXiv cs.AI · Izack Cohen · 2026-05-26

The paper introduces a totally unimodular linear program (LP) formulation for alignment-based conformance checking, complementing existing A*-based methods. By reformulating the problem on the synchronous product's reachability graph as a network-flow LP, it guarantees integral solutions via relaxation, avoiding combinatorial search. Evaluation on 2.1M instances shows A* excels for short, conformant traces, while the LP method accelerates longer, deviant cases. A hybrid selection strategy achieves 38.6% runtime savings with 96% accuracy versus A*-only baselines.

conformance checkingtotally unimodular lpa* algorithmreachability graphnetwork-flow

Beyond Questions: Evaluating What Large Language Models (Actually) Know

arXiv cs.AI · Luca Giordano, Simon Razniewski · 2026-05-26

This paper introduces open knowledge evaluation, a novel paradigm for assessing parametric knowledge in large language models (LLMs) beyond predefined question-answer formats. The authors propose BeQu, a benchmark comprising 10,000 entities with reference corpora, which evaluates LLMs based on knowledge surfaced through open-ended elicitation prompts rather than narrow questions. Using BeQu, they analyze factors such as reasoning effort, model scale, prompt format, and knowledge domain across a range of LLMs. The benchmark and results are publicly available via GitHub and a dedicated website.

parametric knowledgeelicitation promptsstatement verificationbenchmarkreference corpora

Reasoning Depth and Environment Complexity: A Controlled Study of RLVR Data Allocation across Logical Reasoning Tasks

arXiv cs.AI · Yihua Zhu, Qianying Liu, Fei Cheng, Jiaxin Wang · 2026-05-26

The study characterizes reasoning in RLVR along two dimensions—reasoning depth and environment complexity—and evaluates four reasoning abilities: deductive, abductive, inductive, and analogical. Using a synthetic knowledge-graph environment with controlled distributions, the authors find that joint depth-complexity coverage outperforms single-axis approaches, with non-uniform performance across reasoning families (e.g., abductive reasoning degrades outside RL-covered regions). Uniform mixing surpasses staged curricula under fixed budgets, and off-the-shelf models exhibit deductive-over-abductive asymmetry, suggesting broader implications beyond the controlled setup.

rlvrreasoning depthenvironment complexityabductive reasoningknowledge-graph

From Norms to Indicators (N2I-RAG): An Agentic Retrieval-Augmented Generation Framework for Legal Indicator Computation

arXiv cs.AI · Youssef Al Mouatamid, Marie Bonnin, Jihad Zahir · 2026-05-26

The paper introduces N2I-RAG, an agentic retrieval-augmented generation framework for computing legal indicators from normative texts. The method combines adaptive retrieval, LLM-based agents, and validation mechanisms in a modular pipeline to ensure traceability and evidence grounding, with explicit explanations for intermediate decisions. Evaluated on a French marine environmental law corpus, N2I-RAG outperforms baselines across multiple language model families and generalizes to different legal bans, demonstrating its potential for transparent legal monitoring.

retrieval-augmented generationlegal indicatorsagentic frameworknormative textstraceability

TADDLE: A Tool-Augmented Agent for Detecting Deficient LLM-Generated Peer Reviews

arXiv cs.AI · Hanqi Duan, Xiang Li · 2026-05-26

The authors introduce TADDLE, a tool-augmented agent for detecting deficient LLM-generated peer reviews, addressing a gap in existing systems that either classify authorship or score quality without identifying specific defect types. TADDLE employs four specialized analysis tools (Verify, Correct, Complete, Transform) orchestrated by an agent, with outputs integrated via two-stage semi-supervised learning. Evaluated on a new expert-annotated benchmark of 1,800 ICLR 2025 reviews labeled across six defect categories, TADDLE demonstrates strong performance in both binary and multi-label classification tasks.

llm-generated reviewsdefect detectiontool-augmented agentsemi-supervised learningpeer review benchmark

EEG-FM-Audit: A Systematic Evaluation and Analysis Pipeline for EEG Foundation Models

arXiv cs.AI · Xianheng Wang, Yige Yang, Damien Coyle · 2026-05-26

The study introduces EEG-FM-Audit, a systematic pipeline for evaluating EEG foundation models (FMs), addressing limitations in baseline tuning, learning paradigm verification, and interpretability. The method combines ASHA-driven benchmarking, paradigm-level ablation studies, and neurophysiological probing (NPP) to assess temporal, spatial, and spectral feature utilization. Results on four EEG-FMs and five supervised models across three datasets show that tuned baselines often match FMs despite smaller parameter counts, learning paradigm efficacy varies with dataset scale, and NPP reveals FM reliance on physiologically valid EEG features.

eeg foundation modelsneurophysiological probingasha-driven benchmarkinglearning paradigmsneural decoding

On the Detection of Commutative Factors in Factor Graphs: Necessary and Sufficient Conditions

arXiv cs.AI · Malte Luttermann, Ralf Möller, Marcel Gehrke · 2026-05-26

The paper identifies a flaw in the state-of-the-art algorithm for detecting commutative factors in factor graphs, showing that its central theorem provides only a necessary (not sufficient) condition for identification. The authors correct this by proving a modified theorem and presenting a revised algorithm that maintains efficiency while ensuring correctness. Additionally, they introduce a complementary algorithm with improved worst-case bounds, addressing the limitations of existing methods in lifted probabilistic inference.

factor graphscommutative factorsprobabilistic inferencelifted inferencealgorithm correction

Practical Anonymous Two-Party Gradient Boosting Decision Tree

arXiv cs.AI · Huang Chenyu, Zhang Fan, Du Minxin, Chow Sherman SM · 2026-05-26

The authors introduce an anonymous two-party gradient-boosted decision tree (GBDT) training protocol for vertically partitioned data, addressing the challenge of hiding shared record identifiers (IDs) while maintaining efficiency. The method employs dual circuit-private set intersection (PSI) with alternating receiver roles, oblivious programmable pseudorandom functions for state propagation, and optimized ciphertext packing for homomorphic encryption. This approach avoids universal alignment and reduces ID-hiding costs scaling with domain size. Experimental results demonstrate competitive efficiency with non-ID-hiding methods, enabling secure aggregation in vertically partitioned analytics.

gradient-boosted decision treeprivate set intersectionhomomorphic encryptionvertically partitioned dataoblivious programmable pseudorandom functions

ICICLE: Expanding Retrieval with In-Context Documents

arXiv cs.AI · Yu-Chen Den, Yung-Yu Shih, Zhi Rui Tam, Kuan-Yu Chen · 2026-05-26

ICICLE introduces an in-context indexing framework for generative retrieval that addresses corpus expansion challenges by incorporating inference-time document-docid evidence. The method combines a `[COPY]`-based routing mechanism, preference-based calibration, and large context adaptation to distinguish context-grounded from parametric retrieval. Evaluations on MS MARCO and NQ320K demonstrate improved retrieval of new documents while maintaining seen-document retention without retraining, with routing failure identified as the primary cause of high-shot degradation.

generative retrievalin-context learningcorpus expansionparametric memorydocid generation

Strategies for Guiding LLMs to Use Software Design Patterns: A Case of Singleton

arXiv cs.AI · Viktor Kjellberg, Farnaz Fotrousi, Miroslaw Staron · 2026-05-26

This paper evaluates strategies for guiding LLMs to generate code adhering to the Singleton design pattern, testing 13 models across four prompting methods on 164 Java tasks from HumanEval-X. Iterative binary feedback emerged as the most effective approach, with Llama 3.3 achieving 100% Singleton compliance and a 34.1pp functionality improvement via instruction-based guidance, while Qwen 3 (8B) reached 99.2% pattern alignment and 58.6% functionality using binary feedback. Results demonstrate that even simple prompting techniques can significantly enhance LLMs' architectural pattern compliance without compromising code quality.

large language modelsdesign patternssingletonprompt engineeringautomated feedback

Negligible in Size, Significant in Effect: On Scale Vectors in Large Language Models

arXiv cs.AI · Mingze Wang, Shuchen Zhu, Yuxin Fang, Binghui Li · 2026-05-26

This work systematically investigates scale vectors in large language models (LLMs), demonstrating their critical role despite minimal parameter count. Through theoretical and empirical analysis, it reveals that scale vectors enhance optimization via self-amplifying preconditioning in Pre-Norm architectures, rather than increasing expressivity. The study distinguishes Input-Norm and Output-Norm layers, showing weight decay benefits the former but harms the latter. Three lightweight improvements—branch-specific heterogeneity, optimized placement around linear mappings, and magnitude-direction reparameterization—are proposed and validated. Unified scale-vector strategies achieve lower terminal loss and improved scaling behavior across LLMs (0.12B to 2B parameters) with negligible overhead.

scale vectorspre-norm architecturesweight decaymagnitude-direction reparameterizationself-amplifying preconditioning

GeoFaith: A Spatio-Temporal Dual View of Faithful Chain-of-Thought

arXiv cs.AI · Weijiang Lv, Wentong Zhao, Jiayu Wang, Yuhao Wu · 2026-05-26

GeoFaith introduces a spatio-temporal framework for diagnosing and enforcing faithful Chain-of-Thought (CoT) reasoning in large language models (LLMs), addressing pervasive post-hoc rationalization. The method leverages latent geometric structure and entropy dynamics, employing a scalable bootstrapping pipeline to expand step-level annotations from 1k to 20k samples across four domains. An 8B faithfulness detector outperforms GPT-5 on standard benchmarks, and a faithfulness-aware reinforcement learning framework jointly optimizes outcome correctness, process faithfulness, and trajectory consistency. Experiments demonstrate superior performance in faithfulness detection and downstream reasoning, producing shorter, more interpretable chains without accuracy loss.

chain-of-thoughtfaithfulness detectionreinforcement learninglatent geometric structureentropy dynamics

Multi-Stakeholder LLM Alignment: Decomposing Estimation from Aggregation

arXiv cs.AI · Lulu Zheng, Wenjin Yang, Xiangwen Zhang, Rong Yin · 2026-05-26

The paper introduces \textsc{DecompR}, a method for multi-stakeholder LLM alignment that decomposes utility estimation from aggregation to address weighting noise. It fixes counterfactual-calibrated weights via query structure prior to candidate scoring, while independently estimating per-role utilities, eliminating candidate-dependent weight drift. Empirical and theoretical analysis shows holistic LLM judges conflate estimation and aggregation, causing unstable implicit weights that amplify with stakeholder dispersion and count. Experiments demonstrate \textsc{DecompR} reduces estimation noise by decoupling these components.

multi-stakeholder alignmentutility estimationweighting noisecounterfactual calibrationllm judges

Knowledge Graphs as the Missing Data Layer for LLM-Based Industrial Asset Operations

arXiv cs.AI · Madhulatha Mandarapu, Sandeep Kunkunuru · 2026-05-26

This work demonstrates that knowledge graphs significantly improve LLM-based industrial asset operation accuracy by serving as a structured data layer. The authors augment AssetOpsBench (139 scenarios) with a knowledge graph (781 nodes, 955 edges, 16 relationship types) and evaluate three architectures: deterministic graph handlers (99%), LLM-generated Cypher queries (82-83%), and the baseline LLM tool augmentation (65%). Results show that structuring LLM reasoning through graph queries outperforms direct reasoning over raw data. On an expanded benchmark (467 scenarios), deterministic handlers achieve 100% accuracy, suggesting that data layer structure, not LLM orchestration, is the primary bottleneck in operational domains.

knowledge graphllm orchestrationcypher queriesassetopsbenchdeterministic handlers

The Strongest Teacher Is Not Always the Best Teacher: Student-Centric Answer Selection

arXiv cs.AI · Zhengyu Hu, Zheyuan Xiao, Linxin Song, Fengqing Jiang · 2026-05-26

We introduce Student-Centric Answer Sampling (SCAS), a framework for selecting teacher-generated supervision based on student-centric learning cost rather than teacher performance. SCAS leverages a token-wise gradient decomposition to derive an efficient forward-only proxy for learning cost, enabling answer selection during training that is tailored to the student’s current state. Experiments across 30 teacher models, 6 student base models, and 8 tasks demonstrate that SCAS consistently enhances student performance, challenging the assumption that the strongest teacher provides the best supervision. This highlights the importance of student-aligned supervision in LLM training.

student-centric learningtoken-wise gradientforward-only proxyanswer selectionllm training

Persistent AI Agents in Academic Research: A Single-Investigator Implementation Case Study

arXiv cs.AI · Anas H. Alzahrani · 2026-05-26

The study implements a persistent AI agent in academic research, analyzing its operation over 96 days through the PARE-M framework. The environment featured durable memory (502 files), specialized roles (8,059 user messages), and governance protocols, generating 75,671 telemetry records. Results show cache-dominated workflows (82.9% of 73.95M tokens) and 627 model-completed events, suggesting economic shifts toward cost-per-artifact metrics. The case study demonstrates feasibility but highlights needs for artifact-level evaluation and standardized event taxonomies in persistent agentic systems.

persistent agentpare-m frameworkcache-dominant workflowgovernance protocolstelemetry records

The Sensation Modulating Network:Haltability as the architectural ground for object-directed phenomenology

arXiv cs.AI · G. Nagarjuna, Durgaprasad Karnam · 2026-05-26

The paper proposes the Sensation Modulating Network (SMN), an embodied cognitive architecture resolving the cognitivism-4E impasse through three key commitments: haltability (antagonistic affordance recruitment for attentional directedness), dual-signal SMAPs (structural self/world distinction), and a four-level action-pattern hierarchy (autonomic-to-conventional transitions). Methodologically, SMN formalizes opponent dynamics across anatomical scales via coordinated action zones and body-wide broadcast routing. Results include a unified account of recursion (negotiable action patterns) and embodiment (opponent substrate), with eight predicted registers and reference simulations provided.

sensation modulating networkhaltabilityopponent dynamicscoordinated action zonesnegotiable action patterns

Helicase: Uncertainty-Guided Supply Chain Knowledge Graph Construction with Autonomous Multi-Agent LLMs

arXiv cs.AI · Yunbo Long, Haolang Zhao, Ge Zheng, Alexandra Brintrup · 2026-05-26

The paper introduces Helicase, a multi-agent LLM system for uncertainty-aware supply chain knowledge graph construction, addressing structural inference problems requiring multi-hop reasoning across fragmented sources. Helicase decomposes queries into executable plans, coordinates specialized agents (web-search, reasoning, coding) via iterative verification, and builds query-specific knowledge graphs with per-fact uncertainty annotations. A three-layer uncertainty framework (action, trajectory, memory) enables calibrated confidence assessment. Evaluation uses SCQA, a benchmark of 80 supply chain queries spanning single-hop to multi-hop inference under varying data visibility.

multi-agent llmknowledge graph constructionuncertainty calibrationmulti-hop reasoningsupply chain inference

The Kalman Evolve: Closing the Gap in Kalman Filtering via Interpretable Algorithm Discovery

arXiv cs.AI · Vasileios Saketos, Ming Xiao · 2026-05-26

The paper introduces Kalman Evolve, a framework for discovering improved filtering algorithms by jointly optimizing noise parameters and update structure in Kalman filtering. Leveraging LLMs as a structured prior over program space, it generates interpretable, non-affine modifications to the classical Kalman filter while preserving its recursive form. Analytical results demonstrate the suboptimality of affine estimators under nonlinear sensing models. Evaluated on synthetic and real-world tracking benchmarks (Doppler radar, LiDAR, pedestrian tracking), the method reduces RMSE by up to 12% compared to baselines like the Optimized Kalman Filter.

kalman filteringstate estimationlarge language modelsnonlinear sensingrmse reduction

ContextGuard: Structured Self-Auditing for Context Learning in Language Models

arXiv cs.AI · Hongbo Jin, Chi Wang, Haoran Tang, Zhongjing Du · 2026-05-26

ContextGuard introduces a structured self-auditing framework to address LLMs' limitations in applying complex contextual knowledge. The method focuses on identifying and rectifying failures in peripheral, persistent, or format-sensitive requirements during in-context learning, rather than wholesale reasoning collapses. Empirical benchmarks demonstrate that despite strong reasoning capabilities, LLMs often miss nuanced contextual elements, highlighting the need for systematic auditing mechanisms to improve fidelity in context-rich tasks.

contextual knowledgein-context learningreasoning capabilitiesself-auditingcontext-rich tasks

RAGEAR: Retrieval-Augmented Graph-Enhanced Academic Recommender

arXiv cs.AI · Francesco Granata, Lorenzo Lamazzi, Misael Mongiovì, Francesco Poggi · 2026-05-26

RAGEAR introduces a neurosymbolic academic course recommender combining dense retrieval over lecture transcripts with a symbolic Knowledge Graph encoding curricular relationships. The system employs a graph-aware aggregation function that propagates chunk-level semantic matches to course recommendations, weighted by retrieval share, rank strength, and evidence distribution. Evaluation on 152 queries via human and LLM-based assessment demonstrates improvements over metadata-only and transcript-based baselines, particularly for top-ranked recommendations.

neurosymbolicknowledge graphdense retrievalaggregation functionlecture transcripts

Innovation: An Almost Characterization of Hallucination

arXiv cs.AI · Nishant P. Das, Piyush Srivastava · 2026-05-26

The work introduces 'innovation', a property measuring an LLM's tendency to generate outputs outside its training data, as an almost characterization of hallucination. Building on Kalai and Vempala's probabilistic framework linking calibration and hallucination to missing mass, the authors prove that innovation is implied by their hallucination condition and vice versa with high probability. They derive lower bounds on hallucination rates via innovation rates and missing mass, extending prior theoretical results.

hallucinationmissing massinnovation ratecalibrationlower bounds

HTMLCure: Turning Browser Experience into State Guided Repair for Interactive HTML

arXiv cs.AI · Jiajun Wu, Jian Yang, Tuney Zheng, Wei Zhang · 2026-05-26

HTMLCure introduces a browser experience framework for repairing interactive HTML pages generated by LLMs, addressing failures under dynamic interactions (scroll, hover, etc.) missed by screenshot-based evaluation. The method executes pages across viewports and interaction states, records deterministic browser evidence, and uses a VLM with curated keyframes for state-guided repair. Results show HTMLCure-27B-Refined achieves 50.6 on HTMLBench-400 (45.2% test case pass) and 81.2 on MiniAppBench, improving raw SFT by 15.3 points and matching reference systems like Kimi-K2.6 and GPT-5.4.

html repairbrowser experiencestate-guided repairdeterministic evaluationinteractive html

What Makes Chain-of-Thought Work at Probe Time? Local Co-occurrence Rather Than Global Derivation

arXiv cs.AI · Xiang Wang, Wei Wei · 2026-05-26

This work investigates the probe-time mechanisms underlying chain-of-thought (CoT) prompting's effectiveness in language models, focusing on lexical activation and token co-occurrence rather than global logical derivation. Through controlled experiments with fixed rationales, the authors demonstrate that even globally shuffled rationales outperform no-rationale baselines, indicating strong lexical activation. Structured text gains primarily arise from short-range token adjacency, with contiguous windows of 2-3 tokens recovering most of the CoT performance. Results generalize across model families, parameter scales, and datasets, supporting a local co-occurrence activation (LCA) account of CoT's probe-time benefits.

chain-of-thoughtlexical activationtoken co-occurrenceprobe-timerationale

Composition Collapse: Stable Factual Knowledge Does Not Imply Compositional Reasoning

arXiv cs.AI · Zhe Yu, Wenpeng Xing, Yunzhao Wei, Jie Chen · 2026-05-26

The study introduces 'composition collapse,' demonstrating that models with statistically indistinguishable atomic knowledge exhibit over 40 percentage points divergence in compositional reasoning, a phenomenon masked by aggregate benchmark metrics. A double-gate protocol is proposed to decompose post-training gains into atomic stability, residual composition, and critical depth, revealing that post-training objectives shift composition capability in ways aggregate metrics obscure. Diagnostic probes indicate that a significant portion of composition failure arises from generation-time computation constraints rather than inherent inability to compose. Findings suggest that claims about multi-hop reasoning improvement should include atomic-gate-controlled composition metrics.

composition collapsedouble-gate protocolatomic stabilityresidual compositioncritical depth

SeDT: Sentence-Transformer Decision-Transformer Conditioning for Multi-Turn Conversation Reliability

arXiv cs.AI · Ramakrishna Vamsi Setti, Jagadeesh Rachapudi, Sachin Chaudhary, Praful Hambarde · 2026-05-26

SeDT introduces a training-free inference-time method to address LLMs' performance degradation in multi-turn conversations, where models lose up to 39% performance when tasks are revealed incrementally. By importing return-to-go conditioning from offline RL, SeDT annotates conversation shards with cumulative relevance scores derived from semantic, lexical, and positional signals, presenting the full annotated history at the final turn. Evaluated on the Lost-in-Conversation benchmark across three LLMs and three tasks, SeDT improves mean performance by up to +37.7% and reduces unreliability in 7/9 model-task combinations.

multi-turn conversationreturn-to-go conditioningsentence-transformerdecision-transformerreliability failure

Implementation of Big Data Analytics for Diabetes Management: Needs Assessment in the Rwanda Healthcare System

arXiv cs.AI · Silas Majyambere, Tony Lindgren, Workneh Y. Ayele, Celestin Twizere · 2026-05-26

This study evaluates Rwanda's healthcare system readiness for implementing Big Data Analytics (BDA) in diabetes management through a stakeholder workshop (n=25). The research identifies opportunities for leveraging electronic health records with machine learning for predictive analytics and clinical decision support, while highlighting implementation challenges. A proposed BDA framework incorporates explainable AI models to enhance diabetes monitoring and treatment strategies.

big data analyticsdiabetes managementelectronic health recordsexplainable machine learningclinical decision support

EmoDistill: Offline Emotion Skill Distillation for Language Model Agents in Adversarial Negotiation

arXiv cs.AI · Yunbo Long, Haolang Zhao, Lukas Beckenbauer, Liming Xu · 2026-05-26

EmoDistill introduces an offline framework for distilling emotional negotiation skills into language model agents, addressing the vulnerability of post-trained LLMs in adversarial settings where emotional framing can bias outcomes. The method decomposes emotional strategy into emotion selection (via Implicit Q-Learning) and expression (via LoRA-based SFT and Judge Policy Optimization). Evaluated across four high-stakes negotiation domains, EmoDistill-trained SLM policies achieve superior utility over vanilla SLM/LLM baselines and IQL-only selection, with ablations confirming emotion conditioning's necessity and transfer studies demonstrating cross-domain generalization.

emotional strategyimplicit q-learninglow-rank adaptationsupervised fine-tuningjudge policy optimization

Ratio-Variance Regularized Policy Optimization

arXiv cs.AI · Yu Luo, Shuo Han, Yihan Hu, Lei Lv · 2026-05-26

The paper introduces Ratio-Variance Regularized Policy Optimization (R²VPO), a method replacing heuristic clipping in on-policy RL with principled ratio-variance constraints. It employs a primal-dual framework to act as a distributional soft brake, preserving high-return gradient signals while down-weighting stale data. Evaluations across 7 LLM scales and 10 robotic tasks show R²VPO improves mathematical reasoning (especially in smaller models) and outperforms PPO in sparse-reward/dynamic control domains, demonstrating superior sample efficiency.

policy optimizationtrust-region constraintsprimal-dual frameworkratio-variance regularizationsample efficiency

LiveK12Bench: Have Large Multimodal Models Truly Conquered High School-level Examinations?

arXiv cs.AI · Xiaohan Wang, Mingze Yin, Yilin Zhao, Gang Liu · 2026-05-26

LiveK12Bench introduces a dynamic, multi-disciplinary benchmark to evaluate Large Multimodal Models (LMMs) in realistic K-12 exam scenarios, addressing limitations of static datasets and data contamination. The framework features 2K+ verified questions from Mathematics, Physics, Chemistry, and Biology, with an automated pipeline for continuous updates and a novel 'Mock Exam' evaluation scheme assessing end-to-end reasoning. Experiments on 12 LMMs show significant performance drops under exam constraints (e.g., GPT-5's score declines from 79 to 53), revealing vulnerabilities to complex visual layouts and process rigor.

large multimodal modelsk-12 reasoningdynamic benchmarkdata contaminationmock exam evaluation

The Attribution Blind Spot: Detecting When Language Models Rely on Memory Rather Than Retrieved Context

arXiv cs.AI · Zhe Yu, Wenpeng Xing, Yunzhao Wei, Bo Yang · 2026-05-26

The paper introduces Computational Reality Monitoring (CRM), a method to detect whether language models rely on parametric memory rather than retrieved context during retrieval-augmented generation. CRM operationalizes a cognitive science principle by comparing internal representations with and without context, identifying architecture-specific layer patterns indicative of pretraining exposure. Across nine model variants spanning three families, CRM demonstrates measurable divergence in internal representations, supported by block-level noise intervention and generalization across tasks and datasets. This addresses the attribution blind spot, where context-consistent output does not guarantee context-governed generation, enabling systems to govern behavior based on evidence provenance.

retrieval-augmented generationparametric memorycomputational reality monitoringinternal representationsattribution blind spot

Towards Generalization-Oriented Models for Vehicle Routing Problems with Mixture-of-Experts

arXiv cs.AI · Changhao Miao, Yuntian Zhang, Tongyu Wu, Fang Deng · 2026-05-26

The paper proposes Residual Refined Experts with Instance-level Gating (R2E-IG), a generalization-oriented model for Vehicle Routing Problems (VRPs) that enhances cross-distribution generalization. The method integrates three components: (1) a Residual Refined Expert (R2E) architecture for improved expert expressiveness via residual refinement, (2) an instance-level gating mechanism for distribution-aware routing, and (3) a mixed-distribution training mechanism with Dynamic Weight Adaption (DWA) for dynamic data reweighting. Experiments demonstrate R2E-IG's competitive performance on both in-distribution and out-of-distribution instances across synthetic and benchmark datasets, showcasing its adaptability and integration potential with existing Deep Reinforcement Learning (DRL) methods.

vehicle routing problemsresidual refined expertsinstance-level gatingdynamic weight adaptiondeep reinforcement learning

Beyond a Single Direction: Chain-of-Thought Disrupts Simple Steering of Refusal

arXiv cs.AI · Kia-Jüng Yang, Dominik Meier, Jiachen Zhao, Terry Ruas · 2026-05-26

The study demonstrates that chain-of-thought (CoT) reasoning in large reasoning models (LRMs) complicates refusal control by dynamically encoding compliance signals across both residual stream activations and CoT traces. Experiments on DeepSeek-R1-Distill-LLaMA-8B show that fixed-CoT activation steering reverses refusal in only 39% of cases, rising to 70% when CoT is removed, while CoT regeneration under steering achieves 94% refusal reversal. The CoT alone retains 48% of the steering effect, indicating its independent role in signal propagation. This reveals LRMs' dual encoding mechanism and vulnerability to CoT-level attacks.

chain-of-thoughtactivation steeringrefusal controlresidual streamlarge reasoning models

Generative artificial intelligence and the marginalization of minoritized knowledges in higher education: the case of disability

arXiv cs.AI · Fatiha Tali-Otmani · 2026-05-26

The article argues that generative AI in higher education marginalizes non-hegemonic epistemologies, particularly affecting persons with disabilities. Drawing on educational sciences, critical technology studies, and disability studies, it demonstrates how Anglophone, Western-centric training datasets reinforce epistemic coloniality. The analysis reveals that technological architectures often stereotype or exclude disabled individuals, leading to double marginalization. The study explores hybridization between researchers and machines as a potential means to preserve epistemic plurality, while critiquing algorithmic correction as a palliative measure with structural limitations.

generative aiepistemic colonialitynon-hegemonic epistemologiesalgorithmic correctiondouble marginalization

Adversarial Training for Robust Coverage Network under Worst-case Facility Losses

arXiv cs.AI · Changhao Miao, Yuntian Zhang, Tongyu Wu, Fang Deng · 2026-05-26

The authors propose a Dual-Agent Deep Reinforcement Learning (DADRL) framework for solving the Maximal Covering Location-Interdiction Problem (MCLIP), a bi-level optimization challenge in resilient infrastructure planning. The method employs adversarial learning with a location agent (upper level) and an interdiction agent (lower level), trained simultaneously to capture dynamic competition. A Surrogate-based Ensemble Inference Strategy leverages the interdiction agent as a high-fidelity surrogate for location decisions. Experiments on synthetic and real-world datasets show superior computational efficiency and competitive solution quality, with model-agnostic applicability to network structures.

bi-level optimizationadversarial learningdeep reinforcement learningresilient infrastructuresurrogate-based inference

Cordon-MAS: Defending RAG against Knowledge Poisoning via Information-Flow Control

arXiv cs.AI · Zhe Yu, Wenpeng Xing, Gaolei Li, Shuguang Xiong · 2026-05-26

CORDON-MAS introduces a compartmentalized framework to defend retrieval-augmented generation (RAG) against Confundo-style knowledge poisoning by enforcing the Cordon Principle, which prohibits final synthesis agents from accessing untrusted natural-language evidence. The method separates evidence extraction, cross-source audit, and answer synthesis into agents with asymmetric memory privileges, addressing the monitoring-control gap where models detect contradictions but still act on poisoned claims. Evaluated across five BEIR datasets, CORDON-MAS reduces attack success rates by 92.4% compared to undefended RAG, reframing RAG poisoning as an information-flow control problem rather than a detection challenge.

retrieval-augmented generationknowledge poisoningcordon principleinformation-flow controlmonitoring-control gap

A Dataset of Robot-Patient and Doctor-Patient Medical Dialogues for Spoken Language Processing Tasks

arXiv cs.AI · Heriberto Cuayahuitl, Grace Jang · 2026-05-26

The paper introduces MeDial-Speech, a novel dataset of 111+ hours of robot-patient and doctor-patient medical dialogues for spoken language processing tasks, covering four health conditions: Lewy body dementia, heart failure, shoulder pain, and angina. The dataset was collected in realistic environments and includes a dialogue benchmark for sentence selection with 20 options. Three state-of-the-art LLMs—GPT-5 mini, DeepSeek-V3, and Claude Sonnet 4—were evaluated, with Claude Sonnet 4 achieving the highest accuracy (71.1% manual, 74.7% automatic transcription). All models exhibited high overconfidence in probabilistic predictions. The dataset is available for non-commercial use on Hugging Face.

medical dialoguessentence selectionllmstranscriptionbenchmark

MatFormBench: A Benchmarking Evaluation Framework for Target-Driven Materials Formulation

arXiv cs.AI · Linhan Wu, Chenxi Wang, Chuhan Yang, Zhengwei Yang · 2026-05-26

MatFormBench introduces a benchmarking framework for target-driven materials formulation, addressing the lack of systematic evaluation for inverse optimization algorithms in materials science. The framework combines physics-driven synthetic data generation with five difficulty levels and proposes MatFormScore, a multi-dimensional metric assessing target success, search efficiency, exploratory capacity, robustness, and stability. Evaluation of 39 algorithms across 1170 tasks reveals diffusion-based models as top performers, with VAE-based and GA-based methods excelling in specific scenarios, establishing a standardized benchmark for materials inverse design.

inverse designmaterials formulationbenchmarking frameworkdiffusion modelsgenerative algorithms

Stabilizing Recurrent Dynamics for Test-Time Scalable Latent Reasoning in Looped Language Models

arXiv cs.AI · Xiao-Wen Yang, Ziyu Han, Xi-Hua Zhang, Wen-Da Wei · 2026-05-26

The paper introduces STARS (STAbility-driven Recurrent Scaling), a training framework that stabilizes latent reasoning in Looped Language Models (LoopLMs) by enforcing convergence to asymptotically stable fixed points. It employs Jacobian Spectral Radius Regularization with random loop sampling to balance stability and effectiveness during depth recurrence. Experiments on arithmetic and complex mathematical reasoning tasks demonstrate that STARS enables reliable test-time scaling, mitigates performance degradation at increased recurrence depths, and improves peak performance.

looped language modelslatent reasoningjacobian spectral radiustest-time scalingrecurrent dynamics

It's Not the Capability: Harness Sensitivity Is Non-Monotone Across LLM Agent Tiers

arXiv cs.AI · Yong-eun Cho · 2026-05-26

The study refutes the assumption that more capable LLM agents universally require less structural guidance, demonstrating non-monotonic harness sensitivity across capability tiers. Through a 432-run experiment on HEAT-24 benchmark, six models across four tiers were tested under three harness conditions (light, balanced, strict). Results show Gemini 2.5 Flash's VTSR dropped 29-38pp with verbose harnesses, while Qwen3.5-122B achieved 91.7% VTSR under strict harness, and Gemma4:e2B matched strong-open-tier stability. Failure analysis reveals format_violation dominates capable models, while wrong_file errors plague low-capability models.

llm agentsharness sensitivitycapability tiersvtsrfailure taxonomy

Measuring Prediction Uncertainty in Neural Cellular Automata

arXiv cs.AI · Ario Sadafi, Michael Deutges, Nassir Navab, Carsten Marr · 2026-05-26

The paper introduces resilience, a method for measuring prediction uncertainty in neural cellular automata (NCA) without architectural changes or retraining. By treating NCAs as dynamical systems, resilience probes stability under perturbations, where stable attractors indicate confident predictions. Evaluated on medical segmentation benchmarks using selective (ΔDice@90, AURC) and ranking (AUROC, AUPRC) metrics, resilience outperforms baselines in identifying failures, enhancing trust in NCA-based models.

neural cellular automatauncertainty estimationmedical image segmentationdynamical systemsselective prediction

Towards Feedback-to-Plan Decisions for Self-Evolving LLM Agents in CUDA Kernel Generation

arXiv cs.AI · Yee Hin Chong, Jiaming Wu, Youhui Zhang, Peng Qu · 2026-05-26

The paper introduces CUDAnalyst, a unified analysis layer for attributing planning decisions in self-evolving LLM agents for CUDA kernel generation. The method employs trajectory freezing and selective feedback injection to enable generation-level evaluation and coalitional-style attribution of feedback effects. Results indicate that explicit planning is beneficial only with aligned feedback, effective planning arises from structured multi-feedback interactions, and high-level plans can transfer between models of varying reasoning strength. These findings hold across backbones, workloads, and induction regimes.

cuda kernel generationfeedback-conditioned planningtrajectory freezingself-evolving agentscoalitional-style attribution

L2Rec: Towards Dual-View Understanding of LLMs for Personalized Recommendation

arXiv cs.AI · Pingjun Pan, Tingting Zhou, Peiyao Lu, Tingting Fei · 2026-05-26

L2Rec introduces a unified approach for adapting LLMs to personalized recommendation by jointly modeling behavioral and semantic signals at the parameter level. The method employs a Dual-view Personalized Mixture-of-Experts (DPMoE) mechanism to apply view-specific low-rank perturbations to a shared LLM backbone, enabling complementary adaptations without representation misalignment. An adaptive cross-view fusion module integrates dual-view outputs. Experiments on four datasets and online A/B testing demonstrate consistent improvements over state-of-the-art baselines in engagement metrics.

personalized recommendationmixture-of-expertslow-rank perturbationsbehavioral signalssemantic signals

SL-BiLEM: Structured Learnable Behavior-in-the-Loop Epidemic Modeling for Forecasting and Policy Evaluation

arXiv cs.AI · Haochun Wang, Sendong Zhao, Jingbo Wang, Yanrui Du · 2026-05-26

SL-BiLEM introduces a structured learnable behavior-in-the-loop epidemic model that addresses distribution shifts caused by human behavior feedback during policy interventions. The method decomposes effective transmission into interpretable components (baseline, policy, media, compliance) with monotonicity and smoothness constraints, enabling robust forecasting and counterfactual analysis. Evaluations on cruise ship, influenza, and COVID-19 datasets show 76% improvement over neural-mechanistic baselines, 53% OOD degradation (vs. 1142% for neural baselines), and 100% bootstrap CI coverage in counterfactual experiments, demonstrating utility for public health decision-making.

epidemic forecastingdistribution shiftcounterfactual analysismonotonicity constraintsbehavior-in-the-loop

Rotation-Invariant Spherical Watermarking via Third-Order SO(3) Representation Coupling

arXiv cs.AI · Pengzhen Chen, Yanwei Liu, Xiaoyan Gu, Antonios Argyriou · 2026-05-26

The authors propose a rotation-invariant spherical watermarking method for panoramic imagery by leveraging third-order SO(3) representations. They formulate panoramas as spherical signals and derive provably invariant descriptors via tensor products of higher-order SO(3) irreducible representations, projecting onto the trivial representation to construct a spherical invariant bispectrum. This preserves phase information while ensuring strict rotation invariance. Experimental results demonstrate near-perfect robustness to continuous 3D rotations and high visual fidelity, with theoretical proofs of SO(3) invariance provided.

spherical watermarkingso(3) representationrotation-invariant descriptorsspherical harmonic coefficientsinvariant bispectrum

Model Merging on Loss Landscape: A Geometry Perspective

arXiv cs.AI · Juanwu Lu, Anand Bhaskar, Brian Axelrod, Ekaterina Tolstaya · 2026-05-26

The paper introduces EpiMer, a model merging framework that formulates the problem as computing the Fréchet mean on a Riemannian manifold with the expected Hessian as metric, revealing connections between local curvature and epistemic uncertainty. By restricting computations to a low-rank subspace spanned by task vectors, the method provides theoretical error bounds decomposable into subspace Fréchet variance and residual energy, unifying curvature-aware and spectral methods under a geometric framework. Experiments merging CLIP-ViT models on eight image classification tasks demonstrate consistent improvements in average and worst-task accuracy across all three backbones compared to baselines.

model mergingfréchet meanriemannian manifoldhessian approximationepistemic uncertainty

Mind the Tool Failures: Achieving Synergistic Tool Gains for Medical Agents

arXiv cs.AI · Yunhui Gan, Tan Pan, Kaiyu Guo, Limei Han · 2026-05-26

The paper introduces a reinforcement learning framework for medical AI agents to address tool failures in clinical settings, where individual tools may fail on challenging instances. The proposed GRPO-based method incorporates probabilistic risk minimization and disagreement-aware synergy learning to correct erroneous tool consensus at the instance level. An entropy-guided sampling strategy upweights high-disagreement instances, providing stronger signals for learning instance-specific tool synergy. Experiments on seven medical benchmarks demonstrate consistent and robust improvements over baselines, emphasizing the importance of synergy-aware tool use for reliable medical agentic systems.

reinforcement learningtool synergyprobabilistic risk minimizationinstance-level selectionmedical ai agents

Self-Improvement Imitation with Biologically Guided Search for Protein Design Under Oracle Budgets

arXiv cs.AI · Ashima Khanna, Dominik Grimm · 2026-05-26

SILO introduces a trajectory-level self-improvement imitation framework for oracle-budgeted protein sequence optimization, addressing challenges of surrogate noise and functionally critical residue disruption. The method employs a hierarchical edit policy with incremental stochastic beam search (SBS) and a UCB-based proxy ensemble, guided by alanine-scan fitness scores (AFS) for candidate selection. Evaluated across eight protein fitness landscapes, SILO achieves superior maximum and top-100 mean fitness compared to five baselines, demonstrating robustness in low-data and noisy-proxy settings. Ablations highlight SBS and AFS as key contributors to performance gains.

protein sequence optimizationstochastic beam searchalanine-scan fitness scoreself-improvement imitationoracle-budgeted design

Beyond Trajectory-Level Attribution: Graph-Based Credit Assignment for Agentic Reinforcement Learning

arXiv cs.AI · Xin Cheng, Shuo He, Lang Feng, HaiYang Xu · 2026-05-26

Graph-based Group Policy Optimization (GraphGPO) introduces a novel credit assignment method for agentic reinforcement learning by constructing a unified state-transition graph from rollout trajectories. It estimates step-level contributions via graph-based advantage, measuring how each transition reduces distance to the goal. This approach overcomes limitations of trajectory-level attribution, particularly in identifying valuable steps within failed trajectories. Evaluations demonstrate GraphGPO's superior training efficiency and state-of-the-art performance across multiple benchmarks.

graphgpocredit assignmentstate-transition graphreinforcement learningstep-level attribution

An In-Vitro Study on Cross-Lingual Generalization in Language Models

arXiv cs.AI · Adrian Cosma · 2026-05-26

The study introduces an in-vitro framework to isolate factors affecting cross-lingual transfer in language models, using procedurally generated languages with shared structure but divergent surface forms. By systematically varying lexical distance, tokenizer regimes, and vocabulary size across 700 runs, the authors find that transfer depends more on tokenization preserving reusable substructures than on lexical similarity or tokenizer balance. Key results show smaller vocabularies enhance masked transfer via decomposable word fragments, while transfer follows a staged progression from grammatical to lexical competence, explained by tokenizer bridge strength.

cross-lingual transferprocedural generationtokenizer regimesmasked language modelingvocabulary size

DynFrame: Adaptive Reasoning-Driven Multimodal Framework with Dynamic Frame Augmentation for Complex Video Understanding

arXiv cs.AI · Peng Zhang, Guanghao Zhang, Wanggui He, Longxiang Zhang · 2026-05-26

DynFrame introduces an adaptive multimodal framework for complex video understanding that jointly learns temporal window selection and frame sampling density through tokenized retrieval. The method addresses structural gaps in existing video MLLMs by implementing learnable span-density retrieval and Segment-Decoupled GRPO (SD-GRPO), which separately credits retrieval and answer generation tokens. Evaluated on six benchmarks (NExT-GQA, Charades-STA, ActivityNet-MR, Video-MME, MLVU, LVBench), DynFrame-4B matches 7B-8B baselines while DynFrame-8B achieves state-of-the-art performance.

multimodal large language modelstokenized retrievaldynamic frame augmentationsegment-decoupled grpovideo understanding

Certified Causal Attribution for Real-Time Attack Forensics in 6G Network Slicing

arXiv cs.AI · Minh K. Quan, Pubudu N. Pathirana · 2026-05-26

DA-GC introduces a certified causal attribution framework for real-time attack forensics in 6G network slicing, addressing spurious correlations from shared resource contention. It combines resource-conditioned Granger causality with an axiomatic Resource Contention Model (RCM) to block confounding. Evaluated on a 15-slice 6G testbed with 1,100 attack scenarios, DA-GC achieves 89.2% accuracy at 87 ms latency, outperforming baselines by 7.9 percentage points at 2.7x lower latency. The method provides formal certificates for statistical soundness under serially dependent telemetry, security bounds (adversarial breakdown point δ*≈0.95), and differential-privacy guarantees.

granger causalitynetwork slicingattack attributionresource contentiondifferential privacy

The Labyrinth and the Thread: Rethinking Regularizations in Sequential Knowledge Editing for Large Language Models

arXiv cs.AI · Zheng Wang, Kaixuan Zhang, Wanfang Chen, Jingwen Zhang · 2026-05-26

The work establishes a formal equivalence between one-time and sequential knowledge editing in LLMs, demonstrating that stability emerges from accumulated editing constraints rather than specialized regularization. Through rigorous optimization analysis of AlphaEdit, the authors generalize this equivalence to broader editing objectives, showing many common regularization strategies are unnecessary. The framework is extended to handle conflicting edits, yielding robust performance under contradictory updates. Empirical results confirm the approach simplifies sequential editing while maintaining reliability.

sequential knowledge editingregularization mechanismsoptimization analysisconflicting editslarge language models

MemFail: Stress-Testing Failure Modes of LLM Memory Systems

arXiv cs.AI · Ishir Garg, Neel Kolhe, Dawn Song, Xuandong Zhao · 2026-05-26

MemFail introduces a diagnostic benchmark to isolate failure modes in large language model (LLM) memory systems, addressing the lack of empirical understanding in existing benchmarks. The authors formalize memory systems as compositions of summarization, storage, and retrieval operations, identifying potential failure modes for each. Five datasets across four tasks are adversarially designed to test specific operations. Evaluating four state-of-the-art memory systems, MemFail empirically reveals tradeoffs induced by architectural differences, enabling targeted attribution of incorrect answers to specific failure modes.

memory systemsfailure modessummarizationretrievaldiagnostic benchmark

AI evaluation may bias perceptions: The importance of context in interpreting academic writing

arXiv cs.AI · Shang Wu, Randol Yao · 2026-05-26

The study demonstrates that context-aware benchmarks are crucial for accurately measuring AI use in scientific writing, as pooled benchmarks introduce systematic biases across countries and fields. Using Dimensions publication data, the authors construct AI-likeness benchmarks by comparing human-written abstracts with LLM-rephrased versions, revealing that pooled benchmarks conflate stylistic variation with AI-generated text. Results show that country-field-specific benchmarks reduce distortions, with pooled methods overestimating AI use in some contexts (e.g., certain countries/fields) while underestimating in others, particularly when analyzing 2025 publications.

ai-likeness benchmarksllm-rephrased textcontext-aware measurementstylistic variationdimensions database

Respecting Modality Gap in Post-hoc Out-of-distribution Detection with Pre-trained Vision-Language Models

arXiv cs.AI · Yuanwei Hu, Bo Peng, Yadan Luo, Zhen Fang · 2026-05-26

The paper challenges the text-as-prototype paradigm in zero-shot OOD detection using VLMs, demonstrating a fundamental modality gap between text embeddings and optimal visual prototypes. It introduces an online pseudo-supervised framework that learns visual prototypes from test-time data streams, supported by theoretical convergence guarantees. Experiments show state-of-the-art performance across multiple OOD detection benchmarks.

out-of-distribution detectionvision-language modelsmodality gapprototype learningonline optimization

Completion vs Optimality: Policy Gradient in Long-Horizon Cumulative-Damage Problems

arXiv cs.AI · Wolfgang Maass, Sabine Janzen · 2026-05-26

The study identifies two failure modes in policy-gradient methods for long-horizon cumulative-damage problems: completion (reaching terminal horizon) and optimality (matching dynamic-programming references). Using PPO with a linear soft penalty, the authors decompose these modes, showing horizon access reduces completion rates while action-space restriction achieves completion but leaves an optimality gap (ΔM_final = 0.271). Four testable predictions are derived and validated in two calibrated environments (49-step bricklayer career, 20-season NBA power-forward career), with horizon-invariance confirmed at three of four tested horizons (H = 15 as exception).

policy-gradient methodscumulative-damage problemsppodynamic-programminghorizon-invariance

More Expressive Feedforward Layers: Part I. Token-Adaptive Mixing of Activations

arXiv cs.AI · Mingze Wang, Jinbo Wang, Yikuan Xia, Kai Shen · 2026-05-26

We propose Mixture of Activations (MoA), a token-adaptive feedforward network (FFN) design that mixes a dictionary of activation functions using lightweight input-dependent gates while sharing linear projections, and introduce learnable activations (LA) as an input-independent counterpart. Theoretically, MoA strictly contains LA, which in turn strictly contains fixed-activation FFNs, with additional expressivity from input-dependent nonlinear hybridization. Empirically, MoA achieves lower terminal loss and more favorable scaling behavior than baselines in pre-training experiments on dense and MoE language models ranging from 0.12B to 2B parameters, with minimal overhead.

mixture of activationsfeedforward networktoken-adaptivelearnable activationsnonlinear hybridization

UnityMAS-O: A General RL Optimization Framework for LLM-Based Multi-Agent Systems

arXiv cs.AI · Yiqun Chen, Wei Yang, Erhan Zhang, Shijie Wang · 2026-05-26

UnityMAS-O introduces a reinforcement learning (RL) framework for optimizing LLM-based multi-agent systems by treating entire workflows as optimization units. The framework employs four abstractions—logical agent roles, graph trajectories, user-defined rewards, and agent-model mappings—to decouple agents from model parameters, enabling flexible parameter sharing and role-specific credit assignment. Built on verl with a Ray-based runtime, it supports distributed PPO-style updates without infrastructure rewrites. Evaluations on retrieval-augmented QA (Natural Questions, HotpotQA) and code generation show RL optimization improves manual workflows, particularly for smaller models and strict code-passing metrics.

multi-agent systemsreinforcement learningparameter sharingppo-style updatesgraph trajectories

JetViT: Efficient High-Resolution Vision Transformer with Post-Training Attention Search

arXiv cs.AI · Dongyun Zou, Zhuoyang Zhang, Junyu Chen, Wenkun He · 2026-05-26

JetViT introduces a family of hybrid-architecture Vision Transformers (ViTs) that achieve state-of-the-art accuracy with enhanced inference efficiency on high-resolution images. The method employs Post-Training Attention Search, a framework that converts pre-trained full-attention ViTs into hybrid-attention variants by replacing redundant full-attention blocks with linear or window-attention blocks while preserving critical ones. Evaluated on DINOv3 and DepthAnythingV2, JetViT achieves up to 1.79x higher throughput and 44.81% lower latency on NVIDIA H100 GPUs without accuracy loss. Code and accelerated models will be released.

vision transformerpost-training attention searchhybrid-attentionhigh-resolutioninference efficiency

Tail-Aware HiFloat4: W4A4 Post-Training Quantization for Wan2.2

arXiv cs.AI · Zhanfeng Feng, Shuai Guo, Xin Di, Long Peng · 2026-05-26

Tail-Aware HiFloat4 introduces W4A4 post-training quantization for Wan2.2, adapting ViDiT-Q for text-to-video generation. The method employs HiFloat4 fake quantization for linear layers in transformer modules while preserving high-precision boundary components, supplemented by an activation-tail-aware percentile calibration for channel-mask construction. It minimizes rare calibration outlier impact through compact PTQ-state restoration, maintaining runtime HiFloat4 arithmetic and sampling efficiency without architectural modifications.

post-training quantizationw4a4hifloat4channel-mask constructionptq-state restoration

MedVol-R1: Reward-Driven Evidence Grounding for Volumetric Reasoning Segmentation

arXiv cs.AI · Zichun Wang, Hairong Shi, Bingzheng Wei, Yan Xu · 2026-05-26

MedVol-R1 introduces a reinforcement learning-based framework for Volumetric Reasoning Segmentation (VRS) in 3D medical scans, decoupling evidence grounding from volumetric delineation. The method employs a Large Vision-Language Model (LVLM) to ground clinical reasoning to a verifiable 2D evidence anchor, which is propagated into a 3D mask using a frozen MedSAM2 module. Training involves cold-start supervised fine-tuning followed by GRPO, guided by a multi-component reward optimizing evidence selection, 2D spatial grounding, and volumetric coherence. Experiments on CT-ORG, AbdomenCT-1K, and KiTS23 from the M3D-Seg benchmark show MedVol-R1 outperforms baselines, achieving state-of-the-art performance with reinforcement learning providing clear gains over supervised fine-tuning.

volumetric reasoning segmentationreinforcement learningevidence groundinglarge vision-language modelmedical scan

FAST-GOAL: Fast and Efficient Global-local Object Alignment Learning

arXiv cs.AI · Hyungyu Choi, Young Kyun Jang, Chanho Eom · 2026-05-26

FAST-GOAL introduces an efficient fine-tuning method to enhance CLIP's ability to handle lengthy text descriptions through global-local semantic alignment. The approach comprises Fast Local Image-Sentence Matching (FLISM), which extracts and matches local image regions with corresponding sentences, and Token Similarity-based Learning (TSL), which maximizes similarity between patch tokens and region embeddings for both images and text. The method is validated on datasets including DOCCI, DCI, MSCOCO, and Flickr30k, demonstrating significant improvements in adapting CLIP to detailed textual descriptions while maintaining computational efficiency.

clipglobal-local alignmentfine-tuningtoken similarityobject detection

Spend Your Rollouts Where It Counts: Rollout Allocation for Group-Based RL Post-Training

arXiv cs.AI · Woojeong Kim, Ziyi Yang, Jing Nathan Yan, Jialu Liu · 2026-05-26

The paper introduces Pilot-Commit, a budget-aware rollout allocation framework for group-based RL post-training of large language models. By decoupling prompt evaluation from exploitation, Pilot-Commit first estimates per-prompt informativeness via a pilot stage, then allocates remaining rollouts to high-leverage prompts while skipping low-signal ones. Evaluated across math reasoning benchmarks with models scaling from 1.5B to 14B parameters, the method matches baseline accuracy while reducing sampling costs, achieving target accuracy up to 1.9× faster than GRPO and 4.0× faster than DAPO in cumulative rollouts.

reinforcement learningrollout allocationgroup-based rlpost-traininginformativeness estimation

Geometry-Aware Contrastive Learning for Few-Shot Automatic Modulation Recognition

arXiv cs.AI · Guanqun Zhao, Yitong Liu, Jiaxuan Fang, Yufei Mao · 2026-05-26

We introduce Dynamic-Consistency Contrastive Learning (DyCo-CL), a geometry-aware framework addressing challenges in few-shot Automatic Modulation Recognition (AMR). DyCo-CL combines Virtual Adversarial Augmentation (VAA) with a semantic consistency loss, acting as an implicit spectral regularizer for stable manifold exploration. The framework integrates a Signal-Adaptive Swin Backbone with fixed-window attention for structural stability and a Hybrid Knowledge Fusion module to incorporate physical priors. Evaluations on RML benchmarks demonstrate a 6.27% accuracy improvement in 1-shot settings compared to existing methods.

contrastive learningautomatic modulation recognitionspectral regularizationfixed-window attentionknowledge fusion

AGORA: Adapter-Grounded Observation-Action Retention for Inference-Free Prompt Compression in LLM Agents

arXiv cs.AI · Haoran Zhang, Zhaohua Sun · 2026-05-26

The paper introduces AGORA, an adapter-grounded method for prompt compression in LLM agents that avoids inference overhead. It identifies structural limitations in token-level extractive compressors, showing they reduce agent performance to 73-75% of uncompressed baselines across 17 experimental configurations. A four-way ablation study reveals the structural floor as the primary quality determinant, with learned scorers enabling 1.0-11.5x adaptive compression from fixed keep ratios.

prompt compressionllm agentsextractive compressorsadapter-groundedablation study

Cordyceps: Covert Control Attacks on LLMs via Data Poisoning

arXiv cs.AI · Zedian Shao, Charles Fleming, Teodora Baluta · 2026-05-26

The paper introduces Cordyceps, a data poisoning method enabling covert control attacks on LLMs through semantic associations between shared knowledge and attacker-chosen phrases. Unlike fixed-trigger attacks, it teaches models an information hiding scheme for encoding/decoding malicious instructions, evading defenses. Evaluated across 5 LLMs, 3 backdoor defenses, and 4 prompt injection defenses, the method achieves 40% higher success rates than heuristic prompt injection and maintains 93-98% success post-defense.

data poisoningcovert control attackssemantic associationsinformation hidingprompt injection defenses

Examining the Challenges of Intellectual Property in AI-Generated Productions

arXiv cs.AI · Ali Mazhar, Mohammad Zare, Marjan Veysi · 2026-05-26

The paper identifies regulatory gaps in intellectual property (IP) frameworks for AI-generated works through comparative legal analysis of Iranian, EU, UK, and US systems. It examines theoretical foundations and existing laws, including Iran's 1969 Law for the Protection of Authors and Patent Registration Law, highlighting enforcement challenges. Results indicate the need for revised legislation, proposing solutions like specialized AI-generated content rights or human-agent ownership attribution to balance innovation incentives with human creativity protection.

intellectual propertyai-generated workslegal frameworksregulatory gapshuman-agent ownership

Few-shot Cross-country Generalization of Tabular Machine Learning and Foundation Models for Childhood Anemia Prediction under Distribution Shift

arXiv cs.AI · Yusuf Brima, Marcellin Atemkeng, Lansana Hassim Kallon, David Niyukuri · 2026-05-26

This study evaluates the cross-country generalization of a transformer-based tabular foundation model, TabPFN v2.6, for childhood anemia prediction under distribution shift. Using Demographic and Health Surveys (DHS) data from 16 countries (n=68,856), the authors compare TabPFN against Logistic Regression, XGBoost, and LightGBM in full-data, leave-one-country-out (LOCO), reverse-LOCO, and few-shot settings. TabPFN outperformed classical models in low-data regimes (<200 samples), achieving the lowest Brier score (0.042) and ECE (0.203). AUC-ROC ranged from 0.59-0.76 across countries, with stable LOCO performance (0.58-0.69). SHAP analysis identified child age, altitude, and height-for-age z-score as dominant predictors. TabPFN demonstrated superior discrimination and calibration in data-scarce settings.

tabular foundation modeldistribution shiftleave-one-country-outshap analysisauc-roc

On the Error-Correcting Effects of Stochasticity in Discrete Diffusion

arXiv cs.AI · William Yuan, Sungwon Jeong, Amirali Aghazadeh · 2026-05-26

The paper analyzes how stochasticity in Markov transitions affects the speed-quality tradeoff in discrete diffusion models, identifying redundant transitions as an error-correcting mechanism. It proposes Discrete Churn and Restart Sampling (DCRS), which injects controlled stochasticity by alternating forward/reverse diffusion processes. Experiments show DCRS achieves 10× faster sampling on image datasets without quality loss, while language tasks exhibit more context-dependent behavior.

discrete diffusionmarkov transitionserror correctionstochastic samplingdcrs

Bridging Control with Neural Network Verifier alpha-beta-CROWN: A Tutorial

arXiv cs.AI · Haoyu Li, Xiangru Zhong, Hao Cheng, Bin Hu · 2026-05-26

The tutorial introduces a unified framework for formally verifying neural network controllers in safety-critical systems by integrating control theory with the $α,\!β$-CROWN verifier. $α,\!β$-CROWN computes certified bounds and linear relaxations for nonlinear functions via GPU-accelerated domain partitioning and pruning, enabling scalable reachability analysis and satisfiability checking. This approach addresses limitations of prior methods by supporting general computation graphs and demonstrating superior scalability in verification tasks such as Lyapunov stability analysis.

neural network verificationcontrol synthesisreachability analysislyapunov theorygpu parallelization

MedGuideX: Internalizing Decision Logic from Executable Guidelines into Large Language Models for Clinical Reasoning

arXiv cs.AI · Yuhao Shen, Lang Cao, Simo Du, Yuqing Wang · 2026-05-26

The authors introduce MedGuideX, a medical LLM trained on executable clinical decision logic derived from practice guidelines (CPGs). Their pipeline transforms CPG recommendations into factual/counterfactual QA pairs, teaching models both guideline-compliant decisions and their conditional variations. Post-training on this data yields a 10.28% relative accuracy gain across four clinical reasoning benchmarks, with physician evaluations confirming superior faithfulness, validity, and completeness in rationales compared to baseline approaches.

clinical practice guidelinescounterfactual reasoningmedical llmclinical decision logicscalable supervision

Reliable Extraction of Clinical Follow-Up Instructions: A Hybrid Neural-Symbolic Pipeline

arXiv cs.AI · Michal Laufer, Yehudit Aperstein, Alexander Apartsin · 2026-05-26

The paper introduces a hybrid neural-symbolic pipeline for extracting (action, date) pairs from clinical follow-up instructions, outperforming generative baselines. The method combines BioBERT-based BIO tagging and biaffine linking with deterministic time normalization, using a 28-action ontology for canonicalization. Evaluated on a 2,000-note synthetic corpus, the pipeline achieves near-perfect Test-Time Pair F1 (0.997 seen, 0.986 OOV) with 0.00-day MAE, while GPT-4o-mini and LoRA-tuned LLaMA-3 8B score below 0.57 Pair F1 due to implicit arithmetic limitations.

hybrid neural-symbolicbio taggingbiaffine linkertime normalizationsynthetic corpus

Auditing and Fixing Economic Validity in Tabular Foundation Models for Discrete Choice

arXiv cs.AI · Yingshuo Wang, Xian Sun, Yanhang Li, Zhichao Fan · 2026-05-26

The paper introduces a two-stage adapter that ensures economic validity in tabular foundation models for discrete choice prediction. First, it estimates a constrained choice model adhering to utility-maximization principles, then freezes these parameters to train a correction term incorporating the foundation model's predictions. This hybrid approach guarantees monotonic price-demand relationships and computable trade-off measures while preserving accuracy. On transportation datasets, the adapter improves accuracy by up to 13 percentage points over standard logit models while maintaining perfect economic consistency.

tabular foundation modelsdiscrete choice predictionutility-maximizationlogit modeleconomic consistency

Linear and Neural Dueling Bandits with Delayed Feedback

arXiv cs.AI · Xiangyi Wang, Pingchen Lu, Jie Mao, Mingze Kong · 2026-05-26

The authors introduce Linear (LDB-DF) and Neural (NDB-DF) Dueling Bandits with Delayed Feedback, addressing the challenge of delayed feedback in contextual dueling bandits, a critical problem in preference-based decision-making and LLM alignment. They propose a novel estimator incorporating Inverse Probability Weighting (IPW) into the loss function to correct for delayed or missing feedback, ensuring unbiased estimation. Theoretical analysis establishes an O(d*sqrt(T)) regret bound for the linear setting and sub-linear guarantees for the neural setting. Empirical validation on simulated and real-world datasets demonstrates the effectiveness of the proposed algorithms.

contextual dueling banditsinverse probability weightingdelayed feedbackregret boundpreference-based decision-making

Aligning Few-Step Generative Models by Amortizing Sample-based Variational Inference

arXiv cs.AI · Jaewoo Lee, Hyeongyu Kang, Dohyun Kim, Kyuil Sim · 2026-05-26

The paper introduces FAV, a framework for aligning few-step generative models without restrictive assumptions about likelihood tractability or solver types. FAV formulates alignment as sampling from a reward-tilted distribution anchored to a reference, using Stein Variational Gradient Descent for sample-based variational inference and amortizing particle updates via fixed-point regression. Evaluations show FAV outperforms policy extraction baselines on 86 robotic manipulation tasks (56 offline, 30 offline-to-online) and scales to text-to-image synthesis (256×256 to 1024×1024) across GANs, diffusion models, consistency models, and flow maps.

few-step generative modelsstein variational gradient descentfixed-point regressionsample-based variational inferencegenerative policy alignment

MobileExplorer: Accelerating On-Device Inference for Mobile GUI Agents via Online Exploration

arXiv cs.AI · Runxi Huang, Liyu Zhang, Shengzhong Liu, Xiaomin Ouyang · 2026-05-26

MobileExplorer accelerates on-device inference for vision-based mobile GUI agents by leveraging online exploration during VLM reasoning. The framework performs lightweight, parallel exploration of UI elements, recording traces as structured memory and summarizing them into contextual hints for prompt injection. A two-level rollback mechanism ensures reliable execution in live mobile environments. Evaluated on AndroidWorld and complex tasks across off-the-shelf devices, MobileExplorer reduces reasoning steps and latency by 23% while improving task success rates by up to 5%.

gui agentsonline explorationvision-language modelsrollback mechanismstructured memory

PolyFusionAgent: A Multimodal Foundation Model and Autonomous AI Assistant for Polymer Property Prediction and Inverse Design

arXiv cs.AI · Manpreet Kaur, Xingying Zhang, Qian Liu · 2026-05-26

The authors introduce PolyFusionAgent, a multimodal framework integrating a polymer foundation model (PolyFusion) with an autonomous design agent (PolyAgent) for property prediction and inverse design. PolyFusion learns a shared latent space across sequence, topology, 3D geometry, and fingerprint representations of millions of polymers, enabling transferable property prediction and conditioned generation of novel structures. PolyAgent completes the loop via literature-grounded hypothesis generation and evaluation, yielding evidence-backed polymer discovery. The system demonstrates improved thermophysical property prediction and chemically valid generation beyond reference spaces.

multimodal foundation modelinverse designlatent space alignmentthermophysical property predictiontool-augmented agent

ChainCaps: Composition-Safe Tool-Using Agents via Monotonic Capability Attenuation

arXiv cs.AI · Xiaochong Jiang, Shiqi Yang, Ziwei Li, Lifei Liu · 2026-05-26

ChainCaps introduces a runtime capability system for tool-using agents that prevents permission laundering through monotonic capability attenuation. The method assigns sink-specific capability budgets to values, propagating them via intersection during tool composition, ensuring authority can only decrease. Implemented as a transparent MCP proxy, it requires no agent or server modifications. Evaluated on 82 tasks across five frontier models, ChainCaps reduced attack success rates from 25-68% to 0-4.8% while maintaining 96-100% benign completion, outperforming scalar-IFC and per-function-isolation baselines. Expert manifests achieved 100% attack blocking versus 27.3% for naive ones.

permission launderingcapability attenuationtool compositionmcp proxyexplicit-flow safety

DGLD: Domain-Gated Latent Diffusion for the Discovery of Novel Energetic Materials

arXiv cs.AI · Yehudit Aperstein, Alexander Apartsin · 2026-05-26

The paper introduces Domain-Gated Latent Diffusion (DGLD), a novel generative framework for discovering high-performance energetic materials. DGLD addresses sparse-label challenges through label-quality gating during training and multi-task score-model guidance during sampling, validated by a four-stage chemistry funnel ending in DFT audit. The method produces 12 DFT-confirmed novel compounds, including 3,4,5-trinitro-1,2-isoxazole (ρ=2.09 g/cm³, D=8.25 km/s) and 4-nitro-1,2,3,5-oxatriazole (D=9.00 km/s), both structurally distinct from training data. Comparative benchmarks show DGLD outperforms SMILES-LSTM (18.3% memorization), SELFIES-GA (3.5 km/s performance drop), and REINVENT 4 (D=9.02 km/s peak). Code and 918 hard negatives are released on Zenodo (DOI 10.5281/zenodo.19821953).

latent diffusionenergetic materialsdft validationmulti-task guidancesparse-label problem

Recursive Flow Matching

arXiv cs.AI · Jiahe Huang, Sihan Xu, Sharvaree Vadgama, Rose Yu · 2026-05-26

Recursive Flow Matching (RecFM) introduces a generative framework for forecasting complex spatiotemporal dynamics, addressing the speed-fidelity trade-off in physics-based tasks. RecFM enforces self-consistency across discretization scales to reduce errors and improve performance, achieving high-fidelity one- and few-step dynamic generation comparable to state-of-the-art multi-step solvers. It demonstrates a 20× speedup over leading diffusion-based emulators and reduces mean squared error by over 15% compared to vanilla flow matching, offering a scalable solution for real-time scientific emulation.

recursive flow matchingspatiotemporal dynamicsself-consistencydiffusion-based emulatorsmean squared error

A Hybrid Vision-Language Architecture for Automated Defect Reasoning and Report Generation in Industrial Inspection

arXiv cs.AI · Malikussaid, Imad Gohar · 2026-05-26

A hybrid vision-language architecture is proposed for automated defect reasoning and report generation in industrial inspection, specifically for wind turbine blade inspection. The pipeline comprises three components: a YOLO26-x-obb detector for defect localization, a deterministic encoding module for spatial token mapping, and a QLoRA-adapted Qwen-2.5-1.5B model for structured JSON report generation, enhanced with Retrieval-Augmented Fine-Tuning. Evaluated against a monolithic vision-language model baseline, the complete system achieves BLEU-4 0.41, Hallucination Rate 4%, and Expert Score 8.6/10, significantly outperforming the baseline (BLEU-4 0.07, HR 65%, Expert Score 3.3/10). The QLoRA-adapted 1.5B model generates higher-quality reports than a 671B-parameter generalist API model, at 47 tokens per second on a single T4-class GPU.

yolo26-x-obbqloraretrieval-augmented fine-tuningstructured json reporthallucination rate

Which Changes Matter? Towards Trustworthy Legal AI via Relevance-Sensitive Evaluation and Solver-Grounded Reasoning

arXiv cs.AI · Chen Linze, Cai Yufan, Hou Zhe, Dong Jin Song · 2026-05-26

The paper introduces LexGuard, an adversarial multi-agent framework for trustworthy legal AI, addressing the challenge of distinguishing legally relevant from irrelevant changes. LexGuard formalizes statutes into executable constraints, employs adversarial agents to extract competing fact-statute arguments, and uses SMT solvers to verify legal satisfaction and logical consistency. Evaluated across judicial fairness, robustness, and statute-confusion scenarios, LexGuard improves legal reasoning reliability by reducing vulnerability to manipulative framing, enhancing disambiguation among similar statutes, limiting irrelevant attribute influence, and increasing consistency under benign reformulations. Results show existing legal LLMs often fail to distinguish legally material changes, while LexGuard achieves calibrated sensitivity.

legal aiadversarial multi-agentsmt solverslegal reasoningstatute formalization

ReCA: Multi-Shot Long Video Extrapolation via Recursive Context Allocation

arXiv cs.AI · Akide Liu, Jinbo Xing, Chaojie Mao, Ye Li · 2026-05-26

The paper introduces Multi-Shot Video Extrapolation (MSVE), a task extending observed frames into cinematically structured shots while preserving anchor state and narrative intent. It identifies three bottlenecks in long-video generation: over-specified global planners, diluted shot-level prompts, and temporal chaining causing state decay. The proposed Recursive Context Allocation (ReCA) framework hierarchically decomposes MSVE into context-bounded subproblems, invokes frozen generators, and propagates structured state updates. Evaluated on MSVE-Bench and NB-Q, ReCA improves normalized scores by 8-16% over competitors and multi-shot consistency by 28-43%.

multi-shot video extrapolationrecursive context allocationcinematic structuretemporal chainingcontext allocation

CmIVTP: Cross-modal Interaction-based Vessel Trajectory Prediction for Maritime Intelligence

arXiv cs.AI · Yuxu Lu, Dong Yang, Xiaoyu Li, Mengwei Bao · 2026-05-26

The paper proposes CmIVTP, a cross-modal interaction-based vessel trajectory prediction framework for maritime intelligence, addressing limitations of single-source data in maritime transportation systems. The method integrates AIS-derived motion features, CCTV-based environmental features, and scene representations via a cross-modal interaction transformer, leveraging cross-modal attention mechanisms for intra-modal semantics and inter-modal interactions. It introduces a target-aware scene encoder for vessel-environment interactions and constructs a vessel group trajectory bank for scalable candidate trajectory generation. Evaluated on the Maritime-MmD$^+$ dataset, CmIVTP demonstrates superior performance on multimodal-driven vessel trajectory prediction benchmarks.

cross-modal interactionvessel trajectory predictionautomatic identification systemtarget-aware scene encodermaritime multimodal dataset

StreamSplit: Continuous Audio Representation Learning via Uncertainty-Guided Adaptive Splitting

arXiv cs.AI · Minh K. Quan, Pubudu N. Pathirana · 2026-05-26

StreamSplit introduces a framework for continuous contrastive learning (CL) on edge devices by addressing the conflict between volatile resources and large-batch requirements. The method combines (1) a distribution-based streaming framework with a Hybrid Loss to decouple representation quality from local batch size and (2) an Uncertainty-Guided Adaptive Splitter using lightweight RL to dynamically partition computation based on real-time resource monitoring and embedding ambiguity. Evaluations on heterogeneous ARM platforms (Raspberry Pi 4 to Apple M2) show 4.7x lower latency, 77.1% bandwidth reduction, and 52.3% energy savings while maintaining within 2.2% accuracy of server-centric baselines.

contrastive learningedge computingreinforcement learningrepresentation learningresource optimization

InterSketch: An Interleaved Reasoning Model with Self-correcting Visual Sketch and Stepwise Reward

arXiv cs.AI · Zhiwei Ning, Wenwen Tong, Xiangli Kong, Shengnan Ma · 2026-05-26

InterSketch introduces an interleaved visual-textual chain-of-thought (VT-CoT) model for complex visual reasoning, addressing the text-centric limitations of current VLMs. The method combines dynamic visual sketch generation via external tools with textual reasoning, enhanced by a two-stage training approach: (1) cold-start training on synthesized VT-CoT data with reflection for self-correction, and (2) RL fine-tuning with stepwise rewards to mitigate long-horizon reward sparsity. Evaluations on visual reasoning benchmarks show InterSketch outperforms proprietary models like Gemini-3-Pro.

vision-language modelschain-of-thoughtself-correctionstepwise rewardvisual reasoning

CSV-ViT: A Vision Transformer with the Variable-sized Cortical Supervertices for Detection of Alzheimer's Disease Pathologies

arXiv cs.AI · Geonwoo Baek, Ikbeom Jang · 2026-05-26

The study introduces CSV-ViT, a Vision Transformer variant for Alzheimer's disease (AD) pathology detection using structural MRI, addressing limitations of spherical cortical surface processing. The method employs ROI-preserving cortical supervertices (CSVs) for variable-sized patch tokenization, coupled with mask-aware patch embedding to handle non-uniform inputs. Evaluated on T1-weighted MRI for AD diagnosis, amyloid/tau positivity classification, CSV-ViT outperforms existing surface-based models, suggesting utility as a PET/CSF prescreening tool.

cortical superverticesvision transformernon-euclidean manifoldsmask-aware embeddingalzheimer's disease

Foundations of a Time-Consistent Counterfactual Actuarial Runtime for Autonomous AI Agents

arXiv cs.AI · Hao-Hsuan Chen · 2026-05-26

The paper introduces a runtime actuarial framework for autonomous AI agents, where each action with side effects incurs a time-consistent counterfactual risk toll computed against a contractually fixed safe default. The method formalizes per-action insurance as the primary unit, replacing post-hoc liability with a pre-action transaction layer. Key results include: (i) well-defined counterfactual tolls under non-unique safe-default mappings, (ii) a no-splitting property for gaming-resistant boundary design, (iii) an irreversible-authority premium, and (iv) a runtime gating theorem for action-budget guarantees. The framework serves as a base for empirical, mechanism-design, and dynamic-underwriting extensions.

counterfactual risk tollactuarial runtimeunderwriting boundaryirreversible-authority premiumruntime gating

Unveiling the Fragility of Vision-Language Models: Multi-Modal Adversarial Synergy via Texture-Constrained Perturbations and Cross-Modal Optimization

arXiv cs.AI · Xiang Fang, Wanlong Fang, Changshuo Wang · 2026-05-26

The paper introduces Multi-Modal Adversarial Synergy (MMAS), a black-box framework for generating universal adversarial attacks against Vision-Language Models (VLMs). MMAS jointly optimizes texture-constrained image perturbations via wavelet transforms and L-norm-bounded text prompt perturbations, enhanced by cross-modal gradient alignment. Experiments demonstrate strong attack transferability across tasks and models, revealing VLMs' vulnerability to multi-modal adversarial synergy.

vision-language modelsadversarial attackswavelet transformscross-modal optimizationblack-box attacks

Dense2MoE: Pushing the Pareto Frontier of On-Device LLMs via Unified Pruning and Upcycling

arXiv cs.AI · Fengfa Li, Hongjin Ji, Yifeng Ding, Lei Ren · 2026-05-26

Dense2MoE introduces a unified framework for converting dense LLMs into efficient Mixture-of-Experts (MoE) models via simultaneous pruning and upcycling. The method employs Layer Fusion UpCycling (LF-UC) to prune bandwidth-heavy attention modules from redundant layers while repurposing their MLPs as MoE experts, guided by hardware Roofline theory to overcome memory bottlenecks. Experiments show the approach advances the Pareto frontier for on-device inference, outperforming dense baselines and prior compression methods with modest continual pre-training costs.

mixture-of-expertslayer pruningon-device inferenceroofline theorytoken routing

The MiniMax-M2 Series: Mini Activations Unleashing Max Real-World Intelligence

arXiv cs.AI · MiniMax, :, Aili Chen, Aonian Li · 2026-05-26

The MiniMax-M2 series introduces a family of Mixture-of-Experts language models optimized for agentic deployment, featuring 229.9B total parameters with only 9.8B activated per token. The architecture combines agent-driven data pipelines (producing verifiable trajectories), Forge (a scalable RL system with windowed-FIFO scheduling and prefix-tree merging), and self-evolving capabilities (e.g., autonomous debugging). The M2.7 checkpoint demonstrates frontier-tier performance on agentic coding, deep search, office-task, and reasoning benchmarks while maintaining minimal activation footprints.

mixture-of-expertsagentic deploymentwindowed-fifo schedulingprefix-tree mergingself-evolution

Elias in the Lighthouse, Again? Diagnosing Low Diversity in LLM Stories

arXiv cs.AI · Sil Hamilton, David Mimno · 2026-05-26

The study identifies low lexical diversity in LLM-generated stories, attributing it to alignment data biases rather than pre-training corpora. Researchers sampled 20,000 stories from four models using five prompts, finding 11 high-frequency tokens (e.g., 'Elias', 'lighthouse', 'clockmaker') occurring in 88.3% of outputs. These terms appear disproportionately in preference data compared to published literature or base model training sets. Notably, alignment appears to suppress both stereotypical outputs (e.g., copyrighted characters) and diverse generations, demonstrating how small preference datasets can disproportionately shape model behavior.

lexical diversityalignment datapreference datasetsstereotypical outputshigh-frequency tokens

Efficient On-policy Visual-RL via Stochastic Decoupled Policy Gradient

arXiv cs.AI · Haoxiang You, Yilang Liu, Davis Zong, Qian Wang · 2026-05-26

The paper introduces stochastic decoupled policy gradient (SDPG), an efficient on-policy visual-RL method for training diverse visuomotor control policies. SDPG leverages random perturbations of trajectory rollouts to estimate policy gradients, significantly reducing computational and memory overhead compared to baseline methods. Evaluated on visual MuJoCo benchmarks, SDPG demonstrates superior performance in training time, memory efficiency, and reward accumulation. The authors also present a suite of realistic visual robotics benchmarks to facilitate future research, showcasing successful sim-to-real transfer on physical hardware.

stochastic decoupled policy gradientvisual reinforcement learningvisuomotor controlsim-to-real transfermujoco benchmarks

Comparative Study of Vision-Based Metric Measurement for Large-Scale Planar Scenes

arXiv cs.AI · ZhiXin Sun · 2026-05-26

This study evaluates three vision-based methods for metric measurement in large-scale planar scenes: geometry-based monocular ranging, image stitching with birds-eye-view transformation, and stereo-based ranging. Monocular ranging achieves meter-level accuracy with sufficient camera pitch angles, while stereo-based methods reach decimeter-level precision and exhibit robustness to pitch variations. Image stitching proves effective for small-scale mapping but suffers from stability and scalability issues in larger environments. The comparative analysis highlights trade-offs in accuracy, robustness, and scalability across methods.

monocular rangingimage stitchingstereo-based rangingmetric measurementplanar scenes

Diffuse to Detect: Generative Diffusion Models for Unsupervised IC Anomaly Detection

arXiv cs.AI · Yuxuan Yin, Chen He, Todd Jacobs, Jialei He · 2026-05-26

We introduce an unsupervised anomaly detection framework leveraging Diffusion Transformers for latent defect screening in IC manufacturing. The method compresses raw test measurements via autoencoder, structures them into token sequences enriched with sinusoidal and wafer-position embeddings, and derives anomaly scores from noise-prediction errors during mid-range diffusion timesteps. This approach eliminates the need for labeled anomalies or manual feature engineering while enabling interpretable failure localization through latent-space reconstruction residuals. The framework achieves state-of-the-art performance on industrial 16nm IC test data under extreme class imbalance, demonstrating effective wafer-scale screening capabilities.

diffusion transformerunsupervised anomaly detectionlatent defect screeningnoise-prediction errorreconstruction residuals

Towards Error-Free EHRs: Reasoning-Intensive Consistency Verification Between Clinical Notes and Structured Tables in Electronic Health Records

arXiv cs.AI · Yeonsu Kwon, Jiho Kim, Junseong Choi, Paloma Rabaey · 2026-05-26

The paper introduces EHR-ReasonCon, a reasoning-intensive benchmark for verifying consistency between unstructured clinical notes and structured tables in Electronic Health Records (EHRs). Built on MIMIC-III with expert-guided annotations, it contains 8,048 entities and employs specialized table-exploration tools for systematic evidence retrieval. The authors also propose EHR-Inspector, an LLM-based framework that segments notes, extracts anchor entities and temporal references, and verifies consistency against structured tables. Evaluated using expert-validated LLM-as-a-judge metrics, EHR-Inspector achieves state-of-the-art performance across multiple model backbones, with analyses highlighting component effectiveness and human-verification differences.

ehr-reasonconmimic-iiillm-based frameworkconsistency verificationtable-exploration tools

AnchorDiff: Training-Free Concept Grounding for MM-DiTs via Anchor-Based Graph Propagation

arXiv cs.AI · Jian Zhang, Zhijun Zhang · 2026-05-26

AnchorDiff introduces a training-free concept grounding method for Multi-Modal Diffusion Transformers (MM-DiTs) to mitigate concept leakage, where attention-based methods produce overlapping activations on visually confusable concepts. The approach decouples semantic localization from structural refinement by selecting a high-confidence anchor from concept-to-image attention, propagating it via a hybrid graph derived from image-to-image self-attention with output-space similarity and row-wise attention gates. Evaluated on ImageNet-Segmentation, PascalVOC, and a new Multi-Concept Confusion Dataset, AnchorDiff demonstrates strong grounding performance while significantly reducing concept leakage.

multi-modal diffusion transformersconcept leakagetraining-free groundinghybrid graph propagationattention gates

Verus-SpecGym: An Agentic Environment for Evaluating Specification Autoformalization

arXiv cs.AI · Anmol Agarwal, Natalie Neamtu, Pranjal Aggarwal, Seungone Kim · 2026-05-26

The paper introduces Verus-SpecBench and Verus-SpecGym, a benchmark and agentic environment for evaluating LLM-based autoformalization of informal programming specifications into verifiable Rust specs. The method extends Verus's exec_spec to execute generated specs as Rust code and validates them against Codeforces test cases and adversarial 'hacks'. Results show Gemini 3.1 Pro achieves 77.8% success, while other frontier models range 51.1–57.8% and OSS models 21.5–25.5%, with failure analysis revealing omitted assumptions and incorrect output validation. LLM-as-judge evaluation misses 26% of failures detected by the authors' method.

autoformalizationformal verificationllm agentsrust verifieradversarial testing

Cross-scale Aligned Supervision for Training GANs

arXiv cs.AI · Sangeek Hyun, MinKyu Lee, Jae-Pil Heo · 2026-05-26

The paper challenges the interpretation of multi-stage GAN synthesis as coarse-to-fine generation, identifying a cross-scale trajectory misalignment problem where scale-wise adversarial supervision fails to enforce sample consistency across resolutions. It proposes CAT (Cross-scale Aligned Transformer), which maintains scale-wise discriminators while adding generator-side consistency regularization to align intermediate outputs with the final image. On ImageNet-256, CAT-H/2 achieves an FID-50K of 1.56 with single-step inference after 60 epochs, surpassing one-step GAN and diffusion baselines.

generative adversarial networksmulti-scale synthesisconsistency regularizationtrajectory alignmentimage generation

DDGAD: Trajectory Dynamics for Diffusion-Based Graph Anomaly Detection

arXiv cs.AI · Yuxin Yang, Limei Hu, Feng Chen · 2026-05-26

DDGAD introduces a diffusion-based graph anomaly detection framework leveraging trajectory dynamics to address contamination propagation in GCN-based methods. The approach distinguishes normal and anomalous nodes by analyzing representation trajectories under diffusion regularization and reliability-aware neighborhood consensus. Normal nodes exhibit stable trajectories, while anomalous nodes show instability due to conflicts between global manifold priors and locally contaminated message passing. The method employs a distributed reliability-aware consensus refinement mechanism and defines three anomaly signals: neighbor inconsistency, reliability weight, and dynamical conflict energy. Experiments on five real-world datasets validate the framework's effectiveness.

graph anomaly detectiondiffusion regularizationcontamination propagationtrajectory dynamicsreliability-aware consensus

Rethinking Weakly-supervised Video Temporal Grounding From a Game Perspective

arXiv cs.AI · Xiang Fang, Zeyu Xiong, Wanlong Fang, Xiaoye Qu · 2026-05-26

The paper proposes a game-theoretic approach for weakly-supervised video temporal grounding, addressing limitations in cross-modal granularity and moment proposal complexity. It models video frames and query words as cooperative game players, using multivariate game theory to quantify frame-word interactions for multi-level alignment. This eliminates reliance on pre-defined moment proposals, instead using learned query-guided frame scores for localization. The method achieves state-of-the-art performance on Charades-STA and ActivityNet Captions benchmarks.

weakly-supervised learningvideo temporal groundingcooperative game theorycross-modal interactionmoment localization

Aperiodic and Low-Frequency Spectral Bias in Reconstruction based EEG Foundation Models

arXiv cs.AI · Aditya Kommineni, Emily Zhou, Kleanthis Avramidis, Simon Bock Segaard · 2026-05-26

This work identifies a spectral bias in reconstruction-based EEG foundation models, explaining their underperformance in low-resource settings compared to supervised models. Through controlled experiments with synthetic EEG signals and linear probe evaluations on real-world BCI datasets, the authors demonstrate that these models predominantly capture aperiodic components and subject identity, while underrepresenting high-frequency oscillatory components critical for task-relevant information. The findings reveal a fundamental mismatch between reconstruction objectives and EEG signal structure, motivating future work to incorporate auxiliary losses targeting high-frequency oscillatory features for improved generalization.

eeg foundation modelsspectral biasaperiodic componentsoscillatory componentslinear probe evaluations

Structure-Adaptive Conformal Inference for Large-Scale Out-of-Distribution Testing

arXiv cs.AI · Rongyi Sun, Wenguang Sun, Zinan Zhao · 2026-05-26

The paper introduces structure-adaptive conformal q-value (SCQ) and pseudo-score-guided transductive automated model selection (P-TAMS) for structured out-of-distribution (OOD) testing. SCQ integrates individual test evidence with structural patterns, while P-TAMS adapts conformalized model selection across candidate models under pairwise exchangeability. The unified framework provides finite-sample error-rate control, improved power, and interpretability. Experiments on simulated and real data confirm false discovery rate control and robust performance across diverse settings.

conformal inferenceout-of-distribution testingfalse discovery ratepairwise exchangeabilitymodel selection

Uniboost: Global Coordination with Value Alignment for Fair and Efficient Traffic Allocation

arXiv cs.AI · Ge Fan, Nan Zhao, Kai Meng, Cong Luo · 2026-05-26

Uniboost introduces a unified traffic allocation framework for recommendation systems, addressing issues of coupled allocation plans, score inflation, and interpretability. It employs a posterior value alignment mechanism to calibrate abstract model scores to business metrics and an independent linear boosting paradigm to decouple complex weighting schemes. Online A/B tests and data analysis demonstrate that Uniboost reduces unintended business interference, provides macro-level insights via post-hoc analyses, and introduces the 'Effective Completion Score' as a reliable anchor metric. Results show improved micro-level traffic allocation efficiency and macro-level guidance for system iteration.

traffic allocationvalue alignmentlinear boostingrecommendation systemspost-hoc analysis

When Does Deep RL Beat Calibrated Baselines? A Benchmark Study on Adaptive Resource Control

arXiv cs.AI · Guilin Zhang, Chuanyi Sun, Kai Zhao, Shahryar Sarkani · 2026-05-26

RLScale-Bench introduces a reproducible benchmark for evaluating deep reinforcement learning (DRL) in adaptive resource control, comparing six DRL algorithms (PPO, DQN, A2C, SAC, TD3, DDPG) against a calibrated rule-based baseline. The study conducts 240 runs across six workload patterns and five seeds, focusing on Kubernetes Horizontal Pod Autoscaling. Results reveal that the calibrated baseline outperforms all DRL algorithms in cost efficiency across workloads, though DRL agents excel in handling bursty and flash traffic. Discrete-action algorithms reduce constraint violations by one to two orders of magnitude compared to continuous-action ones. The findings emphasize the importance of baseline calibration, reward engineering, and realistic evaluation protocols over algorithm selection.

deep reinforcement learningadaptive resource controlkubernetes horizontal pod autoscalingrule-based baselineconstraint violations

The Rescue Effect: Spatio-Semantic Early Exit Bypasses Quantization Collapse in CLIP

arXiv cs.AI · Kahyeon Nam, Hyesong Choi · 2026-05-26

The paper introduces LRA-EE (Layer-wise Representation-Aware Early Exit), a method to mitigate Quantization-Induced Representation Collapse (QIRC) in INT8-quantized CLIP models. QIRC arises from activation noise accumulation in transformer blocks, degrading cosine alignment for zero-shot retrieval. LRA-EE combines Spatio-Semantic Aggregation (global patch-token averaging), a multi-feature gate (confidence, top-2 margin, spatial variance), and Layer-adaptive Confidence Thresholding. On ImageNet-1K zero-shot, it reduces FLOPs by 13.4% and improves Top-1 accuracy by +2.44% (58.72% to 61.16%), rescuing 9.5% of samples lost to noise at full depth.

quantization-induced representation collapseearly exitspatio-semantic aggregationlayer-wise noise-to-signal ratiozero-shot retrieval

Reasoning, Code, or Both? How Large Language Models Handle Variations in Math Questions

arXiv cs.AI · Matthew Kutakh · 2026-05-26

This study evaluates the robustness of large language models (LLMs) on mathematical reasoning tasks under problem variations, comparing chain-of-thought (CoT) prompting, Program-Aided Language models (PAL), and Step-by-Step Coding (SBSC) on 1,000 GSM-Symbolic problems using Claude Haiku 4.5. CoT showed the highest robustness with a 1.3pp accuracy drop and 1.8% problem breakage, while PAL performed worst (1.7pp drop, 3.1% breakage), though differences were not statistically significant (p=.096). Results suggest code execution methods do not enhance robustness for grade-school-level problem variations.

large language modelsmathematical reasoningchain-of-thoughtprogram-aided language modelsstep-by-step coding

Confounder Detection via Treatment Intent: A New Observational Study Design

arXiv cs.AI · Drago Plecko, Patrik Okanovic, Torsten Hoefler, Elias Bareinboim · 2026-05-26

The paper introduces 'confounder detection via treatment intent', a novel observational study design that queries human experts to identify unobserved confounders by comparing matched unit pairs. The method leverages expert knowledge to explain treatment allocation discrepancies, with theoretical guarantees under specified conditions. Applied to ICU electronic health records, the approach demonstrates unobserved confounding via text note analysis, validated in a semi-synthetic environment with NLP-based proxy variables for physician knowledge.

unobserved confoundingobservational study designtreatment intentcausal inferencenatural language processing

Jailbreak susceptibility prediction and mitigation via the behavioral geometry of models

arXiv cs.AI · Hayden Helm, Xiaodong Liu, Weiwei Yang · 2026-05-26

The paper introduces a framework for predicting and mitigating jailbreak susceptibility in generative models by analyzing their behavioral geometry across a population. Leveraging evaluations from previously defended models, the method enables efficient susceptibility detection (AUPRC 0.94) with 98% fewer probes than full evaluation. It also improves defense transfer efficacy (+2% over same-provider assignment, p=0.03) using a minimal set of three reference models. Results demonstrate robustness across 79 models from 24 providers and 100 configurations of a single base model.

jailbreak susceptibilitybehavioral geometrydefense transfergenerative modelsauprc

From Static Context to Calibrated Interactive RL: Mitigating Distribution Shift in Multi-turn Dialogue with Aligned Simulator

arXiv cs.AI · Xiaohua Wang, Jiakang Yuan, Zisu Huang, Muzhao Tian · 2026-05-26

The paper introduces Calibrated Interactive RL, a framework addressing context distribution shift in multi-turn dialogue systems by coupling interactive RL with simulator alignment. It identifies two sources of shift: policy-induced shift from training on static histories, and simulator-induced shift from discrepancies between simulated and real human behaviors. The method aligns simulators with human interaction patterns to reduce the sim-to-real gap and mitigate compounding shifts. Experiments demonstrate that Interactive RL outperforms Static Context RL baselines, and simulator calibration further improves performance, achieving state-of-the-art results across multiple dialogue tasks.

context distribution shiftinteractive rlsimulator alignmentpolicy-induced shiftsimulator-induced shift

Plans for Evaluating Structured Generative Search Summaries

arXiv cs.AI · Tetsuya Sakai, Jina Lee, Hanpei Fang, Young-In Song · 2026-05-26

The paper introduces a framework for evaluating structured generative search summaries generated by large language models. These summaries include an overview, titled sections, and cited source documents. The authors outline plans for implementing and assessing the framework's effectiveness in enhancing web search results.

structured summariesgenerative searchlarge language modelsevaluation frameworkweb search

Annotator Positionality as Signal: Psychometric Weighting for Anti-Autistic Ableism Detection

arXiv cs.AI · Naba Rizvi, Harper Strickland, Saleha Ahmedi, Nedjma Ousidhoum · 2026-05-26

The study introduces a bias-aware evaluation framework for detecting anti-autistic ableist language in LLMs, leveraging psychometrically-weighted ground truth based on annotator positionality. This framework addresses limitations of majority-vote aggregation, which marginalizes autistic and autism-accepting perspectives. The authors find that LLMs frequently generate harmful outputs, misclassify reclaimed language as ableist, and exhibit more negative attitudes toward autistic individuals when assessment instruments are obscured. Error analysis reveals that models rely on superficial keyword matching rather than contextual factors like speaker identity or the social dynamics of in-group solidarity versus out-group harm.

large language modelsannotator positionalitypsychometric weightinganti-autistic ableismcontextual factors

Advancing Creative Physical Intelligence in Large Multimodal Models

arXiv cs.AI · Cheng Qian, Hyeonjeong Ha, Jiayu Liu, Jeonghwan Kim · 2026-05-25

The paper introduces MM-CreativityBench, a benchmark evaluating affordance-grounded creative tool use in visually rich environments, revealing current LMMs' limitations in sustained grounded exploration. The authors propose affordance-grounded alignment via Direct Preference Optimization, prioritizing visual evidence over hallucinations, supplemented by affordance knowledge base supervision. Results demonstrate improved entity/part selection (quantitative gains unspecified) and reduced grounding errors compared to baseline LMM approaches.

large multimodal modelsaffordance-grounded alignmentdirect preference optimizationcreative tool usehallucination reduction

Credit-assigned Policy Gradient for Early Stage Retrieval in Two-stage Ranking

arXiv cs.AI · Haruka Kiyohara, Mihaela Curmei, Ariel Evnine, Shankar Kalyanaraman · 2026-05-25

We introduce Credit-Assigned Policy Gradient (CA-PG), a novel method for training early-stage rankers (ESRs) in two-stage retrieval systems. CA-PG addresses the scalability limitations of vanilla policy gradient (V-PG) by computing gradients with respect to the marginal probability of target items being selected across candidate sets, reducing variance while preserving ranking correctness. Theoretical analysis confirms CA-PG's variance reduction and alignment with late-stage ranker (LSR) policies. Empirical evaluations on synthetic and real-world datasets demonstrate improved convergence speed and training stability for ESRs using the Plackett-Luce model, particularly with large candidate-set sizes.

policy gradientearly-stage rankerplackett-lucevariance reductiontwo-stage retrieval

VisualNeedle: Benchmarking Active Visual Search in Information-Dense Scenes

arXiv cs.AI · Jingru Chen, Yiming Liu, Mingtao Chen, Sijie Chen · 2026-05-25

The paper introduces VisualNeedle, a benchmark for evaluating active visual search in information-dense scenes where critical evidence is confined to minute regions. It addresses three shortcuts inflating MLLM performance (linguistic priors, coarse semantics, and image corruption resilience) by proposing a counterfactual crop-black setting to test reliance on intermediate visual evidence. Evaluation of 9 MLLMs shows no-tool accuracy below 20%, tool-enabled peaking at 56.01%, and human accuracy at 63.00%, revealing persistent limitations in fine-grained visual search.

multimodal large language modelsvisual searchbenchmarkingfine-grained perceptioncounterfactual evaluation

BioFact-MoE: Biologically Factorized Mixture of Experts for Vision-Language Prognostic Modeling in Hepatocellular Carcinoma

arXiv cs.AI · Junlin Yang, Tian Yu, Nicha C. Dvornek, Yuexi Du · 2026-05-25

BioFact-MoE introduces a biologically factorized Mixture of Experts (MoE) framework for hepatocellular carcinoma (HCC) prognosis, explicitly decomposing hepatic and tumor-related factors via biologically supervised experts within a residual MoE survival architecture. Trained on 4,582 3D MRI image-report pairs and evaluated on N=588 patients, it achieves 12-, 18-, and 24-month AUCs of 75.33%, 75.85%, and 73.96%, outperforming baselines. Gated expert weights enable phenotype-aware risk stratification, with hepatic and tumor embeddings showing selective associations with liver function and tumor burden markers (p<0.05) without supervision.

mixture of expertshepatocellular carcinomaprognostic modelingmultimodal learningsurvival analysis

Exploiting Local Dynamics Regularity for Reusable Skills in Offline Hierarchical RL

arXiv cs.AI · Sarthak Dayal, Abhinav Peri, Carl Qi, Claas Voelcker · 2026-05-25

The paper introduces CARL (Contrastive Action-based Representations for Reusable Local Control), a hierarchical reinforcement learning (HRL) algorithm that improves skill reusability by exploiting local dynamics regularity. CARL aligns local transitions across global contexts with required action sequences, enabling high-level policies to reason about low-level skill reuse. The method integrates with HIQL and demonstrates qualitative skill clustering in complex humanoid environments. Empirical results show improved performance on the OGBench benchmark, validating the approach's effectiveness in long-horizon RL tasks.

hierarchical reinforcement learningskill reusabilitylocal dynamicscontrastive learningoffline rl

Unified Panoramic Geometry Estimation via Multi-View Foundation Models

arXiv cs.AI · Vukasin Bozic, Isidora Slavkovic, Dominik Narnhofer, Nando Metzger · 2026-05-25

PaGeR introduces a unified framework for panoramic geometry estimation by adapting pre-trained 3D foundation models to process both perspective and omnidirectional images. The method minimally modifies the architecture of a transformer-based 3D reconstruction model and trains it on mixed perspective-panoramic data, enabling joint prediction of scale-invariant depth, metric depth, surface normals, and sky masks. Evaluations demonstrate state-of-the-art performance and strong zero-shot generalization across diverse indoor and outdoor scenes.

panoramic geometry reconstruction3d foundation modelsscale-invariant depthomnidirectional imageszero-shot performance

Automatic Layer Selection for Hallucination Detection

arXiv cs.AI · Xinpeng Wang, William Cao, Andrew Gordon Wilson, Zhe Zeng · 2026-05-25

The paper introduces FEPoID, a training-free criterion for automatically selecting optimal intermediate layers in LLMs for hallucination detection, based on the first effective peak of intrinsic dimension. It evaluates layer-selection hypotheses across architectures (e.g., LLaMA, GPT) and tasks (QA, summarization), finding existing criteria inconsistent. FEPoID outperforms baselines by identifying near-optimal layers with negligible overhead, complemented by a truncation strategy amplifying hallucination signals. Results show improved detection on benchmarks like TruthfulQA and HallucinationEval.

hallucination detectionintrinsic dimensionintermediate layersllmsfepoid

Why LLMs Hallucinate on Structured Knowledge: A Mechanistic Analysis of Reasoning over Linearized Representations

arXiv cs.AI · Shanghao Li, Jinda Han, Yibo Wang, Yuanjie Zhu · 2026-05-25

The study mechanistically analyzes why LLMs hallucinate when reasoning over linearized structured knowledge (e.g., graphs, tables), identifying systematic internal dynamics as the root cause. Through attention and feed-forward layer analysis, it reveals that hallucinations stem from disproportionate attention to structural shortcuts and ungrounded feed-forward representations that revert to parametric memory. Results show semantic grounding failures consistently correlate with hallucinations, while attention patterns vary task-dependently, with findings generalizing to multi-hop and tabular settings for hallucination detection.

hallucinationlinearized representationsattention allocationsemantic groundingparametric memory

Personalized Generative Models for Contextual Debiasing

arXiv cs.AI · Xinran Liang, Esin Tureci, Prachi Sinha, Ye Zhu · 2026-05-25

DecoupleGen introduces personalized text-to-image diffusion models to synthesize images with rare contexts for training augmentation, addressing the bias in vision datasets towards common visual patterns. The method decouples contextual patterns from visual details, ensuring generated images remain semantically meaningful and visually aligned with the original dataset distribution. Verification constraints are applied to maintain data relevance. Evaluations on object classification and recognition tasks across complex scene datasets show consistent improvements over prior approaches, with analyses identifying key factors driving these enhancements.

diffusion modelscontextual debiasingtraining augmentationvisual patternssemantic alignment

When Correct Demonstrations Hurt: Rethinking the Role of Exemplars in In-Context Learning

arXiv cs.AI · Chenghao Qiu, Chunli Peng, Yufeng Yang, Kuan-Hao Huang · 2026-05-25

The paper identifies a counterintuitive phenomenon in in-context learning (ICL): correct demonstrations can reduce model accuracy despite preserving task validity. The authors introduce task-preserving perturbations, where exemplar inputs are modified while maintaining correct task mappings (via label-updating or target-preserving variants), formalizing the resulting contextual evidence shift as the mechanism decoupling correctness from utility. Experiments across sentiment analysis, logical reasoning, and math tasks show performance degradation from perturbed demonstrations, particularly for smaller models (e.g., GPT-2), harder tasks, and higher perturbation ratios, highlighting the need to evaluate demonstration influence on contextual inference.

in-context learningtask-preserving perturbationscontextual evidence shiftlabel-updatingexemplar utility

ScientistOne: Towards Human-Level Autonomous Research via Chain-of-Evidence

arXiv cs.AI · Rui Meng, Bhavana Dalvi Mishra, Jiefeng Chen, Chun-Liang Li · 2026-05-25

The paper introduces Chain-of-Evidence (CoE), a verifiability framework ensuring traceability from claims to evidence sources, and ScientistOne, an autonomous research system implementing CoE throughout literature review, solution discovery, and paper writing. CoE Audit provides four integrity checks: score verification, specification violation, reference verification, and method-code alignment. Evaluated across 75 papers from five systems, baselines show systematic failures (21% hallucinated references, 42% score verification), while ScientistOne achieves zero hallucinations (0/337), perfect score verification (12/12), and 14/15 method-code alignment, matching or exceeding human performance on five tasks and achieving SOTA on Parameter Golf and MLE-Bench.

chain-of-evidenceautonomous researchverifiability frameworkmethod-code alignmentscore verification

Managing Uncertainty in LLM-Generated Procedural Knowledge for Virtual Laboratory Planning

arXiv cs.AI · Polychronis Karpodinis, Dimitris Kalles · 2026-05-25

The paper introduces a framework for managing uncertainty in LLM-generated procedural knowledge for virtual laboratory planning. The method leverages structured domain representations and uncertain state-transition samples to extract candidate procedural rules, transform them into explicit constraints, and repair uncertain procedural steps. This approach addresses the limitations of LLM outputs, such as omitted actions, incorrect step ordering, and logical incompatibilities with laboratory equipment. The framework is demonstrated in a virtual laboratory domain involving instruments, containers, tools, and material-transfer actions, aiming to enhance procedural accuracy in structured interactive environments.

procedural uncertaintystate-transition samplesvirtual laboratorystructured domain representationsaction planning

Towards Controllable Image Generation through Representation-Conditioned Diffusion Models

arXiv cs.LG · Nithesh Chandher Karthikeyan, Jonas Unger, Gabriel Eilertsen · 2026-05-26

The authors propose representation-conditioned diffusion models for controllable image generation, addressing limitations of conventional conditioning mechanisms that rely on annotated datasets. Their method leverages representations from a pre-trained self-supervised model as conditioning signals, enhancing both unconditional generation quality and controllability. By analyzing the conditioning space, they identify directions of variation exhibiting smoothness and disentanglement properties. Preliminary results demonstrate the potential of this approach for guiding diffusion models toward specific outputs without extensive annotation requirements.

diffusion modelsself-supervised learningconditioning mechanismsrepresentation spacedisentanglement

Probabilistic Smoothing with Ratio-Monotone Transforms for Global Optimization

arXiv cs.LG · Kukyoung Jang, Taehyun Cho, Junrui Zhang, Ping Xu · 2026-05-26

The paper introduces a probabilistic smoothing framework for global optimization using symmetric unimodal kernels and ratio-monotone transformations, eliminating the need for decreasing smoothing schedules. Theoretical analysis shows preservation of the global maximizer and concentration of stationary points near the true optimum under mild conditions, with explicit complexity bounds for stochastic gradient ascent. Experiments on high-dimensional benchmarks and black-box adversarial attacks demonstrate enhanced robustness and competitive performance compared to Gaussian kernel-based methods.

probabilistic smoothingglobal optimizationratio-monotone transformsunimodal kernelsstochastic gradient ascent

Greening AI Inference with Accuracy and Latency-aware User Incentives

arXiv cs.LG · Vasilios A. Siris, Adamantia Stamou, George D. Stamoulis, Konstantinos Varsos · 2026-05-26

The paper proposes a framework for designing AI inference incentives that balance carbon emissions with quality of experience (QoE) parameters, specifically inference quality and latency, while incorporating user environmental consciousness. The method leverages a two-tier service subscription model, offering discounts to users who accept reduced inference quality and higher latency during periods of high carbon intensity. This approach allows AI providers flexibility in resource allocation and accommodates tradeoffs based on model size, complexity, and carbon intensity. The framework aims to reduce carbon emissions from AI inference while maintaining user satisfaction through tailored incentives.

carbon emissionsinference qualitylatencyqoe parametersservice subscription

Normal Guidance is what Attention Needs

arXiv cs.LG · Ethan Harvey, Dennis Johan Loevlie, Michael C. Hughes · 2026-05-26

The paper introduces Normal Guidance, a regularization technique that shapes attention distributions in multiple instance learning (MIL) to follow bell curves, improving slice-level classification in weakly supervised 3D medical imaging. Motivated by empirical findings that center-focused baselines outperform attention- and transformer-based MIL on brain, thoracic, and abdominal CT scans, the method constrains attention weights without sacrificing whole-scan performance. Evaluated on three datasets totaling 4M+ slices, Normal Guidance enables attention-based and transformer-based MIL to surpass state-of-the-art slice-level localization while maintaining competitive volume-level classification accuracy.

multiple instance learningweak supervisionattention mechanism3d medical imagingnormal guidance

BASIS: Batchwise Advantage Estimation from Single-Rollout Information Sharing for LLM Reasoning

arXiv cs.LG · Shijin Gong, Erhan Xu, Kai Ye, Francesco Quinzan · 2026-05-26

BASIS introduces a critic-free post-training algorithm for LLM reasoning that optimizes the tradeoff between computational and sample efficiency in reinforcement learning. By sampling one rollout per prompt and leveraging batchwise information sharing, BASIS reduces value function estimation MSE by 69% compared to REINFORCE++. It achieves lower MSE with one rollout than group mean estimators with 8 rollouts, leading to more efficient policy optimization that matches or outperforms multi-rollout GRPO and single-rollout REINFORCE baselines.

reinforcement learningvalue estimationpolicy optimizationbatchwise processingllm reasoning

Detectability in Diversity: Improved Canary Crafting for Privacy Auditing in One Run

arXiv cs.LG · Mathieu Dagréou, Aurélien Bellet · 2026-05-26

The paper introduces an improved method for crafting canaries in one-run privacy auditing, optimizing them for high detectability and minimal interference. By combining greedy initialization via influence functions with bilevel optimization that maximizes distinguishability while promoting embedding-space diversity, the approach enhances leakage estimates. Experiments demonstrate stronger privacy bounds at reduced computational cost compared to prior canary crafting techniques.

privacy auditingmembership inference attackscanary craftingbilevel optimizationdifferential privacy

Causal Risk Minimization for High-Dimensional Treatments

arXiv cs.LG · Nikita Dhawan, Arnav Paruthi, Andrew Kim, Lovedeep Gondara · 2026-05-26

The paper proposes causal risk minimization for high-dimensional treatment spaces, addressing scenarios like text-based interventions where classical causal estimators fail due to unobserved variations. The method decomposes causal error into higher-order moment-balancing errors and introduces objectives to directly optimize causal estimation, including projection techniques for lower-dimensional treatment attributes. Empirical evaluation on continuous, discrete, and text treatments (using Amazon Reviews) demonstrates improved higher-order balance optimization and competitive performance of projected causal estimates versus attribute-specific models.

causal inferencehigh-dimensional treatmentsmoment-balancing errorstreatment projectionsemi-synthetic data

Transfer Learning using 66 Diseases for Disease Forecasting Applications

arXiv cs.LG · Lauren J Beesley, Alexander C Murph, Dave Osthus, Lauren A Castro · 2026-05-26

This work introduces a transfer learning framework for disease forecasting by leveraging data from 66 infectious diseases across multiple data streams, significantly expanding prior approaches. The authors train machine learning models on this multi-disease dataset and evaluate their performance on 20 distinct disease data streams. Results demonstrate that incorporating additional data streams improves forecasting accuracy in 84.9% of cases, though performance degrades when dissimilar data streams are included. A key contribution is the compilation of a publicly available database for the infectious disease forecasting community, facilitating future research in this domain.

transfer learningdisease forecastingdata streamsinfectious diseasesmachine learning

Kan Extension Transformers: A Categorical Unification of Attention, Diffusion, and Predict-Detach Self-Conditioning

arXiv cs.LG · Sridhar Mahadevan · 2026-05-26

The paper introduces Kan Extension Transformers (KETs), a categorical framework unifying diverse Transformer variants by interpreting layers as weighted structured extension operators. KETs generalize standard attention (singleton-neighborhood), Geometric Transformers (edge-restricted), and higher-order simplicial cases, while bridging to diffusion-style completion. The predict-detach mechanism enables noncausal self-conditioning without future token leakage. Experiments on Penn Treebank, WikiText-2, and WikiText-103 compare 12 Transformer variants, showing quadratic KET as strongest in strict-causal settings, but largest gains from predict-detach regimes across all datasets.

kan extension transformersstructured extension operatorpredict-detachself-conditioningsimplicial case

Symbolic Regression via Latent Iterative Refinement

arXiv cs.LG · Xieting Chu, Sriram Vishwanath, Vijay Ganesh · 2026-05-26

Latent Equation Embedding (LEE) introduces iterative amortized inference for symbolic regression, closing the amortization gap in neural SR methods. LEE constructs a shared latent space Z with three components: an encoder f_theta embedding symbolic tokens and observations, an expression decoder g_expr reconstructing formulas, and an evaluation decoder g_eval predicting function values. Inference combines discrete re-encoding and continuous gradient descent for hybrid refinement. Evaluated on SRBench across three noise levels, LEE outperforms 19 baselines, including Operon, GP-GOMEA, TPSR, RAG-SR, and GenSR, producing expressions 2--10x simpler (complexity 8--11 vs. 20--90) while advancing the accuracy-complexity Pareto frontier.

symbolic regressionamortized inferencelatent spaceiterative refinementparetto frontier

Explainable Comparison of Feature-Based and Deep Learning Models for TROPOMI Methane Plume Screening

arXiv cs.LG · Solomiia Kurchaba, Joannes D. Maasakkers, Berend J. Schuit, Ilse Aben · 2026-05-26

This study compares feature-based (SVC, Random Forest, XGBoost) and image-based (ResNet-18, ResNet-34) models for classifying methane plume artifacts in TROPOMI satellite data, addressing limitations of expert-designed scalar features. Using SHAP-based explainability, the analysis evaluates performance under balanced and imbalanced settings, providing operational guidance for methane-screening workflows like the CAMS Methane Hotspot Explorer. Results demonstrate trade-offs between interpretability and accuracy across model families.

methane plumetropomishapresnetsvc

Nonlinear Data Integration via Kernel Methods for Data Collaboration Analysis

arXiv cs.LG · Yamato Suetake, Yuta Kawakami, Shunnosuke Ikeda, Yuichi Takano · 2026-05-26

The authors propose nonlinear kernel integration (NKI) for privacy-preserving collaborative analysis of decentralized datasets, addressing limitations of linear integration methods. NKI extends linear kernel integration (LKI) via kernelization, admitting a globally optimal solution through kernel ridge regression and eigenvalue decomposition. Graph regularization and centering constraints are introduced to incorporate geometric and target-variable information. Experiments on image classification demonstrate NKI's superior accuracy over linear methods under nonlinear dimensionality reduction, with further improvements from target-aware regularization. Results highlight the impact of dimensionality reduction choices on both classification accuracy and reconstruction risk.

nonlinear kernel integrationdata collaborationkernel ridge regressiongraph regularizationdimensionality reduction

Not All Tokens Matter Equally: Dynamic In-context Vector Distillation with Decisive-Token Supervision for Long-form Medical Report Generation

arXiv cs.LG · Ning Wu, Rui Liu, Xinkun Lin, Weixing Chen · 2026-05-26

The paper introduces DIVE, a distillation framework for long-form medical report generation that addresses the imbalance in token-level supervision. The method employs decisive-token supervision to upweight pathology-related tokens and EOS events, and state-conditioned dynamic steering to adapt hidden-state-dependent residuals during decoding. Evaluated on MIMIC-CXR and CheXpert Plus with two medical VLM backbones, DIVE achieves top performance in BLEU-4, ROUGE-L, and RadGraph F1 metrics while remaining competitive on CheXbert F1.

dynamic in-context distillationdecisive-token supervisionlong-form generationmedical report generationstate-conditioned steering

Beyond Binary: Speech Representations Across the Cognitive Score Hierarchy

arXiv cs.LG · Serli Kopar, Roshan Prakash Rane, Christian Mychajliw, Lydia Federmann · 2026-05-26

The study investigates speech representations' relationship with hierarchical cognitive assessment in mild cognitive impairment, analyzing 5,754 German neuropsychological recordings across six tasks at task, domain, and global score levels. Comparing hand-crafted acoustic features with self-supervised learning (SSL) embeddings, SSL outperforms at lower levels but underperforms for MCI classification. Task-specific constraints reveal performance dilution in high-freedom tasks ("specialist" representations) versus improved performance in structured tasks ("generalist" representations) at higher hierarchical levels, linking task constraints to assessment hierarchy in clinical speech analysis.

self-supervised learningcognitive impairmentacoustic featureshierarchical assessmentneuropsychological recordings

The Role of Causal Features in Strategic Classification for Robustness and Alignment

arXiv cs.LG · Antonio Gois, Sophia Gunluk, Nir Rosenfeld, Nidhi Hegde · 2026-05-26

The paper establishes theoretical connections between causal modeling and strategic classification, demonstrating that causal features yield optimal classification error post-adaptation under bounded noise conditions. When assumptions fail, it decomposes OOD cross-entropy risk into bias and feature-utilization terms, clarifying causal classifiers' advantages. Additionally, causal features enable long-term incentive alignment between institutions and users, contrasting prior work on social costs. Theoretical claims are validated empirically on synthetic data.

strategic classificationcausal modelsout-of-distribution riskcross-entropy decompositionincentive alignment

Is an Image Also Worth 16x16=256 Superpixels? A Framework for Attentional Image Classification

arXiv cs.LG · Pedro Henrique da Costa Avelar, Anderson R. Tavares, Luís C. Lamb · 2026-05-26

The paper introduces Superpixel Transformers (SPT), a framework unifying superpixel-based image classification with Vision Transformers (ViTs). SPT generalizes prior graph attention methods (SICGAT) and ViTs by supporting arbitrary superpixel chunking, connectivity graphs, and positional encodings, including a novel multidimensional sine-cosine encoding. Evaluated on CIFAR10, FashionMNIST, and Imagenette, SPT outperforms superpixel-based GNNs and matches ViTs while mitigating SICGAT's information loss. The work demonstrates how constrained graph connectivity can enhance ViT performance, bridging superpixel and transformer paradigms.

superpixel transformersgraph attention networksvision transformerspositional encodingimage classification

PILOT: A Data-Free Continual Learning Approach for Real-Time Semantic Segmentation via Boundary Guidance

arXiv cs.LG · Yujing Zhou, Prashant Shekhar, Thomas Yang, Yongxin Liu · 2026-05-26

PILOT introduces a data-free continual learning framework for real-time semantic segmentation, addressing catastrophic forgetting via boundary guidance. The method augments PIDNet with a parallel Derivative-branch (D-branch) that captures high-frequency boundary features of novel classes while freezing the base network, enabling incremental learning without full retraining. Experiments show PILOT maintains high mIoU on base classes while adapting to new categories, outperforming existing continual learning approaches with negligible latency overhead.

continual learningsemantic segmentationcatastrophic forgettingboundary guidancereal-time inference

JLT: Clean-Latent Prediction in Latent Diffusion Transformers

arXiv cs.LG · Funing Fu, Tenghui Wang, Junyong Cen, Qichao Zhu · 2026-05-26

JLT introduces clean-latent prediction in latent diffusion Transformers, demonstrating its geometric advantages over velocity prediction in learned latent spaces. The method employs a 130M Transformer over frozen FLUX.2 VAE codes, comparing clean-latent prediction with velocity-prediction DiT under identical settings. Analysis reveals that velocity regression amplifies low-variance latent directions due to isotropic target-covariance, while clean prediction dampens them. On ImageNet 256×256, JLT-B/1 achieves FID-50K 2.50 with classifier-free guidance, significantly outperforming velocity prediction. These findings highlight that prediction targets in latent diffusion are representation-dependent geometric choices rather than interchangeable algebraic parameterizations.

latent diffusionclean-latent predictionvelocity predictiontransformervae codes

Mildly Overparameterized ReLU Networks on Orthogonal Data: Incremental Learning and Implicit Bias

arXiv cs.LG · James Town, Etienne Boursier, Ben Lewis, Matthias Englert · 2026-05-26

The work rigorously characterizes gradient flow dynamics in mildly overparameterized two-layer ReLU networks with orthogonal data, revealing an incremental saddle-to-saddle learning process where neurons activate sequentially. Using small initialization analysis, the authors prove convergence to an interpolating solution when width $m \gtrsim \log(n)$, recovering prior interpolation results while demonstrating novel implicit bias: the learned solution's $\ell_2$-norm scales as $\sqrt{n}$, matching minimal-norm interpolators up to constants. This provides the first theoretical evidence that mildly overparameterized ReLU networks learn near-optimal interpolators through incremental neuron activation.

gradient flowimplicit biasoverparameterizationrelu networksinterpolation

Adversarial Dual On-Policy Distillation from Expressive Flow-based Teacher

arXiv cs.LG · Zhenglin Wan, Jingxuan Wu, Xingrui Yu, Chubin Zhang · 2026-05-26

The paper introduces FA-OPD, an adversarial dual on-policy distillation method combining Flow Matching (FM) teacher learning with MLP student co-training. The teacher provides reward and action channels: the former optimizes expert-likeness for exploration, while the latter offers dense local targets for stable exploitation. Evaluated on six robot control benchmarks, FA-OPD outperforms baselines and demonstrates robustness to noisy or sparse demonstrations.

flow matchingon-policy distillationadversarial learningbehavioral cloningrobot control

Gaussian Process-based learning with new MCMC-based implementation of Wishart prior on correlation matrix

arXiv cs.LG · Kane Warrior, Dalia Chakrabarty · 2026-05-26

We propose a novel Wishart prior for the covariance matrix in Gaussian Process (GP) learning, enabling simultaneous inference of multiple lengthscale parameters in highly multivariate functions. The method employs Markov Chain Monte Carlo (MCMC) with an adaptive scale matrix defined via a look-back window over recent iterations. Empirical results demonstrate the utility of direct covariance matrix priors for identifying weakly informative inputs in GP-based learning. Validation includes experiments on both synthetic and real-world datasets, showcasing improved inference capabilities.

gaussian processwishart priormarkov chain monte carlocovariance matrixlengthscale parameters

LLMs Are Already Good Tutors: Training-Free Prompt Optimization for Pedagogical Math Tutoring

arXiv cs.LG · Unggi Lee, Minchul Shin, Yeil Jeong, Sookbun Lee · 2026-05-26

We demonstrate that training-free prompt optimization can align large language models (LLMs) for math tutoring without resource-intensive RL-based training. By evolving system prompts via API calls, we adapt 7 existing methods and propose 5 education-specialized techniques, evaluating 12 configurations across 5 conditions on 2 OOD benchmarks. All configurations outperform the strongest RL-trained baseline (R_total = 0.633), with ParetoGrad achieving optimal balance across solve rate, leak control, and helpfulness. Behavioral analysis reveals training-free methods exhibit 2-3x higher teaching-knowledge pattern usage and ~10% reduced intent-level scaffolding compared to RL-trained models. This enables efficient pedagogical alignment of LLM tutors using prompts alone.

prompt optimizationpedagogical alignmentteaching-knowledge patternsintent-level scaffoldingpareto balance

Cost of Structural Learning Under Censored Feedback: A Threshold-Bandit Approach

arXiv cs.LG · Michael Ledford, William Regli · 2026-05-26

The paper introduces the Threshold-Activated Cooperative Multi-Armed Bandit (TAC-MAB) framework to address structural learning under censored feedback, where rewards are only observed when a coalition meets an unknown size threshold. It proposes C-TAC, a centralized algorithm achieving O(log T) cumulative regret, decomposed into structural-search and statistical-monitoring terms. A decentralized protocol, D-TAC, reduces communication by 23x compared to C-TAC while maintaining feasibility alignment through conservative belief fusion. These results demonstrate efficient coordination under censored feedback without continuous synchronization.

threshold-activated cooperative banditcensored feedbackstructural learningcumulative regretdecentralized coordination

Learning Dynamic Graph Representations through Timespan View Contrasts

arXiv cs.LG · Yiming Xu, Zhen Peng, Bin Shi, Xu Hua · 2026-05-26

The paper introduces CLDG and CLDG++, two dynamic graph representation frameworks leveraging temporal translation invariance for unsupervised learning. CLDG employs contrastive learning across timespans to maintain node consistency, while CLDG++ enhances this with graph diffusion and multi-scale contrasts (local-local, local-global, global-global). Both frameworks excel in node classification and anomaly detection, with CLDG notably reducing computational complexity by avoiding sequence models. Experiments validate their effectiveness in finance, cybersecurity, and healthcare applications.

dynamic graphstemporal translation invariancecontrastive learninggraph diffusionanomaly detection

FalAR: A Large-scale Speaker-Annotated European Portuguese Speech Corpus of Parliamentary Sessions

arXiv cs.LG · Francisco Teixeira, Carlos Carvalho, Mariana Julião, Catarina Botelho · 2026-05-26

The authors introduce FalAR, a 5,800-hour European Portuguese (EP) parliamentary speech corpus with 4,850 speaker-annotated hours (1,180 speakers) to address EP's underrepresentation in ASR datasets. Using the CAMÕES ASR model for transcription alignment, the corpus includes speaker metadata (age, gender, political affiliation) and spans 20 years. Experiments show FalAR pre-training yields up to 14% relative WER reduction compared to baselines, demonstrating the impact of domain-specific data quantity on ASR performance.

automatic speech recognitionspeaker annotationcorpus linguisticswer reductiontranscription alignment

BhashaSetu: A Data-Centric Approach to Low-Resource Machine Translation

arXiv cs.LG · Param Thakkar, Anushka Yadav, Michael Tiemann, Abhi Mehta · 2026-05-26

BhashaSetu introduces a linguistically enriched English-Marathi parallel dataset of 2.78 million sentence pairs, addressing data scarcity in low-resource neural machine translation (NMT). The dataset spans diverse domains (news, politics, healthcare, literature, culture) and includes stemmed and lemmatized representations for morphology-aware analysis. Benchmarking state-of-the-art models using BLEU, spBLEU, chrF++, and TER metrics reveals that corpus-level deduplication is the most impactful preprocessing step, with its removal degrading performance by 1.17 BLEU and 2.21 chrF++. Parameter-efficient fine-tuning of NLLB-200-distilled-600M via LoRA demonstrates the dataset's utility. The publicly released dataset aims to advance reproducible, linguistically informed low-resource NMT research.

neural machine translationmorphology-awarecorpus deduplicationparameter-efficient fine-tuninglow-resource languages

Causal Representation Learning for Generalisable Recommendation

arXiv cs.LG · Yorgos Felekis, Michael O'Riordan, Oriol Corcoll, Ciarán M. Gilligan-Lee · 2026-05-26

The paper introduces a causal representation learning (CRL) method to improve recommender systems' generalisation under distribution shift, using an information-theoretic disentanglement criterion that isolates causal components. A tractable variational lower bound enables optimisation from observational data alone, requiring no inference-time overhead. Evaluated via a Spotify A/B test (millions of users), KuaiRand, and synthetic benchmarks, the CRL variant matched offline performance but showed significant online gains in listener engagement, demonstrating robust out-of-distribution generalisation.

causal representation learningdistribution shiftrecommender systemsvariational lower boundoffline-online gap

SQARL: A Size-Agnostic Reinforcement Learning approach for Circuit Allocation in Distributed Quantum Architectures

arXiv cs.LG · Víctor Carballo, Júlia López-Closa, Mario Martin · 2026-05-26

The paper introduces SQARL, a size-agnostic reinforcement learning approach for qubit allocation in distributed quantum architectures. The method employs a transformer-based architecture to handle arbitrary qubit and core counts without retraining, addressing limitations of prior RL approaches that required hardware-specific training. Compared to the Hungarian Qubit Allocation (HQA) heuristic, SQARL reduces allocation costs by 33% for Cuccaro Adder circuits and 25% on average for random circuits, narrowing the performance gap between learning-based and hand-crafted methods.

quantum computingqubit allocationreinforcement learningtransformer architecturedistributed systems

SCENT: Aligning Mass Spectra with Molecular Structure for Olfactory Perception

arXiv cs.LG · Ziqi Zhang, Eunyeong Jin, Miguel Vasco, Farzaneh Taleb · 2026-05-26

The paper introduces SCENT, a multi-modal contrastive learning framework that aligns electron ionization mass spectrometry (EI-MS) representations with pretrained chemical structure embeddings, eliminating the need for explicit molecular structure at inference. The method leverages spectrum-to-chemical embedding alignment to predict olfactory perception directly from mass spectra. Results show SCENT outperforms MS-only baselines and matches structure-based models in multi-label odor descriptor prediction, while also better approximating human perceptual ratings and generalizing to real-world lab-measured spectra.

spectrum-to-chemical embeddingelectron ionization mass spectrometrymulti-modal contrastive learningolfactory perceptionfragmentation fingerprints

Sampling Data with Chains of Forward-Backward Diffusion Steps

arXiv cs.LG · Hyunmo Kang, Noam Itzhak Levi, Corinna Elena Wegner, Daniel J. Korchinski · 2026-05-26

The paper introduces U-turn chains, a Markov chain sampling method for high-dimensional distributions using forward-backward diffusion steps with Metropolis-Hastings correction. The method maintains proximity to the learned data manifold and samples from energy-modified targets. Experiments on synthetic languages reveal an ergodicity-breaking phase transition driven by data manifold fragmentation, with ergodicity restored at larger U-turn magnitudes. Empirical tests on natural language and images show slow relaxation for high-level features in CNNs and LLMs, with layer-ordering inversion occurring only at large noise levels. These findings highlight constrained local dynamics in diffusion-based sampling.

u-turn chainsmetropolis-hastingsergodicity-breakingdata manifolddiffusion models

Probabilistic Recurrent Intention Switching Model

arXiv cs.LG · Wenyuan Sheng, Hao Zhu, Joschka Boedecker · 2026-05-26

The Probabilistic Recurrent Intention Switching Model (PRISM) introduces a lightweight recurrent network to map observation history to intention distributions in inverse reinforcement learning (IRL), addressing goal switching within episodes. Unlike prior methods using Markov chains or fixed history windows, PRISM decomposes the EM objective into independent per-intention reward subproblems, solvable in closed form with O(nK) complexity. Evaluated on non-Markovian gridworld, mouse labyrinth, and BridgeData-V2 robotic manipulation, PRISM achieves superior held-out log-likelihood and recovers interpretable, temporally coherent intentions from unlabeled demonstrations.

inverse reinforcement learningintention switchingem algorithmnon-markovianrobotic manipulation

Constrained Bayesian Experimental Design via Online Planning

arXiv cs.LG · Yujia Guo, Daolang Huang, Xinyu Zhang, Sammie Katt · 2026-05-26

The authors propose a novel approach to Bayesian experimental design (BED) that enables constrained optimization of sequential experiments under dynamic constraints such as budget limitations and varying costs. The method combines offline pre-training of an amortized policy and a posterior network with online multi-step lookahead planning using scenario trees. Empirical results demonstrate that this approach yields substantially more informative design sequences compared to existing methods across various constrained BED tasks, with only a modest increase in computational overhead.

bayesian experimental designamortized policyposterior networkmulti-step lookaheadscenario trees

TED: Related Party Transaction guided Tax Evasion Detection on Heterogeneous Graph

arXiv cs.LG · Yiming Xu, Bin Shi, Bo Dong, Jiaxiang Wang · 2026-05-26

The paper introduces TED, a graph neural network model for tax evasion detection that leverages heterogeneous graph modeling and related party transaction groups. TED employs a hierarchical attention mechanism to capture deep structural and semantic information, filtering low-level noise through heterogeneous transaction groups. Evaluated on two human-labeled real-world tax datasets within a tax bureau's risk management system, TED significantly outperforms state-of-the-art methods in detecting tax evasion, demonstrating improved exploitation of interactive tax scenario information.

graph neural networktax evasion detectionheterogeneous graphrelated party transactionhierarchical attention mechanism

Convergence of Spectral Descent for Non-smooth Optimization

arXiv cs.LG · Yixuan Yang, Yuqing He, Song Li · 2026-05-26

The work provides theoretical convergence guarantees for Spectral Descent (SD) and Truncated Spectral Descent (TSD), simplified variants of the Muon optimizer, in non-smooth convex optimization. Under convexity, Lipschitz continuity, and sharpness conditions, the authors prove global linear convergence for both SD and TSD, and sublinear convergence for regularized variants with decoupled weight decay. The framework is applied to robust low-rank matrix recovery under mixed noise regimes, with numerical experiments validating the theoretical results.

spectral descentnon-smooth optimizationmuon optimizerconvergence guaranteeslow-rank matrix recovery

Signal-to-Noise Ratio and Sample Size Govern Representational Alignment in Neural Networks

arXiv cs.LG · Ali Hussaini Umar, Alessandro Laio · 2026-05-26

This work investigates the factors governing representational alignment in neural networks, demonstrating that signal-to-noise ratio (SNR) and training sample size influence alignment in both linear and nonlinear networks across regression and classification tasks. Using controlled experiments with noise-perturbed training sets, the authors show that alignment varies monotonically with SNR but non-monotonically with sample size, reaching a minimum near the interpolation threshold. Notably, alignment is decoupled from generalization performance, revealing a complex dependence on data quality and quantity. These findings hold consistently across synthetic and real-world datasets, including analysis of a single-layer linear network where alignment can be analytically estimated.

representational alignmentsignal-to-noise ratiointerpolation thresholdgeneralization performancelatent representations

RLVR Datasets and Where to Find Them: Tracing Data Lineage for Better Training Data

arXiv cs.LG · Hsiu-Yuan Huang, Weijie Liu, Chenming Tang, Sanwoo Lee · 2026-05-26

The paper introduces ATLAS, a framework for tracing lineage in Reinforcement Learning from Verifiable Rewards (RLVR) datasets, attributing 99.7% of 1.45M instances to 20 atomic sources. It proposes Source-level Counterfactual Attribution (SCA) to measure sample utility and curates DAPO++, a decontaminated RLVR dataset with a quality score Q that correlates with downstream performance. Experiments on Qwen3 models show DAPO++ improves benchmark performance, with Q reliably predicting training effectiveness.

reinforcement learningdata lineageverifiable rewardscounterfactual attributiondataset quality

Adaptive Reinforcement Learning for Robust Open Quantum System Control: A Multi-Task Framework with Temporal Optimization

arXiv cs.LG · Haftu W. Fentaw, Steve Campbell, Simon Caton · 2026-05-26

The paper introduces a Multi-task Soft Actor-Critic (SAC) Reinforcement Learning framework for robust quantum control in open systems, optimizing both pulse sequences and temporal parameters (evolution time T, pulse segments N). The method trains on 51 Hamiltonian variations, demonstrating high-fidelity state transfer under environmental noise and superior robustness to amplitude perturbations and decoherence compared to GRAPE-optimized controls via Robustness Infidelity Measure (RIM) analysis. Results show generalization to unseen Hamiltonians from the same parameter space.

multi-task reinforcement learningquantum control optimizationsoft actor-criticopen quantum systemsrobustness infidelity measure

Agile Online Model Selection: Resolving Adaptation Lag via Safeguarded Large Learning Rates

arXiv cs.LG · Kei Takemura, Ryuta Matsuno, Keita Sakuma · 2026-05-26

The paper introduces an optimistic online mirror descent algorithm with safeguarded large learning rates (up to Θ(T)) to resolve the adaptation lag in non-stationary environments. The method employs a post-hoc penalty mechanism to dynamically monitor and exclude unstable updates, maintaining O(log T) cumulative penalty while enabling aggressive adaptation. Evaluations on synthetic and 11 real-world datasets show the approach reduces adaptation lag from hundreds to a few rounds, outperforming tuning-free baselines.

online model selectiondynamic regretmirror descentlearning ratesnon-stationary environments

SPHERE-JEPA: Spherical Prediction with Homogeneous Embeddings

arXiv cs.LG · Léo Nicollier, Max Dunitz, Marc Pic, Pablo Musé · 2026-05-26

SPHERE-JEPA introduces a self-supervised learning framework enforcing hyperspherical uniformity in embeddings, addressing the suboptimality of Gaussian priors for manifold-supported distributions. Theoretically, it demonstrates that uniform distributions on hyperspheres optimize k-nearest neighbors and kernel ridge regression (with exponential dot-product/linear kernels), correcting anisotropic biases induced by Gaussian embeddings. Methodologically, it adapts LeJEPA's Cramér-Wold projections to impose spherical uniformity. Empirically, SPHERE-JEPA improves texture retrieval mAP by 6% and achieves +1.8% linear probing accuracy on ImageNet-1K (ViT-B/14) versus LeJEPA.

self-supervised learninghyperspherical uniformitykernel ridge regressioncramér-wold projectionanisotropic bias

Parsimonious Learning-Augmented Online Metric Matching

arXiv cs.LG · Yongho Shin, Phanu Vajanopath · 2026-05-26

The paper introduces parsimonious learning-augmented algorithms for online metric matching, addressing the tradeoff between prediction usage and performance guarantees. The method extends the Follow-the-Prediction framework by incorporating virtual predictions when actual predictions are unavailable, leveraging an online algorithm that maintains intermediate matchings. Theoretical analysis establishes performance lower bounds, while empirical results demonstrate practical efficacy.

learning-augmented algorithmsonline metric matchingfollow-the-predictionparsimonious predictionsperformance guarantees

Generalist Graph Anomaly Detection via Prototype-Based Distillation

arXiv cs.LG · Yiming Xu, Zihan Chen, Zhen Peng, Song Wang · 2026-05-26

ProMoS introduces the first unsupervised generalist framework for graph anomaly detection (GAD), eliminating reliance on labeled data or few-shot support. It employs knowledge distillation from a frozen self-supervised GNN teacher to a mixture-of-students model with shared global and personalized branches, enabling efficient normality modeling. Prototype-guided soft-label distillation aligns representations in a shared prototype space for cross-graph generalization. Zero-shot anomaly detection is achieved via distillation bias and prototype geometric deviation. Experiments demonstrate ProMoS's effectiveness in label-free, zero-shot GAD across diverse graphs.

graph anomaly detectionknowledge distillationprototype alignmentzero-shot learningself-supervised gnn

RAPNet: Accelerating Algebraic Multigrid with Learned Sparse Corrections

arXiv cs.LG · Yali Fink, Ido Ben-Yair, Lars Ruthotto, Eran Treister · 2026-05-26

RAPNet introduces a graph neural network framework to optimize algebraic multigrid (AMG) by learning sparse, robust coarse operators directly from sparse algebraic systems, resolving the trade-off between sparsity and convergence quality. The method employs a level-wise training strategy, enabling generalization from small subgraphs to million-node domains while maintaining computational efficiency during the solve phase. Evaluations demonstrate that RAPNet outperforms classical non-Galerkin baselines across diverse PDE discretizations and graph Laplacians, particularly excelling in multi-query tasks such as eigenproblems, time-dependent simulations, and inverse or design problems.

algebraic multigridgraph neural networksparse operatorspde discretizationslevel-wise training

Learning Energy-Based Models from Stochastic Interpolants using Spatiotemporal Differences

arXiv cs.LG · Hanlin Yu, RuiKang OuYang, Partha Kaushik, Arto Klami · 2026-05-26

The authors propose Spatiotemporal Noise-Contrastive Estimation (stNCE), a framework for training energy-based models by leveraging joint spatiotemporal differences, addressing failure modes in existing spatial or temporal difference methods. stNCE unifies prior approaches and derives new training objectives, using stochastic interpolants to model joint densities over data and time. Experiments on image and molecular datasets demonstrate competitive performance with state-of-the-art density estimation methods.

energy-based modelsstochastic interpolantsspatiotemporal differencesnoise-contrastive estimationdensity estimation

Not All Disagreement Is Learnable: Token Teachability in On-Policy Distillation

arXiv cs.LG · Yuanyi Wang, Su Lu, Yanggan Gu, Pengkai Wang · 2026-05-26

The paper introduces token teachability, a metric for identifying learnable teacher-student disagreement in on-policy distillation (OPD), showing that raw KL divergence poorly predicts learning value. The authors propose Teachability-Aware OPD (TA-OPD), which selects tokens where teacher corrections align with the student's top-K candidates, avoiding incompatible signals. Evaluations on Qwen2.5 and Qwen3 demonstrate TA-OPD's efficacy, matching full-token OPD performance with only 5% retained tokens and outperforming entropy- and divergence-based baselines.

on-policy distillationtoken teachabilitykl divergenceteacher-student learningqwen models

MONA: Muon Optimizer with Nesterov Acceleration for Scalable Language Model Training

arXiv cs.LG · Jiacheng Li, Jianchao Tan, Hongtao Xu, Jiaqi Zhang · 2026-05-26

MONA introduces curvature-aware acceleration into the Muon optimizer for scalable language model training, combining Muon's matrix orthogonalization framework with an acceleration term derived from gradient differences. This modification enables escape from sharp local minima while preserving spectral-norm regularization. Empirical evaluations demonstrate MONA's superior convergence and downstream task performance compared to Muon and AdamW across Mixture-of-Experts pretraining scales (1B to 68B parameters) on 1 trillion tokens. Supervised fine-tuning on MOE-68B-A3B achieves state-of-the-art results on general capability, mathematical reasoning, and code generation benchmarks.

muon optimizermatrix orthogonalizationspectral-norm regularizationmixture-of-expertscurvature-aware acceleration

Particle-Lund Multimodality in Jet Taggers

arXiv cs.LG · Loukas Gouskos, Benedikt Maier · 2026-05-26

The authors propose PLuM, a multimodal transformer architecture that jointly processes particle constituents and Lund plane representations in a shared latent space to investigate whether explicit hierarchical QCD information complements learned particle-level features. Using cross-attention between modalities, PLuM achieves systematic improvements for top-quark and H→bb̄ tagging (25% higher background rejection at 25% di-Higgs efficiency) but not for H→cc̄ or H→4q, suggesting b-jet formation benefits from structured QCD representations while other topologies are sufficiently captured by constituent-level transformers.

lund planetransformerqcd radiationjet taggingmultimodal learning

Neural Autoregressive Control Variates for the Quantum Monte Carlo Sign Problem

arXiv cs.LG · Bei Qiao, Lei Wang · 2026-05-26

The authors propose neural autoregressive control variates to address the sign problem in quantum Monte Carlo simulations, using two strictly normalized autoregressive networks confined to positive- and negative-sign sectors. The method integrates with stochastic series expansion, incorporating incremental loop-topology updates and a twist channel for sign-ergodic sampling on frustrated lattices. Evaluated on the triangular-lattice Heisenberg antiferromagnet, the approach reduces the standard error of the average sign by up to 10× and energy estimator errors by 3–5×, remaining effective even at average signs below 10^-3.

quantum monte carloautoregressive modelssign problemcontrol variatesstochastic series expansion

PATE-TabTransGAN: Differentially Private Synthetic Tabular Data Generation via Transformer-Based Student Discrimination

arXiv cs.LG · M. Youssef, M. Woźniak · 2026-05-26

PATE-TabTransGAN introduces a differentially private framework for synthetic tabular data generation, combining Private Aggregation of Teacher Ensembles (PATE) with a Transformer-based student discriminator. The method employs Logistic Regression teachers trained on disjoint partitions to supervise the student via noisy-aggregated labels, while a residual generator is optimized against this student, inheriting formal (ε, δ)-DP guarantees. Evaluated on four benchmarks (Adult, Breast, Cardio, Cervical), PATE-TabTransGAN achieves the best or tied-best AUROC on all datasets and competitive AUCPR performance, demonstrating its effectiveness in capturing inter-feature dependencies while ensuring privacy.

differential privacytabular datatransformerpategnmax

Latent Recurrent Transformer: Architecture Exploration, Training Strategies, and Scaling Behavior

arXiv cs.LG · Zeyi Huang, Xuehai He, LiLiang Ren, Yiping Wang · 2026-05-26

The Latent Recurrent Transformer (LRT) augments autoregressive transformers by reusing a high-level source-layer hidden state from the previous token as recurrent memory for the next token, adding a cross-layer latent pathway without modifying attention or KV-cache. Interleaved parallel training pretrains this recurrence efficiently: a full-sequence initialization pass builds a shared buffer, followed by parallel refinement of disjoint position subsets, achieving recurrent-memory-aware supervision at ~2× baseline compute. Evaluated across nanochat-style backbones and varying tokens-per-parameter budgets, LRT improves language-modeling loss and in-context learning with only 0.3% added parameters.

latent recurrent transformerautoregressive transformerskv-cacheinterleaved parallel trainingin-context learning

Pretrained Approximators for Low-Thrust Trajectory Cost and Reachability

arXiv cs.LG · Zhong Zhang, Giacomo Acciarini, Dario Izzo, Hexi Baoyin · 2026-05-26

The authors demonstrate that machine learning surrogates can accurately approximate fuel consumption and transfer feasibility in low-thrust trajectory design, bypassing costly optimal control solutions. They identify a scaling law where performance improves linearly with log-scaled training data and model parameters, enabling construction of a large-scale dataset via a homotopy-ray strategy. Key innovations include a self-similar transformation for cross-scenario generalization and validation on public benchmarks like the Global Trajectory Optimization Competition. The open-sourced models achieve accurate predictions for single/multi-revolution transfers across diverse orbital environments.

low-thrust trajectoryscaling lawshomotopy-ray strategyself-similar transformationoptimal control

Time Series Causal Discovery via Context-Conditioned and Causality-Augmented Pretraining

arXiv cs.LG · Biao Ouyang, Tengxue Zhang, Zhihao Zhuang, Yang Shu · 2026-05-26

PTCD introduces a pretraining framework for time-series causal discovery, enhancing cross-task generalization via context-conditioned modeling and causal augmentation. The method employs a dual-scale iterative attention mechanism for window-level causal dependencies and a Gaussian mixture with context-level routing for heterogeneous exogenous distributions. Pretraining on synthetic datasets integrates intervention-based learning and causal mixup to address distribution shifts. Experiments on real-world OOD datasets show PTCD outperforms in causal discovery and root cause identification.

time-seriescausal discoverypretraininggeneralizationattention mechanism

Localizing Memorized Regions in Diffusion Models via Coordinate-Wise Curvature Differences

arXiv cs.LG · Gwangho Kim, Sungyoon Lee · 2026-05-26

The paper introduces a geometric method to localize memorized regions in diffusion models by analyzing coordinate-wise variance collapse, distinguishing overfitting-driven memorization from intrinsic data constraints through curvature-difference techniques. The approach subtracts curvature from an underfitted baseline (unconditional model or less-trained version) and derives a score-difference proxy to explain existing detection metrics. Evaluated on Stable Diffusion with ground-truth memorization masks, the method outperforms prior attention-based localization, achieving superior precision in identifying memorized areas.

diffusion modelsmemorization detectioncurvature-differencevariance collapsescore-difference proxy

APEX: Amplitude Anchors and Phase Priors for Target-Scarce Higher-Frequency Wave Prediction

arXiv cs.LG · Yifan Sun, Lei Cheng, Sijie Chen, Ting Zhang · 2026-05-26

APEX introduces a framework for target-scarce higher-frequency wave-field prediction by leveraging amplitude stability and phase sensitivity across frequencies. The method first uses a lower-frequency neural operator to generate coarse predictions, retaining only amplitude as a structural anchor, then employs a conditional flow-matching enhancer guided by a Green's-function-inspired phase prior to reconstruct high-frequency details. Experiments on SimpleWave, Helmholtz, and Maxwell benchmarks demonstrate APEX's superiority over direct extrapolation, target-adapted operators, and joint generative baselines under limited supervision, highlighting the importance of separate amplitude-phase handling for oscillatory fields.

wave-field predictionneural operatoramplitude anchoringphase priorconditional flow-matching

MTL-FNO: A Lightweight Multi-Task Fourier Neural Operator for Sparse Field Reconstruction

arXiv cs.LG · Siyu Ye, Shihang Li, Zhiqiang Gong, Benrong Zhang · 2026-05-26

The authors propose MTL-FNO, a lightweight multi-task Fourier neural operator for sparse field reconstruction, addressing model size growth and cross-field correlation challenges. The method employs hard parameter sharing with shared and low-rank task-specific components, alongside a polar-form decoupled optimization scheme that disentangles spectral weights into unitary (phase) and positive semi-definite (amplitude) tensors via Cayley transform reparameterization. On two engineering cases, MTL-FNO matches or exceeds standard FNO accuracy while reducing model size by 76% and 60% under few-shot conditions.

fourier neural operatormulti-task learningsparse reconstructionpolar decompositioncayley transform

Image Feature Fusion-based Federated Client Unlearning (FCU)

arXiv cs.LG · Hangyi Shen, Yizhi Pan, Tiansuo Li, Weiqi Jiang · 2026-05-26

The paper introduces Image Feature Fusion-based Federated Client Unlearning (IFF-FCU), a method addressing catastrophic forgetting in federated unlearning by dynamically mixing samples via linear Image Feature Fusion (Mixup). This approach regularizes the forgetting boundary, balancing unlearning effectiveness and model generalization. Evaluated on medical imaging benchmarks (RSNA-ICH and ISIC2018), IFF-FCU achieves competitive Error deviation from retrained standards, notably on the ICH dataset, outperforming existing baselines.

federated unlearningcatastrophic forgettingimage feature fusionmixuperror deviation

Transformers Can Learn Posterior Predictive Distributions In-Context

arXiv cs.LG · Gyeonghun Kang, Changwoo J. Lee, Xiang Cheng · 2026-05-26

The work demonstrates that transformers can theoretically learn posterior predictive distributions (PPDs) in-context for Gaussian process regression, implementing gradient descent on predictive mean/variance followed by nonlinear binned probability mappings. Through constructive proofs, it analyzes PPD approximation error bounds with respect to attention depth and bin resolution, revealing normalization's critical role in extrapolation beyond pretraining sample sizes. Empirical simulations validate the theoretical insights into posterior-predictive-focused prior-data fitted networks (PFNs) and their architectural dependencies.

posterior predictive distributionin-context learninggaussian process regressionattention depthprior-data fitted networks

The Need for an External Observer Formalizing the Sufficiency Gap: A Mathematical Extension of Mixture Identifiability and Contextual Grounding in Sequence Models

arXiv cs.LG · Francesco Corielli · 2026-05-26

The paper formalizes the sufficiency gap in sequence models by constructing a binary mixed-regime process with deterministic and random regimes governed by an unobserved latent state. It demonstrates that even an ideal infinite-capacity predictor can become overconfident when the observed prefix aligns with the wrong regime, leading to an entropy difference termed the sufficiency gap. Through Bayesian analysis, the authors introduce a contextual dominance threshold based on an auxiliary binary signal with fidelity γ, which reduces but does not eliminate the gap. The findings clarify limitations of temperature scaling, emphasize the need for informative grounding mechanisms, and advocate for structurally decoupled observers in high-stakes domains.

sufficiency gaplatent statecontextual dominance thresholdbayesian updatesequence models

Proper Calibeating

arXiv cs.LG · Dean P. Foster, Sergiu Hart · 2026-05-26

The paper extends calibrated forecasting and calibeating to proper scoring rules, introducing proper-calibration and proper-calibeating by requiring uniform error convergence across bounded proper scoring rules. It demonstrates that calibration implies proper-calibration, while calibeating does not necessarily imply proper-calibeating, and provides methods to ensure proper-calibeating and proper-multicalibeating. Additionally, it establishes equivalence between proper-calibration and universal no regret in decision-making under uncertainty when best replying to forecasts.

proper scoring rulescalibrated forecastscalibeatinguniform convergencedecision-making under uncertainty

CART Random Forests as Sequential Allocation over Random Opportunity Sets: A Stochastic-Control Theory of Ensemble Risk

arXiv cs.LG · Tianxing Mei, Yingying Fan, Mingming Leng, Jinchi Lv · 2026-05-26

The paper introduces CART-ROSA, a stochastic-control framework interpreting feature-subsampled CART random forests as sequential allocation over random opportunity sets. It models feature subsets as feasible actions and CART splits as masked-action policies, inducing a controlled process over split-count states that determines forest MSE. The analysis reveals CART's local stabilization properties (contracting split imbalances) but global suboptimality, with explicit MSE risk expansion derived for linear models. This operationalizes forest mechanics via two design levers: feature subsampling's informative-opportunity rate and split policy's contraction strength.

cart random forestsstochastic controlmean squared errorfeature subsamplingsplit policy

WINDQuant: Weight-Informed Neural Decision-Making for Global Mixed-Precision LLM Quantization

arXiv cs.LG · Phong Nam Huu Nguyen, Khoi M. Le, Cong-Duy T Nguyen, Anh Tuan Luu · 2026-05-26

WINDQuant introduces a reinforcement-learning-based controller for fine-grained mixed-precision quantization of LLMs, addressing limitations in existing post-training and heuristic methods. The method employs proximal policy optimization (PPO) to assign bit-widths at column-chunk granularity under global storage constraints, incorporating activation-aware calibration and explicit effective-bit accounting. Evaluations on LLaMA models show competitive accuracy in ultra-low-bit regimes (e.g., 2-4 bits) with reduced optimization overhead compared to retraining-based approaches, demonstrating RL's viability for adaptive quantization.

mixed-precision quantizationreinforcement learningcolumn-chunk granularityproximal policy optimizationeffective-bit accounting

Why Prompt Optimization Works, and Why It Sometimes Doesn't: A Causal-Inspired Edit-Level Analysis

arXiv cs.LG · Shuzhi Gong, Hechuan Wen · 2026-05-26

The study investigates why automated prompt optimization methods (e.g., DSpy, TextGrad) exhibit inconsistent generalization across tasks and LLM backbones, using causal inference-inspired analysis. By analyzing prompt edits across frameworks, backbones, and benchmarks, the authors identify task-conditioned edit patterns: complexity-increasing and meta-instructional edits harm mathematical reasoning, while step-by-step and meta-cognitive edits benefit logical reasoning. These findings, robust across cognitive-load annotations and edit-motif analyses, reveal systematic interactions between edit families and task characteristics, guiding future optimizer design.

prompt optimizationcausal inferencellm backbonestask-conditioned editsmeta-cognitive edits

RT-Lynx: Putting the GEMM Sparsity In a Right Way for Diffusion Models

arXiv cs.LG · Xing Cong, Hanlin Tang, Kan Liu, Lan Tao · 2026-05-26

RT-Lynx introduces activation sparsification for Diffusion Transformers (DiT) to reduce inference costs while preserving generation quality, addressing the limitations of weight sparsification. The method applies N:M semi-structured sparsification to activations, leveraging their intrinsic sparsity, and incorporates error-compensation techniques to mitigate accuracy loss. Optimized CUDA kernels are implemented for efficient execution, achieving up to 1.55x speedup in linear layers. Extensive experiments across multiple diffusion models confirm that RT-Lynx maintains original model performance while significantly accelerating inference, demonstrating its effectiveness in reducing computational overhead without compromising quality.

diffusion transformersactivation sparsificationn:m sparsificationerror-compensationcuda kernels

PIDM-DP: Physics-Informed Diffusion with Dormand-Prince Integration for Chaotic System Identification and State Reconstruction across Multiple Dynamical Regimes

arXiv cs.LG · Shailendra Dabral · 2026-05-26

PIDM-DP introduces a Physics-Informed Diffusion Model with Dormand-Prince Integration for chaotic system identification and state reconstruction, embedding a 5th-order Dormand-Prince ODE integrator into the reverse sampling loop of a Denoising Diffusion Probabilistic Model (DDPM). Physics residuals are back-propagated via automatic differentiation, ensuring trajectories satisfy governing equations with 5th-order accuracy, while a linear-scheduled guidance mechanism prevents gradient explosions. Evaluated on five benchmark systems, PIDM-DP achieves up to 15.4× RMSE improvement over unconstrained diffusion baselines and outperforms the Ensemble Kalman Filter on stiff systems, with significant RMSE reductions (e.g., 0.1097 vs. 0.9443 on Rabinovich-Fabrikant). Topological validation confirms preservation of chaotic invariant measures.

physics-informed diffusiondormand-prince integrationchaotic systemsdenoising diffusionautomatic differentiation

Focal Reward: Balanced Reinforcement Learning under Rubric-Based Rewards

arXiv cs.LG · Yu Huang, Zihua Zhao, Zhaoxin Huan, Wanli Gu · 2026-05-26

The paper introduces Focal Reward, a reinforcement learning objective addressing reward imbalance in multi-dimensional rubric-based evaluation for LLMs. The method employs inverse reward projection to estimate criterion saturation, then dynamically reweights rewards via calibration coefficients to prioritize under-optimized dimensions. Experiments across three model scales and six benchmarks show universal improvements over static baselines in 18 comparisons, with analysis confirming gains stem from saturation-aware reward reallocation.

focal rewardrubric-based rewardsinverse reward projectionreward calibrationreinforcement learning

TrackRef3D: Multi-View Consistent Track-then-Label for Open-World Referring Segmentation in 3D Gaussian Splatting

arXiv cs.LG · Yuyang Tan, Renhe Zhang, Hang Zhang, Ao Li · 2026-05-26

TrackRef3D introduces an automatic pipeline for open-world referring segmentation in 3D Gaussian Splatting (3DGS), eliminating manual annotation through a track-then-label paradigm. The method employs a Trajectory-Aware Semantic Consensus Module (TSCM) for multi-view consistent semantic identity via synonymous clustering and trajectory-aware voting, alongside visibility-aware description generation and a Hybrid Training Strategy (HTS) for robust query handling. Experiments show state-of-the-art performance on benchmarks.

3d gaussian splattingreferring segmentationmulti-view consistencytrajectory-aware votinghybrid training strategy

Separate Aggregation of Split Network for Personalized Federated Learning

arXiv cs.LG · Yunseok Kang, Jaeyoung Song · 2026-05-26

The authors propose PGFedSplit, a personalized federated learning framework addressing performance degradation under heterogeneous client data. The method employs a split architecture with adaptive aggregation scheduling, balancing global knowledge sharing and local adaptation. It enhances robustness via a mixture of local representations and server-generated synthetic representations from Gaussian statistics. Evaluations on Fashion MNIST, CIFAR-10, CIFAR-100, and Tiny ImageNet show consistent improvements over state-of-the-art PFL methods in convergence stability and personalization under severe heterogeneity.

personalized federated learningsplit architectureadaptive aggregationgaussian statisticsclient heterogeneity

Distribution-Aware Conformal Prediction: A Framework for generating efficient prediction intervals for time series

arXiv cs.LG · Daniel Schweizer, Peter Kuhn, Jayant Sharma, Shivali Dubey · 2026-05-26

The paper introduces Distribution-aware Conformal Prediction (DCP), a framework combining probabilistic predictors (Monte Carlo dropout, deep ensembles, quantile regression) with conformal calibration to generate valid prediction intervals for time series. DCP employs numerical inversion to construct interval bounds, supporting arbitrary predictor-score pairings. Benchmarks on synthetic and real-world data show DCP adapts to varying uncertainty regimes, with performance evaluated via a modified Winkler score balancing coverage and efficiency. The modular design generalizes existing methods like Conformalized Quantile Regression while enabling future extensions for uncertainty quantification.

conformal predictionmonte carlo dropoutquantile regressionuncertainty quantificationprediction intervals

Beyond Holistic Models: Systematic Component-level Benchmarking of Deep Multivariate Time-Series Forecasting

arXiv cs.LG · Shuang Liang, Chaochuan Hou, Xu Yao, Shiping Wang · 2026-05-26

The paper introduces TSCOMP, the first large-scale benchmark for systematic component-level analysis of deep multivariate time-series forecasting models. It deconstructs existing approaches into fine-grained components (preprocessing, encoding, architectures, optimization) and evaluates them through orthogonal experimental design across 20,000 model-dataset combinations. Results show that corpus-driven component selection outperforms state-of-the-art holistic models, demonstrating the superiority of systematic analysis over manual architecture design.

multivariate time-seriescomponent-level benchmarkingorthogonal experimental designzero-shot model constructionperformance corpus

SEC-bench Pro: Can Language Models Solve Long-Horizon Software Security Tasks?

arXiv cs.LG · Hwiwon Lee, Jiawei Liu, Dongjun Kim, Ziqi Zhang · 2026-05-26

SEC-bench Pro introduces a benchmark for evaluating LLMs on long-horizon software security tasks, addressing limitations of existing benchmarks by incorporating real-world bug hunting scenarios. The method involves a three-phase pipeline for vulnerability collection, environment reconstruction, and oracle-based validation, instantiated with 183 validated vulnerabilities across V8 and SpiderMonkey. Results show frontier models achieve <40% success (32.0% on V8, 38.8% on SpiderMonkey), with open-weight Kimi-K2.6 at 11.7% on V8, while ClaudeCode and Codex exhibit complementary performance.

vulnerability discoveryproof-of-concept generationoracle-based validationmemory-safety bugsjit compilation

Open-Weight LLM Fine-Tuning Defenses are Susceptible to Simple Attacks

arXiv cs.LG · Kevin Kuo, Chhavi Yadav, Virginia Smith · 2026-05-26

The paper demonstrates that open-weight LLM safeguards are vulnerable to simple jailbreaking attacks without fine-tuning, challenging the assumption that harmful behavior requires gradient-based optimization. It evaluates two low-cost attacks—abliteration and prefilling—on three benchmarks (BeaverTails, HarmBench, AdvBench), increasing attack success rates from <10% to 16%-96%. The authors propose abliteration-resistant tuning (ART) as a mitigation, reducing attack success by 10%-20%. Results reveal a broader attack surface for open-weight models than previously recognized, necessitating more diverse defense evaluations.

open-weight llmjailbreaking attacksabliterationprefillingharmbench

SIKA-GP: Accelerating Gaussian Process Inference with Sparse Inducing Kernel Approximations for Bayesian Deep Learning

arXiv cs.LG · Wenyuan Zhao, Rui Tuo, Chao Tian · 2026-05-26

SIKA-GP introduces sparse inducing kernel approximations to accelerate Gaussian process inference, reducing complexity to O(log M) for M inducing points via dyadic ordered template bases. The method constructs compact kernel representations from sparsely activated bases, enabling efficient GPU tensorization and integration with Bayesian neural networks. Experiments on vision and transformer benchmarks show maintained predictive performance while achieving significant speedups in training and inference for deep architectures.

gaussian processessparse approximationsbayesian neural networkskernel learninginducing points

PRISM: Position-encoded Regressive Inverse Spectral Model for Multilayer Thin-Film Design

arXiv cs.LG · Runtian Wang, Renhao Xue, Baige Chen, Hao Wu · 2026-05-26

PRISM introduces a decoder-only autoregressive transformer for multilayer thin-film design, jointly optimizing discrete material selection and continuous thickness regression. Key innovations include spectrum prefix conditioning for target specification and cumulative-depth Rotary Position Embeddings to encode spatial relationships. The 13M-parameter model reduces MAE by >50% versus baselines, while a 44M variant achieves SOTA performance (MAE=0.010) with faster inference than simulated annealing.

autoregressive transformerrotary position embeddingsthin-film designspectrum prefix conditioningsimulated annealing

Beyond Pairwise Preferences: Listwise Reward-Aware Alignment for Diffusion Models

arXiv cs.LG · Austin Wang, Jiaqi Han, Stefano Ermon, Yisong Yue · 2026-05-26

The paper introduces Diffusion LAIR, a listwise reward-aware alignment method for diffusion models that extends beyond pairwise preference optimization. The method converts reward scores for multiple candidate images per prompt into centered advantage weights, optimizing an advantage-weighted regression objective with quadratic regularization on implicit reward (denoising-loss improvement over a reference model). This approach avoids pairwise reduction, uses all candidates simultaneously, and controls update magnitude via closed-form optimum analysis. Experiments demonstrate superior performance over baselines on SD1.5 and SDXL in text-to-image generation, compositional generation, and image editing tasks.

preference optimizationdiffusion modelsimplicit rewardadvantage-weighted regressiondenoising-loss

The Stability of Singular Distribution: A Spectral Perspective on the Two-Phase Dynamics of Language Model Pre-training

arXiv cs.LG · Hongtao Zhang, Wenjie Zhou, Chenxi Jia, Wei Chen · 2026-05-26

The study identifies the Stability of Singular Distribution (SoSD) as a spectral phenomenon underlying the two-phase dynamics of large language model pre-training, characterized by an initial rapid loss drop followed by slow improvement. Through analysis of diverse architectures (GPT-2, LLaMA) and training configurations (Step-wise, WSD, Cosine Decay schedules; AdamW, Muon optimizers), it demonstrates that SoSD, where the trace-normalized singular value spectrum stabilizes early, synchronizes with the slow-descent regime. Theoretical analysis of a simplified Transformer proves that growing weight norms induce an early SoSD threshold, bounding loss decrease rates by singular distribution variation. Strategies like WSD and Muon are interpreted as modulating the SoSD scale, providing a spectral perspective on pre-training efficiency.

stability of singular distributionspectral phenomenontrace-normalized singular value spectrumslow-descent regimeweight norms

Extra-Merge: Tracing the Rank-1 Subspace of Model Merging in Language Model Pre-Training

arXiv cs.LG · Wenjie Zhou, Bohan Wang, Hongtao Zhang, Chenxi Jia · 2026-05-26

The work identifies a Rank-1 Subspace phenomenon in late-stage LLM pre-training, where merged checkpoints collapse onto a stable one-dimensional manifold despite noisy optimization trajectories. Theoretically grounded in river-valley landscape analysis, the authors propose Extra-Merge, a training-free method that extrapolates along this subspace to minimize loss without gradient updates. Experiments on GPT-2 and LLaMA variants (124M–2B parameters) show consistent improvements over merging baselines, including zero-shot accuracy gains on Pythia-12B downstream tasks and compatibility with the Muon optimizer.

rank-1 subspacemodel mergingriver-valley landscapeextra-mergezero-shot accuracy

Variational Inference for Evidential Deep Learning

arXiv cs.LG · Jiawei Tang, Xinyan Du, Hui Liu, Junhui Hou · 2026-05-26

The paper introduces Variational Inference Evidential Deep Learning (VI-EDL), a framework addressing limitations in conventional Evidential Deep Learning (EDL). VI-EDL reformulates EDL via variational inference, deriving an Evidence Lower Bound (ELBO) to prevent excessive evidence growth and theoretically establishing a generalization bound. The method justifies setting Dirichlet parameters α = e + 1 to minimize this bound. Experiments on visual and medical datasets show VI-EDL achieves state-of-the-art performance in out-of-distribution detection, noise detection, and autonomous driving scenarios.

evidential deep learningvariational inferencedirichlet distributiongeneralization boundout-of-distribution detection

MuCon: Clipped Muon Updates for LLM Training

arXiv cs.LG · Albert Yi · 2026-05-26

MuCon introduces a clipped-Muon optimizer variant for LLM training, replacing the canonical partial polar factor with singular-value clipping. The method applies a spectral-norm clipping operator, MClip_τ, which modifies only singular values exceeding a threshold τ while preserving others. The paper explores when MuCon clipping can be approximated without full dense SVD, identifying numerical obstructions near the threshold and proposing matrix-function methods paired with stable polar/square-root primitives or regularization. Results highlight the necessity of stable numerical techniques for handling ill-conditioned singular values near the clipping boundary.

muon optimizersingular-value clippingspectral-norm ballmatrix-function methodsnumerical obstruction

Robust Koopman Control Barrier Filters for Safe Actor-Critic Reinforcement Learning

arXiv cs.LG · Dhruv S. Kushwaha, Zoleikha A. Biron · 2026-05-26

The paper introduces Robust Koopman-CBF SAC, a safety-filtered actor-critic framework combining Koopman operators and control barrier functions (CBFs) for safe RL. It learns a finite-dimensional Koopman predictor from data, constructs affine CBF constraints in the lifted space, and enforces them via a quadratic-program safety layer, with tightened CBF conditions to account for approximation error. Evaluated on CartPole and Safety Gymnasium tasks, the method achieves zero violations in CartPole while matching unconstrained SAC returns, though exposing limitations of first-order velocity barriers and linear EDMD models in high-dimensional settings.

koopman operatorcontrol barrier functionsactor-criticsafe reinforcement learningquadratic-program safety layer

FM-fMRI: Event Conditioned Flow Matching for Rest-to-Task fMRI Time-Series Synthesis

arXiv cs.LG · Peiyu Duan, Jiyao Wang, Nicha C. Dvornek, Junlin Yang · 2026-05-26

FM-fMRI introduces an event-conditioned flow-matching model for synthesizing task-based fMRI (tfMRI) time series from resting-state fMRI (rsfMRI) and task event schedules. The method learns a continuous-time conditional vector field, enabling fast ODE-based sampling and flexible event conditioning. Evaluated on HCP and BioPoint datasets, FM-fMRI outperforms conditional diffusion, GANs, and VAEs in spectral/connectivity agreement and distributional alignment. Synthesized tfMRI improved autism classification in data-limited settings, demonstrating clinical utility.

flow matchingfmri synthesisode-based samplingrest-to-taskconnectome consistency

Amortized Factor Inference Networks for Posterior Inference

arXiv cs.LG · Joohwan Ko, Justin Domke · 2026-05-26

The paper introduces Amortized Factor Inference Networks (AFINs), a family of dimension-independent encode-merge-decode networks that generalize posterior inference across varying priors, likelihoods, and dimensionalities without retraining. AFINs map model specifications and observations to variational posterior parameters, avoiding costly test-time finetuning. Experiments show that a single trained AFIN matches the posterior accuracy of NUTS and variational methods while reducing test-time compute by 2-4 orders of magnitude.

amortized inferencevariational posteriormodel specificationdimension-independenttest-time compute

Function-Valued Causal Influence in Nonlinear Time Series

arXiv cs.LG · Valentina V. Kuskova, Dmitry Zaytsev, Michael Coppedge · 2026-05-26

The paper introduces function-valued causal influence for nonlinear time series analysis, addressing the limitation of scalar edge scores in summarizing causal relationships. Using Neural Additive Vector Autoregression, the authors propose a framework based on Individual Conditional Expectation to estimate causal response functions directly from trained models. Synthetic experiments demonstrate that edges with identical scalar scores can exhibit diverse functional behaviors, including monotonic, thresholded, saturating, and sign-changing effects. An applied case study on democratic development reveals regime-specific and asymmetric causal structures overlooked by score-centric approaches.

function-valued causal influenceneural additive vector autoregressionindividual conditional expectationnonlinear time seriesscalar edge scores

When Does LeJEPA Learn a World Model?

arXiv cs.LG · David Klindt, Yann LeCun, Randall Balestriero · 2026-05-25

The paper proves that LeJEPA (alignment plus Gaussian regularization) achieves linear identifiability of latent world variables under stationary, additive-noise transitions, with Gaussian latents being the unique distribution enabling this guarantee. The analysis relies on spectral decomposition showing alignment penalizes nonlinearities, forcing linear maps as optima, and demonstrates approximate identifiability with graceful degradation. Theoretical claims are validated through experiments on 2D to 1024D latents, including robotic control tasks, establishing foundations for provably structured world models.

linear identifiabilitygaussian regularizationspectral decompositionlatent variablesworld models

Deep Learning-based Algebraic Reynolds Stress Closures for RANS Simulations of Turbulent Flows

arXiv cs.LG · Daniel Dehtyriov, Jonathan F. MacArt, Justin Sirignano · 2026-05-25

The Deep Algebraic Reynolds Stress Model (DARSM) introduces a physics-derived deep learning closure for RANS simulations, addressing distribution shift and generalization challenges in turbulence modeling. The method combines a neural network mapping flow invariants to empirical parameters in an implicit algebraic Reynolds stress equation, with adjoint-based optimization through coupled PDEs. DARSM reduces average velocity errors by 2-4× (peak 12×) on square-duct and periodic-hill benchmarks, generalizing across Reynolds numbers, geometries, and flow regimes without retraining, outperforming five established ML baselines in accuracy.

rans simulationsturbulence modelingadjoint optimizationphysics-informed mlreynolds stress closure

Balancing Plasticity and Stability with Fast and Slow Successor Features

arXiv cs.LG · Raymond Chua, Doina Precup, Blake Richards · 2026-05-25

The study investigates the stability-plasticity trade-off in deep Reinforcement Learning (RL) under continual non-stationarity, contrasting abrupt shifts with gradual environmental drift. Using modified 3D Miniworld and MuJoCo environments, the authors demonstrate that stability-focused methods (e.g., synaptic consolidation) outperform plasticity-oriented approaches (e.g., parameter resetting) in gradual change scenarios. They propose consolidating Successor Features (SFs) across multiple timescales, finding this yields superior adaptation compared to Q-value consolidation, with multi-timescale SF stabilization capturing complementary aspects of environmental change.

reinforcement learningnon-stationaritysuccessor featuressynaptic consolidationplasticity-stability dilemma

Energy-Gated Attention and Wavelet Positional Encoding: Complementary Inductive Biases for Transformer Attention

arXiv cs.LG · Athanasios Zeris · 2026-05-25

The paper introduces Energy-Gated Attention (EGA) and Morlet Positional Encoding (MoPE) as complementary inductive biases for transformer attention. EGA gates value aggregation via a learned energy estimate of key tokens, while MoPE replaces sinusoidal encodings with learnable Gaussian-windowed wavelets for scale-selective locality. Combined, they achieve superadditive performance (+0.119 validation loss improvement on TinyShakespeare), outperforming standalone implementations (EGA: +0.092, MoPE: -0.032) and demonstrating complementary effects. Ablations show learned components outperform structured spectral priors. Experiments are limited to small-scale models (≤6M parameters), with multi-seed validation identified as future work.

energy-gated attentionmorlet positional encodinginductive biasesscale-selective localitysuperadditive performance

MechRL: Reinforcement Learning Agents Perform Circuit Discovery for Mechanistic Interpretability

arXiv cs.LG · Barsat Khadka · 2026-05-25

The paper introduces MechRL, a reinforcement learning framework for automated circuit discovery in mechanistic interpretability. The method trains a PPO agent to select attention heads in GPT-2-small via a contrastive reward function comparing task-specific and general next-token prediction performance under zero-ablation. The agent matches oracle performance on training tasks (induction, IOI) and a held-out task (docstring completion), recovering 96% of oracle performance via best-of-five planning, while aligning with literature-identified causally critical heads and ignoring redundant ones.

mechanistic interpretabilityreinforcement learningcircuit discoveryattention headszero-ablation

A PAC-Bayesian View of Generalisation for Physics-Informed Machine Learning

arXiv cs.LG · Thien V. Nguyen, Amaury Habrard, Benjamin Guedj · 2026-05-25

The paper develops a PAC-Bayesian framework to analyze generalization in physics-informed machine learning (PIML) with unbounded losses, addressing the gap in statistical understanding of PIML models. It adopts a multi-task perspective that jointly considers data fidelity, PDE residuals, and boundary conditions, avoiding the looseness of union-bound approaches. The framework leverages physics-informed objective structures to derive bounds scaling with input-gradient norms, linking physical regularity to generalization. Two classes of bounds are instantiated under Sobolev and Poincaré-type assumptions, trading off statistical complexity and smoothness. A self-bounding-aware learning algorithm is proposed, optimizing tractable surrogates of derived bounds, with empirical evaluations showing non-vacuous and tighter bounds than baselines.

pac-bayesianphysics-informedgeneralizationsobolevpoincaré

QAM-W: Joint 2D Codebook Quantization for LLM Weights via Hadamard Rotation and Activation-Aware Scaling

arXiv cs.LG · Preetam Sharma, Kacper Dobek · 2026-05-25

The paper introduces QAM-W, a joint 2D codebook quantization method for LLM weights that preserves pairwise coordinate structure via Hadamard rotation and activation-aware scaling. The approach L2-normalizes weight rows, applies block-Hadamard rotation, pairs coordinates, and quantizes using a Lloyd-Max codebook trained on unit circular Gaussian distributions. Evaluated across five LLMs (1.1B--13B parameters), QAM-W achieves ±0.4% WikiText-2 perplexity deviation from BF16 at ≈5.5 bpw, matching SmoothQuant W8A8 quality with 32% fewer weight bits. Joint 2D coding outperforms polar coding by 2--15 pp ΔPPL, with Spearman ρ=0.99 between KL divergence and ΔPPL.

llm quantizationhadamard rotationactivation-aware scalinglloyd-max codebookperplexity preservation

Reparametrizing Shampoo and SOAP for Subspace Basis Updates and BFloat16 Storage

arXiv cs.LG · Alan Milligan, Zikun Xu, Simon Lacoste-Julien, Felix Dangel · 2026-05-25

We introduce a reparametrization of Shampoo-based optimization methods, including KL-Shampoo, SOAP, and KL-SOAP, to enable efficient BFloat16 (BFP16) storage and reduce computational overhead. Our approach updates only a subspace of the preconditioner's basis via QR decomposition, combining updated and unchanged basis vectors to form a complete basis. This mitigates performance degradation from BFP16 storage while maintaining accuracy. Experiments show improved efficiency under BFP16, with KL-SOAP matching or exceeding KL-Shampoo performance. The method enhances memory and time efficiency for Shampoo-based optimizers relying on QR decomposition.

shampoo-based methodsbfp16 storageqr decompositionsubspace basiskl-soap

Semigroup Consistency as a Diagnostic for Learned Physics Simulators

arXiv cs.LG · Lennon J. Shikhman · 2026-05-25

The paper introduces normalized semigroup error as a diagnostic tool for evaluating learned physics simulators, addressing limitations of traditional one-step or short-horizon prediction metrics. The method leverages the semigroup property of autonomous, state-complete systems, comparing direct and composed predictions to assess temporal consistency. Experiments on 1D heat and Burgers equations using time-conditioned ConvNet and FNO baselines show a Spearman correlation ρ=0.635 (95% CI [0.621, 0.649]) between semigroup error and rollout degradation, while semigroup regularization yields mixed results.

semigroup consistencyphysics simulatorstemporal compositionlong-horizon rolloutnormalized error

MULTISEISMO: A Multimodal Seismic Dataset and Model for Cross-Modal Seismic Understanding

arXiv cs.LG · Sai Munikoti, Ian Stewart, Chengping Chai, Lisa Linville · 2026-05-25

The authors introduce MultiSeismo, a multimodal seismic dataset integrating waveform timeseries, geographical imagery, and metadata for 16K+ events (2010–2023), alongside MISCE, a structured instruction set for supervised training. They develop SeisModal by finetuning Unified IO 2 with a timeseries encoder, achieving superior performance on cross-modal seismic reasoning tasks compared to generalist models. Benchmarks demonstrate MultiSeismo's utility for domain-specific multimodal research, particularly in addressing time-series processing challenges.

multimodal datasetseismic analysistimeseries encodercross-modal reasoningdomain adaptation

Curriculum Learning for Safety Alignment

arXiv cs.LG · Sandeep Kumar, Virginia Smith, Chhavi Yadav · 2026-05-25

This paper introduces Staged-Competence, a curriculum learning framework to enhance the robustness of Direct Preference Optimisation (DPO) for safety alignment in language models. The method organizes preference data by difficulty, employs competence-based sampling, and progressively updates the reference model during training. Results show a 16% reduction in out-of-distribution harmful response rates and a 20% decrease in jailbreak attack success rates across three model families, while maintaining general capabilities with minimal over-refusal. Staged-Competence achieves baseline safety performance with only 75% of training data and improves separation between safe and unsafe responses.

staged-competencedirect preference optimisationsafety alignmentcurriculum learningout-of-distribution

Classification and detection of multiple UAVs using rational Gaussian wavelet neural networks

arXiv cs.LG · Ungvári Gergő, Ferenc Braun, Attila Ámon, Péter Kackstädter · 2026-05-25

The paper introduces a cost-effective UAV detection system using acoustic signals processed through rational Gaussian wavelet neural networks. The method employs interpretable adaptive wavelet transformations integrated with a neural network for feature extraction and classification, enabling detection of both single UAVs and swarms. Evaluated on indoor and outdoor datasets, the approach outperforms traditional machine learning methods while maintaining interpretability. Implementation is publicly available for reproducibility.

uav detectionrational gaussian waveletsadaptive feature extractioninterpretable machine learningacoustic signal processing

Dynamic Link Prediction with Temporally Enhanced Signed Graph Neural Networks

arXiv cs.LG · Derek Regier, Andrew Polyak, Aresh Dadlani, Khosro Salmani · 2026-05-25

The authors propose a modular temporal enhancement framework for signed graph neural networks (GNNs) to address dynamic link prediction in temporal signed networks (TSNs). The framework integrates historical context via a Historical Context Integration Module (HCIM) combining learnable temporal weighting, LSTM-based trajectory modeling, and multi-head temporal attention, with node-adaptive fusion strategies. When applied to the Self-Explainable Signed Graph Transformer (SE-SGformer), the approach achieves statistically significant improvements over static baselines on Bitcoin OTC, Bitcoin Alpha, Reddit, and synthetic small-world networks.

temporal signed networksgraph neural networkshistorical context integrationdynamic link predictionbalance-theoretic constraints

Stateful Inference for Low-Latency Multi-Agent Tool Calling

arXiv cs.LG · Victor Norgren · 2026-05-25

We introduce a stateful inference architecture for low-latency multi-agent tool calling, addressing the inefficiency of reprocessing unchanged prompts in conventional LLM serving. The method employs a persistent KV cache across turns, a radix prefix cache for interleaved multi-agent traffic, and a prompt-lookup speculative decoder for structured output acceleration. This reduces per-turn cost from $O(n_t)$ to $O(Δ_t)$. Evaluated against vLLM and SGLang on generated workloads, the implementation achieves $2.1\times$ speedup on a 6-turn workflow and $4.2\times$ on the median turn of a 35-turn workflow, halving end-to-end wall time through stateful reuse and speculation.

kv cachestateful inferencemulti-agentspeculative decodingradix prefix cache

Beyond Differences: Doubly Robust Meta-Learners for Ratio-Based Treatment Effects

arXiv cs.LG · Michael Fuchs, Dominik Kreiss · 2026-05-25

The paper introduces the Q-Learner, a meta-learner for estimating ratio-based conditional average treatment effects (CATE) $τ(x) = E[Y|W=1,X=x] / E[Y|W=0,X=x]$, which decomposes $τ(x)$ into a product of two odds ratios, reducing estimation to two propensity classification tasks. Doubly robust augmentations are derived for both S/T- and Q-style ratio learners, with distinct robustness properties characterized. On seven RCT datasets, the Q-Learner outperforms in low-conversion regimes by avoiding imbalanced regression issues. On four observational datasets, the doubly robust learners excel, establishing them as defaults for confounded observational data.

ratio-based catedoubly robustmeta-learnerpropensity classificationobservational data

Two-Parameter Flows for Learning Population Dynamics of Physical Systems

arXiv cs.LG · Paul Schwerdtner, Tobias Blickhan, Benjamin Peherstorfer · 2026-05-25

The paper introduces two-parameter flows for learning high-dimensional probability density dynamics from unlabeled samples without trajectory data. The method constructs sampling-time transports from a base distribution to each marginal via conditional flow matching, then derives physics-time velocity fields by regressing on synthetic coupled trajectories. Theoretical analysis shows uniqueness of the resulting dynamics and regularity inheritance from sampling-time transports. The approach scales to high dimensions, avoids per-step optimal transport couplings, and supports non-gradient dynamics for modeling rotational phenomena.

two-parameter flowsconditional flow matchingprobability density dynamicsoptimal-transport couplingsnon-gradient dynamics

Benchmarking Convolutional, Transformer, Hybrid, and Vision Language Models for Multi Disease Retinal Screening

arXiv cs.LG · Durjoy Dey, Aymane Ajbar, Yuhong Yan · 2026-05-25

The study benchmarks 12 architectures across four model families (CNNs, vision transformers, hybrid CNN-transformers, and vision-language models) for multi-disease retinal screening using RFMiD. Standardized protocols evaluate binary screening (AUC >84%) and multi-label classification across 28 diseases, reporting AUC, F1, and sensitivity at 80% specificity. SwinTiny, CoAtNet0, and MaxViTTiny outperform others, with attention-based models excelling in both tasks; vision-language models (CLIP ViT-B/16, SigLIP-Base384) match CNN baselines but trail top performers. External validation on Messidor-2 shows hybrid/transformer models lead (AUC 66.8–84.7%).

retinal screeningvision transformersmulti-label classificationdomain shiftauc

Scaling World-Model Reinforcement Learning Through Diffusion Policy Optimization

arXiv cs.LG · Xiaoyuan Cheng, Wenxuan Yuan, Zhancun Mu, Yuanzhao Zhang · 2026-05-25

The paper introduces Model-Based Diffusion Policy Optimization (MBDPO), a framework that unifies search and policy optimization in world-model RL through diffusion policy representations. MBDPO reformulates policy optimization as a diffusion process over searched trajectories in latent world models, extracting an implicit energy function to mitigate training inconsistency. Evaluations across multi-task offline pretraining, online learning, and offline-to-online fine-tuning demonstrate consistent performance gains, with offline pretraining showing monotonic scaling with model capacity.

model-based reinforcement learningdiffusion policyworld modelspolicy optimizationoffline pretraining

Learning Nonlinear Factor Models with Unknown Monotone Links from Incomplete and Noisy Data

arXiv cs.LG · Yutong Chao, Resat Gökhan, Jalal Etesami, Ali Habibnia · 2026-05-25

The paper introduces a nonlinear factor model with unknown monotone link functions in RKHS, addressing identifiability and nonconvexity via projected BCD with explicit regularization. The method jointly recovers low-rank factors, loadings, and link functions from incomplete/noisy data, with convergence guarantees under incoherence conditions and sublinear regret bounds for link updates. Synthetic experiments validate the framework's extension of linear factor models to nonlinear regimes.

nonlinear factor modelmonotone link functionreproducing kernel hilbert spaceblock coordinate descentlatent factor recovery

Quantized Keys Steal Attention: Bias Correction for KV-Cache Compression in Video Diffusion

arXiv cs.LG · Tuna Tuncer, Felix Becker, Thomas Pfeil · 2026-05-25

We introduce a bias correction method for KV-cache compression in chunk-wise autoregressive video diffusion models, addressing the Jensen bias caused by quantization noise in attention weights. The method computes a per-attention-score correction on the fly using quantization step sizes and query norms, employing a second-order Taylor approximation for negligible computational overhead. Evaluated on MAGI-1, SkyReels-V2, and HY-WorldPlay at INT2 quantization, the correction recovers most quality lost to aggressive quantization, achieving near-BF16 video quality and outperforming INT4 quantization with 50% less memory.

kv-cachequantizationattention weightsjensen biasvideo diffusion

Prospective evaluation of multimodal respiratory failure prediction: Do chest X-rays improve performance beyond EHR signals?

arXiv cs.LG · Xiaolei Lu, Shamim Nemati · 2026-05-25

The study demonstrates that integrating chest X-ray (CXR) representations with EHR data improves prospective prediction of invasive mechanical ventilation in ICU patients. A gated multimodal framework selectively combines CXR features from REMEDIS/MedInsight foundation models with EHR time-series data, adapting to patient-specific clinical context. Evaluation shows AUROC improvements (0.860/0.858 vs. 0.752 for EHR-only Vent.io) and enhanced specificity/PPV, outperforming physician predictions in sensitivity. Results validate adaptive multimodal fusion for respiratory failure prediction.

multimodal fusionrespiratory failure predictionfoundation modelselectronic health recordsadaptive gating

Unified Neural Scaling Laws

arXiv cs.LG · Ethan Caballero, Priyank Jaini, David Krueger, Irina Rish · 2026-05-25

The authors propose Unified Neural Scaling Laws (UNSL), a functional form that models and extrapolates scaling behaviors of deep neural networks across multiple simultaneous dimensions, including model parameters, dataset size, training steps, inference steps, compute, and hyperparameters. UNSL is validated across diverse architectures and tasks, including large-scale vision, language, math, and reinforcement learning. Compared to existing scaling laws, UNSL demonstrates significantly more accurate extrapolations of scaling behavior across this varied task set.

scaling lawsdeep neural networksextrapolationhyperparametersreinforcement learning

The Bridge-Garden Dilemma in LLM Distillation: Why Mixing Hard and Soft Labels Works

arXiv cs.LG · Guanghui Wang, Kaiwen Lv Kacuila, Zhiyong Yang, Zitai Wang · 2026-05-25

The paper introduces the Bridge-Garden Decomposition theory to explain why hybrid hard/soft label knowledge distillation (KD) outperforms pure approaches in LLM compression. It posits that generation alternates between exact 'Bridge' tokens (best served by hard labels) and flexible 'Garden' tokens (where soft labels preserve diversity), reducing exposure bias. The proposed adaptive hybrid supervision method achieves 9.7x faster training while outperforming divergence-based and on-policy KD baselines across seven teacher-student pairs (including Qwen and Llama) on reasoning and coding benchmarks.

knowledge distillationexposure biasbridge-garden decompositionhybrid supervisionmodel compression

From Privacy to Generalization: Linear Max-Information Bounds for DP-SGD

arXiv cs.LG · Christoph H. Lampert, Hossein Zakerinia · 2026-05-25

The paper establishes a finite-sample bound on the approximate max-information of differentially private stochastic gradient descent (DP-SGD), matching the linear dataset-size scaling of Dwork et al. (2015)'s classic ε-differential privacy result. By analyzing DP-SGD through max-information, the work derives two generalization bounds: a PAC-Bayes bound with a learnable prior distribution and an explicit complexity term controlled by optimization hyperparameters. These results bridge privacy and generalization theory for deep networks trained with DP-SGD.

differential privacymax-informationpac-bayesgeneralization boundsdp-sgd

Global Convergence of Wasserstein Policy Gradient for Entropy-Regularized Reinforcement Learning

arXiv cs.LG · Zhaoyu Zhu, Rui Gao, Shuang Li · 2026-05-25

The paper establishes global convergence for Wasserstein policy gradient (WPG) in entropy-regularized reinforcement learning (RL), addressing a gap in understanding its theoretical properties. By leveraging the Bellman structure, the authors derive a Bellman-based argument replacing convexity: the soft Bellman residual admits a statewise KL divergence representation, while Bellman contraction links this residual to the optimality gap. Combining a uniform log-Sobolev inequality (LSI) for Gibbs policies with a distributional Polyak–Łojasiewicz condition, they prove geometric convergence up to discretization bias. The analysis reveals that entropy-regularized RL exhibits favorable PL-type geometry despite non-convexity.

wasserstein policy gradiententropy-regularized rlbellman residualpolyak–łojasiewicz conditionlog-sobolev inequality

Rethinking Weak Supervision in Anomaly Detection: A Comprehensive Benchmark

arXiv cs.LG · Xu Yao, Siyuan Zhou, Zhenbo Wu, Chaochuan Hou · 2026-05-25

The paper introduces WSADBench, the first unified benchmark for weakly supervised anomaly detection (WSAD), evaluating 36 algorithms across 4 modalities under varying label quantity, granularity, and quality. Through 700K experiments, it reveals: (i) strong correlations between weak supervision scenarios, (ii) specialized WSAD methods are outperformed by tabular foundation models with increased supervision, (iii) inconsistent utility of unlabeled data, and (iv) asymmetric sensitivity to label noise. The benchmark provides standardized protocols and open-source resources for future WSAD research.

weakly supervised anomaly detectiontabular foundation modelslabel noisebenchmarkingopen-source

📰 Industry Media (7)

NVIDIA Releases Polar, a Token-Faithful Rollout Framework for GRPO Training Across Codex, Claude Code, and Qwen Code

MarkTechPost · Asif Razzaq · 2026-05-27

NVIDIA introduces Polar, a token-faithful rollout framework for GRPO training that enables reinforcement learning across diverse agent harnesses (e.g., Codex, Claude Code, Qwen Code) without modifying their native execution paths. Polar employs a model API proxy to capture token-level interactions, normalizes requests/responses across providers (Anthropic, OpenAI, Google), and reconstructs trajectories via per-request or prefix-merging strategies. Evaluated on SWE-Bench with Qwen3.5-4B, Polar achieves a 22.6-point gain on Codex and reduces wall-clock time 5.39× via prefix-merging, while maintaining harness-agnostic operation for both online RL and offline SFT data generation.

grpo trainingtoken-faithful rolloutagent harnessprefix-mergingswe-bench

Meet EAGLE 3.1: The Speculative Decoding Algorithm That Fixes Attention Drift in LLM Inference

MarkTechPost · Michal Sutter · 2026-05-27

EAGLE 3.1 introduces architectural improvements to speculative decoding, addressing attention drift in LLM inference. The method applies FC normalization after each target hidden state and feeds post-norm hidden states into subsequent decoding steps, stabilizing drafter inputs and improving robustness. Benchmarks on Kimi K2.6 demonstrate 2.03× higher per-user throughput at concurrency 1, with sustained speedups at higher concurrency levels. EAGLE 3.1 achieves up to 2× longer acceptance length in long-context workloads compared to EAGLE 3, while maintaining backward compatibility. The model is integrated into vLLM and supported by TorchSpec for efficient training.

speculative decodingattention driftfc normalizationhidden stateslong-context

MEMO: A Modular Framework for Training a Dedicated Memory Model on New Knowledge Without Modifying LLM Parameters

MarkTechPost · Asif Razzaq · 2026-05-27

MEMO introduces a modular framework for integrating new knowledge into large language models (LLMs) without modifying their parameters, addressing limitations of retrieval-augmented generation (RAG), fine-tuning, and latent memory methods. It employs a dedicated MEMORY model (e.g., Qwen2.5-14B-Instruct) trained via a five-step data synthesis pipeline to internalize knowledge from a target corpus, while the EXECUTIVE model (e.g., Qwen2.5-32B-Instruct or Gemini-3-Flash) remains frozen and queries the MEMORY model through a structured multi-turn protocol. MEMO achieves 53.58% on NarrativeQA, 60.20% on MuSiQue, and 66.67% on BrowseComp-Plus, outperforming baselines like HippoRAG2 and demonstrating robustness to retrieval noise and architectural variations.

retrieval-augmented generationstructured multi-turn protocolsupervised fine-tuningcatastrophic forgettinglatent memory

Design a High-Precision Retrieve-and-Rerank Pipeline with ZeroEntropy Zerank-2 Reranker

MarkTechPost · Sana Hassan · 2026-05-26

The tutorial introduces a high-precision retrieve-and-rerank pipeline leveraging ZeroEntropy Zerank-2, a 4B Qwen3-based cross-encoder reranker, to enhance retrieval quality. The pipeline employs a two-stage approach: a fast bi-encoder retrieves candidates, followed by Zerank-2 reranking for improved precision. Evaluated using NDCG@10 across finance, legal, and code domains, Zerank-2 demonstrates significant reranking lift, improving average NDCG@10 by +0.1234. The pipeline achieves practical throughput of 45.7 pairs per second in batched inference, showcasing its utility in retrieval-augmented generation and semantic search systems.

cross-encoderndcg@10retrieve-and-rerankbi-encoderqwen3

Stability AI Releases Stable Audio 3: A Family of Fast Latent Diffusion Models for Audio Generation and Editing

MarkTechPost · Asif Razzaq · 2026-05-26

Stability AI introduces Stable Audio 3, a family of latent diffusion models for stereo audio generation (44.1 kHz) with variable-length outputs, inpainting-based editing, and fast inference. The architecture comprises a SAME autoencoder (108M–852M params) for 4096× latent compression and a diffusion transformer (459M–2.7B params) conditioned on text, duration, and masks. Key innovations include differential attention, variable-length training via silence augmentation, and a three-stage pipeline (flow matching pre-training, distillation warmup, adversarial post-training). Evaluations show FAD scores of 0.101 (large) and 0.107 (medium) on music generation, with inference times as low as 0.45s for 120s audio on H200 hardware.

latent diffusionautoencoderdifferential attentionflow matchinginpainting

Google folds Display Ads into AI-first Demand Gen platform

AI News · Ryan Daws · 2026-05-27

Google transitions from manual Display Ads to an AI-driven Demand Gen platform, automating ad placement and creative optimization across YouTube, Discover, and Gmail. The platform leverages predictive models to dynamically assemble uploaded assets into in-stream video ads, YouTube Shorts, and interactive Discover posts, optimizing for conversions and brand lift. This shift necessitates higher-volume, format-agnostic content creation and tighter integration with business intelligence systems for real-time conversion data. The move reflects broader industry trends toward AI-driven ad targeting and creative automation, exemplified by Meta's Advantage+ campaigns.

predictive modelsconversion optimizationformat-agnostic contentreal-time conversion dataai-driven targeting

Exploring the Benefits of AI Bots for Forex Trading in Forex Markets

AI News · Bazoom · 2026-05-27

AI-driven automated trading systems enhance forex market participation by reducing emotional bias, enabling 24/7 operation, and improving execution speed. These systems leverage predefined logic, backtesting on historical data, and real-time pattern recognition to optimize entry/exit strategies. By automating risk management through stop-loss and take-profit limits, they ensure disciplined adherence to trading plans. Advanced tools integrate predictive analytics and machine learning to adapt to volatile market conditions, democratizing access to institutional-grade technology for retail traders. This structured approach fosters consistency and control in forex trading, though no system guarantees results.

backtestingstop-losstake-profitpredictive analyticsmachine learning


Generated automatically at 2026-05-27 21:29 UTC. Summaries and keywords are produced by an LLM and may contain inaccuracies — always consult the original article.