Daily Digest — 2026-05-21

Wednesday, May 20, 2026 · 338 items · model: deepseek/deepseek-chat

338 items · 7 research labs, 323 arxiv papers, 8 industry media

🏛️ Research Labs (7)

An OpenAI model has disproved a central conjecture in discrete geometry

OpenAI News · 2026-05-20

An OpenAI general-purpose reasoning model autonomously disproved Erdős's 1946 planar unit distance conjecture in discrete geometry, demonstrating polynomial improvement over the previously best-known square grid constructions. The proof leverages unexpected connections to algebraic number theory, specifically infinite class field towers and Golod–Shafarevich theory, to construct point configurations with Ω(n^(1+c)) unit-distance pairs for some c > 0. External mathematicians verified the result, noting its significance as the first AI resolution of a central open problem and its introduction of deep number-theoretic techniques to geometric combinatorics.

planar unit distance problemalgebraic number theorygolod–shafarevich theorycombinatorial geometryinfinite class field towers

How Ramp engineers accelerate code review with Codex

OpenAI News · 2026-05-20

Ramp engineers leverage OpenAI's Codex with GPT-5.5 to accelerate code review and develop agentic tooling, reducing pull request feedback time from hours to minutes. The system employs advanced reasoning capabilities to analyze codebases with thoroughness exceeding human reviewers, while offering both CLI and GUI interfaces for developer flexibility. Results include a 4x productivity boost in code reviews and successful deployment of an On-Call Assistant agent for incident management, with engineers reporting increased confidence in shipped improvements.

codexgpt-5.5agentic toolingpull requeston-call assistant

The next phase of OpenAI’s Education for Countries

OpenAI News · 2026-05-20

OpenAI expands its Education for Countries initiative, integrating AI tools like ChatGPT and Codex into national education systems to enhance learning outcomes and economic opportunities. The program focuses on research-driven deployment, localized AI tools, and teacher training, partnering with countries such as Estonia, Jordan, Greece, Kazakhstan, Slovakia, and Singapore. Early results include Estonia's ChatGPT Edu reaching 20,000 students and 4,600 teachers, Jordan's Siraj engaging 1 million students, and Kazakhstan's 84,000 educators completing AI-readiness training. OpenAI collaborates with governments and educators to measure impact, share findings, and scale effective practices.

chatgptcodexai literacyresearch-driven deploymentteacher training

Introducing OpenAI for Singapore

OpenAI News · 2026-05-19

OpenAI announces a S$300M partnership with Singapore's Ministry of Digital Development and Information (MDDI) to establish an Applied AI Lab, creating 200+ technical roles and focusing on three key areas: frontier AI deployment, local AI talent development, and broad economic AI adoption. The initiative includes an Applied AI Lab (OpenAI's first outside the US), Forward-Deployed Engineer training, and collaborations with education/government agencies on AI-enabled learning tools. The program targets public service, finance, healthcare, and digital infrastructure sectors while supporting startups and SMEs through accelerator programs and workshops.

applied ai labforward-deployed engineersai-enabled learningfrontier ai deploymentcodex

We’re announcing new community investments in Missouri.

Google AI Blog · 2026-05-20

Google announced new infrastructure investments in Missouri, including a data center in Montgomery County and a $20 million Energy Impact Fund to reduce utility costs. The initiative involves a Capacity Commitment Framework with Ameren to develop over 500 megawatts of additional capacity. The project is expected to generate nine local jobs per direct position and includes workforce training programs, such as partnerships with the Construction Laborers and Contractors Joint Training Fund of Eastern Missouri, to prepare thousands for skilled roles.

data centercapacity commitment frameworkenergy impact fundworkforce trainingmegawatts

100 things we announced at I/O 2026

Google AI Blog · Keyword Team · 2026-05-20

Google announced Gemini 3.5 Flash, a high-speed multimodal model achieving 76.2% on Terminal-Bench 2.1 and 83.6% on MCP Atlas, optimized for agentic tasks. Gemini Omni introduced video generation with physics-aware synthesis and SynthID watermarking. AI Search upgraded to Gemini 3.5 Flash, featuring generative UI via Antigravity platform for dynamic interface creation. Universal Cart leveraged Gemini for cross-platform shopping with UCP checkout. Personal agent Gemini Spark, built on Antigravity, launched in beta for autonomous task execution with user oversight.

gemini 3.5 flashsynthid watermarkinggenerative uiuniversal commerce protocolagentic tasks

A new experiment brings better group meetings to Google Beam

Google AI Blog · Mohamed Abdelgany · 2026-05-20

Google Beam introduces an experimental feature leveraging HP Dimension's immersive display to enhance group meeting inclusivity by rendering remote participants in true-to-life proportions and spatial audio. This optimization automatically adjusts participant positioning and audio anchoring, simulating an in-room experience across devices. Initial research indicates a 50% improvement in social connection and a 21% increase in conversational contribution. The integration extends compatibility with Google Workspace and Zoom, aiming to bridge the hybrid inclusion gap in video conferencing.

immersive displayspatial audiohybrid inclusion gaptrue-to-life proportionsvideo conferencing

📜 arXiv Papers (323)

Atoms of Thought: Universal EEG Representation Learning with Microstates

arXiv cs.AI · Xinyang Tian, Ruitao Liu, Ziyi Ye, Siyang Xue · 2026-05-19

The paper proposes microstates as a universal EEG representation, demonstrating superior performance over traditional time- and frequency-domain features. The method clusters continuous EEG signals into discrete microstate sequences using a tokenizer trained on a large medical dataset, then applies this representation to downstream tasks including sleep staging, emotion recognition, and motor imagery classification. Experiments show improved accuracy across tasks, with additional benefits in interpretability and scalability for cognitive neuroscience and clinical applications.

eeg representationmicrostatesbrain-computer interfacesuniversal tokenizerneuroinformatics

A Methodology for Selecting and Composing Runtime Architecture Patterns for Production LLM Agents

arXiv cs.AI · Vasundra Srinivasan · 2026-05-19

The paper introduces the stochastic-deterministic boundary (SDB) as a foundational architectural primitive for production LLM agents, defining it as a four-part contract governing LLM output integration into system actions. It organizes runtime design into Coordination, State, and Control, presenting six runtime patterns adapted from distributed systems for conversational, autonomous, and long-horizon agents. A five-step methodology is proposed for pattern selection, alongside a diagnostic procedure for failure analysis and identification of replay divergence. The reliability decomposition separates model variance from architectural momentum, emphasizing SDB strength as model variance decreases. The methodology is applied to five workloads, with a reference implementation for a contract-renewal agent.

stochastic-deterministic boundaryruntime patternsreplay divergencearchitectural momentumproduction llm agents

Long-term Power Grid Planning via Answer Set Programming

arXiv cs.AI · Antonio Ielo, Francesco Doria, Sandra Castellanos-Paez, Marco Maratea · 2026-05-19

The paper introduces the first Answer Set Programming (ASP)-based method for automated long-term power grid planning, addressing sustainability targets and demand patterns. ASP elegantly encodes topological and combinatorial invariants that are cumbersome in traditional planning languages. Experimental evaluations on synthetic and real-world grid data demonstrate the approach's expressive power and effectiveness in maintaining supply continuity and service quality over decade-long developments.

answer set programmingpower grid planningtopological invariantscombinatorial optimizationsustainability targets

HaorFloodAlert: Deseasonalized ML Ensemble for 72-Hour Flood Prediction in Bangladesh Haor Wetlands

arXiv cs.AI · Salma Hoque Talukdar Koli, Fahima Haque Talukder Jely, Md. Samiul Alim, Md. Zakir Hossen · 2026-05-19

HaorFloodAlert introduces a deseasonalized machine learning ensemble for 72-hour flood prediction in Bangladesh's Sunamganj Haor wetlands, addressing limitations of riverine flood models. The ensemble combines Random Forest (0.5625) and XGBoost (0.4375), leveraging upstream Barak River Sentinel-1 SAR data with Otsu-thresholded change detection for validation (84-91% spatial match). It achieves 89.6% LOOCV accuracy, 87.5% recall, and 0.943 AUC-ROC on 77 Sentinel-1 events, while mitigating temperature-induced seasonal bias that inflated accuracy by 6.9 pp. The system includes a three-tier alert pipeline and a BRRI-calibrated boro rice damage estimator.

deseasonalizedensembleotsu-thresholdedsentinel-1auc-roc

Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR

arXiv cs.AI · Utkarsh Tyagi, Xingang Guo, MohammadHossein Rezaei, Daniel George · 2026-05-19

The paper introduces POW3R, a policy-aware rubric reward framework for reinforcement learning with verifiable rewards. POW3R dynamically adapts criterion-level reward weights during training, emphasizing criteria that distinguish rollouts while preserving human-assigned importance and category balance. This approach improves upon static aggregation methods, which conflate importance with optimization signal usefulness. Evaluated across three base policies on two datasets, POW3R outperforms vanilla GRPO in 24 of 30 comparisons, achieving higher mean rubric reward and strict completion rates while converging 2.5–4× faster.

rubric rewardsreinforcement learningpolicy-awaredynamic weightinggrpo

Rethinking Visual Attribution for Chest X-ray Reasoning in Large Vision Language Models

arXiv cs.AI · Guangzhi Xiong, Qiao Jin, Sanchit Sinha, Zhiyong Lu · 2026-05-19

The study evaluates visual attribution methods in Large Vision Language Models (LVLMs) for chest X-ray reasoning, revealing their frequent failure to identify evidence used by models. A causal evaluation framework is developed, verifying expert-annotated regions via counterfactual editing. MedFocus, a novel concept-based attribution method, outperforms 11 existing methods by localizing anatomical regions through unbalanced optimal transport and measuring causal effects via targeted interventions. Results across six LVLMs and two output modes demonstrate MedFocus's superiority in producing spatial, concept-level, and token-level attributions.

lvlmsvisual attributionchest x-rayoptimal transportcausal evaluation

Less Back-and-Forth: A Comparative Study of Structured Prompting

arXiv cs.AI · Saurav Ghosh, Gabriella Polach, Abdou Sow · 2026-05-19

This study evaluates structured prompting techniques for improving LLM response quality and efficiency. Comparing raw, checklist-improved, and clarifying-question prompts across ChatGPT, Claude, and Grok on summarization, planning, explanation, and coding tasks, checklist prompts achieved the highest mean rubric score (7.50/8) with superior quality-effort tradeoffs. Results demonstrate that structured prompting reduces interaction overhead while enhancing output quality.

llmsprompt engineeringchecklist promptsresponse qualityinteraction efficiency

Beyond Prediction Accuracy: Target-Space Recovery Profiles for Evaluating Model-Brain Alignment

arXiv cs.AI · Ken Nakamura, Tomoya Nakai, Ryuto Yashiro, Ayumu Yamashita · 2026-05-19

The study introduces a framework for evaluating model-brain alignment beyond prediction accuracy by identifying which reproducible response dimensions of the target brain are recovered. Using repeated fMRI measurements from the Natural Scenes Dataset (eight subjects viewing natural images), the method first identifies reproducible dimensions in early-to-intermediate visual cortex responses, then quantifies their recovery via brain-to-brain or model-to-brain predictions. Results reveal that pretrained and randomly initialized models can achieve similar accuracy while differing in dimension recovery profiles, demonstrating that scalar metrics mask mismatches. The framework provides diagnostic evaluation of artificial vision models against human visual cortex.

model-brain alignmentfmri reproducibilityresponse dimensionsvisual cortexnatural scenes dataset

Using Aristotle API for AI-Assisted Theorem Proving in Lean 4: A Formalisation Case Study of the Grasshopper Problem

arXiv cs.AI · Gabriel Rongyang Lau · 2026-05-19

This paper presents a Lean 4 formalization case study using Aristotle API for AI-assisted theorem proving, focusing on the Grasshopper problem (IMO 2009 Problem 6). The method involves generating a generalized Lean theorem with four verified helper lemmas addressing local components of a maximality and adjacent-swap exchange strategy, while leaving the main theorem unresolved due to an unverified global counting step. Results demonstrate that local proof search successfully verified specific components but failed to address the global combinatorial bookkeeping required for the theorem. The study highlights a central limitation in AI-assisted formalization and provides a reproducible Lean artifact with precise analysis.

lean 4theorem provingaristotle apiformalizationgrasshopper problem

Toto 2.0: Time Series Forecasting Enters the Scaling Era

arXiv cs.AI · Emaad Khwaja, Chris Lettieri, Gerald Woo, Eden Belouadah · 2026-05-19

The study introduces Toto 2.0, a family of five open-weights time series forecasting models demonstrating reliable quality improvements across 4M to 2.5B parameters. The models employ a unified training recipe and achieve state-of-the-art performance on three benchmarks: BOOM, GIFT-Eval, and TIME. Key contributions include architectural design, training data selection, and the u-muP hyperparameter transfer pipeline. Results validate the scalability of time series foundation models, with all checkpoints released under Apache 2.0.

time series forecastingfoundation modelsscaling lawshyperparameter transferopen-weights

k-Inductive Neural Barrier Certificates for Unknown Nonlinear Dynamics

arXiv cs.AI · Ben Wooding, Hongchao Zhang, Taylor T. Johnson, Abolfazl Lavaei · 2026-05-19

The paper introduces k-inductive neural barrier certificates (k-NBCs) for safety verification of (partially) unknown nonlinear systems, relaxing conventional constraints by allowing temporary increases in barrier function values while ensuring overall safety. The method leverages neural networks for scalable design and employs counterexample-guided inductive synthesis (CEGIS) with satisfiability modulo theories (SMT) for verification, utilizing Willems et al.'s fundamental lemma to construct data-driven system representations without requiring full dynamics knowledge. This approach removes restrictions on barrier certificate function classes, enhancing flexibility. Validation on three nonlinear case studies demonstrates efficacy in handling unknown dynamics.

k-inductivebarrier certificatescounterexample-guided inductive synthesissatisfiability modulo theoriesnonlinear dynamics

Beyond Isotropy in JEPAs: Hamiltonian Geometry and Symplectic Prediction

arXiv cs.AI · Robert Jenkinson Alvarez · 2026-05-19

The paper demonstrates that isotropic Gaussian regularization in Joint-Embedding Predictive Architectures (JEPAs) is suboptimal for structured downstream geometries, showing that any fixed marginal target can be misaligned with certain geometries. It proposes HamJEPA, which encodes views as phase-space states (q,p) and predicts transitions via a learned Hamiltonian leapfrog map, incorporating non-isotropic scale and spectral floors to prevent collapse. HamJEPA outperforms SIGReg on CIFAR-100 (+4.89 kNN@20, +3.52 linear-probe at 30 epochs; +6.45 kNN@20, +10.64 linear-probe at 80 epochs) and ImageNet-100 (+4.82 kNN@20, +7.52 linear-probe at 45 epochs), with ablation confirming the symplectic coupling's role in improving neighborhood geometry.

jepahamiltonian geometrysymplectic predictionisotropic regularizationphase-space encoding

Draft Less, Retrieve More: Hybrid Tree Construction for Speculative Decoding

arXiv cs.AI · Yuhao Shen, Tianyu Liu, Xinyi Hu, Quan Kong · 2026-05-19

Graft introduces a hybrid tree construction framework for speculative decoding that breaks the Pareto tradeoff between dense and pruned drafting by coupling pruning and retrieval as mutually reinforcing operations. The method employs a sequential 'prune-then-graft' mechanism, where pruning frees computational budget for retrieval, and retrieval compensates for pruning-induced coverage loss by attaching predictive tokens with near-zero overhead. Evaluations demonstrate that Graft establishes a new Pareto frontier, achieving up to 5.41× speedup on short-context benchmarks and improving average speedup over EAGLE-3 by up to 21.8% on Qwen3-235B. The framework is training-free, lossless, and extensible to block drafting paradigms.

speculative decodingpruningretrievalpareto frontierautoregressive

Neurosymbolic Learning for Inference-Time Argumentation

arXiv cs.AI · Gabriel Freedman, Adam Dejl, Adam Gould, Mansi · 2026-05-19

The paper introduces inference-time argumentation (ITA), a neurosymbolic framework for ternary claim verification (true/false/uncertain) that combines formal argumentation semantics with LLM training. ITA guides LLMs to generate arguments and assign base scores, which are then used to compute deterministic, inspectable predictions. The method ensures faithfulness by constructing verdicts directly from argumentative structures rather than post-hoc rationales. Evaluated on two ternary claim verification datasets, ITA outperforms argumentative baselines and competes with non-argumentative direct-prediction approaches while providing transparent reasoning traces.

neurosymbolic learninginference-time argumentationternary claim verificationformal argumentation semanticsllm training

INSHAPE: Instance-Level Shapelets for Interpretable Time-Series Classification

arXiv cs.AI · Seongjun Lee, Seokhyun Lee, Changhee Lee · 2026-05-19

INSHAPE introduces instance-level shapelets for interpretable time-series classification, addressing limitations of population-level approaches by identifying discriminative temporal patterns specific to each time series. The method models non-overlapping segments with temporal dependencies and aggregates instance-level shapelets into prototypical patterns for global interpretability. Evaluated on 128 UCR and 30 UEA datasets, INSHAPE outperforms state-of-the-art shapelet-based methods while enhancing interpretability.

shapeletstime-series classificationinterpretabilitytemporal patternsinstance-level

ThoughtTrace: Understanding User Thoughts in Real-World LLM Interactions

arXiv cs.AI · Chuanyang Jin, Binze Li, Haopeng Xie, Cathy Mengying Fang · 2026-05-19

The paper introduces ThoughtTrace, the first large-scale dataset pairing real-world multi-turn human--AI conversations with users' self-reported thoughts, comprising 1,058 users, 2,155 conversations, and 10,174 thought annotations across 20 language models. Analysis reveals thoughts are semantically distinct from messages, difficult for LLMs to infer, and vary by conversation stage. Downstream applications show thoughts improve user-behavior prediction as inference-time context and provide fine-grained alignment signals for training personalized assistants. ThoughtTrace establishes user thoughts as a new data modality for studying cognitive dynamics in human--AI interaction.

thought annotationmulti-turn conversationuser-behavior predictionalignment signalscognitive dynamics

What Do Evolutionary Coding Agents Evolve?

arXiv cs.AI · Nico Pelleriti, Sree Harsha Nelaturu, Zhanke Zhou, Zongze Li · 2026-05-19

The paper introduces EvoTrace, a dataset of evolutionary coding traces across four frameworks and 16 tasks, to analyze what evolutionary coding agents actually evolve beyond final benchmark scores. Using EvoReplay, a replay-based methodology with controlled interventions, the authors annotate code edits into nine recurring types and find that most score gains stem from a small subset of edits. Key results include a deterministic cycling pattern where 30% of added code lines are byte-identical re-introductions, revealing that benchmark improvements often arise from mechanisms other than novel algorithmic structure.

evolutionary codingllm-as-judgeedit typesalgorithmic structurebenchmark evaluation

BalanceRAG: Joint Risk Calibration for Cascaded Retrieval-Augmented Generation

arXiv cs.AI · Zijun Jia, Yuanchang Ye, Sen Jia, Yiyao Qian · 2026-05-19

BalanceRAG introduces joint risk calibration for cascaded retrieval-augmented generation (RAG) systems, addressing conservative stage-by-stage thresholding by certifying threshold pairs at target risk levels. The method frames threshold pairs as operating points on a two-dimensional lattice, using sequential graphical testing to identify safe points while controlling system-level error rates. Experiments on three open-domain QA benchmarks with multiple LLM backbones show that BalanceRAG meets risk targets, increases coverage and correct examples, and reduces unnecessary retrieval calls compared to always-on RAG.

retrieval-augmented generationrisk calibrationsequential graphical testinguncertainty thresholdingopen-domain qa

VL-DPO: Vision-Language-Guided Finetuning for Preference-Aligned Autonomous Driving

arXiv cs.AI · Zhefan Xu, Ghassen Jerfel, Marina Haliem, Qi Zhao · 2026-05-19

VL-DPO introduces a vision-language-guided framework for aligning ego-vehicle motion forecasting models with human preferences in autonomous driving. The method leverages a vision-language model (VLM) as a zero-shot reasoner to generate preference pairs from pretrained model rollouts, which are then used to finetune the model via Direct Preference Optimization (DPO). Evaluated on the Waymo Open End-to-End Driving Dataset (WOD-E2E), VL-DPO achieves an 11.94% increase in rater feedback score (RFS) and a 10.01% reduction in average displacement error (ADE) compared to the pretrained model, demonstrating the VLM's effectiveness as a proxy for human preference.

vision-language modeldirect preference optimizationego-vehicle motion forecastingzero-shot reasoneraverage displacement error

Probability-Conserving Flow Guidance

arXiv cs.AI · Parsa Esmati, Junha Hyung, Amirhossein Dadashzadeh, Jaegul Choo · 2026-05-19

The paper introduces Adaptive Manifold Guidance (AdaMaG), a probability-conserving flow guidance method for diffusion and flow-based generative models. Analyzing guidance through the continuity equation, the authors decompose its effect into divergence and score-parallel terms, proving the former blows up near the data manifold. AdaMaG schedules time-dependent attenuation to bound both terms without extra inference cost. Empirical results on image generation benchmarks demonstrate improved realism, reduced hallucinations, and controlled desaturation under high guidance.

adaptive manifold guidancecontinuity equationdivergence termscore-parallel termdesaturation

CopT: Contrastive On-Policy Thinking with Continuous Spaces for General and Agentic Reasoning

arXiv cs.AI · Dachuan Shi, Hanlin Zhu, Xiangchi Yuan, Wanjia Zhao · 2026-05-19

CopT introduces a reversed reasoning pipeline for large language models (LLMs) that first generates draft answers before performing on-policy thinking for reflection, addressing performative reasoning in chain-of-thought (CoT) approaches. The method uses continuous embeddings as contrastive verifiers, computing sequence-level reverse KL estimators to assess answer reliability, with dynamic KL control for draft-answer visibility during corrective thinking. Evaluations on mathematics, coding, and agentic reasoning tasks show accuracy improvements up to 23% and token reductions up to 57% without additional training.

chain-of-thoughton-policy thinkingcontrastive verificationreverse kl estimatorperformative reasoning

Probing Embodied LLMs: When Higher Observation Fidelity Hurts Problem Solving

arXiv cs.AI · Oussama Zenkri, Oliver Brock · 2026-05-19

The study reveals a counterintuitive phenomenon where embodied LLM agents achieve higher task success rates with lower-fidelity RGB observations compared to ground-truth symbolic inputs on the Lockbox sequential puzzle task. Through physical robotic experiments and controlled simulations with randomized action-outcome flips, researchers demonstrate that moderate perceptual noise (40% flip probability) yields a 2.85× performance improvement by reducing repetitive action loops. The findings challenge conventional evaluation metrics, showing that success rates may reflect error-reasoning interactions rather than robust problem-solving capabilities.

embodied llmsobservation fidelitysequential puzzleperceptual noiseaction-outcome flips

Towards LLM-Assisted Architecture Recovery for Real-World ROS~2 Systems: An Agent-Based Multi-Level Approach to Hierarchical Structural Architecture Reconstruction

arXiv cs.AI · Dominique Briechle, Raj Chanchad, Tobias Geger, Ruidi He · 2026-05-19

The paper enhances LLM-assisted architecture recovery for ROS~2 systems via (1) refined prompting for consistent synthesis and (2) a multi-level staged strategy using intermediate representations (node lists, launch dependencies) to enable hierarchical reconstruction. Evaluated on a robotic disassembly system with complex integration, the method improves structural consistency and scalability over prior work, though challenges remain in handling dynamic semantics. Results demonstrate robustness in real-world settings with heterogeneous artifacts.

ros~2architecture recoveryllm-assistedhierarchical reconstructionmulti-level representation

PromptRad: Knowledge-Enhanced Multi-Label Prompt-Tuning for Low-Resource Radiology Report Labeling

arXiv cs.AI · Ying-Jia Lin, Tzu-Chin Lo, Ping-Chien Li, Chi-Tung Cheng · 2026-05-19

PromptRad introduces a knowledge-enhanced multi-label prompt-tuning method for low-resource radiology report labeling, addressing limitations of rule-based systems and data-intensive fine-tuning. The approach reformulates multi-label classification as masked language modeling, integrates UMLS Metathesaurus synonyms via a multi-word verbalizer, and fine-tunes PLMs without additional classification layers. Evaluated on liver CT reports, PromptRad outperforms dictionary-based and fine-tuning baselines with only 32 training examples and matches GPT-4 performance despite smaller model size, demonstrating superior negation pattern handling.

prompt-tuningmulti-label classificationumls metathesaurusmasked language modelinglow-resource learning

Does Code Cleanliness Affect Coding Agents? A Controlled Minimal-Pair Study

arXiv cs.AI · Priyansh Trivedi, Olivier Schmitt · 2026-05-19

The study investigates whether code cleanliness affects autonomous coding agents' performance through a minimal-pair evaluation protocol. Researchers constructed repository pairs identical in architecture and behavior but differing in static-analysis violations and cognitive complexity, then tested Claude Code on 33 tasks across six pairs. Results show no significant difference in task pass rates (660 trials), but cleaner code reduced token usage by 7-8% and file revisitations by 34%, indicating cleaner code improves operational efficiency without altering success rates.

autonomous coding agentsminimal-pair evaluationstatic-analysis violationscognitive complexitytoken efficiency

When Critics Disagree: Adaptive Reward Poisoning Attacks in RIS-Aided Wireless Control System

arXiv cs.AI · Deemah H. Tashman, Soumaya Cherkaoui · 2026-05-19

The paper introduces Disagreement-Guided Reward Poisoning (DGRP), an adaptive attack targeting Soft Actor-Critic (SAC) agents in RIS-assisted Cognitive Radio Networks. DGRP exploits high-disagreement states between SAC's dual critics to corrupt rewards, distort value estimations, and drive suboptimal policy decisions. Experiments show DGRP degrades RIS performance gains and transmission quality more effectively than periodic-timing or exploration-triggered baselines, emphasizing the need for disagreement-aware robustness evaluation in DRL-based wireless control systems.

reward-poisoning attackssoft actor-criticreconfigurable intelligent surfacescognitive radio networksdeep reinforcement learning

AutoResearchClaw: Self-Reinforcing Autonomous Research with Human-AI Collaboration

arXiv cs.AI · Jiaqi Liu, Shi Qiu, Mairui Li, Bingzhou Li · 2026-05-19

AutoResearchClaw introduces a multi-agent autonomous research pipeline that addresses limitations of existing single-agent systems through five key mechanisms: structured multi-agent debate, self-healing execution with Pivot/Refine loops, verifiable result reporting, human-in-the-loop collaboration (seven intervention modes), and cross-run evolution. The system outperforms AI Scientist v2 by 54.7% on ARC-Bench (25-topic benchmark), with targeted human intervention proving more effective than full autonomy or exhaustive oversight. Results demonstrate how iterative failure handling and experience accumulation can enhance automated scientific discovery while preserving human judgment.

multi-agent debateself-healing executorhuman-in-the-loopcross-run evolutionverifiable reporting

When Skills Don't Help: A Negative Result on Procedural Knowledge for Tool-Grounded Agents in Offensive Cybersecurity

arXiv cs.AI · Samuel Jacob Chacko, James Hugglestone, Chashi Mahiul Islam, Xiuwen Liu · 2026-05-19

This work demonstrates that procedural knowledge packages (Skills) provide minimal benefit in offensive cybersecurity tasks, contrasting with their average 16.2 percentage point improvement across other domains. Through a 180-run controlled study of an MCP-grounded autonomous Capture-the-Flag agent across four documentation conditions (55 to 4,147 lines), the analysis reveals only an 8.9 percentage point spread between no-Skills and full-Skills conditions, with statistical insignificance (p = 0.71, χ²; p = 0.25, Cochran–Armitage trend test). The authors attribute this to high environment-feedback bandwidth, where schema-validated, low-latency tool observations render Skills redundant or even detrimental in timing side-channel scenarios. A falsifiable hypothesis and replication pipeline are proposed.

procedural knowledgeenvironment-feedback bandwidthschema-validatedlow-latencyside-channel

Training Neural Networks with Optimal Double-Bayesian Learning

arXiv cs.AI · Vy Bui, Hang Yu, Karthik Kantipudi, Ziv Yaniv · 2026-05-19

The paper introduces a double-Bayesian probabilistic framework for determining optimal learning rates in stochastic gradient descent (SGD). The method extends classical Bayesian statistics into two antagonistic Bayesian processes, deriving a theoretically optimal learning rate. Experiments on classification, segmentation, and detection tasks validate the framework's effectiveness in improving model training and performance. The approach addresses the empirical challenges of hyperparameter selection in neural network optimization.

stochastic gradient descentbayesian statisticslearning ratehyperparameter optimizationneural network training

GeoX: Mastering Geospatial Reasoning Through Self-Play and Verifiable Rewards

arXiv cs.AI · Kyeongjin Ahn, Seungeon Lee, Krishna P. Gummadi, Meeyoung Cha · 2026-05-19

GeoX introduces a self-play framework for geospatial reasoning that learns spatial logic through executable programs with verifiable rewards, eliminating the need for large-scale human annotations. The method employs a multimodal policy to propose and solve spatial problems via abduction, deduction, and induction, using spatial primitives and an image understanding tool. Reinforcement learning optimizes the policy using program-execution rewards. GeoX improves base VLMs by up to 5.5 points, matching or surpassing conventional baselines trained on curated data. The work also releases a self-play-derived benchmark for geospatial understanding.

geospatial reasoningself-playexecutable programsverifiable rewardsmultimodal policy

LLM Benchmark Datasets Should Be Contamination-Resistant

arXiv cs.AI · Ali Al-Lawati, Jason Lucas, Dongwon Lee, Suhang Wang · 2026-05-19

The paper advocates for contamination-resistant benchmark datasets to ensure reliable evaluation of large language models (LLMs), given widespread contamination in current benchmarks. It proposes leveraging the asymmetry between inference and training pipelines in Transformers to create unlearnable yet inference-supportive datasets, alongside mathematical advancements for cross-architecture interoperability. The authors call for community action to develop contamination-resistant methodologies, supporting tools, and their integration into evaluation pipelines.

benchmark contaminationtransformer architectureinference pipelineunlearnable datasetsllm evaluation

A Case for Agentic Tuning: From Documentation to Action in PostgreSQL

arXiv cs.AI · Hongyu Lin, Mingyu Li, Weichen Zhang, Yihang Lou · 2026-05-19

The paper introduces PerfEvolve, a method for dynamic database tuning that replaces static documentation with LLM-based agentic skills. The system addresses three limitations of traditional tuning guides (version staleness, workload heterogeneity, and parameter dependencies) by implementing version-consistency verification, workload-specific profiling, and multi-parameter joint optimization. Evaluated on PostgreSQL under TPC-C and TPC-H benchmarks, PerfEvolve achieves up to 35.2% better performance than documentation-driven baselines.

postgresqlllm-based agentsparameter tuningworkload profilingjoint optimization

Learning with Foresight: Enhancing Neural Routing Policy via Multi-Node Lookahead Prediction

arXiv cs.AI · Xia Jiang, Yaoxin Wu, Yew-Soon Ong, Yingqian Zhang · 2026-05-19

The paper introduces Multi-node Lookahead Prediction (MnLP), a training strategy enhancing neural routing policies by enabling multi-step node prediction during training without inference overhead. MnLP employs causal and discardable modules for auxiliary supervision, improving long-horizon planning. Experiments demonstrate MnLP's superior generalization across problem sizes, distributions, and benchmarks compared to next-node prediction baselines, while maintaining architectural flexibility.

neural routing policymulti-node lookaheadlong-horizon planningauxiliary supervisioninference efficiency

Block-Sphere Vector Quantization

arXiv cs.AI · Heesang Ann, Joongkyu Lee, Min-hwan Oh · 2026-05-19

The paper presents a unified theoretical comparison of rotation-based vector quantizers (EDEN, RabitQ, TurboQuant) and introduces Block-Sphere Quantization (BlockQuant). Analysis reveals method-dependent advantages: EDEN/TurboQuant excel in MSE distortion, EDEN in inner-product distortion, and RabitQ in high-probability control. BlockQuant employs spherical geometry to quantize vector blocks, outperforming baselines in reconstruction MSE and inner-product distortion theoretically and empirically on embedding datasets and LLM inference tasks.

vector quantizationrotation-based quantizersmse distortioninner-product distortionblock-spherical geometry

Detecting Fluent Optimization-Based Adversarial Prompts via Sequential Entropy Changes

arXiv cs.AI · Mohammed Alshaalan, Miguel R. D. Rodrigues · 2026-05-19

The paper introduces CPD Online (CPD), a model-agnostic method for detecting optimization-based adversarial prompts in LLMs by analyzing sequential entropy changes. The approach formulates detection as an online change-point problem, standardizing token-level entropies against a system-prompt baseline and applying a one-sided CUSUM statistic. Evaluated on 1,012 attacks (GCG, AutoDAN, etc.) and benign prompts across six models (LLaMA-2, Vicuna, Qwen2.5), CPD achieves AUROC 0.88 and F1 0.82 on LLaMA-2-7B, outperforms windowed-perplexity baselines, and localizes 79.6% of triggers within adversarial suffixes. As a gate for LLaMA Guard, it reduces guard calls by 17-22% while maintaining detection quality.

adversarial promptschange-point detectiontoken-level entropycusum statisticllm jailbreaking

World-Ego Modeling for Long-Horizon Evolution in Hybrid Embodied Tasks

arXiv cs.AI · Zuyao Lin, Jianhui Zhang, Peidong Jia, Xiaoguang Zhao · 2026-05-19

The paper introduces World-Ego Modeling (WEM), a novel paradigm for long-horizon embodied tasks that disentangles world and ego dynamics through motion-, semantic-, and intention-based views. The proposed WEM framework combines an implicit world-ego planner with a cascade-parallel mixture-of-experts (CP-MoE) diffusion generator. Evaluated on the new HTEWorld benchmark (125K video clips, 4.5M frames) and existing manipulation benchmarks, WEM achieves state-of-the-art performance in hybrid navigation-manipulation tasks.

world-ego modelinglong-horizon evolutionhybrid embodied taskscascade-parallel mixture-of-expertsdiffusion generator

GEM: GPU-Variability-Aware Expert to GPU Mapping for MoE Systems

arXiv cs.AI · Sourish Wawdhane, Avinash Kumar, Poulami Das · 2026-05-19

GEM introduces GPU-variability-aware expert mapping for Mixture-of-Expert (MoE) models, addressing straggler GPUs caused by imbalanced token loads and hardware variability. The method profiles GPU performance variability and token load distributions, strategically placing consistent (frequently used) and temporal (co-occurring) experts across GPUs to minimize synchronization delays. Evaluations demonstrate 7.9% average and up to 16.5% maximum latency reduction compared to baseline approaches.

mixture-of-expertgpu variabilitytoken load balancingexpert placementsynchronization bottleneck

A Measure-Theoretic Analysis of Reasoning: Structural Generalization and Approximation Limits

arXiv cs.AI · Yuyang Zhang, Yifu Zhang, Xuehai Zhou, Xiaoyin Chen · 2026-05-19

The paper theoretically analyzes out-of-distribution (OOD) generalization in LLM reasoning through a measure-theoretic framework. Using optimal transport, it projects discrete reasoning trajectories into a continuous metric space, quantifying domain shifts via Wasserstein-1 distance and bounding generalization via Kantorovich duality. Key findings show that position-dependent attention mechanisms (e.g., Absolute Positional Encoding) incur higher Lipschitz constants and expected risk compared to shift-invariant alternatives (e.g., Rotary Embeddings). Additionally, mapping backtracking to Dyck-$k$ languages reveals depth scaling is necessary for $ ext{TC}^0$ Transformers to avoid representation collapse, which width scaling cannot circumvent. Empirical validation across 54 Transformer configurations confirms generalization risk scales monotonically with Wasserstein domain shift.

optimal transportwasserstein distancelipschitz continuitydyck-$k$ languagebarron spaces

Probabilistic Tiny Recursive Model

arXiv cs.AI · Amin Sghaier, Ali Parviz, Alexia Jolicoeur-Martineau · 2026-05-19

The paper introduces Probabilistic Tiny Recursive Model (PTRM), a task-agnostic framework enhancing Tiny Recursive Models (TRMs) through stochastic exploration. PTRM injects Gaussian noise at each recursion step, enabling parallel trajectories to explore diverse solution basins, and selects among them using the model's existing Q head for early stopping. Without retraining or task-specific augmentations, PTRM achieves significant accuracy improvements: 87.4% to 98.75% on Sudoku-Extreme and 62.6% to 91.2% on Pencil Puzzle Bench, outperforming frontier LLMs (55.1%) at 0.0001x cost with 7M parameters.

probabilistic recursionstochastic explorationgaussian noisesolution basinsearly stopping

Robotics-Inspired Guardrails for Foundation Models in Socially Sensitive Domains

arXiv cs.AI · Rebecca Ramnauth, Drazen Brscic, Brian Scassellati · 2026-05-19

The paper proposes robotics-inspired guardrails for foundation models in socially sensitive domains, introducing formal constructs for runtime behavioral control over interaction trajectories. The Grounded Observer framework enforces constraints in uncertain, closed-loop systems, addressing cumulative failures through trajectory-level safety rather than individual outputs. Applied to small talk, autism therapy, and school de-escalation deployments, the framework demonstrates effective runtime interventions that mitigate undesirable interaction drift while adapting to social contexts. The work suggests extensions for stronger behavioral guarantees.

foundation modelsruntime behavioral controlinteraction trajectoriesclosed-loop systemsconstraint enforcement

PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents

arXiv cs.AI · Zhuohan Gu, Qizheng Zhang, Omar Khattab, Samuel Madden · 2026-05-19

PEEK introduces a context map system for long-context LLM agents, caching reusable orientation knowledge about recurring external contexts (e.g., document corpora) as a small, constant-sized prompt artifact. The method employs three modules: a Distiller for knowledge extraction, a Cartographer for structured edits, and an Evictor for token-budget enforcement. Evaluations show 6.3-34.0% performance gains over ACE, with 93-145 fewer iterations, 1.4-5.8x lower cost, and improved solving rates (6.0-14.0%) and rubric accuracy (7.8-12.1%) across models like OpenAI Codex.

context maplong-context reasoningllm agentsknowledge distillationtoken budget

StruMPL: Multi-task Dense Regression under Disjoint Partial Supervision and MNAR Labels

arXiv cs.AI · Reza M. Asiyabi, Juan Alberto Molina-Valero, The SEOSAW Partnership, Steven Hancock · 2026-05-19

StruMPL introduces a multi-task dense regression framework for disjoint partial supervision with MNAR labels and inter-task physical constraints, addressing forest aboveground biomass (AGB) estimation from Earth observation data. The method employs a shared encoder with per-variable regression, imputation, and propensity heads for MNAR correction, alongside a learnable physics module enforcing biome-specific allometric constraints. An Augmented IPW (AIPW) pseudo-outcome with stop-gradients ensures stable joint optimization. Evaluated on two biomes, StruMPL reduces high-AGB bias by ~54% and outperforms baselines in RMSE and bias metrics.

multi-task learningmissing not at randomdense regressionallometric constraintsaugmented ipw

Breaking Modality Heterogeneity in Low-Bit Quantization for Large Vision-Language Models

arXiv cs.AI · Yi Zhong, Haotong Qin, Xindong Zhang, Lei Zhang · 2026-05-19

The paper introduces SplitQ, a channel-splitting-driven post-training quantization framework for vision-language models (VLMs) that addresses modality heterogeneity in low-bit quantization. The method features Modality-specific Outlier Channel Decoupling (MOCD) to isolate salient modality-specific outlier channels and Adaptive Cross-Modal Calibration (ACC) with dual learnable branches to mitigate quantization errors. Experiments on 6 multi-modal datasets show SplitQ outperforms existing approaches, preserving 93.5% of FP16 performance under W3A3 quantization (69.5 vs. 74.3 accuracy).

post-training quantizationvision-language modelsmodality heterogeneityoutlier channel decouplingcross-modal calibration

Real-Time Parallel Counterfactual Regret Minimization

arXiv cs.AI · Boning Li, Longbo Huang · 2026-05-19

The paper introduces Parallel CFR, the first parallelization framework for real-time depth-limited Counterfactual Regret Minimization (CFR) in imperfect-information games. The method decomposes CFR iterations into a seven-stage pipeline with two orthogonal parallelism dimensions (by information set and by tree node), offloading leaf node evaluation to GPUs via batched neural network inference. Experiments on Heads-Up No-Limit Texas Hold'em show 3.3–3.4× speedup over single-threaded baselines, achieving per-iteration times of ~47–54 ms on a game tree with over 1 billion histories using a single desktop-class device (NVIDIA DGX Spark).

counterfactual regret minimizationparallel computationimperfect-information gamesgpu accelerationreal-time decision making

Fast and Featureless Node Representation Learning with Partial Pairwise Supervision

arXiv cs.AI · Sujan Chakraborty, Saptarshi Bej · 2026-05-19

Contrastive FUSE introduces a fast, unified framework for node representation learning in graphs with partial pairwise labels and no node features. The method optimizes a spectral contrastive objective integrating community-aware structural signals and signed pairwise constraints, replacing costly modularity gradients with a lightweight approximation for scalability. It employs an efficient optimization scheme with natural gradient decomposition and adaptive learning-rate scaling, enabling rapid updates on million-edge graphs. Experiments on citation networks, co-purchase graphs, and OGB datasets demonstrate competitive or superior contrastive classification performance without node features, alongside significant runtime improvements over baselines.

spectral contrastive objectivemodularity gradientnode representation learningadaptive learning-rate scalingpairwise constraints

Streamlined Constraint Reasoning via CNN Pattern Recognition on Enumerated Solutions

arXiv cs.AI · Patrick Spracklen · 2026-05-19

We introduce a CNN-LLM pipeline for automated streamliner synthesis in constraint programming, leveraging enumerated solutions to detect structural patterns. The method trains a Convolutional Neural Network contrastively on feasible solutions versus perturbed non-solutions, then translates the CNN's discriminative signal into MiniZinc streamliner constraints via LLM-driven synthesis. Evaluated on hardened benchmarks, the pipeline achieves 98.8%, 98.6%, and 89.4% portfolio time reductions on Vessel Loading, Social Golfers, and Black Hole respectively, with geometric-mean speedups of 932x, 356x, and 1103x. Discovered streamliners include class-based packing constraints, canonicalisations, and layout-coordinate bounds.

streamliner synthesisconvolutional neural networkminizincconstraint programmingcontrastive learning

Deep Tech to Space: Space Data Centers and AI Revolution at the Edge

arXiv cs.AI · Jonas Weiss, Patricia Sagmeister, Gabriel Maiolini Capez, Dinesh Verma · 2026-05-19

The article proposes Space Data Centers (SDCs) as AI-driven orbital platforms to address bandwidth and latency constraints in space-to-Earth data transmission. The authors present a constellation architecture for Low Earth Orbit SDCs, detailing orbital design, inter-satellite networking, computational resource allocation, and service orchestration. Technical and economic feasibility is analyzed through forecasting models based on technology roadmaps, with validation via Earth observation and lunar exploration case studies. The approach aims to mitigate ground station limitations caused by visibility windows and scheduling complexity.

space data centerslow earth orbitinter-satellite linksservice orchestrationtechnology roadmaps

Passive Construction Site Safety Monitoring via Persona-Scaffolded Adversarial Chain-of-Thought VLM Verification

arXiv cs.AI · Ananth Sriram, Neel Mokaria, Rajveer Singh · 2026-05-19

The paper introduces a passive construction safety monitoring pipeline using a three-stage architecture: (1) YOLO11 for PPE/hazard detection, (2) SAM 3 for segmentation refinement, and (3) Qwen3-VL-8B-Instruct with persona-scaffolded adversarial chain-of-thought verification. The method achieves a 12% precision improvement over single-pass prompting, particularly in hallucination-prone violation categories, evaluated on the 12-video Ironsite corpus. The system maps violations to OSHA standards, performs ergonomic risk scoring, and generates timestamped safety reports.

yolo11sam 3qwen3-vl-8b-instructadversarial chain-of-thoughtosha standards

StableGrad: Backward Scale Control without Batch Normalization

arXiv cs.AI · Jose I. Mestre, Alberto Fernández-Hernández, Cristian Pérez-Corral, Manuel F. Dolz · 2026-05-19

The paper introduces StableGrad, an optimizer-level mechanism for controlling gradient scales in deep neural networks without modifying forward passes or using batch normalization. The method rescales layer-wise weight gradients during backpropagation, preserving network outputs and derivatives—crucial for Physics-Informed Neural Networks (PINNs) where batch normalization disrupts physical consistency. Evaluations demonstrate StableGrad's effectiveness: it improves PINN benchmark accuracy by enabling deeper models and stabilizes BatchNorm-free ResNet/EfficientNet training without architectural changes. Results indicate gradient-scale control at the optimizer level can replace forward normalization when inappropriate.

gradient scalingphysics-informed neural networksbatch normalizationoptimizer-level controldeep network training

A Framework for Evaluating Zero-Shot Image Generation in Concept-based Explainability

arXiv cs.AI · Giacomo Astolfi, Matteo Bianchi, Riccardo Campi, Antonio De Santis · 2026-05-19

The paper proposes a framework for evaluating zero-shot text-to-image (T2I) generative models as synthetic concept datasets for concept-based explainable AI (XAI). It assesses synthetic concept faithfulness through four analyses: (1) concept representation similarity between synthetic and real images, (2) intra-similarity of progressively larger concept subsets, (3) downstream explanation task performance, and (4) concept removal effects on explanations. Results reveal challenges in using synthetic data for XAI, raising open questions about zero-shot pipelines. The dataset is publicly available.

concept-based xaizero-shot learningtext-to-image generationsynthetic datasetsmodel explainability

FineBench: Benchmarking and Enhancing Vision-Language Models for Fine-grained Human Activity Understanding

arXiv cs.AI · Gueter Josmy Faure, Min-Hung Chen, Jia-Fong Yeh, Hung-Ting Su · 2026-05-19

The authors introduce FineBench, a benchmark for evaluating fine-grained human activity understanding in Vision-Language Models (VLMs), featuring 199,420 QA pairs across 64 long-form videos with dense spatial/temporal annotations. They propose FineAgent, a modular framework combining a Localizer and Descriptor to enhance VLMs, demonstrating consistent performance improvements on FineBench. Results show proprietary models like GPT-5 perform adequately, while open-source VLMs struggle with spatial reasoning and subtle action distinctions.

vision-language modelsfine-grained understandingvideo question answeringspatial reasoningmodular framework

CADENet: Condition-Adaptive Asynchronous Dual-Stream Enhancement Network for Adverse Weather Perception in Autonomous Driving

arXiv cs.AI · Sherif Khairy, Catherine M. Elias · 2026-05-19

CADENet introduces a training-free, three-thread system for adverse weather perception in autonomous driving, addressing the latency and annotation completeness bias of existing approaches. The system comprises Thread S (YOLOv11n) for zero-latency detections, Thread Q for condition-adaptive enhancement (CAPE) fused via entropy-guided NMS (EG-NMS), and Thread E for CLIP-based zero-shot weather classification requiring only text prompts. Evaluated on 1327 DAWN images, CADENet achieves Recall = 0.0103 (micro), F1 = 0.0230 on snow, and F1 = 0.0038 on rain, with Thread S sustaining ≈44 FPS. The method formalizes annotation completeness bias, ensuring recall as the primary metric.

adverse weather perceptioncondition-adaptive enhancemententropy-guided nmszero-shot classificationannotation completeness bias

A Closed-loop, State-centric, Multi-agent Framework for Passenger Load Estimation from Heterogeneous Data Streams

arXiv cs.AI · Yiyao Xu, Hao Zhou, Yuhang Wang, Jingran Sun · 2026-05-19

The paper proposes a closed-loop, state-centric, multi-agent framework for robust passenger load estimation from heterogeneous data streams, addressing challenges like incremental count errors and sensor reliability. The method enforces physical feasibility, dynamically allocates trust among evidence sources, and uses physics-derived violation residuals for training. The architecture includes a unified stop-event backbone, a coupled Perception--Physical--Fusion loop for stop-by-stop inference, and optional trip-level macro-correction modules. This approach improves accuracy in automatic passenger counting (APC) systems under varying operational conditions.

passenger load estimationheterogeneous data streamsclosed-loop frameworkmulti-agent systemautomatic passenger counting

Mega-ASR: Towards In-the-wild^2 Speech Recognition via Scaling up Real-world Acoustic Simulation

arXiv cs.AI · Zhifei Xie, Kaiyu Pang, Haobin Zhang, Deheng Ye · 2026-05-19

Mega-ASR introduces a unified framework for robust automatic speech recognition (ASR) in real-world environments, addressing the 'acoustic robustness bottleneck' through scalable compound-data construction and progressive acoustic-to-semantic optimization. The method leverages Voices-in-the-Wild-2M, a dataset covering 7 acoustic phenomena and 54 compound scenarios, and employs Acoustic-to-Semantic Progressive Supervised Fine-Tuning and Dual-Granularity WER-Gated Policy Optimization. Experiments show Mega-ASR outperforms state-of-the-art systems on adverse-condition benchmarks, achieving 45.69% WER on VOiCES R4-B-F (vs. 54.01%) and 21.49% on NOIZEUS Sta-0 (vs. 29.34%), with over 30% relative WER reduction in complex compositional scenarios.

automatic speech recognitionacoustic robustnessprogressive fine-tuningwer-gated optimizationcompound-data construction

Explainable Wastewater Digital Twins: Adaptive Context-Conditioned Structured Simulators with Self-Falsifying Decision Support

arXiv cs.AI · Gary Simethy, Daniel Ortiz Arroyo, Petar Durdevic · 2026-05-19

The paper presents CCSS-IX, an explainable digital twin for wastewater treatment plants, combining interpretable locally linear state-space models with a context-aware gating network. The system features a runtime decision layer using conformal risk control to certify or falsify operator-proposed actions, providing finite-sample coverage guarantees. Evaluated on the Avedøre and Agtrup/BlueKolding plants and BSM2 benchmark, the method achieves 0.78-1.08% RMSE versus black-box baselines, reduces aggregate regret by 43.6%, and prevents 93/187 false-safe N2O approvals (4.65× baseline improvement, p<1e-21).

digital twinconformal risk controlstate-space modelswastewater treatmentinterpretable ai

From Prompts to Pavement Through Time: Temporal Grounding in Agentic Scene-to-Plan Reasoning

arXiv cs.AI · Ahmed Y. Gado, Omar Y. Goba, Alaa Hassanein, Catherine M. Elias · 2026-05-19

This work introduces temporal conditioning in inter-agent communication to enhance coherence in scene-to-plan reasoning for autonomous vehicles, addressing inconsistencies in continuous actions. Three planner architectures with increasing temporal integration were evaluated on curated subsets of the BDD-X dataset using semantic, syntactic, and logical metrics. Results indicate no statistically significant improvements in standard NLP-based correctness metrics, but qualitative analysis reveals predictive hazard reasoning, stable corrective behavior, and strategic divergence in the Sentinel architecture. The study establishes the first empirical benchmark for temporal scene-to-plan reasoning.

temporal conditioningscene-to-plan reasoningautonomous vehiclesinter-agent communicationbdd-x dataset

Smooth Piecewise Cutting for Neural Operator to Handle Discontinuities and Sharp Transitions

arXiv cs.AI · Ha Dang, Sebastian Schmidt, Juergen Hesser · 2026-05-19

We propose Cut-DeepONet, a two-stage neural operator framework that explicitly models discontinuities in PDE solutions while reducing learning complexity. The method employs a lifting strategy to partition the domain into smooth subregions, representing discontinuities as boundaries in higher-dimensional space, and uses an auxiliary network to predict discontinuity locations for unseen inputs. Experiments on benchmark PDEs demonstrate that Cut-DeepONet outperforms state-of-the-art methods on problems with discontinuities and sharp transitions, achieving superior performance even with low-resolution training data and fewer trainable parameters.

neural operatordiscontinuitieslifting strategypartial differential equationscut-deeponet

ST-TGExplainer: Disentangling Stability and Transition Patterns for Temporal GNN Interpretability

arXiv cs.AI · Hongjiang Chen, Xin Zheng, Pengfei Jiao, Huan Liu · 2026-05-19

ST-TGExplainer introduces a self-explainable temporal graph neural network (TGNN) that disentangles stability and transition patterns for improved interpretability. The method employs a disentangled information bottleneck objective to learn compact explanatory subgraphs, explicitly suppressing redundancy between historical (stability) and emerging (transition) interaction patterns. Experiments demonstrate that ST-TGExplainer achieves strong predictive performance while providing more faithful explanations compared to existing interpretable TGNNs.

temporal graph neural networksinterpretabilitydisentangled information bottleneckstability patternstransition patterns

LP-Eval: Rubric and Dataset for Measuring the Quality of Legal Proposition Generation

arXiv cs.AI · Shanshan Xu, Johan Lindholm, Amogh Raina, Henrik Palmer Olsen · 2026-05-19

The paper introduces LP-Eval, a rubric and dataset for evaluating legal proposition generation quality, co-designed with legal experts. The method decomposes quality into formal validity and substantive dimensions, applying it to 100 LLM-generated propositions from Court of Justice of the European Union decisions. Results indicate LLMs produce predominantly well-formed propositions, with higher quality for established cases, while rubric-guided LLM evaluations align moderately with expert assessments but lack sensitivity to fine-grained distinctions.

legal proposition generationevaluation rubriclarge language modelsexpert annotationsformal validity

FLUXtrapolation: A benchmark on extrapolating ecosystem fluxes

arXiv cs.AI · Anya Fries, Jacob A Nelson, Martin Jung, Markus Reichstein · 2026-05-19

The paper introduces FLUXtrapolation, a benchmark for evaluating machine learning models on ecosystem flux extrapolation under distribution shifts. It addresses the challenge of upscaling flux measurements from sparse tower sites to global estimates, considering covariate and conditional shifts across climates and ecosystems. The benchmark includes temporal, spatial, and temperature-based extrapolation scenarios, with evaluation metrics focusing on held-out domains, temporal aggregations, and tail errors. Initial results show baseline models perform similarly in median hourly RMSE but diverge under tail-focused and multi-scale assessments, highlighting the benchmark's utility for advancing flux upscaling methods.

flux upscalingdistribution shiftcovariate shiftconditional shiftrmse

Chunking German Legal Code

arXiv cs.AI · Max Prior, Natalia Milanova, Andreas Schultz · 2026-05-19

The study evaluates chunking strategies for retrieval-augmented generation in German statutory law, using the German Civil Code as a benchmark. It compares structural units (sections, subsections), fixed-size windows, contextual chunking, semantic clustering, Lumber, and RAPTOR-based hierarchical retrieval, measuring recall, latency, build time, and storage. Results indicate that structure-aligned methods (sections, subsections) achieve highest recall and computational efficiency, outperforming complex LLM-intensive approaches. The findings emphasize the importance of preserving domain-specific structure for legal retrieval.

retrieval-augmented generationchunking strategieslegal information retrievalcontextual chunkinghierarchical retrieval

Latent Laplace Diffusion for Irregular Multivariate Time Series

arXiv cs.AI · Zinuo You, Jin Zheng, John Cartlidge · 2026-05-19

Latent Laplace Diffusion (LLapDiff) proposes a generative framework for irregular multivariate time series forecasting, avoiding temporal distortion from re-gridding and drift from sequential solvers. The method models targets as low-dimensional latent trajectories, using a stable modal parameterization inspired by port-Hamiltonian dynamics and Laplace-domain complex-conjugate poles for direct irregular timestamp evaluation. A gap-aware history summarizer links continuous dynamics to observations via renewal-averaging analysis. Experiments demonstrate LLapDiff's superiority in long-horizon forecasting and missing-value imputation, with code publicly available.

latent laplace diffusionirregular time seriesport-hamiltonian dynamicsrenewal-averaging analysisgap-aware summarizer

Stitched Value Model for Diffusion Alignment

arXiv cs.AI · Hyojun Go, Hyungjin Chung, Prune Truong, Goutam Bhat · 2026-05-19

We propose StitchVM, a model stitching framework for aligning diffusion-based generative models with task-specific rewards by transferring pretrained pixel-space reward models to noisy latent regimes. The method attaches a frozen diffusion backbone to a truncated pixel-space reward model, combining robust reward capabilities with native noisy latent handling. Stitching and fine-tuning CLIP ViT-L and SD 3.5 Medium requires only 10 GPU-hours. This approach enables amortized value function construction rather than per-sample approximation, improving downstream steering and post-training methods: DPS becomes 3.2× faster with 50% reduced peak GPU memory, and DiffusionNFT becomes 2.3× faster.

diffusion alignmentmodel stitchingnoisy latentsreward transferamortized estimation

Synergistic Foundation Models for Semi-Supervised Fetal Cardiac Ultrasound Analysis: SAM-Med2D Boundary Refinement and DINOv3 Semantic Enhancement

arXiv cs.AI · Tonghao Zhuang, Shanglong Hu, Yongsheng Luo, Zhiqi Zhang · 2026-05-19

The authors propose a semi-supervised framework for fetal cardiac ultrasound analysis, combining joint segmentation and classification via synergistic foundation models. The method integrates SAM-Med2D for boundary refinement and DINOv3 for pseudo-label enhancement, employing view-specific hard masking and a two-stage optimization strategy (EMA phase followed by Classification Fine-Tuning). Evaluated on FETUS 2026, it achieves 79.99% Dice Similarity Coefficient, 61.62% Normalized Surface Distance, and 41.20% F1-score, demonstrating effectiveness for prenatal congenital heart disease screening.

semi-supervised learningboundary refinementpseudo-label enhancementmulti-task backbonetwo-stage optimization

AffectAI-Capture: A Reproducible Multimodal Protocol for Small-Group Meeting Research

arXiv cs.AI · Meisam Jamshidi Seikavandi, Alice Modica, Anna Obara, Fabricio Batista Narcizo · 2026-05-19

AffectAI-Capture introduces a reproducible protocol for synchronized multimodal data collection in four-person meeting interactions, integrating eye tracking, wearable physiology, audio, multi-view video, event logging, and structured self-report. The protocol employs fixed task blocks based on established group-interaction paradigms, with acquisition and post-processing organized around a unified event timeline and standardized outputs. Pilot validation confirmed audio quality and video synchronization through controlled bench tests, while full protocol sessions with participants are ongoing. This architecture links task design, instrumentation, timing provenance, and data packaging for affective, behavioral, and meeting-analytics research.

multimodal dataeye trackingwearable physiologyevent timelinetask blocks

Prior Knowledge or Search? A Study of LLM Agents in Hardware-Aware Code Optimization

arXiv cs.AI · Dmitry Redko, Albert Fazlyev, Konstantin Sozykin, Maria Ivanova · 2026-05-19

The study investigates the role of pretrained priors versus search in LLM agents for hardware-aware code optimization through three experiments. Findings reveal: (1) LLMs act as greedy optimizers in black-box settings; (2) zero-shot kernel generation ignores explicit input-size instructions, with performance degrading for uncommon sizes; (3) iterative feedback improves CUDA but degrades TVM IR, indicating language density affects optimization. Results suggest LLMs rely heavily on pretrained knowledge over feedback or agentic structure.

llm agentshardware-aware optimizationkernel generationpretrained priorsfeedback-loop optimization

From SGD to Muon: Adaptive Optimization via Schatten-p Norms

arXiv cs.AI · Thomas Massena, Corentin Friedrich, Mathieu Serrurier · 2026-05-19

The paper introduces a data-driven criterion for dynamically selecting optimal Linear Minimization Oracle (LMO) geometries in deep neural network optimization, interpolating between SGD and Muon updates. The method derives closed-form update rules from gradient and activation statistics using a single-step random feature regression surrogate, while incorporating parameter-wise preconditioning to recover SGD, Muon, Adam, and MuAdam as special cases. With only ~3% runtime overhead, the adaptive optimizer matches or outperforms Muon and AdamW across three training scenarios, demonstrating that LMO geometry can be efficiently adapted from runtime data.

linear minimization oracleadaptive optimizationschatten-p normsrandom feature regressionpreconditioning

Distribution-Free Uncertainty Quantification for Continuous AI Agent Evaluation

arXiv cs.AI · Yuxuan Gao, Megan Wang, Yi Ling Yu · 2026-05-19

The paper introduces distribution-free uncertainty quantification methods for continuous AI agent evaluation, adapting split conformal prediction and adaptive conformal inference (ACI) to provide coverage guarantees for forecasted quality scores. The approach achieves calibration error below 0.02 at 24h horizons, with ACI dynamically adjusting intervals by 35% post-agent releases. It extends to multi-agent pipelines with compositional bounds (validated for inter-stage correlations ρ ∈ [-0.5, 0.9]), conformal abstention for pairwise rankings, and FDR-corrected leaderboard testing. Evaluation on 50 agents using 18 hourly signals shows mean conditional coverage of 80.4%, with 90% of agents within [72%, 90%], and cross-source sentiment divergence predicting ranking instability (r=0.64, p<0.01).

conformal predictionadaptive conformal inferenceuncertainty quantificationmulti-agent pipelinesconditional coverage

OpenComputer: Verifiable Software Worlds for Computer-Use Agents

arXiv cs.AI · Jinbiao Wei, Qianran Ma, Yilun Zhao, Xiao Zhou · 2026-05-19

OpenComputer introduces a verifier-grounded framework for constructing verifiable software worlds for computer-use agents, integrating four components: app-specific state verifiers, a self-evolving verification layer, a task-generation pipeline, and an evaluation harness. The framework supports 33 desktop applications and 1,000 finalized tasks across diverse domains. Experiments demonstrate that OpenComputer's hard-coded verifiers outperform LLM-as-judge evaluations in aligning with human adjudication, particularly for fine-grained application states. Frontier agents exhibit challenges in end-to-end task completion, and open-source models show significant performance drops from their OSWorld-Verified scores, highlighting gaps in robust computer automation.

verifier-grounded frameworkapp-specific state verifiersself-evolving verification layertask-generation pipelineevaluation harness

AR1-ZO: Topology-Aware Rank-1 Zeroth-Order Queries for High-Rank LoRA Fine-Tuning

arXiv cs.AI · Ziye Chen, Hongbin Lin, Chenyu Zhang, Xiangda Yan · 2026-05-19

AR1-ZO introduces a topology-aware rank-1 zeroth-order optimization method for high-rank LoRA fine-tuning, addressing the rank paradox where increasing LoRA rank degrades finite-difference signal quality. The method decomposes LoRA into rank-1 atoms, querying one atom per step with topology-aware scaling γ=αr to maintain signal strength without auxiliary mechanisms. Theoretical analysis confirms atom minimality and rank-independent active query dimension, while experiments on OPT and Qwen3 models demonstrate improved effectiveness for high-rank LoRA under fixed query budgets.

zeroth-order optimizationlora fine-tuningrank paradoxfinite-difference signaltopology-aware scaling

Synthesis and Evaluation of Long-term History-aware Medical Dialogue

arXiv cs.AI · Hebin Hu, Renke Dai, Ah-Hwee Tan, Yilin Kang · 2026-05-19

The authors introduce MediLongChat, a framework for synthesizing longitudinal medical dialogues to address the lack of datasets for evaluating long-term patient history reasoning. Their method involves three stages: creating synthetic patient profiles, generating multi-turn dialogues per encounter, and integrating them into coherent histories. They propose three benchmark tasks (In-dialogue, Cross-dialogue, and Synthesis Reasoning) and a multi-dimensional evaluation framework combining vector-based metrics with LLM-as-a-judge assessments. Experiments reveal state-of-the-art LLMs struggle with these tasks, demonstrating the benchmark's utility and the need for specialized healthcare agent methods.

medical dialogue synthesislongitudinal reasoningllm-as-a-judgemulti-turn dialogue generationhealthcare agent evaluation

GroupAffect-4: A Multimodal Dataset of Four-Person Collaborative Interaction

arXiv cs.AI · Meisam Jamshidi Seikavandi, Alice Modica, Anna Obara, Shan Ahmed Shaffi · 2026-05-19

The GroupAffect-4 dataset addresses gaps in multimodal affective computing by capturing co-located group interactions across four ecologically varied tasks (information pooling, negotiation, idea generation, public-goods game). It includes synchronized data from 40 participants (10 groups) with wrist-worn physiology sensors, eye-tracking glasses, close-talk microphones, self-reports, questionnaires, and personality scores. The dataset achieves 91% physiology and 98% eye-tracking coverage, validated by affective manipulation checks. Fifteen benchmark targets span within-person states, between-person traits, and group dynamics. Released with BIDS structure, Croissant metadata, and open scripts, it supports reproducible research in collaborative affect analysis.

multimodalaffective computinggroup dynamicsphysiologyeye-tracking

What Really Improves Mathematical Reasoning: Structured Reasoning Signals Beyond Pure Code

arXiv cs.AI · Yuze Zhao, Junpeng Fang, Lu Yu, Zhenya Huang · 2026-05-19

The study challenges the assumption that code universally enhances reasoning in language models through controlled pretraining on a 10T-token corpus. By isolating executable code and controlling for Code-NL data, it finds code improves programming but not general reasoning, often competing with mathematical tasks. Structured reasoning traces (e.g., code-text mixtures) better explain reasoning gains than pure code. Increasing math-domain sample density boosts mathematical reasoning without compromising programming, suggesting targeted cognitive scaffolds mitigate cross-domain trade-offs. Routing analyses reveal domain interactions in expert-activation patterns.

language modelsmathematical reasoningstructured reasoning tracesdata-composition effectsexpert-activation patterns

CogScale: Scalable Benchmark for Sequence Processing

arXiv cs.AI · Yannis Bendi-Ouis, Romain de Coudenhove, Xavier Hinaut · 2026-05-19

The paper introduces CogScale, a benchmark of 14 scalable synthetic tasks designed to evaluate sequence processing capabilities across different architectural scales. The framework enables efficient testing of novel architectures under controlled parameter budgets (1k, 10k, 100k) before large-scale deployment. Evaluations of seven architectures (GRU, LSTM, xLSTM, ESN, Mamba, Transformer variants) demonstrate that attention mechanisms and state-space models maintain performance at higher complexity, while classical RNNs excel only in basic retention tasks under strict parameter constraints.

sequence processingparameter budgetstate-space modelssynthetic tasksarchitectural evaluation

Memory-Augmented Reinforcement Learning Agent for CAD Generation

arXiv cs.AI · Yin Xiaolong, Liu Yu, Shen Jiahang, Lu Xingyu · 2026-05-19

The paper proposes a memory-augmented reinforcement learning framework for CAD generation agents to address limitations of LLM-based methods in handling complex models with long operation sequences and geometric constraints. The method integrates a geometric kernel toolchain, dual-track memory (case library and skill library), and dynamic utility retrieval with reinforcement learning for policy optimization. Experiments demonstrate significant improvements in success rate and geometric consistency for complex CAD generation tasks.

cad generationreinforcement learningmemory-augmentedgeometric constraintsdynamic retrieval

EngiAI: A Multi-Agent Framework and Benchmark Suite for LLM-Driven Engineering Design

arXiv cs.AI · Gioele Molinari, Florian Felten, Soheyl Massoudi, Mark Fuge · 2026-05-19

The paper introduces EngiAI, a multi-agent framework and benchmark suite for evaluating LLM-driven engineering design systems. The benchmark assesses three dimensions: workflow (7 prompt styles), retrieval-augmented generation (RAG with gated scoring), and HPC orchestration (SLURM cluster). The LangGraph-based MAS implementation coordinates 7 specialized agents for tasks like topology optimization and 3D printer control. Results show proprietary LLMs achieve 96-97% task completion on Beams2D versus 55-78% for open-source 4B-parameter models, with conditional branching proving most challenging (20-53% completion). RAG gating confirms retrieval's critical role (≈1.0 vs. near-zero scores), while HPC orchestration reveals performance variability (50-100% completion).

multi-agent systemretrieval-augmented generationtopology optimizationhpc orchestrationlanggraph

TERGAD: Structure-Aware Text-Enhanced Representations for Graph Anomaly Detection

arXiv cs.AI · Wen Shi, Zhe Wang, Huafei Huang, Qing Qing · 2026-05-19

TERGAD introduces a structure-aware text-enhanced representation framework for graph anomaly detection, addressing limitations in existing text-rich approaches by incorporating node-level topological properties. The method translates topological features into natural language narratives, processes them via Large Language Models (LLMs) to derive semantic embeddings, and fuses these with original node attributes using a gated dual-branch autoencoder. Anomaly scores are computed based on integrated reconstruction errors, capturing deviations in both attributes and semantic expectations. Experiments on six real-world datasets show TERGAD outperforms state-of-the-art baselines, with ablation studies confirming the importance of structural semantic guidance and gated fusion.

graph anomaly detectionlarge language modelssemantic embeddingsdual-branch autoencoderreconstruction error

ContextRAG: Extraction-Free Hierarchical Graph Construction for Retrieval-Augmented Generation

arXiv cs.AI · Roman Prosvirnin, Sergei Kuznetsov, Seungmin Jin · 2026-05-19

ContextRAG introduces a retrieval-augmented generation system that constructs hierarchical graphs without LLM-based entity extraction, reducing computational costs. The method employs residual-quantization k-means and Formal Concept Analysis with Lukasiewicz residuated logic to derive fuzzy concept graphs through soft join/meet operations. On UltraDomain's 130-task subset, it achieves 33.6% F1 (36.8% on multi-hop tasks) using only 30 LLM calls (22k tokens), versus HiRAG's extrapolated 23M tokens, with lattice-derived nodes improving F1 by +3.9pp.

retrieval-augmented generationformal concept analysisresidual-quantizationlukasiewicz logicmulti-hop reasoning

LIFT and PLACE: A Simple, Stable, and Effective Knowledge Distillation Framework for Lightweight Diffusion Models

arXiv cs.AI · Hyunsoo Han, Sangyeop Yeo, Jaejun Yoo · 2026-05-19

We propose LIFT and PLACE, a coarse-to-fine knowledge distillation framework for lightweight diffusion models, addressing challenges in mimicking complex teacher denoising processes. LIFT decomposes training into coarse alignment and fine refinement phases, while PLACE extends this by partitioning outputs into error-based groups for locally adaptive guidance. The framework demonstrates effectiveness across diffusion spaces (image/latent), architectures (U-Net/DiT), tasks (unconditional/conditional), and datasets, including flow-based models like MMDiT (SD3). Under extreme compression with a 1.3M-parameter student (1.6% of teacher size), conventional KD fails (FID 50-200+), but our method achieves stable convergence with FID 15.73.

knowledge distillationdiffusion modelscoarse-to-finedenoising processparameter compression

Mathematical Reasoning in Large Language Models: Benchmarks, Architectures, Evaluation, and Open Challenges

arXiv cs.AI · Husnain Amjad, Raja Khurram Shahzad, Aamir Shahzad, Mehwish Fatima · 2026-05-19

This survey provides a comprehensive analysis of mathematical reasoning in Large Language Models (LLMs), synthesizing approximately 120 studies to evaluate datasets, architectures, training strategies, and evaluation protocols. The study introduces a unified taxonomy for mathematical datasets, categorizing them by pretraining corpora, supervised fine-tuning resources, and evaluation benchmarks. It systematically examines reasoning architectures, including tool integration and verifier-guided reasoning, revealing gaps in process-level verification and identifying failure modes like reasoning faithfulness and benchmark biases. Key research directions focus on improving symbolic grounding and evaluation reliability for robust LLM-based reasoning systems.

mathematical reasoninglarge language modelssupervised fine-tuningverifier-guided reasoningsymbolic grounding

Measuring Safety Alignment Effects in Autonomous Security Agents

arXiv cs.AI · Isaac David, Arthur Gervais · 2026-05-19

The study introduces a trace-based benchmark to evaluate safety alignment effects in autonomous security agents, addressing limitations of single-turn refusal tests. The method analyzes 30 vulnerability-analysis tasks with deterministic metrics, comparing four language model families (Gemma, Qwen2.5-Coder, Llama) and their uncensored derivatives across 2,300 traces. Results show Gemma models exhibit significant performance gains in less-restricted variants (14.0% vs 0.7% success for Gemma 4 31B) with improved grounding, while other models demonstrate inconsistent or negative effects. The benchmark reveals task-specific alignment tradeoffs, demonstrating the need for system-level evaluation separating refusal rates, tool reliability, and evidence grounding.

safety alignmentautonomous agentsvulnerability analysistrace-based benchmarklanguage models

Projecting Latent RL Actions: Towards Generalizable and Scalable Graph Combinatorial Optimization

arXiv cs.AI · Franco Terranova, Guillermo Bernardez, Albert Cabellos-Aparicio, Nina Miolane · 2026-05-19

The authors propose projection agents, a novel RL-GCO approach that addresses generalization and scalability challenges in graph combinatorial optimization. The method operates in a continuous GNN-based action embedding space, predicting latent actions via a single forward pass and decoding them into discrete actions using nearest-neighbor techniques. Evaluations show 16.2x faster inference and 40% better generalization across benchmarks, with potential for super-linear decision spaces. The work includes LaGCO-RL, a Python library for latent action-space construction and RL-GCO reproducibility.

graph combinatorial optimizationreinforcement learninggraph neural networkslatent action spacenearest-neighbor decoding

Beyond Rational Illusion: Behaviorally Realistic Strategic Classification

arXiv cs.AI · Xinpeng Lv, Yunxin Mao, Renzhe Xu, Chunyuan Zheng · 2026-05-19

The paper introduces behaviorally realistic strategic classification (BRSC), addressing the limitation of assuming strict rationality in strategic classification by incorporating cognitive biases from prospect theory. The proposed Prospect-Guided Strategic Framework (Pro-SF) models agent behavior through three prospect theory mechanisms: asymmetric cost-benefit evaluation, subjective reference points, and non-rational probability distortion. Experiments on synthetic and real-world datasets demonstrate Pro-SF's effectiveness in capturing behaviorally grounded strategic manipulations, bridging machine learning and behavioral economics for more reliable decision systems.

strategic classificationprospect theorycognitive biasesstackelberg gamebehavioral economics

Transforming Constraint Programs to Input for Local Search

arXiv cs.AI · Jo Devriendt, Patrick De Causmaecker, Marc Denecker · 2026-05-19

The paper presents an automated method for transforming constraint optimization problems into local search neighborhoods by leveraging symmetry properties. Using the IDP system, the approach compiles constraint specifications into neighborhood structures suitable for metaheuristic algorithms. Evaluation on six classical optimization problems demonstrates the technique's viability, with results suggesting effective neighborhood generation without manual intervention.

constraint optimizationlocal searchsymmetry propertiesmetaheuristic algorithmsneighborhood generation

CriterAlign: Criterion-Centric Rationale Alignment for Code Preference Judging

arXiv cs.AI · Zhenyu Li, Aleksandar Cvejic, Zehui Chen, Peter Wonka · 2026-05-19

CriterAlign introduces a criterion-centric framework for pairwise code preference prediction, addressing limitations of pointwise rubric-based LLM judges. The method employs direct criterion-level pairwise judgments, tie-driven criterion refinement, swap-consistency filtering, and pairwise synthesis, enhanced by Human-Preference-Aligned Guidance (HPAG) derived from training examples. Evaluated on BigCodeReward, CriterAlign improves accuracy from 60.4% to 66.3% over a Qwen2.5-VL-32B monolithic judge, with ablations validating the pairwise criterion design and HPAG.

pairwise preference predictionrubric-based judginghuman-preference-aligned guidanceswap-consistency filteringcriterion-centric framework

Pseudocode-Guided Structured Reasoning for Automating Reliable Inference in Vision-Language Models

arXiv cs.AI · Weicong Ni, Tianbao Jiang, Linlin Wang · 2026-05-19

The paper introduces Pseudocode-guided Structured Reasoning (PStar), a framework to mitigate hallucinations in Vision-Language Models (VLMs) by adaptively selecting structured pseudocode reasoning paths. PStar employs abstract reasoning functions, a pseudocode library, and a Difficulty Feature Vector (DFV) to assess question complexity and choose appropriate strategies. Experiments show PStar reduces hallucination rates, achieving 87.1% on POPE and 68.0% on MMStar, outperforming GPT-4V, thus enhancing reliability for real-world deployments.

pstarvision-language modelspseudocode reasoningdifficulty feature vectorhallucination mitigation

When Tabular Foundation Models Meet Strategic Tabular Data: A Prior Alignment Approach

arXiv cs.AI · Xinpeng Lv, Yunxin Mao, Renzhe Xu, Chunyuan Zheng · 2026-05-19

The paper introduces Strategic Prior-data Fitted Network (SPN), an inference-time framework that adapts tabular foundation models to strategic data manipulation by aligning predictions with post-manipulation distributions. SPN addresses the prior mismatch in pretrained PFNs by constructing strategic in-context examples, avoiding retraining. Experiments on real-world and synthetic datasets demonstrate SPN's superior robustness and predictive accuracy over both tabular foundation models and classical methods in strategic settings.

tabular foundation modelsstrategic manipulationprior-data fitted networksdistribution shiftin-context learning

The Accessibility Capability Boundary: Operational Limits and Expansion Potential of AI-Generated Browser-Native Accessibility Systems

arXiv cs.AI · Rizwan Jahangir, Daisuke Ishii · 2026-05-19

The paper introduces the Accessibility Capability Boundary (ACB), a formal framework for analyzing the operational limits and expansion potential of AI-driven accessibility systems. It models accessibility as a multidimensional capability space constrained by variables such as deployment latency, cognitive load, and adaptability. The authors argue that AI-generated, browser-native systems leveraging standard APIs can reduce deployment friction and enable rapid interface adaptation. The framework is grounded in two prototypes: an AI-generated browser-native interface for a blind user in Nepal and an open-source webcam alignment assistant for visually impaired users. The study identifies computational, infrastructural, and verification constraints as hard boundaries for this paradigm.

accessibility capability boundaryai-generated systemsbrowser-nativedeployment latencycognitive load

P2DNav: Panorama-to-Downview Reasoning for Zero-shot Vision-and-Language Navigation

arXiv cs.AI · Kai Sheng, Liuyi Wang, Haojie Dai, Jinlong Li · 2026-05-19

P2DNav introduces a hierarchical framework for zero-shot vision-and-language navigation (VLN), addressing the limitations of existing methods by disentangling directional reasoning from local grounding. The framework comprises three components: Panorama-to-Downview (P2D), which separates navigation into panoramic direction selection and downview local grounding; Sliding-Window Dialogue Memory (SDM), organizing navigation history as multi-turn dialogue context; and Reflective Reorientation Mechanism (RRM), enabling reliability assessment and reorientation. Evaluated on the R2R-CE benchmark, P2DNav achieves significant success rate (SR) gains of 146.6% and 58.9% over state-of-the-art zero-shot waypoint-based and waypoint-free methods, respectively.

zero-shot vlpanorama-to-downviewsliding-window dialogue memoryreflective reorientation mechanismr2r-ce benchmark

optimize_anything: A Universal API for Optimizing any Text Parameter

arXiv cs.AI · Lakshya A Agrawal, Donghyun Lee, Shangyin Tan, Wenjie Ma · 2026-05-19

The paper introduces optimize_anything, a universal LLM-based optimization system that achieves state-of-the-art results across six diverse tasks by formulating optimization as improving text artifacts evaluated by scoring functions. The system supports single-task search, multi-task search with cross-problem transfer, and generalization to unseen inputs, demonstrating capabilities such as tripling Gemini Flash's ARC-AGI accuracy (32.5% to 89.5%), reducing cloud costs by 40%, and generating CUDA kernels that match or beat PyTorch in 87% of cases. Ablations show that actionable side information improves convergence and final scores, while multi-task search benefits from cross-task transfer. The work is open-sourced as part of the GEPA project.

llm-based optimizationmulti-task searchcross-problem transfertext artifactscoring function

EMO-BOOST: Emotion-Augmented Audio-Visual Features for Improved Generalization in Deepfake Detection

arXiv cs.AI · Aritra Marik, Marcel Klemt, Anna Rohrbach · 2026-05-19

The paper proposes Emo-Boost, a multimodal deepfake detection framework that augments low-level audio-visual features with high-level emotion cues to improve generalization. The method integrates an off-the-shelf RGB/acoustic detector with EmoForensics, which models temporal consistency in vision/audio emotion representations. Results show complementary signals between emotion and low-level features, yielding a 2.1% AUC improvement in cross-manipulation generalization on FakeAVCeleb.

deepfake detectionmultimodal fusionemotion recognitiongeneralizationtemporal consistency

Component-Aware Structure-Preserving Style Transfer for Satellite Sim2Real 6D Pose Estimation

arXiv cs.AI · Yonglong Zhang · 2026-05-19

The paper introduces a component-aware structure-preserving style transfer framework for satellite synthetic-to-real data construction to improve 6D pose estimation. The method uses weakly paired real-synthetic samples, extracts part-wise real-domain style codes, and injects them into synthetic regions via mask-aligned modulation, preserving geometric annotations. Adversarial training with local contrastive consistency and edge-preserving constraints maintains downstream usability. Evaluated on 5,000 synthetic and 100 real images, the approach achieves FID 54.32 and KID 0.048, improving GDRNet's ADD pass rate to 0.260 and AUC to 0.611.

sim2realstyle transfer6d pose estimationcomponent-awareadversarial training

MiMuon: Mixed Muon Optimizer with Improved Generalization for Large Models

arXiv cs.AI · Feihu Huang, Yuning Luo, Songcan Chen · 2026-05-19

The paper introduces MiMuon, a mixed optimizer combining Muon and momentum-based SGD, to improve generalization in large models. Through algorithmic stability analysis, the authors prove MiMuon achieves a lower generalization error bound of O(1/N) compared to Muon's O(1/(Nκ^T)), where κ is the minimum singular value difference in gradient estimates. The method employs orthogonalized gradients while maintaining Muon's O(1/T^(1/4)) convergence rate. Experiments on Qwen3-0.6B and YOLO26m validate MiMuon's efficacy.

muon optimizergeneralization errororthogonalizationalgorithmic stabilitylarge models

Spectral Integrated Gradients for Coarse-to-Fine Feature Attribution

arXiv cs.AI · Soyeon Kim, Seongwoo Lim, Kyowoon Lee, Jaesik Choi · 2026-05-19

The authors propose Spectral Integrated Gradients (SIG), a novel feature attribution method that improves upon Integrated Gradients by constructing integration paths via singular value decomposition. SIG activates singular components from largest to smallest, enabling a coarse-to-fine attribution progression that reduces gradient noise accumulation. Evaluations across multiple image classification datasets show SIG produces cleaner attribution maps and outperforms existing path-based methods in quantitative metrics.

integrated gradientsfeature attributionsingular value decompositioncoarse-to-finegradient noise

Formal Skill: Programmable Runtime Skills for Efficient and Accurate LLM Agents

arXiv cs.AI · Xi Zhang, Meijun Gao, Yuntian Zhao, Xinyu Tan · 2026-05-19

The paper introduces Formal Skill, a runtime-native abstraction for LLM agents that encodes reusable capabilities through JSON metadata, Python executors, and hook-governed control logic. This approach moves procedural knowledge from prompt text to executable state machines, improving token efficiency and policy enforcement. Implemented in FairyClaw, an event-driven runtime, Formal Skill achieves competitive scores on Harness-Bench with reduced token usage, particularly excelling in tasks requiring structured skill execution.

formal skillllm agentsruntime abstractiontoken efficiencyhook policies

A novel YOLO26-MoE optimized by an LLM agent for insulator fault detection considering UAV images

arXiv cs.AI · João Pedro Matos-Carvalho, Laio Oriel Seman, Stefano Frizzo Stefenon, Mohammad Khalaf Mohammad Khreasat · 2026-05-19

The paper proposes YOLO26-MoE, a modified YOLO26 detector integrating a sparse Mixture-of-Experts (MoE) module in its high-resolution branch for improved insulator fault detection in UAV imagery. The architecture enables adaptive feature refinement for small defects and diverse fault patterns while maintaining one-stage detection efficiency. A tool-augmented LLM agent orchestrates hyperparameter optimization and training. Evaluations show state-of-the-art performance with 0.9900 mAP@0.5 and 0.9515 mAP@0.5:0.95, surpassing contemporary YOLO variants.

yolo26-moemixture-of-expertsuav inspectioninsulator fault detectionllm agent optimization

Towards Multi-Model LLM Schedulers: Empirical Insights into Offloading and Preemption

arXiv cs.AI · Mert Yildiz, Pietro Spadaccino, Alexey Rolich, Francesca Cuomo · 2026-05-19

The paper presents an empirical study of multi-model LLM scheduling under GPU memory constraints, analyzing performance impacts of layer offloading and preemption. Through systematic measurements across diverse models and hardware, it reveals non-linear throughput degradation from offloading (with smaller models showing greater sensitivity) and identifies preemption overhead dominated by model state reload rather than KV-cache transfer. Key findings include significant variation in data movement costs across architectures and the influence of sequence length on execution inefficiencies. The work provides design principles for future schedulers handling heterogeneous multi-model workloads.

large language modelsgpu offloadingpreemption overheadkv-cacheheterogeneous scheduling

Implicit Action Chunking for Smooth Continuous Control

arXiv cs.AI · Bosun Liang, Shuo Pei, Zirui Chen, Chuanzhi Fan · 2026-05-19

The paper introduces Dual-Window Smoothing (DWS), an implicit action chunking framework for smooth continuous control in reinforcement learning. DWS enforces temporal coherence without expanding action space via a dual-window design: an execution window for deterministic modulation and a value window for bias correction. A lightweight actor-side temporal regularizer promotes global continuity. Evaluated on DeepMind Control Suite and industrial energy management tasks, DWS outperforms SOTA baselines, achieving smoother control, reduced jitter, and 100% success rate in vision-based autonomous driving.

reinforcement learningaction chunkingtemporal coherencecontinuous controldual-window smoothing

SceneCode: Executable World Programs for Editable Indoor Scenes with Articulated Objects

arXiv cs.AI · Puyi Wang, Yuhao Wang, Linjie Li, Zhengyuan Yang · 2026-05-19

SceneCode introduces executable world programs for editable indoor scenes with articulated objects, addressing limitations of static mesh-based pipelines. The framework compiles natural language prompts into code-driven indoor worlds via a room-level agentic backbone and five code-generation strategies, validated through execution-guided refinement. Evaluations demonstrate improved prompt faithfulness, cleaner mesh structure, and simulator-loadable articulation metadata compared to existing methods.

scene synthesisarticulated objectsprogrammatic generationexecutable programsphysics simulation

Lens Privacy Sealing: A New Benchmark and Method for Physical Privacy-Preserving Action Recognition

arXiv cs.AI · Mengyuan Liu, Ziyi Wang, Peiming Li, Junsong Yuan · 2026-05-19

The paper introduces Lens Privacy Sealing (LPS), a hardware solution for physical privacy-preserving action recognition using adjustable laminating film to obscure camera lenses via stochastic multi-layer scattering. It presents the P$^3$AR dataset (114K videos) with privacy annotations and proposes MSPNet, a single-stage framework with Inter-Frame Noise Suppressor (IFNS) and Cross-Frame Semantic Aggregator (CFSA) for degraded video processing. Experiments show MSPNet nearly doubles action recognition accuracy while suppressing identity recognition, achieving superior privacy-utility trade-offs and resistance to reconstruction attacks.

privacy-preservingaction recognitionhardware solutionstochastic scatteringcontrastive learning

Library Drift: Diagnosing and Fixing a Silent Failure Mode in Self-Evolving LLM Skill Libraries

arXiv cs.AI · Xing Zhang, Yanwei Cui, Guanghui Wang, Ziyuan Li · 2026-05-19

The paper identifies and addresses library drift, a silent failure mode in self-evolving LLM skill libraries where unbounded skill accumulation degrades performance. Through reproducible triggers (skill injection ablation, premature retirement) and trace-level diagnostics (evidence logs, contribution scores), the authors isolate the drift mechanism. They propose a governance solution (outcome-driven retirement, bounded active-cap, meta-skill authoring) that improves pass@1 from 0.258 to 0.584 on MBPP+ hard-100 over 100 rounds. Eight ablations validate the load-bearing components of the fix.

library driftskill accumulationoutcome-driven retirementmeta-skill authoringtrace-level diagnostics

TORQ: Two-Level Orthogonal Rotation for MXFP4 Quantization

arXiv cs.AI · Zukang Xu, Xing Hu, Dawei Yang · 2026-05-19

TORQ introduces a training-free Post-Training Quantization framework for MXFP4 activation quantization in LLMs, addressing structural imbalances in activation distributions. The method employs two-level orthogonal rotation: macroscopic inter-block rotation redistributes activation energy using the Schur-Horn theorem, while microscopic intra-block rotation maximizes codebook utilization via maximum-entropy guidance. Evaluated on LLaMA3 and Qwen3, TORQ reduces perplexity on WikiText to 8.43 (vs. BF16's 7.61) and increases average accuracy from 38.40% (RTN) to 73.63% (vs. BF16's 74.82%), significantly closing the gap between 4-bit and full-precision inference.

mxfp4post-training quantizationschur-horn theoremcodebook utilizationactivation quantization

EgoCoT-Bench: Benchmarking Grounded and Verifiable Operation-Centric Chain of Thought Reasoning for MLLMs

arXiv cs.AI · Yang Dai, Dian Jiao, Tianwei Lin, Wenqiao Zhang · 2026-05-19

The authors introduce EgoCoT-Bench, a benchmark for evaluating grounded and verifiable operation-centric reasoning in Multimodal Large Language Models (MLLMs) using egocentric videos. The benchmark comprises 3,172 QA pairs across 351 videos, organized into four task groups (12 sub-tasks) for perception, retrospection, anticipation, and high-level reasoning. Constructed via spatio-temporal scene graphs and human refinement, it reveals MLLMs' difficulties with fine-grained reasoning and inconsistent evidence in explanations despite correct answers.

multimodal large language modelsegocentric video understandingspatio-temporal scene graphsoperation-centric reasoningverifiable qa pairs

CaptchaMind: Training CAPTCHA Solvers via Reinforcement Learning with Explicit Reasoning Supervision

arXiv cs.AI · Pengcheng Wang, Haoxiang Liu, Yang Dai, Xiangxiang Zeng · 2026-05-19

The authors introduce CaptchaBench, the first large-scale CAPTCHA benchmark with 16,000 programmatically generated samples across eight task categories, featuring detailed region and process-level annotations. They propose CaptchaMind, a reinforcement learning-based solver trained with explicit reasoning process supervision, addressing limitations of existing methods in fine-grained visual detail capture and region-level comparison. The system achieves 82.9% average success rate on CaptchaBench and 71.0% on real-world instances, significantly outperforming prior approaches without closed-source API dependencies.

captchareinforcement learningvisual reasoningprocess supervisionbenchmark

Generative-Evaluative Agreement: A Necessary Validity Criterion for LLM-Enabled Adaptive Assessment

arXiv cs.AI · Grandee Lee, Yue Wang, Che Yee Lye, Luke Peh · 2026-05-19

The paper introduces Generative-Evaluative Agreement (GEA) as a validity criterion for LLM-enabled adaptive assessments, addressing self-referential validation when LLMs generate items, simulate responses, and score them. GEA measures whether an LLM's scoring function recovers the skill levels its generative function was instructed to produce. Experiments on a two-stage adaptive assessment show GEA recovers roughly half the intended variance (r = 0.698) with systematic positive bias, varying strongly by skill type (r > 0.7 for syntactic skills, near zero for design-level skills) and revealing low-skill overestimation near routing thresholds. The authors propose granular, skill-decomposed rubrics as the primary mitigation strategy.

generative-evaluative agreementadaptive assessmentskill decompositionrouting thresholdvalidity criterion

Investigating Cross-Modal Skill Injection: Scenarios, Methods, and Hyperparameters

arXiv cs.AI · Zhiyu Xu, Lean Wang, Yuanxin Liu, Lei Li · 2026-05-19

This study systematically investigates cross-modal skill injection for Vision-Language Models (VLMs) through model merging, focusing on scenarios, methods, and hyperparameters. The research demonstrates that domain-expert LLMs can enhance VLMs in instruction-following and cross-lingual tasks but struggle with mathematical reasoning. Classic merging methods like TA and DARE outperform alternatives, with hyperparameter tuning being critical for performance. The analysis provides quantitative insights into these methodologies, offering a framework for efficient skill transfer without extensive data or computational overhead.

vision-language modelsmodel mergingcross-modal skill injectionhyperparameter tuningdomain-expert llms

Efficient Elicitation of Collective Disagreements

arXiv cs.AI · Mohamed Ouaguenouni, Felipe Garrido-Lucero, Umberto Grandi, César Hidalgo · 2026-05-19

The authors introduce a stratified framework to efficiently elicit collective disagreements among voters over alternatives, addressing limitations of pairwise comparisons and full rankings. They propose the plurality matrix, which generalizes pairwise comparisons by recording the probability of each alternative ranking first in any subset. The framework defines the level of a disagreement measure as the smallest subset size needed to express it, proving that many existing notions, including rank-variance and divisiveness, require level 3. Theoretical and experimental results demonstrate the utility of higher levels. Two elicitation protocols are designed to estimate the plurality matrix, balancing participant count and cognitive load.

plurality matrixdisagreement measureselicitation protocolspairwise comparisonsrank-variance

BLINKG: A Benchmark for LLM-Integrated Knowledge Graph Generation

arXiv cs.AI · Carla Castedo, Enrique Iglesias, Manuel Lama, Alberto Bugarin-Diz · 2026-05-19

The paper introduces BLINKG, a benchmark for evaluating Large Language Models (LLMs) in Knowledge Graph (KG) generation from heterogeneous data sources. The benchmark comprises scenarios of increasing complexity based on real-world use cases, assessing LLMs' ability to map data schemas to ontology concepts. Experimental evaluation shows LLMs offer promising but limited performance in complex scenarios, highlighting requirements for semi-automated KG construction and opening new research directions.

knowledge graph generationlarge language modelsontology alignmentbenchmark evaluationheterogeneous data

Base Models Look Human To AI Detectors

arXiv cs.AI · Yixuan Even Xu, Ziqian Zhong, Aditi Raghunathan, Fei Fang · 2026-05-19

We demonstrate that base models are frequently classified as human-written by commercial AI-text detectors (GPTZero, Pangram), while instruction-tuned models are not. To exploit this finding, we introduce Humanization by Iterative Paraphrasing (HIP), a detector-agnostic pipeline that fine-tunes a base model for paraphrasing and applies it iteratively. HIP achieves superior trade-offs between semantic preservation and detector evasion compared to baselines, consistently improving human-likeness scores across Llama-3 and Qwen-3 model families (0.6B to 70B). Results suggest current detectors track instruction-tuning artifacts rather than fundamental machine-generated text characteristics, necessitating detector designs that explicitly model these factors.

instruction tuningparaphrasingdetector evasionsemantic preservationhuman-likeness

Position: The Turing-Completeness of Real-World Autoregressive Transformers Relies Heavily on Context Management

arXiv cs.AI · Guanyu Cui, Zhewei Wei, Kun He · 2026-05-19

The paper clarifies distinctions between fixed-system and scaling-family settings for Transformer Turing-completeness claims, arguing that real-world LLMs operate in the former. It formalizes the fixed-system setting, demonstrating that context-management methods critically influence computational power. Results show existing proofs in scaling-family settings provide resource bounds but not Turing-completeness, addressing common misinterpretations.

transformersturing-completenesscontext-managementautoregressivellms

ARC-RL: A Reinforcement Learning Playground Inspired by ARC Raiders

arXiv cs.AI · Carlo Romeo, Andrew D. Bagdanov · 2026-05-19

The paper introduces ARC-RL, a MuJoCo-based reinforcement learning benchmark suite featuring four stylized robotic morphologies (18-DoF Queen, 12-DoF Bastion, 18-DoF Tick, 12-DoF Leaper) inspired by ARC Raiders. The environments employ a unified reward function combining velocity tracking, gait compliance, and safety penalties without motion-capture data, alongside hand-crafted Central Pattern Generator demonstrators. Experiments compare online algorithms (SAC, SPEQ, SOPE-EO) and prior-data-augmented methods (SACfD, SPEQ-O2O, SOPE), analyzing their performance across morphological diversity and stylistic constraints.

reinforcement learningmujococentral pattern generatormorphological diversitycontinuous control

CANINE: Coaching Visually Impaired Users for Interactive Navigation with a Robot Guide Dog

arXiv cs.AI · Cunjun Yu, Zishuo Wang, Anxing Xiao, Linfeng Li · 2026-05-19

CANINE introduces an automated coaching system for training visually impaired users in interactive navigation with robot guide dogs, addressing the challenge of subtle human-robot coordination. The system decomposes navigation into sub-skills, employing knowledge tracing to prioritize training on weak areas and using foundation models to infer error causes and generate adaptive verbal feedback. A controlled study with blindfolded participants demonstrates CANINE's superiority over generic instructions in learning efficiency and navigation performance. Retention and case studies confirm lasting skill improvement and real-world applicability, aligning with controlled study findings.

robot guide dogknowledge tracingfoundation modelsadaptive feedbackinteractive navigation

Attention-Guided Reward for Reinforcement Learning-based Jailbreak against Large Reasoning Models

arXiv cs.AI · Zheng Lin, Zhenxing Niu, Haoxuan Ji, Yuzhe Huang · 2026-05-19

The paper proposes a reinforcement learning-based jailbreak method for Large Reasoning Models (LRMs) that incorporates attention patterns into the reward function, alongside diverse persuasion strategies to expand the action space. Analysis reveals that successful jailbreaks correlate with lower attention to harmful input tokens and higher attention to them in reasoning content. Experiments on five LRMs across three benchmarks show the method achieves significantly higher attack success rates than existing approaches, demonstrating improved effectiveness, efficiency, and transferability.

large reasoning modelsjailbreak attacksreinforcement learningattention patternsattack success rate

CutVerse: A Compositional GUI Agents Benchmark for Media Post-Production Editing

arXiv cs.AI · Haobo Hu, Xiangwu Guo, Zhiheng Chen, Difei Gao · 2026-05-19

The authors introduce CutVerse, a benchmark for evaluating GUI agents in professional media post-production workflows, addressing a gap in current GUI agent capabilities. The benchmark comprises 186 complex tasks across 7 applications (e.g., Premiere Pro, Photoshop), with expert demonstrations and a lightweight parser for structured action trajectory extraction. Evaluations show existing agents achieve only 36.0% task success, highlighting challenges in long-horizon reliability and domain-specific planning despite strengths in spatial grounding and multimodal alignment.

gui agentsmedia post-productionlong-horizon tasksmultimodal interfacesaction trajectories

Sampling-Based Safe Reinforcement Learning

arXiv cs.AI · Luca Vignola, Bruce D. Lee, Manish Prajapat, Manuel Wendl · 2026-05-19

The paper introduces Sampling-Based Safe Reinforcement Learning (SBSRL), a model-based RL algorithm ensuring safety during learning by enforcing constraints across sampled dynamics. The method approximates worst-case optimization over uncertain dynamics and incorporates an exploration strategy based on epistemic uncertainty constraints, eliminating explicit exploration bonuses. Theoretical analysis provides high-probability safety guarantees and finite-time sample complexity bounds. Empirical results demonstrate safe and efficient exploration in simulation and real robotics, with extensions to deep-ensemble implementations for high-dimensional continuous control.

safe reinforcement learningmodel-based rlepistemic uncertaintycontinuous controlsample complexity

Quantifying the Pre-training Dividend: Generative versus Latent Self-Supervised Learning for Time Series Foundation Models

arXiv cs.AI · Noam Major, Kathy Razmadze, Yoli Shavit · 2026-05-19

The study quantifies the pre-training dividend of self-supervised learning (SSL) for time series foundation models, comparing Generative paradigms against Latent Alignment architectures. Adaptations of LeJEPA and DINO for time series are introduced, utilizing Discrete Wavelet Transform (DWT) augmentations to enforce invariance to local fluctuations. Results reveal asymmetric gains: SSL yields up to 375% improvement for anomaly detection and classification, but marginal benefits for forecasting. Representational utility is governed by a precision-invariance trade-off, aligning task-specific signal resolution with the objective. Representation quality saturates at moderate architectural depths and is independent of data origin, suggesting scaling via synthetic generation.

self-supervised learninglatent alignmentdiscrete wavelet transformanomaly detectionsynthetic generation

Beyond Mode Collapse: Distribution Matching for Diverse Reasoning

arXiv cs.AI · Xiaozhe Li, Yang Li, Xinyu Fang, Shengyuan Ding · 2026-05-19

The paper introduces DMPO (Distribution-Matching Policy Optimization), a method to prevent mode collapse in on-policy reinforcement learning by approximating forward KL minimization. DMPO constructs a target distribution over trajectories proportional to rewards and aligns the policy distribution to it, enabling sustained exploration. Evaluated on NP-hard combinatorial optimization tasks, DMPO achieves 43.9% Quality Ratio on text-based NP-Bench (9% improvement over GRPO) and 43.1% on vision-based NP-Bench (12% improvement), with gains extending to mathematical reasoning (+2.0%) and out-of-domain tasks (+2.3%).

mode collapsedistribution matchingforward klpolicy optimizationnp-hard

Generative Auto-Bidding with Unified Modeling and Exploration

arXiv cs.AI · Mingming Zhang, Feiqing Zhuang, Na Li, Shengjie Sun · 2026-05-19

We propose GUIDE, a generative auto-bidding framework that unifies exploration and safety in digital advertising. GUIDE integrates a Decision Transformer (DT) for joint modeling of historical bidding actions and environmental states, a Q-value module for exploration regularization, and an Inverse Dynamics Module (IDM) for safe policy fallback. The framework adaptively selects actions between exploration and fallback via an 'explore-safeguard-select' pipeline. Evaluations on public datasets, simulated auctions, and large-scale deployment on Taobao demonstrate GUIDE's superiority, achieving +4.10% GMV, +1.40% clicks, +1.66% cost, and +3.52% ROI compared to state-of-the-art baselines.

decision transformerinverse dynamics moduleauto-biddingexploration regularizationq-value module

Resilient Byzantine Agreement with Predictions

arXiv cs.AI · Julien Dallot, Darya Melnyk, Tijana Milentijevic, Stefan Schmid · 2026-05-19

The paper characterizes algorithmic resilience in Byzantine Agreement problems with predictors flagging faulty nodes, presenting tight consistency--robustness trade-offs for both non-authenticated and authenticated settings. For $n$ nodes and parameter $α$, algorithms tolerate $α\cdot n$ faults when predictors are correct and $\frac{1-α}{2} \cdot n - 1$ faults when predictors are wrong, improving to $(1-α) \cdot n - 1$ in authenticated settings. Resilience linearly decreases with predictor inaccuracy, losing one unit per wrong prediction in non-authenticated settings and halving this decline in authenticated settings. Tight impossibility results show these bounds are exact.

byzantine agreementalgorithmic resilienceconsistency-robustness trade-offsauthenticated settingspredictor accuracy

What and When to Distill: Selective Hindsight Distillation for Multi-Turn Agents

arXiv cs.AI · Xiaozhe Li, Tianyi Lyu, Yang Li, Yichuan Ma · 2026-05-19

The paper introduces SERL, a selective environment-reweighted learning framework for multi-turn LLM agents that optimizes credit assignment by leveraging diverse environmental feedback. SERL combines task rewards for update direction with environment feedback (error messages, observations, etc.) to adjust update placement and magnitude, focusing on critical actions. Evaluated on ALFWorld and WebShop, SERL achieves 90.0% and 80.1% success rates, outperforming RL and distillation baselines. Analysis demonstrates that selective, action-relevant feedback at meaningful points yields better performance than indiscriminate use of longer context.

reinforcement learningcredit assignmentmulti-turn agentsenvironment feedbackselective distillation

Targeted Downstream-Agnostic Attack

arXiv cs.AI · Zhuxin Lei, Ziyuan Yang, Yi Zhang · 2026-05-19

The paper introduces Targeted Downstream-Agnostic Attack (TDAA), a method for generating adversarial examples that force pre-trained encoders to produce identical features for both adversarial inputs and a pre-selected 'threat image'. Unlike prior downstream-agnostic attacks (DAAs) that use shared perturbations, TDAA employs example-specific perturbations via a generator, ensuring high attack success and stealth. The approach is evaluated on 10 self-supervised methods across 3 benchmarks, revealing significant vulnerabilities in pre-trained encoders.

targeted attackdownstream-agnosticadversarial examplespre-trained encodersfeature-level anchor

When the Majority Votes Wrong, the Intervention Timing for Test-Time Reinforcement Learning Hides in the Extinction Window

arXiv cs.AI · Hongxiang Lin, Zhirui Kuai, Erpeng Xue, Lei Wang · 2026-05-19

We introduce TTRL-Guard, a framework addressing misinterpreted accuracy gains in test-time reinforcement learning (TTRL) for mathematical reasoning. TTRL-Guard targets the Correct-Answer Extinction Window, a phenomenon where correct-answer signals in low-ability problems are briefly active before suppression. The framework employs Flip-Rate-Aware Reward Scaling (FRS) to down-weight at-risk updates, Minority-Preserving Sampling (MPS) to retain gradient signals from minority correct answers, and Risk-Conditioned Sparse Updatings (RCSU) to suspend updates on polarized problems. Experiments across three models and four benchmarks demonstrate that TTRL-Guard achieves the best average pass@1 on Qwen2.5-7B-Instruct and Qwen3-4B, with a +54% relative improvement over TTRL on AIME 2025.

test-time reinforcement learningextinction windowflip-rate-aware reward scalingminority-preserving samplingrisk-conditioned sparse updatings

KappaPlace: Learning Hyperspherical Uncertainty for Visual Place Recognition via Prototype-Anchored Supervision

arXiv cs.AI · Maya Yanko, Yoli Shavit · 2026-05-19

KappaPlace introduces a novel framework for uncertainty-aware Visual Place Recognition (VPR) via Prototype-Anchored supervision, addressing the lack of well-calibrated uncertainty estimation in existing methods. The approach models image descriptors as von Mises-Fisher variables and predicts concentration parameters to quantify aleatoric uncertainty, extending beyond query-centric views to match-level reliability assessment. Evaluated on five benchmarks, KappaPlace reduces Expected Calibration Error (ECE@K) by up to 50% while maintaining or improving retrieval recall. The framework supports both joint training and post-training extensions for frozen backbones, offering robust uncertainty signals for safety-critical robotics applications.

visual place recognitionprototype-anchored supervisionvon mises-fisheraleatoric uncertaintyexpected calibration error

Backtracking When It Strays: Mitigating Dual Exposure Biases in LLM Reasoning Distillation

arXiv cs.AI · Bing Wang, Shaotian Yan, Chen Shen, kaiyuan liu · 2026-05-19

The paper proposes MOTAB, a novel LLM reasoning distillation pipeline addressing dual exposure biases in knowledge transfer from teacher to student models. The method dynamically monitors student-generated trajectories during on-policy distillation, backtracking to safe states when deviations exceed an adaptive threshold and invoking teacher correction. Experiments on LIMO-v2 and AceReason datasets show MOTAB achieves ~3% average improvement in reasoning performance by mitigating both standard exposure bias (from training-inference distribution mismatch) and reversed exposure bias (from teacher guidance on sub-optimal student contexts).

llm reasoning distillationexposure biaschain-of-thoughton-policy learningknowledge transfer

When to Stop Reusing: Dynamic Gradient Gating for Sample-Efficient RLVR

arXiv cs.AI · Yuchun Miao, Sen Zhang, Yuqi Zhang, Yaorui Shi · 2026-05-19

The paper introduces Dynamic Gradient Gating (DGG), a method to improve sample efficiency in Reinforcement Learning with Verifiable Rewards (RLVR) by dynamically controlling gradient reuse. The authors identify Disproportionate Weight Divergence (DWD), where performance degradation correlates with sharp gradient surges in the \texttt{lm\_head} layer, and prove this layer's gradient norm bounds policy divergence. DGG monitors \texttt{lm\_head} gradients in real-time, intercepting harmful updates. Evaluations across math, ALFWorld, WebShop, and QA tasks show DGG achieves up to 2.93× sample efficiency and 2.14× speedup while matching single-use baseline performance.

reinforcement learningsample efficiencygradient gatingpolicy divergencelm_head

Conflict-Resilient Multi-Agent Reasoning via Signed Graph Modeling

arXiv cs.AI · Longgang He, Longzhu He, Daojing He, Chaozhuo Li · 2026-05-19

SIGMA introduces a signed graph-informed multi-agent reasoning framework that explicitly models trust, conflict, and neutral relations among agents to address error propagation and unreliable interaction patterns in LLM-based multi-agent systems. The method constructs a structured signed interaction graph with confidence-weighted edges, employs conflict-aware signed message passing to reinforce trustworthy signals while suppressing conflicts, and performs structure- and conflict-aware weighted aggregation for globally consistent predictions. Experiments on six benchmarks across multiple LLM backbones show SIGMA outperforms state-of-the-art baselines, achieving significant improvements in accuracy and conflict-resilient performance.

multi-agent systemssigned graphconflict-awaremessage passingweighted aggregation

Unlocking the Potential of Continual Model Merging: An ODE Perspective

arXiv cs.AI · Lihong Lin, Haidong Kang · 2026-05-19

The paper proposes ODE-driven Merging (ODE-M), a novel method for Continual Model Merging (CMM) that addresses catastrophic forgetting by tracing low-loss paths in parameter space. Drawing on mode connectivity theory, ODE-M integrates a time-dependent velocity field and enforces barrier constraints to prevent loss-increasing steps during sequential model merging. Experiments show ODE-M achieves state-of-the-art performance on mainstream CMM benchmarks, outperforming existing merging rules that lack explicit control over learning capacity allocation.

continual model mergingmode connectivityode-driven mergingcatastrophic forgettingparameter space

A Bitter Lesson for Data Filtering

arXiv cs.AI · Christopher Mohri, John Duchi, Tatsunori Hashimoto · 2026-05-19

The study challenges conventional wisdom on data filtering for pretraining large models, demonstrating that unfiltered data yields superior results in high-compute regimes. Through scaling experiments targeting data-scarce scenarios, the authors show that sufficiently large models not only tolerate low-quality data but benefit from it. Results indicate that optimal performance is achieved without any data filtering when training compute and model size are sufficiently scaled.

data filteringpretrainingscaling lawslarge language modelscompute-optimal training

Rebalancing Reference Frame Dominance to Improve Motion in Image-to-Video Models

arXiv cs.AI · Wooseok Jeon, Seungho Park, Seunghyun Shin, Sangeyl Lee · 2026-05-19

The paper introduces DyMoS (Dynamic Motion Slider), a training-free method to address reference-frame dominance in image-to-video (I2V) models, which often produce overly static outputs. By rebalancing self-attention pathways to reduce excessive focus on reference-frame key tokens during initial denoising steps, DyMoS enhances inter-frame dynamics without modifying model weights or input images. Experiments on multiple state-of-the-art I2V backbones show improved motion dynamics while preserving visual quality and reference fidelity, controlled via a single scalar parameter.

image-to-videoreference-frame dominanceself-attentiondenoisingmotion dynamics

EmbGen: Teaching with Reassembled Corpora

arXiv cs.AI · Arun K Lenin, Kai Rouse, Andrea Nicastro, Anna Leontjeva · 2026-05-19

EmbGen introduces a synthetic data generation pipeline for domain adaptation of instruction-tuned models, addressing limitations of homogenized outputs and cross-document dependencies in existing methods. The approach decomposes corpora into entity-description pairs, reassembles them via embedding-based semantic structure, and generates QA pairs through proximity, intra-cluster, and inter-cluster sampling with cluster-specialized prompts. Evaluated against EntiGraph, InstructLab, and Knowledge-Instruct on three datasets under 5M and 20M token budgets, EmbGen improves Binary Accuracy by 12.5% and 88.9% respectively on the most heterogeneous dataset while remaining competitive on others.

synthetic data generationembedding similarityinstruction tuningdomain adaptationbinary accuracy

PRISM: A Benchmark for Programmatic Spatial-Temporal Reasoning

arXiv cs.AI · Qiran Zhang, Yuheng Wang, Runde Yang, Lin Wu · 2026-05-19

The paper introduces PRISM, a benchmark for programmatic spatial-temporal reasoning with 10,372 human-calibrated instruction-code pairs, significantly larger than prior benchmarks (20x scale). It spans 437 subject categories across English and Chinese, focusing on knowledge visualization. The authors propose a funnel-style evaluation framework with four metrics: Code-Level Reliability, Spatial Reasoning, Prompt-Aware Dynamic Visual Complexity (PADVC), and Temporal Density (TD). Evaluation of seven LLMs reveals an Execution-Spatial Gap, with a 41% average drop from execution success to spatial correctness, demonstrating that runnable code often fails to produce spatially coherent animations.

programmatic video generationspatial-temporal reasoningexecution-spatial gapknowledge visualizationdynamic visual complexity

The Evaluation Game: Beyond Static LLM Benchmarking

arXiv cs.AI · Paul Wang, Jade Garcia-Bourrée, Anne-Marie Kermarrec, Vincent Corruble · 2026-05-19

The paper introduces a game-theoretic framework to model the interaction between evaluators and trainers in defending against jailbreaks in Large Language Models (LLMs). Using group actions to formalize data augmentation, the study analyzes generalization regimes, showing evaluators maintain constant miss ratios below critical thresholds. Empirical evidence from Llama, Qwen, and Mistral models indicates fine-tuning on adversarial prompts yields local generalization, with refusal rates correlated to prompt distance. The framework redefines benchmarks as dynamic orbits under group actions, challenging static evaluation protocols.

jailbreaksgroup actionsgeneralization regimesadversarial promptsrefusal rates

Generative Recursive Reasoning

arXiv cs.AI · Junyeob Baek, Mingyu Jo, Minsu Kim, Mengye Ren · 2026-05-19

The paper introduces Generative Recursive reAsoning Models (GRAM), a framework that extends recursive reasoning to probabilistic multi-trajectory computation. GRAM models reasoning as stochastic latent trajectories, enabling multiple hypotheses and solution strategies through recursive depth and parallel sampling. It functions as a latent-variable generative model supporting both conditional ($p_θ(y \mid x)$) and unconditional ($p_θ(x)$) generation. Trained with amortized variational inference, GRAM outperforms deterministic recurrent and recursive baselines on structured reasoning and multi-solution constraint satisfaction tasks while demonstrating unconditional generation capabilities.

recursive reasoninglatent-variable modelsvariational inferencemulti-trajectory computationconstraint satisfaction

Concept-Guided Noisy Negative Suppression for Zero-Shot Classification and Grounding of Chest X-Ray Findings

arXiv cs.AI · Chenyu Lian, Hong-Yu Zhou, Chun-Ka Wong, Jing Qin · 2026-05-19

The paper introduces CoNNS, a concept-guided noisy-negative suppression framework for zero-shot classification and grounding of chest X-ray findings. The method constructs a hierarchical concept ontology using large language models, structuring 41 clinical concepts by presence, attributes, and texts. It implements cross-patient pair relabeling via fine-grained breakdown, noisy negative filtering, and hard negative mining, followed by a Concept-Aware NCE loss for visual-text alignment. Experiments on multi-granularity zero-shot grounding tasks and five classification datasets show CoNNS outperforms state-of-the-art models.

zero-shot classificationnoisy negativesconcept ontologycontrastive learningchest x-ray

Multi-Scale Generative Modeling with Heat Dissipation Flow Matching

arXiv cs.AI · Jun Ma, Hanquan Zhang, Yanjun Qin, Haoyuan Guan · 2026-05-19

The paper introduces Heat Dissipation Flow Matching (HDFM), a novel generative model that integrates continuous blur-based corruption into Flow Matching (FM) to inject multi-scale priors. HDFM addresses ill-posedness in the inverse heat-dissipation process by aligning an interpolated heat-dissipation path and mitigates high-dimensional regression difficulty via $x$-prediction. Experiments demonstrate HDFM's superiority over baseline methods across datasets, with ablations confirming the benefits of blur and $x$-prediction.

heat dissipation flow matchingmulti-scale priorsflow matchinginverse heat-dissipation$x$-prediction

Toward User Comprehension Supports for LLM Agent Skill Specifications

arXiv cs.AI · Zikai Alex Wen · 2026-05-19

The study evaluates LLM agent skill specifications as user comprehension aids, proposing four comprehension anchors: operational basis, output contract, boundary disclosure, and example capability demonstration. Analyzing 878 cybersecurity skills via rule-based coding, it finds only 19.0% include example demonstrations and 2.3% cover all anchors. A DNS/C2 telemetry subset (n=6) reveals missing examples complicate local checks, requiring helper code inspection. The work advocates treating specifications as capability disclosures rather than mere executable containers.

llm agentskill specificationscomprehension anchorsrule-based codingcybersecurity skills

Skinned Motion Retargeting with Spatially Adaptive Interaction Guidance

arXiv cs.AI · Soojin Choi, Seokhyeon Hong, Chaelin Kim, Junghyun Nam · 2026-05-19

The paper presents a geometry-aware motion retargeting framework that preserves interaction semantics across characters with varying body proportions. The method dynamically repositions anchors via a Transformer-based refinement strategy, using differentiable soft projection to constrain anchors to target geometry, and employs a graph-based autoencoder for skeletal motion prediction. An alternating training scheme optimizes anchor adaptation and motion retargeting jointly. Evaluations show superior interaction fidelity preservation compared to state-of-the-art approaches.

motion retargetinginteraction semanticstransformer-based refinementdifferentiable soft projectiongraph-based autoencoder

Brain alignment of reasoning and action representations from vision-language and action models during naturalistic gameplay

arXiv cs.AI · Subba Reddy Oota, Anant Khandelwal, Khushbu Pahwa, Satya Sai Srinath Namburi · 2026-05-19

The study investigates brain alignment between foundation models (vision-language models and large-action models) and fMRI recordings during naturalistic gameplay, revealing three key findings. First, both model families outperform RL baselines in voxel-wise encoding performance regardless of feature dimensionality. Second, prompt-driven improvements scale with cortical hierarchy, showing maximal gains in frontal-parietal regions (2× early visual cortex). Third, representational organization differs qualitatively: VLMs show prompt symmetry (12.5-13.6% unique variance) while LAMs exhibit action-prompt asymmetry (27% vs -5%), particularly in frontal-motor cortex. The results demonstrate action-specialized fine-tuning reorganizes multimodal representations toward action-relevant neural computations.

brain alignmentvision-language modelslarge-action modelsfmri encodingcortical hierarchy

PAVE: A Cognitive Architecture for Legitimate Violation in Generative Agent Societies

arXiv cs.AI · Ahmad Yehia, Abduallah Mohamed, Kun Qian, Tianyi Wang · 2026-05-19

The paper introduces PAVE, a cognitive architecture enabling generative agents to reason about legitimate rule violations in cooperative settings. PAVE's four modules (Perception, Assessment, Verdict, Emulation) process contextual cues, score legitimacy, decide on violations, and scope actions. Evaluated in the Voville environment with four LLM backbones, PAVE agents demonstrate legitimate violation, authority deference, bounded scope, and recovery properties, outperforming vanilla agents in structured decision-making and plausibility ratings. Ablation studies confirm the legitimacy gate's necessity for these properties.

cognitive architecturelegitimate violationgenerative agentsllm backbonesvoville environment

IMLJD: A Computational Dataset for Indian Matrimonial Litigation Analysis

arXiv cs.AI · Joy Bose · 2026-05-19

The paper introduces IMLJD, a novel computational dataset of 3,613 Indian matrimonial litigation judgments from the Supreme Court (2000-2024) and Karnataka High Court (2018-2024), annotated with structured outcome labels, metadata indicators, and a knowledge graph. The dataset covers cases under IPC Section 498A, the Protection of Women from Domestic Violence Act, and CrPC Section 482. Analysis reveals a 57.6% quashing petition success rate at the Supreme Court versus 39.7% at the Karnataka High Court, with a 19.6 percentage point differential persisting in matched temporal analysis (2018-2024).

computational jurisprudencelegal knowledge graphstructured outcome labelingmatrimonial litigationcourt judgment analysis

HalluWorld: A Controlled Benchmark for Hallucination via Reference World Models

arXiv cs.AI · Emmy Liu, Varun Gangal, Michael Yu, Zhuofu Tao · 2026-05-19

The paper introduces HalluWorld, a controlled benchmark for studying hallucinations in language models through reference-world formulations. It constructs synthetic and semi-synthetic environments (gridworlds, chess, terminal tasks) to automatically label hallucinations while varying complexity, observability, and temporal dynamics. Evaluation reveals frontier models excel at perceptual hallucination but struggle with multi-step state tracking, causal simulation, and abstention decisions. Results indicate hallucinations stem from multiple distinct failure modes rather than a unified capability gap.

hallucinationbenchmarkreference-worldstate trackingabstention

STAR-PólyaMath: Multi-Agent Reasoning under Persistent Meta-Strategic Supervision

arXiv cs.AI · Jiaao Wu, Xian Zhang, Hanzhang Liu, Sophia Zhang · 2026-05-19

STAR-PólyaMath introduces a multi-agent framework for mathematical reasoning that addresses reliability issues like hallucination accumulation and memory fragmentation through meta-level supervision and structured Reasoner-Verifier interaction. The system employs a Python orchestrator to separate control from inference, with a persistent Meta-Strategist providing high-level guidance. It achieves state-of-the-art results on eight benchmarks, including perfect scores on AIME 2025-2026, Putnam 2025, and HMMT February 2026, outperforming GPT-5.5 by 13.54% on MathArena Apex 2025. Ablation studies confirm the framework's orchestration drives performance gains.

multi-agentmeta-strategistreasoner-verifierorchestrationablation

Agentic Trading: When LLM Agents Meet Financial Markets

arXiv cs.AI · Yihan Xia, Panpan You, Taotao Wang, Fang Liu · 2026-05-19

The paper investigates LLM-based trading agents through a systematic review of 77 studies, focusing on their decision pipelines and market adaptability. It identifies protocol incomparability as a key issue, with only 2 out of 19 primary studies reporting time-consistent split protocols and none achieving R3 reproducibility. The study proposes an Architecture-Capability-Adaptation framework and emphasizes the need for standardized evaluation protocols and reproducible artifacts. Findings highlight rapid architectural experimentation but persistent bottlenecks in execution semantics and reproducibility.

llm-based trading agentsprotocol incomparabilityarchitecture-capability-adaptationreproducibility auditexecution semantics

MOCHA: Multi-Objective Chebyshev Annealing for Agent Skill Optimization

arXiv cs.AI · Md Mehrab Tanjim, Jayakumar Subramanian, Xiang Chen, Branislav Kveton · 2026-05-19

The paper introduces MOCHA, a Multi-Objective Chebyshev Annealing method for optimizing LLM agent skills under platform constraints. Unlike existing approaches that use weighted sums, MOCHA employs Chebyshev scalarization to cover the full Pareto front, including non-convex regions, combined with exponential annealing for exploration-exploitation balance. Evaluated across six agent skills, MOCHA achieves a 7.5% mean correctness improvement over baselines (up to 14.9% on FEVER and 10.4% on TheoremQA), discovering twice as many Pareto-optimal variants while baseline methods fail on 4/6 tasks.

multi-objective optimizationchebyshev scalarizationpareto frontllm agentsskill optimization

RE-VLM: Event-Augmented Vision-Language Model for Scene Understanding

arXiv cs.AI · Hanqing Liu, Mingjie Liu, Luoping Cui, Endian Lin · 2026-05-19

RE-VLM introduces a dual-stream vision-language model combining RGB and event camera data for robust scene understanding under adverse conditions. The model employs parallel encoders and progressive training to align heterogeneous visual features with language, addressing modality gaps. A graph-driven pipeline synthesizes captions and QA pairs from synchronized RGB-event streams, overcoming data scarcity. Evaluated on PEOD-Chat and RGBE-Chat datasets, RE-VLM outperforms RGB-only and event-only baselines in captioning and VQA tasks, particularly in challenging illumination scenarios. Results demonstrate significant improvements in cross-modal alignment and real-world applicability.

vision-language modelevent camerasdual-stream architecturescene graphsprogressive training

Exploring and Developing a Pre-Model Safeguard with Draft Models

arXiv cs.AI · Hongyu Cai, Arjun Arunasalam, Yiming Liang, Antonio Bianchi · 2026-05-19

The paper introduces a pre-model safeguard that leverages jailbreak attack transferability between large and small language models (LLMs/SLMs) to improve prompt safety auditing. By systematically studying transferability factors, the authors observe that SLM draft responses predict LLM safety implications. Their design uses speculative inference with SLMs to generate draft responses, then applies existing guards to both prompt and drafts, reducing false-negative rates while maintaining computational efficiency compared to post-model guards. Experiments demonstrate improved safety prediction with lower token usage and processing time.

jailbreak transferabilityspeculative inferencepre-model guardfalse-negative ratedraft models

Inference-Time Scaling in Diffusion Models through Iterative Partial Refinement

arXiv cs.AI · Taegu Kang, Jaesik Yoon, Sungjin Ahn · 2026-05-19

We propose Iterative Partial Refinement (IPR), an inference-time scaling method for sequential diffusion models that operates without external verifiers. IPR re-noises and regenerates subsets of regions in an already-generated sample, conditioned on the remaining regions, enabling richer context for revising earlier decisions. This approach enhances global consistency in samples through iterative partial refinement. On MNIST Sudoku, IPR increases the valid solution rate from 55.8% to 75.0%, demonstrating its efficacy in tasks requiring global constraint satisfaction. The method is tailored for sequential, mixed-noise conditioning settings.

inference-time scalingdiffusion modelsiterative partial refinementmixed-noise conditioningglobal constraint satisfaction

ContextFlow: Hierarchical Task-State Alignment for Long-Horizon Embodied Agents

arXiv cs.AI · Shuhan Guo, Kun Zhang, Haifei Liu, Xingyu Gao · 2026-05-19

ContextFlow introduces a hierarchical task-state alignment framework for long-horizon embodied agents, addressing task-state misalignment failures in planner-executor coordination. The method represents task stages as explicit contracts, converts runtime observations into evidence packets, and applies scoped updates (continue, refine, transfer, promote, repair) to maintain alignment. By keeping specialist executors responsible for local control while making task-frontier alignment auditable, the framework mitigates unsupported handoffs, stage lock, and executor-context mismatches. Experiments on long-horizon embodied tasks demonstrate improved diagnosis and mitigation of task-state failures through evidence-grounded updates.

task-state alignmentembodied agentsevidence packetsscoped updateslong-horizon planning

DEFLECT: Delay-Robust Execution via Flow-matching Likelihood-Estimated Counterfactual Tuning for VLA Policies

arXiv cs.AI · Yixiang Zhu, Yonghao Chen, Rui Meng, Jingyu Guo · 2026-05-19

DEFLECT introduces an offline post-training refinement for Vision-Language-Action (VLA) policies to address prediction-execution misalignment in asynchronous inference. The method constructs counterfactual action pairs from a frozen reference policy and scores them using an implicit flow-matching likelihood-ratio surrogate, requiring no human labels or online rollouts. Results show +6.4 success-rate gain in high-latency regimes (5-7 control steps), +4.6 when transferred to real-scale VLA, and consistent improvements on two real-robot tasks.

vision-language-actionasynchronous inferenceflow-matchingcounterfactual tuningdelay-robust

Are Rationales Necessary and Sufficient? Tuning LLMs for Explainable Misinformation Detection

arXiv cs.AI · Bing Wang, Rui Miao, Ximing Li, Chen Shen · 2026-05-19

The paper introduces LONSREX, a data synthesis pipeline for fine-tuning LLMs to generate necessary and sufficient rationales in explainable misinformation detection (MD). The method addresses limitations of naive filtering by proposing a metric to quantify each verification step's contribution, evaluating rationale quality. Experiments show that traditional approaches using binary labels yield insufficient rationales, while stronger LLMs produce overly verbose ones. LONSREX effectively balances these issues by optimizing rationale necessity and sufficiency.

misinformation detectionlarge language modelsrationale generationfine-tuningveracity prediction

EviTrack: Selection over Sampling for Delayed Disambiguation

arXiv cs.AI · Omer Haq · 2026-05-19

The paper introduces EviTrack, a test-time inference framework for sequential prediction in delayed disambiguation regimes, where early observations remain ambiguous. EviTrack maintains competing trajectory hypotheses and applies evidence- and likelihood-ratio-based selection to delay commitment until sufficient evidence accumulates, inspired by multiple hypothesis tracking. Evaluated on a synthetic benchmark with known latent ground truth, EviTrack outperforms sampling-based baselines at matched inference budget, achieving faster post-disambiguation recovery. Results demonstrate trajectory-level selection's superiority over increased sampling coverage in such regimes.

sequential predictiondelayed disambiguationmultiple hypothesis trackingtrajectory hypothesesevidence-ratio

FormalASR: End-to-End Spoken Chinese to Formal Text

arXiv cs.AI · Wanyi Ning, Yinshang Guo, Haitao Qian, Jiyuan Cheng · 2026-05-19

The paper introduces FormalASR, two compact end-to-end models (0.6B and 1.7B parameters) for direct spoken Chinese to formal text transcription, eliminating the need for post-processing LLMs. The method involves constructing WenetSpeech-Formal and Speechio-Formal datasets via LLM-based rewriting and quality filtering, followed by fine-tuning Qwen3-ASR models. Results show a 37.4% relative CER reduction over verbatim baselines, alongside improved ROUGE-L and BERTScore, demonstrating efficient on-device deployment.

formalasrend-to-endspoken-to-formalqwen3-asrverbatim transcription

Swimming with Whales: Analysis of Power Imbalances in Stake-Weighted Governance

arXiv cs.AI · Yuzhe Zhang, Manvir Schneider, Qin Wang, Davide Grossi · 2026-05-19

The paper analyzes power imbalances in stake-weighted governance systems, particularly in Proof-of-Stake blockchains, using the Penrose-Banzhaf power index. Methodologically, it combines analytical proofs showing that perfect power-stake alignment is unattainable but approximable under specific conditions, with empirical analysis of real-world data from Project Catalyst. Results reveal significant power distortions favoring large stakeholders, providing quantitative insights into governance centralization risks in current implementations.

proof-of-stakegovernancepenrose-banzhaf indexpower imbalanceblockchain

When Web Apps Heal Themselves: A MAPE-K Based Approach to Fault Tolerance and Adaptive Recovery

arXiv cs.AI · Sales Aribe, Rov Japheth Oracion · 2026-05-19

This study introduces a MAPE-K-based self-healing framework for web applications, integrating an AutoFix-inspired mechanism for adaptive fault recovery. The system was evaluated through fault injection experiments across 20 scenarios, achieving 90.7% F1-score in fault detection and 93.2% recovery success. AutoFix reduced time-to-recovery by 56.2% (avg. 3.92s), maintaining 88-95% throughput with only 3.1% response time increase. Feedback mechanisms improved recovery efficiency by 18.6%, demonstrating practical fault tolerance via feedback-driven adaptation.

mape-kautofixfault tolerancetime-to-recoverythroughput

AQuaUI: Visual Token Reduction for GUI Agents with Adaptive Quadtrees

arXiv cs.AI · Yuankai Li, Tinghui Zhu, Ha Min Son, Zhe Zhao · 2026-05-19

AQuaUI introduces a training-free token reduction method for GUI-agent models by leveraging non-uniform spatial information density in screenshots. It employs adaptive quadtrees to merge redundant tokens while preserving critical visual elements, maintaining spatial consistency via position encoding. The method enhances temporal consistency across interactions through conditional quadtree refinement using prior states. Evaluated on GUI-Owl-1.5-32B-Instruct, AQuaUI achieves 13.22% speedup and 29.52% token reduction with 99.06% performance retention, demonstrating efficient exploitation of GUI spatial redundancy.

quadtreetoken reductiongui agentsspatial redundancymultimodal models

ExECG: An Explainable AI Framework for ECG models

arXiv cs.AI · Jong-Hwan Jang, Yong-yeon Jo · 2026-05-19

ExECG introduces a standardized Python framework for explainable AI in ECG analysis, addressing variability in current pipelines. The three-stage architecture includes: (1) Wrapper for ECG format standardization, (2) Explainer unifying XAI methods (e.g., saliency maps, attention), and (3) Visualizer for cross-method comparison. Case studies demonstrate interoperability and reproducibility across heterogeneous ECG models, facilitating clinical deployment through improved interpretability of arrhythmia classification and abnormality detection outputs.

explainable aiecg analysisarrhythmia classificationsaliency mapsinterpretability

Causal Evidence for Attention Head Imbalance in Modality Conflict Hallucination

arXiv cs.AI · Jinrui Jiang, Zhangtai Wu, Zhen Wu, Xinyu Dai · 2026-05-19

The study investigates modality-conflict hallucination in multimodal LLMs, where textual premises override contradictory visual evidence. Through causal analysis of attention heads in five MLLMs, it identifies opposing groups: broadly distributed hallucination-driving heads and concentrated hallucination-resisting heads. The imbalance between these groups biases generation toward erroneous premises. The proposed MACI intervention selectively suppresses driving heads during conflicts, achieving significant hallucination reduction on the MMMC benchmark (best among baselines) with favorable accuracy trade-offs and zero-shot transfer to SCI-SemanticConflict.

modality-conflict hallucinationattention head imbalancepath patchingcausal interventionmmmc benchmark

Euclidean Embedding of Data Using Local Distances

arXiv cs.AI · Dimitris Arabadjis · 2026-05-19

The paper presents a method for Euclidean embedding of data using only local distance graphs, without requiring prior vector representations. The approach formulates a variational problem to match local on-graph distances to Euclidean metrics, deriving Euler-Lagrange equations in coordinate-free form. These non-linear equations are solved via iterative sparse linear updates. Key contributions include: (a) continuum-level functional equations for optimal embedding, (b) representation-free formulation relying solely on neighborhood distance graphs, and (c) local graph-based estimation. Experiments on synthetic and real datasets demonstrate preservation of local metric structure and global isometric approximation.

euclidean embeddinglocal distance graphvariational problemeuler-lagrange equationsisometric approximation

PhyWorld: Physics-Faithful World Model for Video Generation

arXiv cs.AI · Pu Zhao, Juyi Lin, Timothy Rupprecht, Arash Akbari · 2026-05-19

PhyWorld introduces a physics-faithful world model for video generation, employing two-stage post-training to enhance physical plausibility. First, flow matching fine-tuning improves video-to-video continuation stability and motion coherence. Second, Direct Preference Optimization (DPO) aligns generated dynamics with physical principles. Evaluated on VBench and a dedicated physical-faithfulness benchmark, PhyWorld achieves 0.769 (vs. 0.756 baselines) in video consistency and 3.09 (vs. 2.99 baselines) in physical plausibility, demonstrating its efficacy for Physical AI simulations.

phyworldvideo generationphysical plausibilitydirect preference optimizationflow matching

AI Technologies in Language Access: Attitudes Towards AI and the Human Value of Language Access Managers

arXiv cs.AI · Miguel A. Jiménez-Crespo, Stephanie Rodriguez, Alejandro Jaume Losa · 2026-05-19

This study examines language access managers' attitudes toward AI technologies in translation services, focusing on sectors with legal and ethical constraints like healthcare and government. Through qualitative thematic analysis of 10 semi-structured interviews with US-based professionals, the research reveals conditional optimism about AI adoption, coupled with strong risk awareness and insistence on human oversight in AI-mediated language access. Findings highlight tensions between efficiency mandates and the perceived irreplaceability of human judgment in high-stakes multilingual contexts.

language accessqualitative analysishuman oversighttranslation technologyethical ai

Can Large Language Models Revolutionize Survey Research? Experiments with Disaster Preparedness Responses

arXiv cs.AI · Yan Wang, Ziyi Guo, Christopher McCarty · 2026-05-19

The study evaluates LLMs' potential to address survey research challenges through a five-stage framework tested on hurricane preparedness data (n=946). It introduces an Anchored Marginal Theory-Informed LLM (A-TLM) that integrates Protection Motivation Theory (PMT) via knowledge graphs, outperforming classical imputation methods (RMSE 1.439 vs. 1.496) with minimal bias (-0.121). Structured retrieval around PMT causal relationships reduces MAE by 9.5% compared to standard RAG, while subgroup analysis reveals masked bias patterns. The framework demonstrates controlled hallucination through knowledge-grounded refusal in chatbot implementations.

large language modelsmissing data imputationprotection motivation theoryknowledge graphretrieval-augmented generation

Diagnosing Multi-step Reasoning Failures in Black-box LLMs via Stepwise Confidence Attribution

arXiv cs.AI · Xiaoou Liu, Tiejin Chen, Dengjia Zhang, Yaqing Wang · 2026-05-19

The paper introduces Stepwise Confidence Attribution (SCA), a framework for diagnosing reasoning failures in black-box LLMs by assigning step-level confidence scores to generated traces. SCA employs Information Bottleneck principles through two methods: NIBS (non-parametric consistency measurement) and GIBS (graph-based subgraph learning). Experiments on mathematical reasoning and multi-hop QA tasks demonstrate SCA's effectiveness in identifying erroneous steps, with step-level confidence feedback improving self-correction success rates by up to 13.5% over answer-level baselines.

stepwise confidence attributioninformation bottleneckmulti-step reasoningblack-box llmsself-correction

Token by Token, Compromised: Backdoor Vulnerabilities in Unified Autoregressive Models

arXiv cs.AI · Tobias Braun, Jonas Henry Grebe, Hossein Shakibania, Anna Rohrbach · 2026-05-19

The paper introduces Token by Token Backdoor Attack (ToBAC), the first backdoor attack targeting unified autoregressive models (UAMs) that generate both text and image tokens. It explores data-based and model-based poisoning strategies, demonstrating how innocuous triggers (e.g., common words) can propagate malicious effects across modalities, manipulating visual outputs and accompanying text. Experiments show ToBAC achieves a 55% success rate in modality-aligned brand promotion via model access and 63.1% success against JanusPro through data poisoning.

unified autoregressive modelsbackdoor attackmultimodal generationdata poisoningmodel poisoning

Position: Uncertainty Quantification in LLMs is Just Unsupervised Clustering

arXiv cs.AI · Tiejin Chen, Longchao Da, Xiaoou Liu, Hua Wei · 2026-05-19

The paper contends that current uncertainty quantification (UQ) methods for LLMs constitute a category error, functioning as unsupervised clustering algorithms that measure internal consistency rather than external correctness. Through critical analysis, the authors identify three pathologies: hyperparameter sensitivity, conflation of stability with truth, and reliance on unstable proxy metrics due to absent ground truth. They propose a paradigm shift toward UQ methods anchored in objective verification, advocating for improved evaluation metrics, native uncertainty mechanisms, and reality-grounded confidence measures to address confident hallucinations.

uncertainty quantificationlarge language modelsconfident hallucinationsunsupervised clusteringproxy metrics

SimGym: A Framework for A/B Test Simulation in E-Commerce with Traffic-Grounded VLM Agents

arXiv cs.AI · Han Li, Vibhor Malik, Zahra Zanjani Foumani, Alberto Castelo · 2026-05-19

SimGym introduces a framework for simulating e-commerce A/B tests using vision-language model (VLM) agents, addressing the limitations of traditional testing (traffic diversion, slow cycles). The method combines traffic-grounded persona generation from clickstream data, live-browser agents with multimodal perception and episodic memory, and an evaluation protocol comparing simulated vs. real outcomes. Validation on UI theme changes shows 77% directional alignment with real add-to-cart shifts, reducing experimental cycles from weeks to under an hour.

a/b testingvision-language modelclickstream datamultimodal perceptionepisodic memory

Rotation-Aligned Key Channel Pruning for Efficient Vision-Language Model Inference

arXiv cs.AI · Beomseok Kang, Dongwon Jo, Jiwon Song, Donghwee Son · 2026-05-19

The paper introduces RotateK, a rotation-based structured Key channel pruning framework for efficient vision-language model inference. The method employs online PCA-based rotation to align token-dependent channel importance into a shared low-dimensional subspace, enabling accurate pruning under lightweight head-wise masks, and uses a fused Triton attention kernel for efficient decoding. Experiments on two VLM backbones demonstrate that RotateK outperforms prior Key channel pruning methods in accuracy and decoding latency, with joint token-channel pruning improving over token-only baselines at matched KV cache budgets.

key channel pruningvision-language modelskv cacheonline pcatriton attention kernel

Not all uncertainty is alike: volatility, stochasticity, and exploration

arXiv cs.AI · Payam Piray · 2026-05-19

The paper demonstrates that different sources of environmental uncertainty (volatility and stochasticity) have opposing effects on optimal exploration strategies in decision-making. By extending the Gittins index framework to Gaussian state-space bandits with latent dynamics, the authors derive Cause-Aware Uncertainty-Sensitive Exploration (CAUSE), a closed-form exploration bonus that accounts for these asymmetries. CAUSE outperforms standard exploration methods in environments with heterogeneous noise and improves upon Gittins-per-arm policies in restless bandit settings, while revealing that pathological noise inference can lead to reversed exploration patterns relevant to psychiatric modeling.

gittins indexstate-space banditsexploration-exploitationvolatilitystochasticity

Quantized Machine Learning Models for Medical Imaging in Low-Resource Healthcare Settings

arXiv cs.AI · Sumanth Meenan Kanneti, Aryan Shah · 2026-05-19

The paper presents a multi-strategy compression framework for deploying deep learning models in low-resource medical imaging settings, focusing on brain tumor classification from MRI. The approach combines quantization-aware training, knowledge distillation (DenseNet-101 to DenseNet-32), and Float16 post-training quantization on MobileNetV2. The quantized MobileNetV2 achieves 82.37% validation accuracy (vs. 82.20% full-precision) with a 6.14x size reduction (35.34 MB to 5.76 MB), maintaining uniform diagnostic performance across glioma, meningioma, pituitary tumors, and healthy controls. Results demonstrate clinical viability for resource-constrained environments.

quantization-aware trainingknowledge distillationfloat16 quantizationmobilenetv2medical imaging

Aerial Inspection Behaviors via RL-based Quadrotor Control for Under-canopy Forest Environments

arXiv cs.AI · Fausto Mauricio Lagos Suarez, Akshit Saradagi, Vidya Sumathy, Viswa Narayanan Sankaranarayanan · 2026-05-19

The paper presents a deep Reinforcement Learning (RL)-based low-level controller for quadrotor navigation in under-canopy forest environments, enabling inspection view-pose tracking (position and yaw reference tracking) for target inspection behaviors. The method combines an end-to-end RL policy (mapping states to RPMs) with a higher navigation layer comprising a Traveling Salesman Problem (TSP) planner for optimal visitation sequencing and a Rapidly-exploring Random Tree Star (RRT*) planner for collision-free path generation. Results demonstrate effective deployment in five target inspection scenarios, showing RL-based motor-level control can serve as a reliable low-level execution module when supported by navigation guidance.

reinforcement learningquadrotor controlview-pose trackingtsp plannerrrt* planner

On-Device Continual Learning with Dual-Stage Buffer and Dynamic Loss for Point-of-Care Pneumonia Diagnosis

arXiv cs.AI · Danu Kim · 2026-05-19

PneumoNet introduces a domain-incremental learning method for point-of-care pneumonia diagnosis, addressing performance decline under domain shifts. The method combines a lightweight CNN, a dual-stage balanced buffer for class-balanced replay, and dynamic class-weighted loss to correct training imbalances. Evaluated on PneumoniaMNIST with five domain-shift scenarios, PneumoNet achieves 86.6% accuracy with 1.4% forgetting, outperforming baselines in size and speed.

domain-incremental learninglightweight cnndual-stage bufferdynamic losspneumoniamnist

Hallucination as Exploit: Evidence-Carrying Multimodal Agents

arXiv cs.AI · Guijia Zhang, Hao Zheng, Harry Yang · 2026-05-18

The paper introduces evidence-carrying multimodal agents (ECA) to mitigate hallucination-to-action conversion, where false visual claims trigger unauthorized privileged actions. ECA decomposes tool calls into action-critical predicates, verifies them via constrained DOM/OCR/AX certificates, and uses a deterministic gate to authorize only supported actions. Evaluations show ECA reduces gate bypass from 15% to 1.3% through targeted hardening, achieves 0% unsafe-action rate on 200-task and 120-task pipelines, and outperforms naive agents (100% unsafe execution) and prompt-only defenses (49.6%). Oracle-certificate replay on 7,488 GPT-5.4 traces validates gate correctness.

multimodal agentshallucination-to-actionevidence-carryingaction-critical predicatesdeterministic gate

Going PLACES: Participatory Localized Red Teaming for Text-to-Image Safety in the Global South

arXiv cs.AI · Charvi Rastogi, Mukul Bhutani, Minsuk Kahng, Shamsuddeen Hassan Muhammad · 2026-05-18

The study introduces PLACES, a dataset of 26,000 text-to-image (T2I) model failures collected through localized red teaming in Ghana, Nigeria, and India (Karnataka, Punjab). The method emphasizes participatory localization, involving community workshops in secondary urban centers to capture culturally specific adversarial patterns. Results reveal unique harms (e.g., religious norm violations, ominous symbolism) and structural gaps in Western-centric safety frameworks, demonstrating the need for culturally contextualized T2I safety evaluation.

text-to-image modelsred teamingsafety frameworkscultural pluralismadversarial patterns

Discoverable Agent Knowledge -- A Formal Framework for Agentic KG Affordances (Extended Version)

arXiv cs.AI · Terry R. Payne, Valentina Tamma, Enrico Daga · 2026-05-18

The paper proposes Agentic Affordance Profiles (AAPs), a formal framework extending Semantic Web Service concepts to Knowledge Graphs (KGs) by addressing epistemic gaps in current metadata standards (VoID, DCAT). AAPs model four dimensions: agent-provable knowledge, closure assumptions, vocabulary grounding, and entailment regime alignment, enabling principled KG selection and failure diagnosis during agent planning. The work identifies a five-point research agenda for scalable affordance matching, bridging ontological commitments between agents and heterogeneous KGs.

agentic affordance profileknowledge graphsemantic web servicesontological commitmententailment regime

Planner-Admissible Graph-PDE Value Extensions for Sparse Goal-Conditioned Planning

arXiv cs.AI · Shiheng Zhang · 2026-05-18

The paper establishes planner-admissibility conditions for graph-PDE value extensions in sparse goal-conditioned planning, proving a local action-gap certificate: greedy rollouts succeed if surrogate value errors remain below half the true action gap. The analysis compares Absolutely Minimal Lipschitz Extension (AMLE) and harmonic extension, showing AMLE's superiority through a comparison-principle fill-distance bound and its compatibility with local geometry. Experimental results on 120 AntMaze configurations demonstrate AMLE's 0.970 success rate versus harmonic's 0.584, with high-p methods (p=8, p=16) achieving 0.973-0.982 success by correcting harmonic's action-ranking inversions.

goal-conditioned planninggraph-pdeabsolutely minimal lipschitz extensionaction-gap certificateharmonic extension

Bridge: Retrieval-Augmented Spatiotemporal Modeling for Urban Delivery Demand

arXiv cs.AI · Yihong Tang, Tong Nie, Junlin He, Qianjun Huang · 2026-05-18

The paper introduces Bridge, a retrieval-augmented spatiotemporal framework for urban delivery demand forecasting in cold-start regions. Bridge combines an inductive graph backbone with a time-aware memory of region-time windows, retrieving future demand patterns based on regional context and recent dynamics, then refining forecasts via gated fusion. The retriever is trained with a future-aware objective to align retrieval with forecasting utility. Evaluations on four delivery datasets demonstrate Bridge's superiority over baselines in within-city cold-start and cross-city transfer scenarios, highlighting retrieval augmentation's value when parametric generalization falls short.

retrieval-augmentedspatiotemporalcold-startinductive graphgated fusion

How Far Are We From True Auto-Research?

arXiv cs.AI · Zhengxin Zhang, Ning Wang, Sainyam Galhotra, Claire Cardie · 2026-05-18

The paper introduces ResearchArena, a framework for evaluating auto-research systems by generating and assessing 117 agent-written papers across 13 CS domains using Claude Code, Codex, and Kimi Code. Manuscript-only review (SAR) shows Claude Code outperforming Analemma's FARS and matching human ICLR 2025 submissions, but artifact-aware peer review reveals severe experimental rigor issues (fabrication, underpowering, execution mismatch) with agent-dependent failure rates (5-77%). No agent-generated paper meets top-tier venue standards, indicating significant gaps in true auto-research capability.

auto-researchresearcharenaartifact-aware reviewexperimental rigoragent-generated papers

Progressive Autonomy as Preference Learning: A Formalization of Trust Calibration for Agentic Tool Use

arXiv cs.AI · Changkun Ou · 2026-05-18

The paper formalizes trust calibration for agentic tool use as a preference-learning problem, where a policy gateway models human risk tolerance via Gaussian-process classification with probit likelihood on binary approve/deny feedback. The method leverages Preferential Bayesian Optimization's inference machinery and uncertainty-targeted querying to classify actions into allow/block/ask regions, differing from standard optimization objectives. Theoretical connections to sample-efficient preference learning are established while addressing the distinct challenge of autonomous action approval.

trust calibrationagentic tool usegaussian-process classificationpreferential bayesian optimizationprobit likelihood

Flash PD-SSM: Memory-Optimized Structured Sparse State-Space Models

arXiv cs.AI · Aleksandar Terzić, Francesco Carzaniga, Nicolas Menet, Yannick Biehl · 2026-05-18

Flash PD-SSM introduces a memory-optimized structured sparse state-space model (SSM) that balances expressivity and efficiency through trainable structured sparse matrices, with discrete selection at each time-step. This approach matches unstructured matrix expressivity in finite-state automaton modeling while maintaining computational efficiency. Evaluations on synthetic tasks and long sequences (17k+ tokens) show state-of-the-art accuracy among SSMs, with improved throughput and memory usage in language modeling and state-tracking applications compared to existing SSMs.

state-space modelsstructured sparsityfinite-state automatonmemory optimizationtime-series modeling

Be Kind, Rewrite: Benign Projections via Rewriting Defend Against LLM Data Poisoning Attacks

arXiv cs.AI · John T. Halloran, Noopur S. Bhatt · 2026-05-18

The paper proposes open-book benign rewriting (OBBR) as a defense against LLM data poisoning attacks, demonstrating theoretically and empirically that projecting poisoned samples to benign prompt space via rewriting with benign references improves safety. The method outperforms closed-book rewriting by 25.7% and state-of-the-art defenses by 51% across five backdoor attacks on four LLMs, while maintaining computational efficiency and preserving natural language task performance. Results show OBBR's effectiveness against both trigger-based and non-trigger-based poisoning.

data poisoningbackdoor attacksllm rewritingopen-book defensebenign projection

GRASP: Deterministic argument ranking in interaction graphs

arXiv cs.AI · Diganta Misra, Antonio Orvieto, Rediet Abebe, Volkan Cevher · 2026-05-18

The paper introduces GRASP (Gradual Ranking with Attacks and Support Propagation), a deterministic framework for argument ranking in interaction graphs that addresses instability in holistic LLM-as-a-Judge evaluations. GRASP aggregates local interaction judgments via a convergent attack-defense propagation operator, demonstrating greater consistency than holistic rankings (inter-model disagreement reduced). Results show GRASP scores are reproducible but uncorrelated with human 'convincingness' labels, instead measuring structural sufficiency—argument robustness within explicit interaction graphs. The method provides a transparent alternative to opaque LLM judging practices.

argument rankinginteraction graphsllm-as-a-judgestructural sufficiencypropagation operator

Learning to Hand Off: Provably Convergent Workflow Learning under Interface Constraints

arXiv cs.AI · Jiayu Li, Enpei Zhang, Dawei Zhou, Elynn Chen · 2026-05-18

The paper introduces IC-$Q$, a provably convergent decentralized $Q$-learning algorithm for workflow learning under interface constraints, formalized as an interface-constrained semi-Markov decision process (IC-SMDP). The method extends the approximate information state (AIS) framework to multi-agent SMDPs and controls Markovian noise under random duration, achieving coordination via scalar handoffs. Theoretical analysis provides a finite-sample bound decomposing into neural approximation error, interface gap, and mixing-time residual, validated empirically on synthetic tasks, multi-LLM reasoning, routing, and CPU programming, matching centralized performance without joint trajectory access.

ic-smdpdecentralized q-learningapproximate information stateworkflow learningmulti-agent coordination

COBALT: Crowdsourcing Robot Learning via Cloud-Based Teleoperation with Smartphones

arXiv cs.AI · Ayush Agarwal, Ansh Gandhi, Jeremy A. Collins, Omar Rayyan · 2026-05-18

COBALT introduces a cloud-based teleoperation platform for scalable robot learning via crowdsourced demonstrations. The system leverages vectorized environments and load-balanced infrastructure to support concurrent teleoperation by multiple users on a single GPU, achieving sub-100 ms latency for up to 8 users. It accommodates diverse input devices (smartphones, VR headsets, etc.) and demonstrates scalability (256 simulated clients across 8 GPUs). A user study confirms smartphone-based teleoperation matches or exceeds specialized hardware performance. The platform includes real-time metrics for data quality filtering and a training curriculum, enabling collection of 7500+ demonstrations across nine countries. The resulting dataset validates state-of-the-art imitation learning algorithms.

teleoperationimitation learningvectorized environmentsload-balanced infrastructurecrowdsourcing

Knowing When Not to Predict: Self Supervised Learning and Abstention for Safer DR Screening

arXiv cs.AI · Muskaan Chopra, Lorenz Sparrenberg, Jan H. Terheyden, Rafet Sifa · 2026-05-18

The study investigates how self-supervised learning (SSL) pretraining duration affects confidence calibration and abstention in diabetic retinopathy screening models. Using multiple SSL checkpoints with fixed fine-tuning, the authors evaluate calibrated confidence, coverage, selective accuracy, and selective macro-F1. Results show SSL improves selective prediction over random initialization, but prolonged pretraining does not consistently enhance reliability despite accuracy saturation. The work highlights pretraining length as a critical design choice for safety-aware models, not just a computational detail.

self-supervised learningconfidence calibrationselective predictiondiabetic retinopathyabstention

EgoBabyVLM: Benchmarking Cross-Modal Learning from Naturalistic Egocentric Video Data

arXiv cs.AI · Dongyan Lin, Phillip Rust, Angel Villar Corrales, Alvin W. M. Tan · 2026-05-18

The paper introduces EgoBabyVLM, a benchmark for evaluating vision-language models (VLMs) on naturalistic egocentric video data resembling infant learning conditions. It proposes Machine-DevBench, an automatically generated evaluation suite that eliminates train/eval mismatches by sampling lexical and grammatical items across logarithmic frequency bins from the model's vocabulary. Experiments reveal current VLMs fail to leverage weakly-aligned visuo-linguistic signals prevalent in egocentric streams, despite human proficiency in such conditions. The work establishes the EgoBabyVLM Challenge to advance models capable of learning from infant-like naturalistic input.

vision-language modelsegocentric videocross-modal learningdevelopmental benchmarkssemantic alignment

POLAR-Bench: A Diagnostic Benchmark for Privacy-Utility Trade-offs in LLM Agents

arXiv cs.AI · Qiaoyuan Zheng, Yiqu Yang, Qi Gao, Imanol Schlag · 2026-05-18

POLAR-Bench introduces a diagnostic benchmark for evaluating privacy-utility trade-offs in LLM agents, featuring adversarial probing of protected attributes under user-defined policies. The method employs a trusted model with privacy policies conversing with adversarial third-party models across 10 domains (7,852 samples), scoring privacy/utility via set-membership and varying policy dimensions/attack strategies. Results show frontier models withhold >99% of protected attributes, while 1-30B open-weight models leak >50%, revealing critical gaps in privacy alignment for on-device deployments.

privacy-utility trade-offllm agentsadversarial probingpolicy-aware benchmarkingintent-following

GOAL: Graph-based Objective-Aligned Diffusion Solvers for Dynamic Multi-Objective Optimization

arXiv cs.AI · Xingyu Li · 2026-05-18

The paper introduces GOAL, a graph-based diffusion solver for dynamic multi-objective optimization that enables controllable decision generation by conditioning on human-specified objectives. The method employs a heterogeneous graph encoding with distinct edge types for different constraint classes, guiding message passing in a graph neural network. Evaluated on Flow Shop Problem (FSP), Job Shop Scheduling Problem (JSP), and Flexible Job Shop Scheduling Problem (FJSP), GOAL achieves 100% feasibility and <0.20% MAPE for problems up to 20 jobs/60 operations, outperforming NSGA-II and MOEA/D by 25x in speed and quality.

graph neural networkmulti-objective optimizationdiffusion solverheterogeneous graphscheduling benchmarks

FAGER: Factually Grounded Evaluation and Refinement of Text-to-Image Models

arXiv cs.AI · Youngsun Lim, Cusuh Ham, Pin-Yu Chen, Deepti Ghadiyaram · 2026-05-18

FAGER introduces a factually grounded framework for evaluating and refining text-to-image models by assessing implicit and explicit factual correctness. The method constructs a structured rubric via LLM-based fact proposal and reference-guided visual verification, then converts it into VLM-evaluated QA pairs. Experiments on five datasets (science, history, products, culture, knowledge-intensive concepts) show FAGER outperforms prior metrics in Factual A/B tests and enables training-free output refinement with significant factuality improvements.

text-to-image evaluationfactual correctnessvisual verificationllm-based fact proposalvlm-based evaluation

Neural Operators for Design-Space Surrogate Modeling of Tendon-Actuated Continuum Robots

arXiv cs.AI · Branden Frieden, James M. Ferguson, Alan Kuntz, Varun Shankar · 2026-05-18

The authors propose neural operator architectures for surrogate modeling of tendon-actuated continuum robots, enabling generalization across robot designs. They formulate the problem as operator learning, mapping design parameters and tendon inputs to configurations, and develop four novel architectures: two DeepONet-based and two FNO-based variants. Trained on simulation data, all models achieve good accuracy while maintaining fast inference, demonstrating effective generalization for control, planning, and design optimization in surgical and industrial applications.

neural operatorscontinuum robotssurrogate modelingdeeponetfourier neural operators

DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows

arXiv cs.AI · Yuxuan Gao, Megan Wang, Yi Ling Yu, Zijian Carl Ma · 2026-05-18

The authors introduce DecisionBench, a benchmark for evaluating emergent delegation in long-horizon agentic workflows, featuring a fixed task suite (GAIA, tau-bench, BFCL multi-turn), peer-model pool (11 models across 7 vendors), and multi-axis metrics (quality, cost, latency, etc.). The substrate supports various evaluation methods, including learned routers and adaptive profile construction. Key findings include: (i) end-task quality is statistically similar across awareness conditions (|β|=0.21), masking orchestration signals; (ii) routing fidelity-at-1 varies 7.5-29.5% across conditions, with delivery channel being the dominant factor; (iii) counterfactual analysis reveals 15-31 percentage points of unrealized performance headroom. The release includes the substrate, annotations, and 220 per-condition run archives.

agentic workflowsdelegation interfacerouting fidelitycounterfactual ceilingorchestration signal

ScheduleFree+: Scaling Learning-Rate-Free & Schedule-Free Learning to Large Language Models

arXiv cs.AI · Aaron Defazio · 2026-05-18

The paper introduces ScheduleFree+, a learning-rate-free and schedule-free method for training large language models, addressing scalability challenges in prior work. By incorporating necessary modifications for larger batch and model sizes, the method outperforms Warmup-Stable-Decay schedules by 31% at 1000 tokens per parameter. Results demonstrate its efficacy in long-duration training, establishing a theoretical basis for model averaging and checkpoint merging during pretraining.

schedule-free learninglarge language modelslearning-rate-freemodel averagingcheckpoint merging

Embedding by Elicitation: Dynamic Representations for Bayesian Optimization of System Prompts

arXiv cs.AI · Zhiyuan Jerry Lin, Benjamin Letham, Samuel Dooley, Maximilian Balandat · 2026-05-18

The paper introduces ReElicit, a Bayesian optimization framework for tuning system prompts using aggregate feedback. The method employs embedding by elicitation, where an LLM constructs a compact feature space from task descriptions and prompt-score histories, enabling Gaussian process-based optimization with adaptive feature representations. Evaluated on ten prompt optimization tasks with a 30-evaluation budget, ReElicit outperforms baseline methods in aggregate performance, demonstrating LLMs' capability as adaptive semantic representation builders for natural-language optimization.

bayesian optimizationsystem promptsgaussian processembedding by elicitationllm

Counterfactual Likelihood Tests for Indirect Influence in Private Reasoning Channels

arXiv cs.AI · Alexander Boesgaard Lorup · 2026-05-18

The paper introduces a counterfactual likelihood test to quantify indirect influence between private reasoning channels in modular AI systems. The method substitutes upstream private blocks with length-matched donor blocks while fixing public tokens and downstream targets, then measures negative-log-likelihood shifts. Validation on a 7B role-channel model shows textual probes (n-gram overlap, canary tests) fail to reliably detect leakage, whereas the counterfactual approach distinguishes unmasked/masked conditions and isolates public-channel pathways. Results demonstrate persistent A-to-B influence through public-speech hidden states (verified across 13,734 directional contrasts) and zero reverse influence, with graph-separation controls confirming the public channel as the sole signal carrier.

counterfactual likelihoodprivate reasoning channelsnegative-log-likelihood shiftrole-visibility maskgraph-separation control

MANGO: Meta-Adaptive Network Gradient Optimization for Online Continual Learning

arXiv cs.AI · Ankita Awasthi, Marco Apolinario, Kaushik Roy · 2026-05-18

The paper proposes MANGO, a meta-adaptive gradient optimization framework for online continual learning (OCL) that balances stability-plasticity via gradient-gating and meta-learned regularization. Gradient-gating scales parameter updates based on sensitivity, while meta-learned regularization adapts stability coefficients using replay data as both training signal and forgetting evaluator. Evaluated on CLEAR-10, CIFAR-100, and Tiny-ImageNet, MANGO achieves state-of-the-art accuracy and positive backward transfer, outperforming baselines across varying replay buffer sizes.

online continual learninggradient-gatingmeta-learned regularizationcatastrophic forgettingbackward transfer

ReacTOD: Bounded Neuro-Symbolic Agentic NLU for Zero-Shot Dialogue State Tracking

arXiv cs.AI · Yanjun Lin, Zimo Xiao, Kartik Natarajan, Mahesh Sankaranarayanan · 2026-05-18

ReacTOD introduces a neuro-symbolic architecture for zero-shot dialogue state tracking, combining bounded ReAct loops with deterministic validation to reduce hallucinations and format errors. The method employs iterative self-correction, symbolic validation, and compact prompt management via incremental state prediction. Results show 9.3% accuracy improvement over single-pass inference on MultiWOZ, 93.1% self-correction rate, and state-of-the-art zero-shot performance: 52.71% joint goal accuracy (gpt-oss-20B) and 47.34% (Qwen3-8B). On SGD, Claude-Opus-4.6 achieves 80.68% JGA, demonstrating cross-benchmark generalization.

reactodneuro-symboliczero-shotmultiwodsgd

CRAFT: Critic-Refined Adaptive Key-Frame Targeting for Multimodal Video Question Answering

arXiv cs.AI · Mahesh Bhosale, Abdul Wasi, Vishvesh Trivedi, Pengyu Yan · 2026-05-18

The paper introduces CRAFT, a query-conditioned pipeline for multimodal video question answering that combines dynamic keyframe selection, multilingual ASR, and a hybrid critic loop for iterative claim verification. The system integrates UNLI temporal entailment, DeBERTa-v3 cross-claim screening, and a Llama-3.2-3B adjudicator, with a citation-merging stage for source attribution. On MAGMaR 2026, CRAFT achieves 0.739 average score, 0.810 reference recall, and 0.635 citation F1, outperforming baselines. Ablations confirm the importance of atomic claims, ASR, and the critic loop. The method also generalizes to WikiVideo (0.823 Avg), demonstrating cross-dataset applicability.

multimodal video qakeyframe selectiontemporal entailmentclaim verificationcitation merging

Learning Long-Term Temporal Dependencies in Photovoltaic Power Output Prediction Through Multi-Horizon Forecasting

arXiv cs.AI · Sumit Laha, Ankit Sharma, Hassan Foroosh · 2026-05-18

The paper proposes a multi-horizon forecasting framework for photovoltaic power output prediction, demonstrating architecture-independent accuracy improvements by jointly optimizing over sequential future values. The method integrates sequential sky imagery with historical PV data, enabling deep neural networks to better capture long-term temporal dependencies through gradient and filter diversity preservation. Evaluations across diverse architectures show significant accuracy gains across all forecast horizons with minimal computational overhead, offering a scalable solution for grid stability.

multi-horizon forecastingphotovoltaic predictiontemporal dependenciesgradient diversityfilter diversity

Riemannian Networks over Full-Rank Correlation Matrices

arXiv cs.AI · Ziheng Chen, Xiaojun Wu, Bernhard Schölkopf, Nicu Sebe · 2026-05-18

The paper introduces Riemannian networks operating on the manifold of full-rank correlation matrices, an underexplored alternative to SPD matrices. The authors leverage five distinct correlation geometries to systematically extend basic layers (MLR, FC, convolutional) to these manifolds, while providing accurate backpropagation methods for two geometries. Experimental comparisons with SPD and Grassmannian networks demonstrate the approach's effectiveness, though specific performance metrics are not provided in the excerpt.

riemannian networkscorrelation matricesspd manifoldbackpropagationgrassmannian

Benchmarking Commercial ASR Systems on Code-Switching Speech: Arabic, Persian, and German

arXiv cs.AI · Sajjad Abdoli, Ghassan Al-Sumaidaee, Clayton W. Taylor, Ahmad · 2026-05-18

This benchmark evaluates five commercial ASR systems on code-switching speech across Arabic-English, Persian-English, and German-English language pairs. A two-stage pipeline (heuristic filtering + LLM ensemble) selects 300 samples per pair, reducing scoring costs by 91%. Systems are assessed via WER and BERTScore, with ElevenLabs Scribe v2 achieving lowest WER (13.2% overall) and highest BERTScore (0.936). Difficulty-stratified analysis reveals performance gaps, while BERT embeddings confirm semantic proximity despite script differences. Dataset available on HuggingFace.

asrcode-switchingbertscorewertransliteration

Toward an AI-Powered Computational Testbed for Workforce Policy

arXiv cs.AI · Sumer S. Vaid, Ashley V. Whillans · 2026-05-18

The article proposes dynamic employee agents as an AI-powered computational testbed for workforce policy, combining LLM-powered generative agents with management science to simulate employee responses to organizational changes. The method integrates HR records, psychometric data, and digital activity to model cognitive, emotional, and behavioral trajectories. The authors outline the required computational architecture and emphasize safeguards for privacy, accuracy, and representativeness, positioning this as a critical tool for managing AI-driven workforce transitions.

generative agentspsychometric measurescomputational architectureworkforce simulationorganizational behavior

Multi-axis Analysis of Image Manipulation Localization

arXiv cs.LG · Keanu Nichols, Divya Appapogu, Giscard Biamby, Dina Bashkirova · 2026-05-19

The authors introduce AUDITS, a benchmark for image manipulation detection comprising 530K images from user and news photos, supporting analysis across domain shifts, quality, type, and size. The dataset includes diffusion-based inpaintings with diverse manipulation types and sizes. Experiments evaluate robustness of existing methods under domain shifts, aiming to advance research in reliable, generalizable detection. Results highlight the need for improved methods to address varied manipulation scenarios.

image manipulation detectiondomain shiftdiffusion-based inpaintingbenchmark datasetrobustness evaluation

Interpretable Computer Vision for Defect Detection in X-ray Tomography of Aerospace SiC/SiC Composites

arXiv cs.LG · Antonio Peña Corredor, Julien Lesseur, Romain Nunez, Paul Rivalland · 2026-05-19

The paper introduces p-ResNet-50, an interpretable convolutional framework for defect detection in X-ray tomography of aerospace SiC/SiC composites, combining high accuracy with case-based explanations. The method extends ResNet-50 with a prototype layer aligned to expert-defined defect categories, using novel anchor-based and medoid-based regularization to prevent prototype collapse. Evaluated on 12,000 XCT patches, it achieves comparable performance to black-box ResNet-50 (ROC-AUC 0.994 vs. 0.993) while providing traceable decisions via representative evidence patches and uncertainty mapping through UMAP visualization.

interpretable computer visionprototype networksx-ray tomographydefect detectionregularization

SAGE: Scalable Automatic Gating Ensemble for Confident Negative Harvesting in Fraud Detection

arXiv cs.LG · Sudheer Tubati, Amit Goyal · 2026-05-19

The paper introduces SAGE, a scalable automatic gating ensemble for confident negative harvesting in fraud detection, specifically targeting music streaming fraud. The method combines SimHash-based stratified sampling with a modular gating ensemble (using Mahalanobis distance and k-NN density) to address representation bias in Positive-Unlabeled learning. It ensures coverage of rare behavioral cohorts through floor-constrained sampling. Evaluation shows strong precision and recall on held-out data, with generalization across customer-level and artist-level fraud detection without methodological modifications.

fraud detectionpositive-unlabeled learningsimhashmahalanobis distancek-nn density

When Does Model Collapse Occur in Structured Interactive Learning?

arXiv cs.LG · Yuchen Wu, Kangjie Zhou, Weijie Su · 2026-05-19

The paper characterizes model collapse in structured interactive learning environments where multiple generative models train on each other's synthetic outputs. It formalizes model interactions via directed graphs, proving that collapse depends critically on interaction topology, and derives necessary/sufficient conditions for collapse occurrence. Theoretical results include finite-sample guarantees for linear regression and asymptotic guarantees for general M-estimators, validated through numerical experiments. The work extends prior single-model collapse analyses to multi-agent settings with arbitrary interaction patterns.

model collapseinteractive learningm-estimatorssynthetic datainteraction graphs

Goal-Oriented Lower-Tail Calibration of Gaussian Processes for Bayesian Optimization

arXiv cs.LG · Aurélien Pion, Emmanuel Vazquez · 2026-05-19

The paper introduces goal-oriented calibration for Gaussian process (GP) predictive distributions in Bayesian optimization (BO), specifically targeting lower-tail miscalibration for minimization tasks. It proposes a framework with two spatial calibration metrics—occurrence calibration and thresholded μ-calibration—and develops tcGP, a post-hoc method to calibrate GP predictions below a threshold t. Theoretical analysis shows the resulting expected improvement (EI) algorithm maintains denseness in the design space. Empirical evaluations on benchmarks demonstrate improved lower-tail calibration and BO performance compared to standard and globally calibrated GP models.

bayesian optimizationgaussian processlower-tail calibrationexpected improvementspatial calibration

TrajTok: Adaptive Spatial Tokenization for Trajectory Representation Learning

arXiv cs.LG · Zhen Xiong, Shang-Ling Hsu, Cyrus Shahabi · 2026-05-19

TrajTok introduces an adaptive spatial tokenization method for learning transferable trajectory representations from raw GPS traces. The approach combines multi-resolution hexagonal cell partitioning with a factorized transformer encoder featuring modality-specific self-attention, cross-attention fusion, and spatiotemporal rotary position embeddings (ST-RoPE). Pretrained via masked-token modeling to recover geometric and kinematic patterns, TrajTok achieves state-of-the-art performance on Porto dataset benchmarks including similarity search (85.3% accuracy), classification (91.2% F1), and travel-time regression (12.4 min MAE), demonstrating generalizability across geometry- and kinematics-dominated tasks.

trajectory representationspatial tokenizationfactorized transformerrotary position embeddingsmasked-token modeling

FiLark: a streaming-first software framework for end-to-end exploration, annotation, and algorithm integration in distributed acoustic sensing

arXiv cs.LG · Jintao Li, Weichang Li, Kai Tong, Xaingyu Guo · 2026-05-19

FiLark introduces a streaming-first Python framework for distributed acoustic sensing (DAS) that unifies data access, processing, and visualization under a continuous-stream abstraction. The system features an OpenGL-based ring-buffer renderer for interactive browsing of long recordings with constant memory, an integrated annotation interface for event labeling in streams, and a signal processing library with CPU/GPU implementations via PyTorch. By maintaining stateful chunked execution and standardized monitor interfaces, FiLark enables seamless transition from interactive exploration to production pipelines without modification.

distributed acoustic sensingstreaming-firstring-buffer rendererstateful chunked executiongpu-accelerated

Optimizing Computational-Statistical Runtime for Wasserstein Distance Estimation

arXiv cs.LG · Peter Matthew Jacobs, Jeff M. Phillips · 2026-05-19

The paper introduces a Sample-Sketch-Solve paradigm to optimize the computational-statistical runtime for estimating Wasserstein distance between distributions. By employing a regular cartesian grid sketch of samples, the method compresses data without increasing asymptotic error, particularly effective under α-Hölder smooth distributions. The approach achieves ε-additive error in expectation, with runtime scaling as ε^(-max(2,(d+1+o(1))/(1+α))) for d=2,3, demonstrating near-optimal performance when α→1 in d=3.

wasserstein distancecomputational-statistical runtimesample-sketch-solveα-hölder smoothcartesian grid sketch

Optimal Representation Size: High-Dimensional Analysis of Pretraining and Linear Probing

arXiv cs.LG · Valentina Njaradi, Clémentine Dominé, Rachel Swanson, Marco Mondelli · 2026-05-19

The paper provides an analytical framework for understanding how pretrained representation dimensionality affects downstream generalization in high-dimensional linear models. Using principal component analysis for pretraining and linear regression for downstream tasks, the authors derive exact expressions for training and generalization error as functions of representation size, sample sizes, and task alignment. Key findings show maximal compression is optimal with abundant pretraining data but scarce labels, while higher-dimensional representations generalize better with limited pretraining. The work also quantifies a precise trade-off between unlabeled and labeled data requirements, with empirical validation in autoencoders and pretrained LLMs.

representation learninglinear probinghigh-dimensional analysispretraininggeneralization error

Towards Distillation Guarantees under Algorithmic Alignment for Combinatorial Optimization

arXiv cs.LG · Thien Le, Melanie Weber · 2026-05-19

The paper establishes sufficient conditions for successful knowledge distillation in combinatorial optimization tasks when the target architecture is algorithmically aligned with the problem structure. Focusing on graph neural networks (GNNs) aligned with dynamic programming (DP) algorithms, the analysis assumes the source model satisfies the linear representation hypothesis (LRH) and shows distillation efficiency depends on the decision tree complexity of DP transition functions. Theoretical results demonstrate that distillation succeeds when the target GNN's architecture matches the DP algorithm's structure, extending prior work on decision-tree distillation to structured prediction settings.

knowledge distillationalgorithmic alignmentgraph neural networksdynamic programmingcombinatorial optimization

Smooth Partial Lotteries for Stable Randomized Selection

arXiv cs.LG · Alexander Goldberg, Giulia Fanti, Nihar B. Shah · 2026-05-19

The paper introduces smoothness as a design principle for partial lotteries in competitive selection processes, addressing instability in existing lottery designs where small score changes cause large probability shifts. It proposes the Clipped Linear Lottery, a mechanism where selection probabilities scale linearly between upper and lower thresholds, proving its worst-case regret matches a lower bound for smooth selection rules up to a factor of $(1 - k/n)$. Experiments on peer review data from ICLR 2025, NeurIPS 2024, and the Swiss National Science Foundation demonstrate the instability of existing designs and the superior smoothness-utility tradeoff of the proposed method.

partial lotteriessmooth selectionclipped linear lotterylipschitz conditionregret bound

Tail Annealing for Heavy-Tailed Flow Matching

arXiv cs.LG · Jean Pachebat · 2026-05-19

The paper introduces Log-FM, a method for improving flow matching on heavy-tailed data via coordinate-wise soft-log transforms. The approach applies $\phi(x) = \mathrm{sign}(x) \cdot \log(1 + |x|)$ pre-training and exponentiation post-generation, with a Hill diagnostic to selectively transform only heavy-tailed dimensions. Theoretical analysis shows the log-transform converts Pareto tails to exponentials, enabling tail annealing through power transformations. Evaluated on a 144-configuration benchmark (3 copulas, dimensions up to 100, 4 tail indices), Log-FM outperforms specialized baselines in $W_1$, CVaR$_{99}$, and extreme-quantile metrics, achieving zero severe divergences across 2,880 runs.

flow matchingheavy-tailed datatail annealinghill diagnosticsoft-log transform

Active Context Selection Improves Simple Regret in Contextual Bandits

arXiv cs.LG · Mohammad Shahverdikondori, Jalal Etesami, Negar Kiyavash · 2026-05-19

The paper introduces an active context selection method for contextual bandits, improving simple regret bounds compared to passive sampling. By allowing the learner to choose which contexts to sample, the authors derive tight regret rates: passive sampling achieves $\sqrt{n/T \lVert p \rVert_{1/2}}$, while active sampling with allocation $q_j \propto p_j^{2/3}$ achieves $\sqrt{n/T} \lVert p \rVert_{2/3}$, yielding improvements up to $\Theta(k^{1/4})$. They extend the analysis to budgeted active sampling and propose the Explore-Explore-Then-Commit (EETC) algorithm for unknown context distributions, matching known-$p$ rates asymptotically. Experiments validate the theoretical results.

contextual banditssimple regretactive samplingregret boundseetc algorithm

D$^3$-Subsidy: Online and Sequential Driver Subsidy Decision-Making for Large-Scale Ride-Hailing Market

arXiv cs.LG · Taijie Chen, Rui Su, Siyuan Feng, Laoming Zhang · 2026-05-19

The paper introduces D$^3$-Subsidy, a hierarchical diffusion-based framework for dynamic driver-side subsidy optimization in ride-hailing platforms. The method employs a prefix-conditioned diffusion model to generate future trajectories from historical data, coupled with a context-conditioned inverse module for low-dimensional control signals. A Lagrangian-dual-derived mapping ensures subsidy-rate cap compliance without iterative optimization. Offline evaluations show improvements in completed rides (\texttt{Rides}) and gross merchandise value (\texttt{GMV}), while real-world A/B tests confirm operational feasibility with budget constraints.

diffusion modellagrangian dualonline decision-makingparameter-efficient fine-tuningride-hailing

CAMERA: Adapting to Semantic Camouflage in Unsupervised Text-Attributed Graph Fraud Detection

arXiv cs.LG · Junjun Pan, Yixin Liu, Yu Zheng, Lianhua Chi · 2026-05-19

The paper introduces CAMERA, a Case-Adaptive Multi-cue Expert fRAmework for unsupervised text-attributed graph fraud detection (TAGFD) that addresses semantic camouflage by fraudsters. The method employs an ego-decoupled mixture-of-experts architecture, where each expert models distinct fraud-indicative cues, and a context-informed gating model adaptively integrates these cues based on ego node representations and local neighborhood context. CAMERA leverages fraudster rarity for unsupervised one-class learning with expert-level objectives that emphasize benign patterns. Evaluations on 4 datasets demonstrate CAMERA's superior performance against semantically camouflaged fraudsters compared to competitors.

text-attributed graphsemantic camouflagemixture-of-expertsunsupervised learningfraud detection

Take It or Leave It: Intent-Controlled Partial Optimal Transport

arXiv cs.LG · Salil Parth Tripathi, Bertrand Chapron, Fabrice Collard, Nicolas Courty · 2026-05-19

The paper introduces intent-controlled partial optimal transport (IC-POT), a generalization of partial optimal transport that replaces global rejection mechanisms with pointwise rejection costs based on side-specific reliability or external information. The method formulates the problem as a balanced Kantorovich OT on an augmented support and provides a dual interpretation via local acceptance thresholds. Experiments in positive-unlabeled learning, open-partial domain adaptation, and geophysical satellite data demonstrate improved performance when rejection rules encode statistical or physical priors.

partial optimal transportpointwise rejectionkantorovich problemside informationdomain adaptation

Training-Free Bayesian Filtering with Generative Emulators

arXiv cs.LG · Thomas Savary, François Rozet, Gilles Louppe · 2026-05-19

The paper introduces a training-free Bayesian filtering method using generative emulators, specifically diffusion-based models, to address scalability issues in high-dimensional settings. By leveraging these emulators, the authors implement an optimal variant of particle filters that avoids the computational limitations of classical numerical solvers. Experimental results on nonlinear chaotic systems, including atmospheric dynamics, demonstrate the method's effectiveness in scaling particle filtering to high dimensions without additional training.

bayesian filteringparticle filtersdiffusion-based emulatorsnonlinear dynamicshigh-dimensional scaling

Fine-Tuning Without Forgetting via Loss-Adaptive Learning Rates

arXiv cs.LG · Parjanya Prajakta Prashant, Jiongli Zhu, Aldan Creo, Babak Salimi · 2026-05-19

The paper introduces FINCH, a loss-adaptive learning-rate schedule that mitigates catastrophic forgetting during fine-tuning of large language models without modifying the objective function. The method dynamically adjusts learning rates based on batch loss, reducing rates for high-loss batches to limit forgetting while maintaining task performance. Evaluated on knowledge acquisition, science, and low-resource language benchmarks, FINCH reduces forgetting by 93% on average, preserves TruthfulQA and HaluEval performance on Qwen3-4B, and improves confidence calibration compared to standard fine-tuning.

catastrophic forgettingfine-tuninglearning-rate scheduleloss-adaptiveconfidence calibration

Minimalist Visual Inertial Odometry

arXiv cs.LG · Francesco Pasti, Jeremy Klotz, Nicola Bellotto, Shree K. Nayar · 2026-05-19

The paper presents a minimalist visual-inertial odometry (VIO) system for differential-drive robots using only four photodiodes with optical Gabor masks and an IMU. The method jointly optimizes mask parameters and a Temporal Convolutional Network (TCN) via simulation to decode speed from photodiode measurements, combining them with IMU angular velocity for planar trajectory estimation. Experimental validation on a prototype shows accurate ground truth tracking across diverse terrains without real-world fine-tuning, demonstrating resource-efficient odometry with minimal sensing.

visual-inertial odometrygabor maskstemporal convolutional networkdifferential-drive robotsplanar odometry

Beyond Binary Success: A Diagnostic Meta-Evaluation Framework for Fine-Grained Manipulation

arXiv cs.LG · He-Yang Xu, Pengyuan Zhang, Zongyuan Ge, Xiaoshuai Hao · 2026-05-19

The paper introduces MetaFine, a diagnostic meta-evaluation framework for fine-grained manipulation tasks, addressing limitations of binary success metrics in current embodied AI benchmarks. MetaFine decomposes manipulation competency into three axes (understanding, perception, controlled behavior) via a compositional task graph that integrates heterogeneous benchmarks under a unified protocol. Evaluation of vision-language-action models reveals dimension-specific failures, with visual encoder spatial structure preservation identified as a critical bottleneck; targeted improvements yield 70% gains in precision without policy modifications. The framework supports hybrid real-sim validation for stable physical benchmarking.

fine-grained manipulationmeta-evaluationvision-language-actioncompositional task graphspatial perception

Your Neighbors Know: Leveraging Local Neighborhoods for Backdoor Detection in Decentralized Learning

arXiv cs.LG · Sayan Biswas, Antoine Boutet, Davide Frey, Romaric Gaudel · 2026-05-19

The paper introduces Argus, a decentralized backdoor detection framework for collaborative learning that requires no central coordinator or trigger knowledge. Argus leverages local trigger analysis and neighborhood consensus via structural similarity metrics to distinguish true backdoors from false alarms caused by data heterogeneity, with theoretical convergence guarantees. Evaluated on three datasets against three baselines, Argus reduces attack success rates by up to 90 percentage points while maintaining model utility within 5 points of an oracle, with improved effectiveness under higher data heterogeneity.

decentralized learningbackdoor detectionstructural similaritydata heterogeneityconvergence guarantees

Normative Networks for Source Separation via Local Plasticity and Dendritic Computation

arXiv cs.LG · Bariscan Bozkurt, Efe Ali Gorguner, Francesco Innocenti, Rafal Bogacz · 2026-05-19

The paper introduces Predictive Entropy Maximization, a biologically plausible neural network for blind source separation (BSS) that uses local plasticity and dendritic computation. The method approximates entropy maximization through an interpretable objective function, enabling error-driven feedforward synapses (implementable via dendritic mechanisms), Hebbian lateral inhibition, and output nonlinearities for domain constraints. Theoretical spectral bounds characterize approximation accuracy. Empirically, the approach outperforms biologically plausible baselines under correlated sources and noise, matching performance of exact determinant-based methods. Results demonstrate how local plasticity and adaptive inhibition emerge from regularized second-order entropy maximization.

blind source separationlocal plasticitydendritic computationentropy maximizationhebbian learning

Learning Orthonormal Bases for Function Spaces

arXiv cs.LG · Hamidreza Kamkari, Mohammad Sina Nabizadeh, Justin Solomon · 2026-05-19

The paper introduces a neural network-based method for learning adaptive orthonormal bases in function spaces, departing from fixed bases like Fourier or wavelets. By parameterizing basis transformations as continuous paths on the Lie manifold of the orthogonal group, governed by ODEs with neural network-defined finite-rank skew-adjoint operators, the approach achieves universality: rank-2 generators suffice to approximate any target basis. Theoretical results prove density in the orthogonal group, while experiments demonstrate applications to functional PCA, eigenfunction computation, and energy-preserving dynamical systems.

orthonormal basislie manifoldskew-adjoint operatorfunction spaceneural ode

Exploiting Non-Negativity in DAG Structure Learning

arXiv cs.LG · Samuel Rey, Madeline navarro, Gonzalo Mateos · 2026-05-19

The paper introduces a novel approach for learning directed acyclic graphs (DAGs) from linear structural equation models by exploiting non-negative edge weights. The method formulates a regularized non-negative DAG learning problem using an augmented-Lagrangian approach, demonstrating that non-negativity simplifies the acyclicity characterization and yields a benign optimization landscape. Theoretical analysis proves the true DAG is the unique global minimizer without spurious stationary points, while experiments show superior performance over state-of-the-art continuous DAG-learning methods on synthetic and real-world data.

directed acyclic graphsstructural equation modelsnon-negative edge weightsaugmented-lagrangianoptimization landscape

Variance-Reduced Manifold Sampling via Polynomial-Maximization Density Estimation

arXiv cs.LG · Serhii Zabolotnii · 2026-05-19

The paper introduces PMM-MASEM, a variance-reduced manifold sampling method that replaces k-nearest-neighbor density estimation in MASEM with a polynomial-maximization moment estimator. The hybrid approach uses a gated PMM2/PMM3 estimator for non-flat spacing distributions while defaulting to the plug-in/MLE rule for homogeneous manifolds. Experiments show a 22-36% reduction in density MSE for asymmetric gamma and boundary-spacing regimes, though performance degrades on platykurtic uniform spacings. Results indicate applicability-boundary conditions rather than universal improvement.

manifold samplingpolynomial-maximizationvariance reductiondensity estimationk-nearest-neighbor

JAXenstein: Accelerated Benchmarking for First-Person Environments

arXiv cs.LG · Ruo Yu Tao, George Konidaris · 2026-05-19

JAXenstein introduces a JAX-based benchmark for visual first-person tasks in reinforcement learning, addressing the lack of such domains in the JAX ecosystem. Utilizing the Wolfenstein 3D rendering engine, it enables fast and scalable experimentation, outperforming comparable vision-based benchmarks in speed. The benchmark supports testing exploration and partial observability, facilitating rapid algorithm development.

jaxreinforcement learningwolfenstein 3dbenchmarkpartial observability

Hierarchical Contrastive Learning for Multi-Domain Protein-Ligand Binding

arXiv cs.LG · Shuo Zhang, Rongqi Hong, Huifeng Zhang, Jian K. Liu · 2026-05-19

HCLBind introduces a hierarchical contrastive learning framework for multi-domain protein-ligand binding prediction, addressing limitations of monolithic graph approaches. The method combines self-supervised pre-training on Q-BioLiP with a novel hierarchical decoy strategy: local perturbations for single-domain physicochemical constraints and inter-domain rotations for global geometry. It employs a hybrid architecture with domain-gated graph attention, cross-modal attention, and LoRA-adapted foundation models. Evaluated on PDBBind, HCLBind demonstrates improved interface feature discrimination and uncertainty estimation compared to supervised baselines.

protein-ligand bindingcontrastive learninggraph attention networkmulti-domain proteinsuncertainty estimation

Fast Tensorization of Neural Networks via Slice-wise Feature Distillation

arXiv cs.LG · Safa Hamreras, Sukhbinder Singh, Román Orús · 2026-05-19

The authors propose a scalable tensorization framework for neural network compression via slice-wise feature distillation. Unlike global tensor decomposition methods requiring expensive finetuning, their approach decomposes networks into modular slices (individual layers/blocks or layer groups) and tensorizes each slice independently to match intermediate representations of the original model. This method improves accuracy recovery, reduces data dependence, and enables parallel optimization. Experiments on ResNet-34 demonstrate near-lossless compression at moderate rates with faster optimization than global approaches, while GPT-2 XL results show scalability to large models in distributed settings.

tensorizationfeature distillationneural compressionslice-wise decompositionparallel optimization

Set-Valued Policy Learning

arXiv cs.LG · Laura Fuentes-Vicente, Mathieu Even, Gaëlle Dormion, Antoine Chambaz · 2026-05-19

The paper introduces set-valued policy learning, a paradigm where policies output multiple plausible treatments instead of single recommendations, enabling intrinsic uncertainty quantification. The method extends learning-to-defer via a greatest Lower Bound approach and introduces conformal policy learning, which connects estimated optimal treatments with unobserved ground-truth rules. A randomness-injection technique guarantees marginal coverage without assumptions on black-box optimal rules. Experiments on synthetic data and In-Vitro Fertilization (IVF) demonstrate robust policies that balance performance and reliability while incorporating clinical considerations.

set-valued policyconformal policy learninguncertainty quantificationlearning-to-deferrandomness-injection

General Lower Bounds for Differentially Private Federated Learning with Arbitrary Public-Transcript Interactions

arXiv cs.LG · Yicheng Li · 2026-05-19

The work establishes a general lower bound for differentially private federated learning protocols with arbitrary public-transcript interactions, applicable to adaptive rounds and client sample reuse. By developing a privacy-information contraction inequality for complete public transcripts, the authors derive a federated van Trees lower bound for estimators under total clientwise sample-level zero-concentrated differential privacy (zCDP). The results demonstrate the bound's applicability to mean estimation, linear regression, and nonparametric regression under squared ℓ2 loss.

federated learningdifferential privacyzcdplower boundsparameter estimation

LionMuon: Alternating Spectral and Sign Descent for Efficient Training

arXiv cs.LG · Arman Bolatov, Artem Riabinin, Nikita Kornilov, Andrey Veprikov · 2026-05-19

The paper introduces LionMuon, a hybrid optimizer alternating between Lion's sign-based updates and Muon's spectral matrix-sign updates with period P, sharing a dual-EMA momentum buffer. This approach reduces per-step cost while maintaining Muon's effectiveness, achieving Pareto dominance over Muon, Lion, Signum, and AdamW across 124M, 355M, and 720M model scales. Theoretical analysis provides complexity bounds under heavy-tailed noise, predicting compute-optimal P and conditions for superiority. LionMuon's state memory matches Lion (half of AdamW), and even a simpler single-EMA variant (SignMuon) outperforms pure Muon.

optimizerspectral descentsign descentdual-emaheavy-tailed noise

B-cos GNNs: Faithful Explanations through Dynamic Linearity

arXiv cs.LG · Joschka Groß, Mohammad Shaique Solanki, Verena Wolf · 2026-05-19

The paper introduces B-cos GNNs, a class of graph neural networks designed for faithful explainability through dynamic linearity. By employing linear aggregation and B-cos transforms instead of non-linear message passing, the model decomposes predictions into interpretable per-node, per-feature contributions via a single input-dependent linear map. This approach eliminates the need for auxiliary explainers or modified objectives, providing instance-level explanations efficiently. Evaluated as a GIN variant, B-cos GNNs achieve state-of-the-art explainability with minor accuracy trade-offs, outperforming post-hoc methods in speed across synthetic and real-world benchmarks.

b-cos gnnsdynamic linearitygraph neural networksexplainabilitygin variant

MSAlign: Aligning Molecule and Mass Spectra Foundation Models for Metabolite Identification

arXiv cs.LG · Paul Krzakala, Gabriel Melo, Camille Lançon, Charlotte Laclau · 2026-05-19

The paper introduces MSAlign, a framework for metabolite identification through alignment of mass spectra and molecular representations. The method combines frozen foundation models (DreaMS for mass spectra and ChemBERTa for molecules) via lightweight MLP projections trained with contrastive learning. MSAlign outperforms existing approaches across benchmarks while being simple and fast. The work also formalizes distribution shift in evaluation strategies, providing quantitative analysis of data splitting tradeoffs. All implementations and datasets are released for reproducibility.

metabolite identificationcontrastive learningfoundation modelsrepresentation alignmentmass spectrometry

Graph Neural Networks for Community Detection in Graph Signal Analysis

arXiv cs.LG · Roberto Cavoretto, Alessandra De Rossi, Enrico Montini · 2026-05-19

The paper proposes integrating GNN-derived community detection with Partition of Unity Method (PUM) interpolation for graph signal analysis. Using a taxonomy of GNN architectures for community detection, it constructs local subdomains via GNN clustering, computes Graph Basis Function (GBF) interpolants per community, and combines them into global approximations. Experiments on geometric and urban network benchmarks show accurate signal reconstruction, demonstrating that deep learning-based partitions enhance localized interpolation scalability.

graph neural networkscommunity detectionpartition of unity methodgraph basis functionssignal interpolation

Awakening the Hydra: Stabilizing Multi-Concept Backdoor Injection in Text-to-Image Diffusion Models

arXiv cs.LG · Kai Wang, Jiale Zhang, Chengcheng Zhu, Chuang Ma · 2026-05-19

The paper introduces Hydra, a framework for stable multi-concept backdoor injection in text-to-image diffusion models under decentralized reuse scenarios. The method employs evolutionary trigger search in text encoder space to align triggers with target concepts while maintaining stability across injections, combined with multi-task fine-tuning and trigger-clean regularization. Experiments on multiple diffusion backbones demonstrate Hydra's effectiveness, achieving ~95% attack success rate across 8 attackers and 500 concept pairs while preserving clean generation quality.

backdoor injectiontext-to-image diffusionmulti-task fine-tuningevolutionary trigger searchtrigger-clean regularization

Probabilistic Multivariate Time Series Forecasting with Diffusion Copulas

arXiv cs.LG · David Huk, Dongshan Wang, Miha Bresar · 2026-05-19

The paper introduces a Diffusion-Copula framework for probabilistic multivariate time series forecasting, addressing the 'normality bias' in diffusion models by decoupling marginal distribution learning from dependence structure modeling. The method combines deep Mixture Density Networks for heavy-tailed marginal dynamics with a Classification-Diffusion Copula for joint dependence. Evaluated on cryptocurrency markets, the framework outperforms state-of-the-art baselines in forecasting extreme events, correctly identifying simultaneous market crashes as statistically probable rather than impossible, thus improving risk management during contagion events.

diffusion-copulamixture density networksmultivariate forecastingtail riskdependence structure

Agentic Discovery of Cryomicroneedle Formulations

arXiv cs.LG · Hao Li, Lifu Du, Nurul Hameed, Shemonti Saha Authai · 2026-05-19

The study presents an AI-driven closed-loop workflow for discovering cryoprotectant formulations for cryomicroneedles, combining literature curation, Gaussian-process surrogate modeling, Bayesian optimization, and wet-lab validation. A dataset of 198 mesenchymal stem-cell cryopreservation formulations was used to train an uncertainty-aware prior model, which was iteratively refined through 106 wet-lab observations. Final results showed improved predictive performance (batch RMSE reduced from 41.21 to 6.86 percentage points, R²=0.942) and identified a high-viability (95.15%) formulation with low toxicity. The work highlights the potential of agent-assisted discovery for labs lacking in-house data expertise.

cryomicroneedlesgaussian-processbayesian optimizationcryoprotectantmulti-objective optimization

Convergence of Consensus-Based Particle Methods for Nonconvex Bi-Level Optimization

arXiv cs.LG · Yutong Chao, Xudong Sun, Konstantin Riedl, Majid Khadiv · 2026-05-19

The paper proposes a derivative-free consensus-based optimization method for nonconvex bi-level optimization, where the upper-level function is minimized over the set of lower-level global minimizers. The method employs smooth quantile selection and a Gibbs-type Laplace approximation to construct consensus points. Theoretical analysis establishes convergence guarantees for both mean-field dynamics and finite-particle approximations, demonstrating exponential convergence to arbitrary Wasserstein neighborhoods of the bi-level solution under smooth quantile localization and stability assumptions. Numerical experiments on constrained 2D problems and neural network training validate the theoretical findings.

bi-level optimizationconsensus-based optimizationmean-field dynamicswasserstein distancegibbs-type approximation

Cross-View Attention Fusion Net: A Prior-Guided Dual-View Representation Learning for Cardiac Output Estimation from Short-Term PPG Signals

arXiv cs.LG · Yaowen Zhang, Bo Cui, Libera Fresiello, Peter H. Veltink · 2026-05-19

The Cross-View Attention Fusion Network (CVAF-Net) is proposed for cardiac output (CO) estimation from short photoplethysmography (PPG) signals, combining raw temporal PPG data with structured feature sequence maps via cross-view attention. This dual-view approach leverages both end-to-end learning and physiological priors, achieving mean absolute error (MAE) of 0.19 L/min (3.95% MAPE) on simulated data and 1.20 L/min in real-world settings, while reducing FLOPs by 12× compared to Transformer-based models. The method demonstrates physiological plausibility through correlations with age (ρ=−0.274), heart rate (ρ=0.894), and vascular resistance (ρ=−0.740).

photoplethysmographycardiac output estimationcross-view attentionfeature sequence maphemodynamic monitoring

OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond

arXiv cs.LG · Zunhai Su, Rui Yang, Chao Zhang, Yaxiu Liu · 2026-05-19

OScaR introduces a lightweight KV cache compression framework for X-LLMs (text-only, multi-modal, and omni-modal LLMs) by addressing Token Norm Imbalance (TNI) through Canalized Rotation and Omni-Token Scaling. The method mitigates sequence-dimensional variance efficiently, supported by optimized system design and CUDA kernels. Evaluations demonstrate near-lossless INT2 quantization performance, achieving 3.0x decoding speedup, 5.3x memory reduction, and 4.1x throughput increase compared to BF16 FlashDecoding-v2.

kv cachetoken norm imbalanceper-channel quantizationcanalized rotationomni-token scaling

BCI-sift: An automated feature selection toolbox for Brain Computer Interface applications

arXiv cs.LG · Elena C Offenberg, Dirk Keller, Mariska J Vansteensel, Zachary V Freudenburg · 2026-05-19

The authors present BCI-sift, a Python toolbox for systematic feature selection in Brain-Computer Interface applications, compatible with scikit-learn. The method integrates optimization algorithms to identify relevant features across electrode, temporal, and frequency dimensions in high-dimensional BCI data. Validation on HD ECoG data (8 participants, 64-128 electrodes) showed improved classification accuracy, with selected features anatomically consistent and temporally clustered around speech production, while high-frequency bands proved most informative. The open-source toolbox enhances decoding performance and interpretability for various BCI modalities.

feature selectionbrain-computer interfaceelectrocorticographysensorimotor cortexhigh-dimensional data

Inferring Sensitive Attributes from Knowledge Graph Embeddings: Attack and Defense Strategies

arXiv cs.LG · Yasmine Hayder · 2026-05-19

The paper investigates privacy risks in knowledge graph embeddings (KGEs), demonstrating that adversaries can infer sensitive user attributes from non-sensitive KGE outputs. It proposes a defense framework using post-processing sanitization techniques to mitigate these attribute inference attacks. Preliminary results reveal the attack effectiveness and explore the privacy-utility trade-off in randomization-based defenses, suggesting future work on advanced techniques is needed.

knowledge graph embeddingsattribute inference attacksprivacy riskssanitization techniquesprivacy-utility trade-off

Increasing Missingness to Reduce Bias: Richardson-SGD with Missing Data

arXiv cs.LG · Ferdinand Genans, Erwan Scornet · 2026-05-19

The paper introduces Richardson-SGD, a debiasing method for stochastic gradient descent (SGD) with missing data that deliberately increases missingness to reduce gradient bias. By generating a further-thinned version of incomplete observations and combining gradients via Richardson extrapolation, the method reduces bias from $O(\|p\|)$ to $O(\|p\|^2)$, where $p$ is the missingness ratio vector. Theoretical analysis shows the approach is model-agnostic, computationally efficient, and generalizes to multi-step cancellation of higher-order bias terms. Empirical results demonstrate improved optimization and estimation across generalized linear models, particularly when combined with imputation methods like MICE.

richardson extrapolationgradient biasmissing datastochastic gradient descentimputation

Gaussian Approximation and Multiplier Bootstrap for Federated Linear Stochastic Approximation

arXiv cs.LG · Ilya Levin, Maksim Shuklin, Eric Moulines, Paul Mangold · 2026-05-19

The paper establishes Berry-Esseen-type bounds for federated linear stochastic approximation (LSA), providing the first federated Gaussian approximations that quantify communication-computation trade-offs and heterogeneity-aware error terms. It analyzes both constant and decreasing step size regimes, recovering prior results as special cases. A key contribution is an online multiplier bootstrap procedure for inference on the last iterate, which bypasses asymptotic covariance matrix estimation and offers non-asymptotic validity guarantees.

federated learningstochastic approximationberry-esseen boundmultiplier bootstrapheterogeneity-aware

Optimal Reconstruction from Linear Queries

arXiv cs.LG · Yuval Filmus, Shay Moran, Elizaveta Nesterova · 2026-05-19

The paper characterizes optimal reconstruction error for recovering an unknown point in ℝᵈ from noisy linear queries, establishing fundamental limits analogous to Bayes optimal error in supervised learning. Using geometric methods including a robust generalization of Jung's theorem via Lie group analysis, the authors prove: (1) asymptotic error √(2d/(d+1))δ as T→∞, (2) doubly exponential excess error decay for fixed d, and (3) Θ(exp(d)) query complexity for vanishing error in growing dimensions. The improper variant analysis further extends these theoretical foundations.

linear queriesreconstruction errorjung's theoremlie groupquery complexity

Diffusion Graph Posterior Sampling for Nonlinear Inverse Problems with Application to Electrical Impedance Tomography

arXiv cs.LG · Giovanni S. Alberti, Damiana Lazzaro, Serena Morigi, Matteo Santacesaria · 2026-05-19

The paper introduces a graph-based diffusion framework for solving nonlinear inverse problems in PDEs, specifically electrical impedance tomography (EIT). The method extends diffusion posterior sampling (DPS) to unstructured meshes via an unconditional score-based diffusion model on 2D triangular meshes, supplemented by a regularized variant (RDPS) incorporating total variation and Tikhonov terms. Experiments on synthetic and real EIT data show RDPS achieves stable, physically plausible reconstructions, outperforming GPnP-BM3D and DP-SGS in accuracy and noise robustness while generalizing to out-of-distribution geometries.

diffusion posterior samplingelectrical impedance tomographyunstructured meshesscore-based diffusioninverse problems

A Family of Divergence Measures for Evaluating the Reconstruction Quality of Explainable Ensemble Trees

arXiv cs.LG · Massimo Aria, Agostino Gnasso, Carmela Iorio · 2026-05-19

The authors propose a family of divergence measures for evaluating reconstruction quality in Explainable Ensemble Trees (E2Tree), addressing limitations of correlation-based validation. Their framework introduces the normalized Loss of Interpretability (nLoI), a Cressie-Read power divergence (λ=2) measure with closed-form decomposition into within-node and between-node components, enabling precise diagnostic analysis. Four complementary metrics capture distinct structural facets, supported by a unified permutation testing procedure. Theoretical analysis establishes boundedness and symmetry, while empirical evaluations on three benchmarks demonstrate superior detection of reconstruction fidelity gradients compared to correlation-based methods.

explainable ensemble treescressie-read divergencereconstruction fidelitypermutation testinginterpretability loss

Posterior Contraction of Lévy Adaptive B-spline Regression in Besov Spaces

arXiv cs.LG · Jeunghun Oh, Sewon Park, Jaeyong Lee · 2026-05-19

The study establishes nearly minimax-optimal posterior contraction rates, up to a logarithmic factor, for the Lévy Adaptive B-spline (LABS) regression model in Besov spaces. LABS extends the Lévy Adaptive Regression Kernel (LARK) framework by incorporating B-spline kernels with independently defined knots, enabling adaptation to irregular and locally structured features. Theoretical results are complemented by simulations on standard Besov test functions (Blocks, Bumps, HeaviSine, Doppler), demonstrating practical utility while automatically adapting to unknown smoothness.

b-spline regressionbesov spacesposterior contractionnonparametric bayesianminimax-optimal rates

Physics-Informed Graph Neural Network Surrogates for Turbulent Nanoparticle Dispersion in Dental Clinical Environments

arXiv cs.LG · Takshak Shende, Viktor Popov · 2026-05-19

The paper introduces ELGIN, a physics-informed graph neural network surrogate for simulating turbulent nanoparticle dispersion in dental clinics. The model combines a multi-head Graph Transformer with Lagrangian particle tracking and symplectic integration, using a four-stage curriculum for stable autoregressive rollouts. Compared to a Lagrangian-only baseline, ELGIN reduces mean parcel displacement error from 19.56% to 16.20% and cloud radius-of-gyration error from 9.85% to 6.58%, while achieving 37x speedup over traditional CFD methods. The approach enables real-time infection-risk screening in clinical environments.

graph neural networkturbulent dispersionphysics-informed learningautoregressive rolloutsymplectic integrator

Online Market Making and the Value of Observing the Order Book

arXiv cs.LG · Davide Maran, Marcello Restelli · 2026-05-19

The paper introduces an action-dependent feedback model for online market making, where observing the order book provides partial information about supply and demand when trades don't occur. For stochastic i.i.d. prices, the authors propose an elimination-based algorithm achieving O(√T) high-probability regret without smoothness assumptions on trader valuations. They extend this to mean-reverting processes (both local autoregressive dynamics and global drift conditions) while maintaining O(√T) regret, and present an explore-then-perturb algorithm for adversarial settings with O(T^{2/3}) expected regret. The results demonstrate improved learnability compared to standard bandit feedback models.

online market makingaction-dependent feedbackregret boundsmean-reverting processeslimit order book

HiLiftAeroML: High-Fidelity Computational Fluid Dynamics Dataset for High-Lift Aircraft Aerodynamics

arXiv cs.LG · Neil Ashton, Adam Clark, Liam Heidt, Christopher Ivey · 2026-05-19

The authors introduce HiLiftAeroML, the first open-source high-fidelity CFD dataset for high-lift aircraft aerodynamics, targeting AI surrogate model development. The dataset comprises 1800 samples from 180 geometry variants and 10 angles of attack for the NASA Common Research Model, generated using GPU-accelerated explicit wall-modeled LES with solution-adapted grids (300M-500M cells). Results include time-averaged volume/surface variables and integral forces, released under CC-BY-4.0 to accelerate aerospace AI research.

computational fluid dynamicshigh-lift aerodynamicsles simulationsurrogate modelingnasa crm

Learning-Accelerated Optimization-based Trajectory Planning for Cooperative Aerial-Ground Handover Missions

arXiv cs.LG · Jingshan Chen, Bochen Yu, Henrik Ebel, Peter Eberhard · 2026-05-19

The paper introduces a learning-augmented trajectory planning framework for UAV-UGV handover missions, combining neural surrogate planning with centralized optimization. A decoupled encoder-decoder LSTM network predicts initial trajectories from task specifications, accelerating downstream optimization. Evaluations show a threefold speedup and 100% optimization success rate compared to cold starts, demonstrating efficient, feasible trajectory generation for multi-robot systems.

trajectory planninguav-ugv cooperationlstm networksoptimization accelerationmulti-robot systems

Density-Ratio Losses for Post-Hoc Learning to Defer

arXiv cs.LG · Alexander Soen, Ragnar Thobaben, Joakim Jaldén, Richard Nock · 2026-05-19

The paper introduces density-ratio losses for post-hoc Learning to Defer (L2D), framing deferral decisions as density-ratio estimation between model and expert ideal distributions. The method derives DR CPE losses for L2D scorers via reduction from density-ratio to class-probability estimation, enabling adjustable deferral rates without retraining. Theoretical analysis shows connections to Chow's rule and expert-tilted Bayes posteriors. Experiments demonstrate competitive performance against baselines and robustness across datasets, positioning post-hoc L2D as density-ratio learning between ideal distributions.

learning to deferdensity-ratio estimationpost-hoc learningchow's rulebayes posterior

Provable Fairness Repair for Deep Neural Networks

arXiv cs.LG · Jianan Ma, Jingyi Wang, Qi Xuan, Zhen Wang · 2026-05-19

The paper introduces ProF, a provable fairness repair framework for deep neural networks (DNNs) addressing ethical concerns like individual discrimination. ProF leverages interval bound propagation to soundly capture model outputs over input neighborhoods, integrating fairness constraints into a Mixed-Integer Linear Programming (MILP) formulation for guaranteed repair. Evaluated on four benchmarks, ProF achieves up to 95.93% fairness generalization on datasets and 93.16% on the entire input space, with ~90% fairness improvement while supporting multiple sensitive attributes.

fairness repairinterval bound propagationmixed-integer linear programmingprovable guaranteessensitive attributes

The Silent Hyperparameter: Quantifying the Impact of Inference Backends on LLM Reproducibility

arXiv cs.LG · David Pape, Jonathan Evertz, Lea Schönherr · 2026-05-19

This work identifies inference backends as a critical but underreported hyperparameter affecting LLM benchmark reproducibility. The authors survey 200 inference engines and analyze 35,000 ML publications, finding minimal reporting of inference stacks despite their diversity. Through controlled experiments with five backends (vLLM, SGLang, llama.cpp) across multiple models and benchmarks, they demonstrate backend choice alone can alter scores by up to 16.6 percentage points and cause output divergence, traced to optimizations like prefix caching, CUDA graphs, and logit processing defaults.

inference backendsbenchmark reproducibilitycuda graphsprefix cachinglogit processing

Boosting Text-to-Image Diffusion Models via Core Token Attention-Based Seed Selection

arXiv cs.LG · Yunzhe Zhang, Hongfu Liu, Pengyu Hong · 2026-05-19

The paper introduces Attention-Based Seed Selection (ABSS), a training-free method to improve text-to-image diffusion models by ranking seeds based on cross-attention to core tokens during early denoising steps. ABSS operates at inference time, selecting top-k seeds without fixed thresholds or model modifications. Experiments on Stable Diffusion variants demonstrate consistent improvements in text-image alignment and visual quality across three benchmarks, validated by human preference metrics.

seed selectioncross-attentiondenoisingtext-to-imagestable diffusion

Adynamical systems view of training generativemodels and the memorization phenomenon

arXiv cs.LG · Siva Athreya, Chiranjib Bhattacharya, Vivek S. Borkar · 2026-05-19

The paper provides a dynamical systems interpretation of memorization in generative models during SGD training, building on prior work about collapse phenomena and two-time-scale dynamics. By modeling the loss function with strongly/weakly coupled variables and leveraging Austin (2016)'s framework, the authors formalize how constant-step SGD exhibits distinct time scales. This analysis, combined with Borkar (2025a)'s collapse model and Azizian et al. (2024)'s results, explains memorization as prolonged output similarity during fine-tuning. The work unifies memorization, double descent, and collapse through a system-theoretic lens.

memorization phenomenontwo-time-scale dynamicsstochastic gradient descentgenerative modelscollapse phenomenon

Drifting Objectives for Refining Discrete Diffusion Language Models

arXiv cs.LG · Daisuke Oba, Hiroki Furuta, Naoaki Okazaki · 2026-05-19

The paper introduces TokenDrift, a method to refine discrete diffusion language models (DDLMs) by transferring drifting objectives from continuous to discrete domains. The approach lifts categorical predictions to soft-token features, applies anti-symmetric drifting in a frozen semantic space, and backpropagates stop-gradient feature targets to DDLM logits. Experiments with masked (MDLM) and uniform-state (DUO) diffusion backbones show significant improvements: 89% and 86% reductions in generation perplexity at 4 NFEs, respectively, compared to baselines.

discrete diffusiondrifting objectivesoft-token featuresanti-symmetricgeneration perplexity

Implicit Bias of Mirror Flow in Homogeneous Neural Networks: Sparse and Dense Feature Learning

arXiv cs.LG · Tom Jacobs, Guido Montufar · 2026-05-19

The paper characterizes max-margin solutions induced by mirror flow in homogeneous neural networks through convex duality, deriving a balance equation for the horizon function governing margin formation. The analysis extends classical gradient flow results, providing convergence rates, norm growth estimates, and demonstrating how mirror maps influence solution geometry. Experiments on synthetic and vision datasets reveal: (1) non-homogeneous mirror maps can converge to identical max-margin solutions, (2) convergence exhibits extremely slow (including exponential) regimes, and (3) mirror maps induce diverse feature learning behaviors, from sparse to dense neuron activations.

mirror flowmax-marginhomogeneous networksfeature learningconvex duality

CEPO: RLVR Self-Distillation using Contrastive Evidence Policy Optimization

arXiv cs.LG · Ahmed Heakl, Abdelrahman M. Shaker, Youssef Mohamed, Rania Elbadry · 2026-05-19

The paper introduces Contrastive Evidence Policy Optimization (CEPO), a reinforcement learning method that sharpens credit assignment in reasoning tasks by contrasting token-level preferences under correct versus incorrect answers. CEPO constructs a wrong-answer teacher from rejected rollouts without additional sampling, theoretically preserving safety guarantees while better identifying decisive reasoning steps versus filler tokens. Experiments on five multimodal mathematical reasoning benchmarks show CEPO improves average accuracy to 43.43% (2B) and 60.56% (4B) versus GRPO's 41.17% and 57.43%, while distribution-matching baselines (OPSD, SDPO) underperform due to information leakage.

reinforcement learningcredit assignmentpolicy optimizationmultimodal reasoningself-distillation

TIDE: Asymmetric Neural Circuits for Stabilized Temporal Inhibitory-Excitatory Dynamics

arXiv cs.LG · Alexander Kyuroson, Denis Kleyko, Marcus Liwicki · 2026-05-19

The paper introduces TIDE, a neuro-inspired architecture for stabilized neural dynamics, addressing stability limitations in Continuous Thought Machine (CTM) architectures. TIDE employs asymmetric Excitatory-Inhibitory (E-I) networks with Wilson-Cowan dynamics and lateral inhibition, ensuring stability via energy-based optimization and game-theoretic loss. It enforces Dale's principle and an 80:20 E-I ratio while incorporating Hierarchical Receptive Fields for biological realism. Theoretical proofs confirm convergence and stability, with empirical results showing TIDE achieves +1.65% top-1 accuracy on ImageNet under perturbations while using <50% of CTM's training time.

neural dynamicswilson-cowan dynamicsdale's principlelateral inhibitionenergy-based systems

Neuron Incidence Redistribution for Fairness in Medical Image Classification

arXiv cs.LG · Abin Shoby, Lyle John Palmer, Nikhil Cherian Kurian · 2026-05-19

The paper introduces Neuron Incidence Redistribution (NIR), a regularization method to mitigate demographic disparities in medical image classification by redistributing latent disease evidence across penultimate-layer neurons. NIR penalizes variance in predicted-probability-weighted mean activations without requiring demographic labels. Evaluated on HAM10000 and Harvard OCT-RNFL datasets, NIR reduces TPR disparity from 10.81% to 0.93% (age) and 12.04% to 0.74% (gender), and FPR disparity from 15.68% to 10.66% (race) and 12.69% to 1.80% (age), while marginally improving AUC by 0.51 points.

fairnessmedical image classificationneuron incidence redistributionpenultimate-layer activationsdemographic disparity

Understanding Dynamics of Adam in Zero-Sum Games: An ODE Approach

arXiv cs.LG · Yi Feng, Weiming Ou, Xiao Wang · 2026-05-19

The paper provides a theoretical analysis of Adam-DA in zero-sum games by deriving continuous-time ODE approximations of its discrete-time dynamics. Using this framework, the authors examine local convergence and implicit gradient regularization, revealing that momentum parameters exhibit opposite effects compared to minimization problems. Experimental validation on GANs across multiple architectures and datasets confirms these reversed momentum dynamics.

adam-dazero-sum gamesode approximationmomentum parametersimplicit regularization

Tweedie's Formulae and Diffusion Generative Models Beyond Gaussian

arXiv cs.LG · Wenpin Tang, Nizar Touzi, Zikun Zhang, Xun Yu Zhou · 2026-05-19

The authors extend Tweedie's formula to non-Gaussian diffusion processes, enabling denoising score-matching objectives for geometric Brownian motion (GBM), squared Bessel (BESQ), and Cox-Ingersoll-Ross (CIR) processes. This theoretical advancement facilitates score-based generative modeling beyond Gaussian noise assumptions. The derived formulae are empirically validated on image generation, financial time series modeling, and empirical Bayes estimation, demonstrating competitive performance with non-Gaussian diffusion models. Results indicate particular promise for GBM- and CIR-based approaches in their respective domains.

tweedie's formulanon-gaussian diffusiondenoising score matchinggeometric brownian motioncox-ingersoll-ross process

Accurate, Efficient, and Explainable Deep Learning Approaches for Environmental Science Problems

arXiv cs.LG · Jimeng Shi · 2026-05-19

This dissertation develops three deep learning approaches for environmental science challenges. First, WaLeF and FIDLAr improve flood prediction and management in coastal systems, outperforming baselines in accuracy and efficiency while providing interpretability. Second, CoDiCast, a conditional diffusion model, enables probabilistic weather forecasting with explicit uncertainty quantification. Third, Hypercube-RAG enhances scientific QA by combining retrieval-augmented generation with a structured text cube framework, simultaneously improving accuracy, efficiency, and explainability. Evaluations demonstrate effectiveness in flood-prone regions and global weather prediction tasks.

water level forecastingconditional diffusion modelretrieval-augmented generationprobabilistic forecastinginterpretable deep learning

Scalable, Energy-Efficient Optical-Neural Architecture for Multiplexed Deepfake Video Detection

arXiv cs.LG · Parnian Ghapandar Kashani, Shiqi Chen, Aydogan Ozcan · 2026-05-19

The authors propose a hybrid digital-analog architecture for scalable deepfake video detection, combining a lightweight digital front-end with a spatially multiplexed optical back-end using programmable spatial light modulators. The system processes 15+ video streams in parallel via optical propagation, achieving 97.79% accuracy on Celeb-DF with 99.86% sensitivity and 95.72% specificity while reducing computational costs versus digital methods. Experimental validation demonstrates robustness to video degradation, noise, compression, and adversarial attacks, highlighting simultaneous improvements in throughput, energy efficiency, and adversarial resilience.

optical computationspatial multiplexingdeepfake detectionanalog inferenceadversarial robustness

MAM-CLIP: Vision-Language Pretraining on Mammography Atlases for BI-RADS Classification

arXiv cs.LG · Halil Ibrahim Gulluk, Olivier Gevaert · 2026-05-19

The authors propose MAM-CLIP, a vision-language model for BI-RADS classification in mammography, leveraging contrastive pretraining on 2313 image-text pairs from mammography atlases. The method combines a PubMedBERT language encoder with a vision encoder, pretrained via image-text alignment to capture rich textual descriptors, then fine-tuned for BI-RADS prediction. Results show consistent improvements over image-only baselines, with F1-score gains of +1% (40K samples) to +14% (1K samples), and demonstrate that 2K image-text pairs outperform 2K labeled samples by +1.1% when >10K training samples are available. The work releases preprocessed TEKNOFEST data and model artifacts.

bi-radscontrastive learningvision-language pretrainingmammography atlasespubmedbert

CompoSE: Compositional Synthesis and Editing of 3D Shapes via Part-Aware Control

arXiv cs.LG · Habib Slim, Shariq Farooq Bhat, Mohamed Elhoseiny, Yifan Wang · 2026-05-19

CompoSE introduces a novel method for compositional synthesis and editing of 3D shapes via part-aware control, enabling localized granular editing of individual parts. The approach uses a diffusion transformer architecture that alternates between local part processing and global context aggregation, with a novel conditioning technique for strong user input adherence. It infers part semantics and symmetries from coarse geometric primitives without requiring part-level text prompts. Experiments show superior performance in guided synthesis, with capabilities including part substitution, addition, deletion, and style-preserving resizing, validated by objective metrics and LLM-based evaluations.

compositional synthesisdiffusion transformerpart-aware control3d shape editinggeometric primitives

What Makes a Representation Good for Single-Cell Perturbation Prediction?

arXiv cs.LG · Wenkang Jiang, Yuhang Liu, Yichao Cai, Erdun Gao · 2026-05-19

The paper introduces PerturbedVAE, a framework addressing signal imbalance in single-cell perturbation modeling by separating perturbation-specific information from invariant structure. It employs causal representation learning to recover sparse perturbation effects, supported by identifiability analysis for reliable recovery conditions. Empirical results demonstrate state-of-the-art performance on benchmark tasks, with notable improvements in out-of-distribution combinatorial predictions and interpretable perturbation-response programs.

perturbedvaesingle-cellrepresentation learningidentifiabilitysparsity

An Exterior Method for Nonnegative Matrix Factorization

arXiv cs.LG · Qiujing Lu, Tonmoy Monsoor, Ehsan Ebrahimzadeh, Kartik Sharma · 2026-05-19

The authors propose an exterior method for nonnegative matrix factorization (eNMF) that decouples low-rank approximation from nonnegativity enforcement, contrasting with traditional interior approaches. The method initializes from optimal unconstrained factorization and employs a rotation procedure to map factors to exterior points near the nonnegative orthant, yielding KKT-satisfying stationary points on the boundary. Experiments across 400 NMF trials show 99% convergence to equivalent factor matrices, with eNMF outperforming 81 competitor configurations by achieving 30% lower reconstruction error and 150% speedup. Downstream applications in audio processing and recommendation systems demonstrate practical benefits.

nonnegative matrix factorizationexterior optimizationlow-rank approximationkkt conditionsorthogonal transformations

BrainDyn: A Sheaf Neural ODE for Generative Brain Dynamics

arXiv cs.LG · Siddharth Viswanath, Panayiotis Ketonis, Chen Liu, Michael Perlmutter · 2026-05-19

BrainDyn introduces a sheaf neural ODE model for generating brain-like dynamics on structured graphs, addressing limitations of LLMs and RNNs in anatomical alignment and graph networks in expressiveness. The method combines LSTM-based activity history encoding with learnable restriction maps and a sheaf Laplacian for message passing, integrated with a neural ODE for continuous-time evolution. Evaluated on resting-state fMRI (PNC), EEG with epilepsy (TUSZ), and NEST spiking simulations, BrainDyn demonstrates strong forecasting and supports in silico perturbation prediction.

sheaf neural odebrain dynamicsrestriction mapssheaf laplacianin silico perturbation

A Unified Framework for Structure-Aware Clustering and Heterogeneous Causal Graph Learning

arXiv cs.LG · Honglin Du, Muxuan Liang, Xiang Zhong · 2026-05-19

The paper proposes DAG-DC-ADMM, a unified framework for jointly learning cluster assignments and cluster-specific dependency structures in multivariate systems with heterogeneous causal relationships. The method combines Structural Equation Modeling (SEM) with a groupwise truncated Lasso fusion penalty (gTLP) to enforce structural similarity within clusters, while incorporating sparsity and acyclicity constraints via a smooth formulation. An adapted ADMM algorithm solves the resulting nonconvex optimization problem, with convergence guarantees to KKT points for certain graph structures. Experiments show the method achieves high true positive rates and low false discovery rates in recovering cluster-specific causal dependencies.

structural equation modelingdirected acyclic graphsalternating direction method of multipliersheterogeneous causal learningnonconvex optimization

An Objective Performance Evaluation of the LSTM Networks in Time Series Classification

arXiv cs.LG · Sooraj Sunil, Balakumar Balasingam · 2026-05-19

This paper presents a framework for objectively evaluating LSTM networks against model-based approaches in time-series classification. The study compares an LSTM classifier with an expectation maximization (EM) classifier on binary classification tasks using scalar linear Gaussian state space models, with the Kalman filter likelihood ratio test as a reference. Monte Carlo simulations reveal that the EM classifier outperforms LSTM when data conform to the model structure, while LSTM requires larger noise separation for reliable classification and underperforms the reference in measurement noise scenarios regardless of sequence length or training size.

lstm networkstime-series classificationexpectation maximizationkalman filtermonte carlo simulations

A Two-Phase Adaptive Balanced Penalty Method for Controllable Pareto Front Learning under Split Feasibility Conditions

arXiv cs.LG · Nguyen Viet Hoang, Dung D. Le, Tran Ngoc Thang · 2026-05-19

The paper proposes a two-phase Adaptive Balanced Penalty (ABP) method for Controllable Pareto Front Learning under split feasibility conditions, reformulating the problem as a Bi-Level Scalarized Split Problem. ABP combines three gradient components (optimality, set feasibility, image feasibility) via an adaptive indicator and proves convergence using convex surrogates. The method is implemented as ABP-HyperNet for Hyper-MLP and HyperTrans architectures, evaluated through a new Expected Feasible Hypervolume metric. Experiments on five multi-objective benchmarks and three multi-task datasets show ABP-HyperNet achieves 2.3× higher EFHV than baselines, improving feasibility from 36-49% to 87-100%.

pareto front learningsplit feasibilityhypernetworkbi-level optimizationfeasible hypervolume

Matérn Noise for Triangulation-Agnostic Flow Matching on Meshes

arXiv cs.LG · Tianshu Kuai, Arman Maesumi, Daniel Ritchie, Noam Aigerman · 2026-05-19

The paper introduces a triangulation-agnostic flow matching (FM) method for generating signals over triangle meshes, employing a Matérn process as a noise distribution to ensure triangulation invariance. The approach adapts FM to meshes by using PoissonNet, a state-of-the-art gradient-domain learning model, as the denoiser. Experiments demonstrate the method's efficacy in generating realistic elastic rest states and humanoid poses on meshes exceeding one million triangles, outperforming existing techniques in quality and diversity.

flow matchingmatérn processtriangulation-agnosticpoissonnetgradient-domain

Cross-Paradigm Knowledge Distillation: A Comprehensive Study of Bidirectional Transfer Between Random Forests and Deep Neural Networks for Big Data Applications

arXiv cs.LG · Mahdi Naser Moghadasi · 2026-05-19

This paper introduces the first comprehensive study of bidirectional knowledge distillation between Random Forests (RF) and Deep Neural Networks (DNN), addressing a gap in cross-paradigm transfer. The authors propose novel methods including progressive multi-stage distillation, multi-teacher ensemble distillation, and uncertainty-aware transfer mechanisms. Evaluated across 144 experiments on 6 datasets, their approach achieves 98.13% classification accuracy (NN-COMPACT) and 92.6% R^2 score (NN-WIDE), demonstrating complementary benefits of interpretability (RF) and expressiveness (DNN) while enabling flexible deployment in big data environments.

knowledge distillationrandom forestsdeep neural networksmulti-teacher ensembleinterpretable ai

Domain-Adaptive Communication-Rate Optimization for Sim-to-Real Humanoid-Robot Wireless XR Teleoperation

arXiv cs.LG · Caolu Xu, Zhiyong Chen, Meixia Tao, Li Song · 2026-05-19

The paper proposes a domain-adaptive communication-rate optimization framework for sim-to-real wireless XR teleoperation of humanoid robots, minimizing communication energy while maintaining motion trajectory reconstruction accuracy. The method integrates sampling, transmission, interpolation, and reconstruction, employing dimension-wise sampling-rate control and a PPO algorithm with density-ratio weighting and trust-region regularization for sim-to-real adaptation. Experiments on a public humanoid teleoperation dataset demonstrate improved tradeoffs between reconstruction error and energy consumption under distribution shift, with analysis across varying wireless channels and dynamic trajectories.

sim-to-realwireless xr teleoperationproximal policy optimizationdensity-ratio estimationcommunication-rate optimization

Factor Augmented High-Dimensional SGD

arXiv cs.LG · Shubo Li, Yuefeng Han, Xiufan Yu · 2026-05-19

The paper introduces Factor-Augmented SGD (FSGD), a novel optimization method for high-dimensional learning that leverages latent factor representations on streaming data, eliminating the need for offline preprocessing. Unlike traditional two-stage approaches, FSGD operates purely online, enhancing scalability. The authors provide the first theoretical framework incorporating latent factor estimation error into SGD analysis, proving moment convergence in ℓ^s norm under decaying step sizes and mini-batch updates. This work establishes a foundation for reliable, scalable SGD in high-dimensional systems.

stochastic gradient descentlatent factorhigh-dimensional learningmoment convergencestreaming data

Language models struggle with compartmentalization

arXiv cs.LG · Thomas Vincent Howe, David Wingate · 2026-05-19

The study demonstrates that large language models (LLMs) exhibit compartmentalization, failing to share statistical strength between distinct presentations of unified concepts (e.g., multilingual or multi-representational data). Through empirical analysis, the authors show that LLMs often learn parallel internal representations for each presentation, saturating capacity and reducing sample efficiency. Synthetic parallel data fails to mitigate this issue, and early multilingual learning in small models appears highly compartmentalized. Interventions exhibit phase transitions in effectiveness based on presentation count, suggesting inconsistent representation unification under the language modeling objective.

compartmentalizationstatistical strengthparallel representationssample efficiencyphase transition

Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR

arXiv cs.LG · Chongyu Fan, Gaowen Liu, Mingyi Hong, Ramana Rao Kompella · 2026-05-19

The paper introduces Pion, a modified version of the Muon optimizer that addresses spectral whitening limitations in vision-language-action (VLA) training and reinforcement learning with verifiable rewards (RLVR). Pion replaces Muon's uniform spectral whitening with a high-pass Newton-Schulz iteration, promoting dominant singular values while suppressing noisy components, and supports per-head updates for preserving attention-head heterogeneity. Experiments on LIBERO and LIBERO-Plus show Pion achieving 100% success rate in VLA tasks versus 97.0% for Muon and 32.2% for AdamW, with similar gains in RLVR on Qwen3-1.7B/4B models.

spectral whiteningnewton-schulz iterationvision-language-actionreinforcement learningattention heads

Do Better Volatility Forecasts Lead to Better Portfolios? Evidence from Graph Neural Networks

arXiv cs.LG · Rylan Wade · 2026-05-19

The paper demonstrates that volatility forecasting accuracy, cross-sectional ranking quality, and portfolio performance are distinct objectives in financial machine learning. Using weekly realized volatility data from 465 S&P 500 equities (2015-2025), the authors compare Heterogeneous Autoregressive, LSTM, and GraphSAGE models across correlation, sector, and Granger-causal graphs with macro regime features. Results show the best MSE, ranking accuracy, and Sharpe ratio metrics come from different models, indicating graph-based approaches only benefit portfolio rules that exploit their encoded cross-sectional structure.

realized volatilitygraph neural networkssharpe ratiocross-sectional rankinggranger-causality

OpenCompass: A Universal Evaluation Platform for Large Language Models

arXiv cs.LG · Maosong Cao, Kai Chen, Haodong Duan, Yixiao Fang · 2026-05-19

The paper introduces OpenCompass, a modular and scalable evaluation platform for large language models (LLMs) addressing challenges in cross-domain benchmarking. The system features five core components: Configuration System, Task Partitioning Module, Execution and Scheduling Module, Task Execution Unit, and Result Visualization Module, supporting rule-based, LLM-as-a-Judge, and cascaded evaluation methods. OpenCompass provides unified evaluation across multiple domains (knowledge, reasoning, computation, science, language, code) with high compatibility, flexibility, and concurrency, enabling efficient identification of LLM capabilities and optimization pathways.

large language modelsbenchmark evaluationmodular architecturetask partitioninghigh-concurrency

CODA: Rewriting Transformer Blocks as GEMM-Epilogue Programs

arXiv cs.LG · Han Guo, Jack Zhang, Arjun Menon, Driss Guessous · 2026-05-19

CODA introduces a GPU kernel abstraction that reformulates Transformer block computations as GEMM-plus-epilogue programs, addressing memory-bound bottlenecks in training systems. The method algebraically reparameterizes normalization, activations, and residual updates to execute during GEMM tile retention on-chip, using composable epilogue primitives for scaling, reductions, and accumulation. Evaluations on Transformer workloads show that both human- and LLM-authored CODA kernels achieve high performance, demonstrating the approach's efficacy in combining framework productivity with hardware efficiency.

transformergemmepiloguekernelmemory-bound

From Simple to Complex: Curriculum-Guided Physics-Informed Neural Networks via Gaussian Mixture Models

arXiv cs.LG · Jianan Yang, Yiran Wang, Shuai Li, Fujun Cao · 2026-05-19

The authors propose Curriculum-Guided Gaussian Mixture Physics-Informed Neural Networks (CGMPINN), a method combining Gaussian mixture modeling with dynamic curriculum learning to address training challenges in PINNs for PDEs. The approach periodically fits a GMM to residual distributions, implements a smooth curriculum schedule for progressive difficulty adaptation, and employs precision-based variance modulation. Theoretical analysis includes convergence guarantees and generalization bounds. Experiments on six PDE benchmarks demonstrate CGMPINN reduces relative $L_2$ error by up to 97.8% compared to standard PINNs.

physics-informed neural networksgaussian mixture modelcurriculum learningpartial differential equationsadaptive optimization

Backdooring Masked Diffusion Language Models

arXiv cs.LG · Daniel Yiming Cao, Chengzhong Wang, Sheng-Yen Chou, Chengyu Huang · 2026-05-19

The paper introduces SHADOWMASK, the first training-time backdoor attack for masked diffusion language models (MDLMs), addressing their unique discrete corruption and iterative denoising mechanics. The method modifies the forward process by replacing the standard all-mask terminal distribution with a trigger-mask mixture prior, creating a dedicated denoising pathway for trigger inputs while preserving clean performance. Evaluations on DiT-based MDLM and LLaDA-8B-Instruct across WikiText-103, OpenWebText, and Alpaca demonstrate near-100% attack success, minimal clean utility degradation, and robustness against fine-tuning and defenses.

masked diffusion language modelsbackdoor attackdenoising pathwaytrigger-mask mixtureparameter-efficient fine-tuning

Beyond Extrapolation: Knowledge Utilization Paradigm with Bidirectional Inspiration for Time Series Forecasting

arXiv cs.LG · Liu Chong, Yingjie Zhou, Hao Li, Pengyang Wang · 2026-05-19

The paper proposes KUP-BI, a novel time-series forecasting paradigm that leverages bidirectional structural knowledge by approximating post-target continuation proxies from training data. The method distills continuation-style knowledge from historical trajectories and integrates it into standard forecasting models via a lightweight feature-level gating module, avoiding reliance on parametric extrapolation. Experiments on six public datasets demonstrate consistent performance improvements across state-of-the-art models with minimal computational overhead.

time-series forecastingbidirectional inspirationcontinuation proxystructural knowledgefeature-level gating

GAE Falls Short in Imperfect-Information Self-Play Reinforcement Learning

arXiv cs.LG · Zhiyuan Fan, Gabriele Farina · 2026-05-19

The paper introduces $Q$-boosting, a variance-reduced advantage estimator for imperfect-information self-play reinforcement learning, addressing the high variance in generalized advantage estimation (GAE) caused by stochastic action sampling. The proposed Variance-Reduced Policy Optimization (VRPO) combines this estimator with a multi-step Expected SARSA$(λ)$ trace to compute policy expectations, reducing action-sampling noise while retaining PPO's clipped objective and on-policy updates. Empirical results demonstrate VRPO's strong performance in large-scale games like Dou Dizhu and Heads-Up No-Limit Texas Hold'em.

generalized advantage estimationself-play reinforcement learningvariance reductionimperfect-information gamesproximal policy optimization

Quantum Machine Learning for Cyber-Physical Anomaly Detection in Unmanned Aerial Vehicles: A Leakage-Free Evaluation with Proxy-Audited Feature Sets

arXiv cs.LG · Carlos A. Durán Paredes, Javier E. León Calderón, Nicolás Sánchez Perea, German Darío Díaz · 2026-05-19

The study presents a leakage-free evaluation of quantum machine learning for anomaly detection in unmanned aerial vehicles (UAVs) using the TLM:UAV benchmark. Key contributions include (i) a group-aware temporal protocol (B2) for dataset partitioning, (ii) a three-mode feature audit to quantify accuracy sources, and (iii) a hybrid XGBoost + Data Reuploading (DRU) classifier benchmarked against classical controls. Results show the trained-DRU hybrid exhibits a directional F1 macro improvement (+0.05) under strict feature auditing and the lowest mean false-alarm rate, though inter-seed variance limits statistical significance. The implementation is provided in Qiskit 2.x for NISQ-era aerospace cybersecurity.

quantum machine learninganomaly detectionunmanned aerial vehiclesdata reuploadingnisq-era

DeRegiME: Deep Regime Mixtures for Probabilistic Forecasting under Distribution Shift

arXiv cs.LG · Kieran Wood, Stefan Zohren, Stephen J. Roberts · 2026-05-19

DeRegiME introduces a deep regime mixture of experts for probabilistic forecasting under distribution shift, separating latent uncertainty regimes from the underlying signal via a sparse variational Gaussian process with a nonstationary regime-mixing kernel and Student-t likelihood. The method employs a shared gate to combine per-regime sub-kernels and noise processes, yielding an interpretable mean-residual-noise decomposition and regime transitions as implicit changepoints. Evaluated across ten benchmarks, DeRegiME improves negative log predictive density by 20.3%, CRPS by 3.0%, and MSE by 4.7% over encoder-matched baselines, demonstrating consistent gains across abrupt, gradual, and seasonal shifts.

probabilistic forecastinggaussian processdistribution shiftregime mixturesparse variational inference

Robust Mitigation of Age-Dependent Confounding Effects via Sample-Difficulty Decorrelation

arXiv cs.LG · Nikhil Cherian Kurian, Victor Caquilpan Parra, Abin Shoby, Luke Whitbread · 2026-05-19

The paper proposes a framework to mitigate age-dependent confounding in medical image classification by decorrelating sample difficulty from age trends, preserving diagnostically meaningful age information. The method employs a warm-up phase to model label-conditioned age-difficulty relationships, then applies Huber-weighted affinity weights for robust decorrelation, supplemented by an Age Coverage Score for stable optimization under limited age diversity. Evaluated on two radiology datasets, the approach reduces age-dependent disparities in true/false positive rates by 15-30% with <1% AUC impact, demonstrating robustness to train-test age distribution shifts.

confounding mitigationsample-difficulty decorrelationhuber weightingage coverage scorelabel-conditioned modeling

Worst-Group Equalized Odds Regularization for Multi-Attribute Fair Medical Image Classification

arXiv cs.LG · Nikhil Cherian Kurian, Victor Caquilpan Parra, Abin Shoby, Luke Whitbread · 2026-05-19

The paper introduces a worst-group equalized-odds margin regularizer for multi-attribute fair medical image classification, addressing disparities in true and false positive rates across demographic subgroups at fixed operating points. The method identifies extreme margin deviations in subgroups defined by attributes like age, sex, and race, applying a unified penalty without intersectional constraints. Evaluated on two medical imaging datasets in multi-label settings, it reduces Equalized Odds and Equalized Opportunity disparities while maintaining AUC performance.

equalized oddsmulti-attribute fairnessmargin regularizermedical image classificationsubgroup disparities

Precision Physical Activity Prescription via Reinforcement Learning for Functional Actions

arXiv cs.LG · Gefei Lin, Rui Miao, Jennifer Sacheck, Xiaoke Zhang · 2026-05-19

The paper introduces a reinforcement learning (RL) algorithm to optimize personalized physical activity (PA) distributions for health biomarkers, using step count data from the All of Us Research Program. The method addresses the lack of PA recommendation systems by modeling daily step distributions as continuous actions in an offline RL framework. Results show superior performance over existing continuous-action RL methods, with optimal policies recommending higher and more consistent step counts, tailored to subgroups based on glucose levels, BMI, blood pressure, age, and sex.

reinforcement learningphysical activitybiomarkersoffline learningpersonalized recommendation

Sequential Consensus for Multi-Agent LLM Debates: A Wald-SPRT compute governor with calibration-based failure detection

arXiv cs.LG · Andrea Morandi · 2026-05-18

The paper introduces a Sequential Probability Ratio Test (SPRT)-based compute governor for multi-agent LLM debates, dynamically terminating rounds when consensus is reached or maximal rounds (R_max) are exhausted. The method employs a Beta-distributed LLM judge score to estimate convergence likelihood, with calibration ensuring domain-specific validity. Evaluations on GSM8K and MMLU show 3.7x fewer LLM calls (1.01 average rounds) at 97.0% accuracy versus fixed-5 debates (99.0%), while MMLU reveals calibration failure (99.5% capping at 2.1x cost). The SPRT layer optimizes compute without accuracy guarantees.

sequential probability ratio testmulti-agent llm debatebeta likelihoodcompute governorcalibration-based failure detection

A Cloud-Based Tool for Meteorite Recovery Using Drones and Machine Learning

arXiv cs.LG · Seamus L. Anderson, Hadrien A. R. Devillepoix, Lewis Lakerink, Sawitchaya Tippaya · 2026-05-18

The paper introduces a cloud-based tool integrating drones and machine learning for meteorite recovery, specifically targeting instrumentally observed falls. The system features iterative improvements over prior versions and has been tested in South and Western Australia. Results demonstrate both successes and limitations in field applications. The tool is accessible to the meteoritics community via https://find.gfo.rocks.

meteorite recoverydronesmachine learningcloud-based toolinstrumentally observed falls

Activation Functions, Statistics and Learning of Higher-Order Interactions in Restricted Boltzmann Machines

arXiv cs.LG · Giovanni di Sarra, Yasser Roudi · 2026-05-18

The work analyzes how activation functions in Restricted Boltzmann Machines (RBMs) affect the representation and learning of higher-order interactions in data. By exploiting the duality between RBMs and interacting binary variable models, the authors characterize the space of representable models analytically for four activation functions: Linear, Step, ReLU, and Exponential. Results show that rapidly increasing nonlinearities (e.g., Exponential) facilitate learning of data with large higher-order interactions, while certain structures remain difficult to represent across all activation functions, with analytical predictions closely matching simulation outcomes.

restricted boltzmann machinesactivation functionshigher-order interactionsnonlinearitiesbinary variables

Reducing Diffusion Model Memorization with Higher Order Langevin Dynamics

arXiv cs.LG · Benjamin Sterling, Mónica F. Bugallo, Tom Tirer · 2026-05-18

The paper theoretically characterizes how Higher-Order Langevin Dynamics (HOLD) reduces memorization in diffusion models by analyzing its regularization effect. HOLD introduces auxiliary variables (interpreted as velocity/acceleration) that impose dynamical constraints, causing the data variable's dynamics to follow a low-pass-filtered version of the learned score function. Theoretical analysis shows increased smoothness with higher-order HOLD, mitigating memorization risks while preventing distribution collapse. Empirical validation on real-world data confirms HOLD's advantage over standard diffusion models in reducing sample replication.

higher-order langevin dynamicsdiffusion modelsmemorizationscore functiondistribution collapse

A Heuristic Approach for Performance Tuning in RL-based Quadrotor Control via Reward Design and Termination Conditions

arXiv cs.LG · Fausto Mauricio Lagos Suarez, Akshit Saradagi, Vidya Sumathy, George Nikolakopoulos · 2026-05-18

The article presents a heuristic method for performance tuning in RL-based quadrotor control through reward design and termination conditions. A novel dual-bandwidth exponential reward structure enables critically damped setpoint tracking with low steady-state errors (∼2%), trained via Proximal Policy Optimization (PPO) in 6M time steps. Heuristic adjustments to reward weights and coefficients yield tunable settling times for acrobatic (fast) and inspection (slow) behaviors while maintaining baseline response characteristics. Evaluation across 100 trials demonstrates precise position/yaw tracking from random initial conditions.

reinforcement learningquadrotor controlreward designproximal policy optimizationsetpoint tracking

Information Processing Capacity of Stationary Physical Systems: Theory, Data-efficient Estimation Methods, and Photonic Demonstration

arXiv cs.LG · Rahul Uma Ramachandran, Serge Massar · 2026-05-18

The authors extend the Information Processing Capacity (IPC) framework to stationary physical computing systems, proving fundamental bounds: individual capacities ∈ [0,1], sum bounded by readout count, and noise reducing this bound. They develop data-efficient IPC estimation methods using Richardson extrapolation and Sobol quasi-random sampling, addressing finite-sample bias. Experimental validation with a nonlinear optical fibre photonic system shows IPC shifts toward higher-order nonlinear capacities under Kerr effect modulation. Total IPC correlates strongly (r unspecified) with benchmark ML task performance, establishing it as a dimensionality measure linking physical dynamics to computational capability.

information processing capacityphysical computingrichardson extrapolationkerr effectquasi-random sampling

Identifiable Multimodal Causal Representation Learning under Partial Latent Sharing

arXiv cs.LG · Manal Benhamza, Marianne Clausel, Myriam Tami · 2026-05-18

The paper establishes component-wise identifiability guarantees for causal latent representations in multimodal data with partially shared latent structures, addressing a key challenge in causal representation learning (CRL). The authors propose a non-parametric approach under flexible assumptions, using nonlinear mixing functions to model modality-specific latent subsets without requiring parametric latent distributions. A differentiable Wasserstein-based module is introduced to recover the shared structure, compatible with diverse architectures. Experiments on synthetic and real-world datasets demonstrate superior performance over state-of-the-art methods.

causal representation learningmultimodal learningidentifiabilitywasserstein distancelatent variable models

CLIC: Contextual Language-Informed Cardiac Pathology Classification

arXiv cs.LG · Giovani D. Lucafo, Rafael da Costa Silva, João Lucas Luz Lima Sarcinelli, Andre Guarnier De Mitri · 2026-05-18

The paper introduces CLIC (Contextual Language-Informed Cardiac pathology classification), a multimodal framework that enhances ECG-based diagnosis by integrating patient metadata and demographic variables through natural language encoding. The method translates contextual data into descriptive text, providing an informative anchor for disambiguating physiological patterns, and compares template-based clinical text with LLM-generated descriptions. Results show that controlled template-based text yields consistent classification improvements, though LLM-synthesized texts remain competitive in downstream performance.

electrocardiogrammultimodal frameworkclinical textlarge language modelspathology classification

Atomistic Modeling of Chemical Disorder in Materials: Bridging Classical Methods and AI-Assisted Approaches

arXiv cs.LG · Jiayu Peng, Peichen Zhong · 2026-05-18

The review addresses the representation gap between experimental and computational descriptions of chemical disorder in materials, proposing a framework that integrates classical and AI-driven methods. It evaluates techniques including mean-field theories, cluster expansion, Monte Carlo, and emerging AI approaches like universal interatomic potentials and generative models. The analysis demonstrates how AI can enhance disorder-native capabilities, such as configurational exploration, generative modeling of disordered structures, and kinetics-aware prediction, enabling more realistic AI-accelerated materials discovery.

chemical disorderatomistic modelinggenerative modelscluster expansionmonte carlo

Dual-Channel Tensor Neural Networks: Finite-Sample Theory and Conformal Structure Selection

arXiv cs.LG · Elynn Chen, Jiayu Li, Zheshi Zheng, Jian Pei · 2026-05-18

The paper introduces a Dual-Channel Tensor Neural Network (DC-TNN) that processes tensor data through coupled channels for a low-rank core and sparse refinement, accommodating CP, Tucker, and tensor-train decompositions. It establishes non-asymptotic risk bounds for the estimator, showing effective dimension depends on core rank and refinement sparsity. A conformal ROC procedure provides finite-sample, distribution-free coverage for uncertainty quantification, while a conformal structure selector chooses among tensor decompositions. Experiments on synthetic data and a protein dataset demonstrate improved predictive accuracy and structure recovery.

tensor neural networksnon-asymptotic risk boundsconformal inferencestructure selectionlow-rank decomposition

Learning Interpretable Point-Based Clinical Risk Scores via Direct Optimization

arXiv cs.LG · Ying Cui, Albert M Li, Vivek Charu, Yeon-Mi Hwang · 2026-05-18

The paper introduces novel machine learning algorithms for constructing interpretable, point-based clinical risk scores via direct optimization of explicit objectives. The method employs a flexible greedy optimization strategy to learn additive scoring rules with nonnegative integer weights, addressing computational challenges of integer programming for nonconcave or discontinuous value functions. Applied to an Epic Cosmos EHR cohort, the approach constructs an integer-weighted comorbidity score for post-discharge mortality risk prediction, with performance validated through simulation studies.

clinical risk scoresinteger programminggreedy optimizationelectronic health recordscomorbidity score

Performance Monitoring of Proton Exchange Membrane Water Electrolyzer by Transformers-Based Machine Learning Model

arXiv cs.LG · Bingqing Chen, Ivan Batalov, Qiu Chen, Weiqi Ji · 2026-05-18

The authors propose a transformer-based machine learning framework for real-time state-of-health monitoring in proton exchange membrane (PEM) water electrolyzers without interrupting operation. Their encoder-decoder architecture employs patch-based sequence tokenization of operational data to reconstruct polarization curves, achieving a 10× MSE reduction versus baseline transformers across four longitudinal tests (≤478 hours). The method enables continuous performance monitoring while capturing latent representations of degradation, suggesting potential for interpretable health indicators in green hydrogen production systems.

proton exchange membranetransformer modelstate-of-healthpolarization curvepatch tokenization

Heterogeneity-Aware Dataset Scheduling for Efficient Audio Large Language Model Training

arXiv cs.LG · Yanru Wu, Jianning Wang, Chongxin Gan, Yang Li · 2026-05-18

The paper proposes Grouped Sequential Training (GST), a heterogeneity-aware dataset scheduling method for efficient Audio Large Language Model (ALLM) training. GST organizes datasets into affinity-aware groups using gradient-based metrics and introduces them via progressive scheduling, balancing parallel training stability with sequential optimization efficiency. Evaluations on 14 AudioQA datasets show GST achieves 30–40% faster convergence than parallel training while matching or exceeding mix-all training performance, providing a scalable framework for multi-dataset ALLM optimization.

audio large language modelsdataset heterogeneitygradient-based affinitygrouped sequential trainingaudioqa

Chessformer: A Unified Architecture for Chess Modeling

arXiv cs.LG · Daniel Monroe, George Eilender, Philip Chalmers, Zhenwei Tang · 2026-05-18

Chessformer introduces a unified transformer architecture for chess modeling that simultaneously advances playing strength, human move prediction, and interpretability. The encoder-only model represents board squares as tokens, employs Geometric Attention Bias (GAB) for dynamic positional encoding, and uses an attention-based source-destination policy head. Evaluations show state-of-the-art human move prediction (57.1% accuracy), +100 Elo improvement in Leela Chess Zero, and granular interpretability via square-token attention patterns. Results demonstrate domain-aligned design enables concurrent gains across performance metrics.

geometric attention biassource-destination policysquare-token representationhuman move predictionencoder-only transformer

The impact of observation density on Bayesian inversion of latent dynamics in shock-dominated flows

arXiv cs.LG · Bipin Tiwari, Muhammad Abid, Omer San · 2026-05-18

The authors present a non-intrusive reduced-order modeling framework for Bayesian initial-state inversion in shock-dominated compressible flows, addressing the ill-posed inverse problem through uncertainty quantification. The method combines a convolutional autoencoder (32D latent space) with a learned latent-space forward operator, trained on 500 high-fidelity Sod shock tube simulations solved via fifth-order WENO scheme. Results demonstrate accurate reconstruction of shock-tube structures (rarefaction wave, contact discontinuity) and show that increased observation density reduces posterior uncertainty by 78% (density) and 76% (pressure), with 250 training simulations yielding sufficient accuracy.

bayesian inversionreduced-order modelingconvolutional autoencodershock-dominated flowsuncertainty quantification

Mapping Uncharted Symmetries: Machine Discovery in Combinatorics

arXiv cs.LG · Eugenio Cainelli, Lorenzo Luccioli, Alessandro Iraci, Michele D'Adderio · 2026-05-18

The paper demonstrates machine learning's capacity for verifiable mathematical discovery by addressing combinatorial function construction under exact constraints (SLURP problem). Two novel methods are introduced: MapSeek-Functional (alternating pseudo-labeling and supervised training) and MapSeek-Symbolic (direct symbolic formula generation). Applied to algebraic combinatorics, these methods yield a new combinatorial interpretation of $q,t$-Narayana polynomials via noncrossing partitions, resolving a previously open case with a symmetry proof. All discoveries are formally verified in Lean 4, with full code released for reproducibility.

slurpmapseeknarayananoncrossinglean4

Provably Data-driven Lagrangian Relaxation for Mixed Integer Linear Programming

arXiv cs.LG · Tung Quoc Le, Anh Tuan Nguyen, Viet Anh Nguyen · 2026-05-18

The paper establishes theoretical foundations for learning Lagrangian Relaxation (LR) multipliers in Mixed Integer Linear Programming (MILP) via data-driven methods. It derives a generalization bound of O(s^1.5/√N) for learned multipliers, proves a minimax lower-bound of Ω(s/√N), and shows Stochastic Gradient Ascent (SGA) achieves the optimal Θ(s/√N) rate. The framework extends to learning-to-warm-start, attaining Θ(s/N) rates. Contributions include tight bounds on sample complexity and constructive proofs for SGA optimality in LR contexts.

lagrangian relaxationmixed integer linear programminggeneralization boundstochastic gradient ascentminimax optimality

Generative Pseudo-Force Fields for Molecular Generation

arXiv cs.LG · Stefaan Simon Pierre Hessmann, Khaled Kahouli, Stefan Gugler, Michael Plainer · 2026-05-18

The paper introduces generative pseudo-force fields (GPFFs), a method combining energy-based relaxation and data-driven generation for molecular conformations. GPFFs train a machine learning force field (MLFF) on a quadratic pseudo-potential energy surface derived from reference equilibrium structures, eliminating the need for costly ab-initio data. The approach is shown to be a time-step-agnostic variant of variance exploding diffusion models, enabling efficient sampling with arbitrary structural priors. On QM9, GPFFs achieve 100% validity at 256 neural function evaluations (NFE) and over 50% at 6 NFE, outperforming diffusion baselines across all samplers.

generative pseudo-force fieldsmolecular conformationsmachine learning force fielddiffusion modelsneural function evaluations

KVBuffer: IO-aware Serving for Linear Attention

arXiv cs.LG · Longwei Zou, Lin Zhong · 2026-05-18

KVBuffer introduces an IO-aware serving mechanism for linear attention to address inefficiencies in recurrent state updates during long-context inference. The method buffers recent keys and values, enabling chunkwise computation for decoding, parallel verification for speculative decoding, and direct attention output computation for short contexts. Implemented in SGLang for Qwen3-Next, KVBuffer reduces decoding latency by up to 45.17% and increases maximum serving requests by 5x for speculative decoding with four draft tokens.

linear attentionspeculative decodingmemory accesschunkwise computationio-aware serving

Guiding Neuro-Symbolic Scenario Generation with Spatio-Temporal Logic

arXiv cs.LG · Lorenzo Bonin, Francesco Giacomarra, Luca Bortolussi, Jyotirmoy V. Deshmukh · 2026-05-18

The paper introduces STRELGen, a neuro-symbolic framework for generating safety-critical autonomous driving scenarios. The method combines a multi-agent trajectory diffusion model with differentiable Spatio-Temporal Logic (STREL) specifications, enabling gradient-based optimization in latent space to produce plausible edge cases. This approach addresses the inefficiency of brute-force real-world testing by generating targeted scenarios that satisfy complex safety constraints while remaining within the learned data distribution. Results demonstrate efficient synthesis of interpretable, safety-critical multi-agent interactions for stress-testing autonomous systems.

autonomous drivingdiffusion modelsspatio-temporal logicscenario generationsafety validation

RLFTSim: Realistic and Controllable Multi-Agent Traffic Simulation via Reinforcement Learning Fine-Tuning

arXiv cs.LG · Ehsan Ahmadi, Hunter Schofield, Behzad Khamidehi, Fazel Arasteh · 2026-05-18

RLFTSim introduces a reinforcement learning fine-tuning framework for multi-agent traffic simulation, enhancing realism and controllability by aligning simulator rollouts with real-world data distributions. The method builds on a pre-trained model, employs a dense reward signal balancing fidelity and controllability, and demonstrates state-of-the-art performance on the Waymo Open Motion Dataset. Results show improved realism with fewer samples compared to heuristic search-based methods, while effectively enabling goal-conditioned scenario generation.

multi-agent simulationreinforcement learning fine-tuninggoal-conditioned controllabilitywaymo open motion datasetdense reward signal

Learning When to Adapt

arXiv cs.LG · Ali Zindari, Xiaowen Jiang, Rotem Mulayoff, Sebastian U. Stich · 2026-05-18

The paper introduces DISeL (Dynamic Input-Sensitive LoRA), a parameter-efficient fine-tuning method that addresses catastrophic forgetting in static low-rank adaptation (LoRA) by incorporating input-dependent gating over rank-one components. DISeL preserves pre-trained model behavior by default while activating task-specific components during fine-tuning, adding minimal parameters. Evaluated on RoBERTa (GLUE), Llama, and Mistral for mathematical reasoning and code generation, DISeL reduces forgetting compared to LoRA variants while maintaining competitive accuracy. The gating mechanism also provides interpretable insights into layer-wise and rank-wise adaptation patterns.

low-rank adaptationcatastrophic forgettingparameter-efficient fine-tuninginput-dependent gatinginterpretable adaptation

Conformal Prediction via Transported Beta Laws

arXiv cs.LG · Thiago R. Ramos, Helton Graziadei, Luben M. C. Cabezas · 2026-05-18

The paper introduces a framework for analyzing calibration-conditional coverage in conformal prediction by modeling it as a transported beta law. Using Wasserstein distances on [0,1], the method quantifies departures from the ideal beta reference distribution under non-i.i.d. settings, distinguishing between test-side shifts (via transport maps) and calibration dependence (via order-statistic law changes). Theoretical bounds are derived for marginal coverage gaps and bad-calibration probabilities, with applications to scale-shift, clustered, and stationary mixing scenarios. Simulations demonstrate that first-order approximations accurately track empirical Wasserstein distances even at moderate sample sizes.

conformal predictionwasserstein distancebeta lawcalibration-conditional coverageorder-statistic law

Deep Neural Sheaf Diffusion

arXiv cs.LG · Remi Bourgerie, Sarunas Girdzijauskas, Viktoria Fodor · 2026-05-18

The paper introduces Deep Neural Sheaf Diffusion (DNSD), a novel approach to address representation collapse in deep Graph Neural Networks (GNNs) by replacing the sheaf Laplacian with a sheaf adjacency operator. DNSD incorporates normalization, odd nonlinearities, and gating to maintain signal integrity across layers. The method is theoretically contrasted with graph attention mechanisms, emphasizing matrix-valued edge functions and node representation normalization. Empirical results show DNSD outperforms GNN and Neural Sheaf Diffusion (NSD) baselines by up to 30 percentage points in accuracy on synthetic datasets and consistently on real-world benchmarks, positioning sheaf-based architectures as viable for graph foundation models.

neural sheaf diffusiongraph neural networkssheaf laplacianadjacency operatorrepresentation collapse

LoRA vs. Full Fine-Tuning: A Theoretical Perspective

arXiv cs.LG · Ali Zindari, Rotem Mulayoff, Sebastian U. Stich · 2026-05-18

The paper provides a theoretical analysis comparing Low-Rank Adaptation (LoRA) and full fine-tuning in linear regression, identifying conditions where LoRA achieves lower excess risk. By modeling the pretraining-downstream task relationship as a low-rank difference, the analysis shows LoRA outperforms full fine-tuning when this difference is effectively low-rank. Theoretical results demonstrate that optimal rank selection can improve generalization despite reduced expressivity, with experimental validation suggesting broader applicability beyond linear settings.

low-rank adaptationexcess riskfine-tuninggeneralization performancelinear regression

SAGA: A Sequence-Adaptive Generative Architecture for Multi-Horizon Probabilistic Forecasting with Adaptive Temporal Conformal Prediction

arXiv cs.LG · Gustav Olaf Yunus Laitinen-Fredriksson Lundström-Imanov, Hafize Gonca Cömert · 2026-05-18

The paper introduces SAGA, a decoder-only transformer architecture for multi-horizon probabilistic forecasting of irregular tabular panel sequences, enhanced with adaptive temporal conformal prediction for finite-sample coverage guarantees. The method processes longitudinal data from 2,143,817 individuals in the Swedish LISA register (1990-2022), forecasting annual labor earnings at 1-30 year horizons and aggregating them into lifetime earnings distributions via Monte Carlo. SAGA reduces continuous ranked probability score by 31.9% at 10 years and mean absolute error by 37.7% at 20 years versus parametric baselines, while conformal intervals maintain nominal coverage within 0.4 percentage points marginally.

decoder-only transformerconformal predictionprobabilistic forecastingpanel datamonte carlo aggregation

Learn-by-Wire Training Control Governance: Bounded Autonomous Training Under Stress for Stability and Efficiency

arXiv cs.LG · Anis Radianis · 2026-05-18

The paper introduces Learn-by-Wire Guard (LBW-Guard), a bounded autonomous training-control governance layer that operates above AdamW to mitigate instability in large language model training. LBW-Guard monitors training telemetry and applies bounded control to optimizer execution without altering fixed objectives, evaluated on Qwen2.5 models (3B-14B) and TinyLlama-1B under stress conditions. Results show an 18.7% perplexity reduction (13.21→10.74) and 1.10x speedup for Qwen2.5-7B, with maintained trainability under aggressive learning rates (LR=3e-3: 1885.24→11.57 perplexity) where AdamW fails.

training stabilityoptimizer governancelearning-rate stressperplexity reductionbounded control

EgoTraj: Real-World Egocentric Human Trajectory Dataset for Multimodal Prediction

arXiv cs.LG · Ahmad Yehia, Abduallah Mohamed, Tianyi Wang, Jiseop Byeon · 2026-05-18

The authors introduce EgoTraj, a multimodal egocentric dataset for human trajectory prediction, addressing the scarcity of real-world egocentric trajectory data. Collected using Meta Quest Pro, EgoTraj comprises 75 sequences of human navigation in urban environments, featuring synchronized RGB video, 6-DOF head poses, 3D eye gaze vectors, and scene annotations. Benchmarking state-of-the-art methods reveals the utility of gaze, scene, and motion cues for trajectory prediction, demonstrating EgoTraj's potential for AR-based perception and assistive systems.

egocentric trajectorymultimodal dataset6-dof head poses3d eye gazear-based perception

Distance-Aware Muon: Adaptive Step Scaling for Normalized Optimization

arXiv cs.LG · Yury Demidovich, Abhishek Chakraborty, Grigory Malinovsky, Angelia Nedić · 2026-05-18

The paper introduces three adaptive step-scaling algorithms for the Muon optimizer, addressing sensitivity to step scale in normalized optimization. Distance-Adaptive Muon uses trajectory radius for trust-region scaling, proving stationarity for smooth non-convex objectives. Scale-Calibrated Muon employs local descent certificates for star-convex objectives, achieving O(1/T) objective-gap bounds. Distance-Free Muon eliminates distance-to-minimizer knowledge via scalar certificates. Experiments on GPT-124M/WikiText-103 and ViT-Tiny/CIFAR-100 demonstrate reduced tuning sensitivity and performance matching/exceeding fixed-scale baselines.

normalized optimizationadaptive scalingtrust-region methodsstar-convex objectivestrajectory radius

📰 Industry Media (8)

Roundtables: Inside the Musk v. Altman Trial

MIT Tech Review — AI · MIT Technology Review · 2026-05-19

The California Superior Court ruled against Elon Musk's lawsuit alleging OpenAI executives Sam Altman and Greg Brockman misrepresented the company's nonprofit status, as analyzed by MIT Technology Review's legal correspondent Michelle Kim. The trial proceedings revealed Musk's claims of deception regarding OpenAI's governance structure and his concurrent development of competing AI systems through xAI. Key testimony included internal communications about model distillation practices and recruitment attempts between the parties. The verdict maintains OpenAI's current organizational framework amid ongoing debates about AI safety and commercial competition in foundation model development.

nonprofit statusmodel distillationfoundation modelsgovernance structurelegal precedent

How to Build Knowledge Graph Generation Pipelines From Text With kg-gen, NetworkX Analytics, and Interactive Visualizations

MarkTechPost · Sana Hassan · 2026-05-20

The tutorial presents a comprehensive pipeline for generating knowledge graphs from unstructured text using kg-gen, NetworkX, and PyVis. The method employs LLMs (GPT-4o-mini via LiteLLM) for entity-relation extraction, implements chunking and clustering for long documents, and demonstrates multi-source aggregation with entity resolution. Results include NetworkX-based centrality analysis (degree: 0.317 for 'Deep learning'), Louvain community detection (4 communities in AI text), and interactive PyVis visualizations with PageRank-weighted nodes (size=12+80*PR). The system exports to JSON/GraphML and supports neighborhood queries (2-hop around 'machine learning').

knowledge graphentity resolutionpageranklouvain communitiesgraphml

NVIDIA AI Releases Nemotron-Labs-Diffusion: A Tri-Mode Language Model with 6× Tokens Per Forward Over Qwen3-8B

MarkTechPost · Asif Razzaq · 2026-05-20

NVIDIA introduces Nemotron-Labs-Diffusion, a tri-mode language model family (3B/8B/14B) unifying autoregressive (AR), diffusion-based parallel, and self-speculation decoding within a single architecture. The model employs a joint AR-diffusion training objective (α=0.3) with two-stage pretraining (1T AR tokens + 300B joint tokens), achieving 63.61% average accuracy on 10-task evaluation in AR mode (8B). Key innovations include block-wise bidirectional diffusion (2.57× tokens/forward), LoRA-enhanced linear self-speculation (5.99× tokens/forward), and quadratic self-speculation (6.38× tokens/forward), outperforming Qwen3-8B by 2.4× speed at batch size 1.

tri-mode language modeldiffusion-based parallel decodingself-speculation decodingjoint ar-diffusion objectivelora-enhanced linear self-speculation

Alibaba Qwen Team Introduces Qwen3.5-LiveTranslate-Flash: Real-Time Multimodal Interpretation Across 60 Languages at 2.8-Second Latency

MarkTechPost · Asif Razzaq · 2026-05-20

Alibaba's Qwen team introduces Qwen3.5-LiveTranslate-Flash, a multimodal real-time translation system achieving 2.8-second latency across 60 input languages. The model employs semantic unit prediction for streaming output, integrates visual cues (lip movements, gestures) to disambiguate noisy audio, and performs real-time voice cloning from a single utterance. Benchmarked on FLEURS and CoVoST2, it outperforms commercial alternatives while offering dynamic keyword injection for domain-specific terminology. The system supports 29 speech output languages via a WebSocket API with vision-audio fusion.

multimodal translationsemantic unit predictionvoice cloningdynamic keyword configurationwebsocket protocol

Google Introduces Gemini 3.5 Flash at I/O 2026: A Faster and Cheaper Model for AI Agents and Coding

MarkTechPost · Michal Sutter · 2026-05-20

Google introduced Gemini 3.5 Flash, a cost-efficient variant of its Gemini series optimized for AI agents and coding tasks. The model features a 1,048,576-token context window, multimodal input support, and dynamic compute allocation for complex problems. Benchmark results show superior performance over Gemini 3.1 Pro, with Terminal-Bench 2.1 (76.2%), GDPval-AA (1656 Elo), MCP Atlas (83.6%), and CharXiv Reasoning (84.2%). Priced at $1.50/M input tokens and $9.00/M output tokens, it integrates with Google's Managed Agents API and Antigravity 2.0 platform for enterprise-scale agentic workflows.

gemini 3.5 flashmanaged agents apiantigravity 2.0multimodal understandingdynamic compute allocation

Upstash for Redis vs Supabase vs Neon: Which One Fits Vibe Coding Workflows in 2026?

MarkTechPost · Michal Sutter · 2026-05-19

The article provides a technical comparison of Upstash for Redis, Supabase, and Neon, clarifying their distinct roles in serverless architectures. Upstash specializes in HTTP-based Redis for caching and rate-limiting in edge environments, Supabase offers a full-stack BaaS with PostgreSQL, auth, and storage, while Neon provides serverless PostgreSQL with scale-to-zero compute and copy-on-write branching. Key findings highlight Upstash's compatibility with Vercel/Cloudflare Workers, Supabase's integrated AI tooling, and Neon's cost efficiency for idle workloads. The analysis emphasizes their complementary nature rather than direct competition.

serverless postgresqlhttp-based rediscopy-on-write branchingbackend-as-a-servicescale-to-zero

Enterprise AI roadblocks and roadmaps, security and physical AI: Day two at TechEx

AI News · Joe Green · 2026-05-19

TechEx North America's second day analyzed enterprise AI deployment challenges, identifying pilot-to-production scaling as a critical bottleneck. Sessions emphasized agentic AI specialization, data infrastructure readiness, and token-based cost management, while infrastructure discussions contrasted build-vs-buy decisions for physical compute. Cybersecurity tracks highlighted velocity gaps between AI adoption and governance, proposing zero-trust architectures for agent permissions. Physical AI emerged as a focus area beyond LLMs, with hands-on workshops demonstrating agent self-improvement techniques via Google Colab instances.

agentic aizero-trusttoken-based chargingvelocity gapphysical ai


Generated automatically at 2026-05-20 21:36 UTC. Summaries and keywords are produced by an LLM and may contain inaccuracies — always consult the original article.