Daily Digest — 2026-06-13

Friday, June 12, 2026 · 315 items · model: deepseek/deepseek-chat

315 items · 3 research labs, 312 arxiv papers

⚠️ Source issues today:
  • MarkTechPost: all feed URLs failed (last tried: https://www.marktechpost.com/feed/)
  • AI News: all feed URLs failed (last tried: https://artificialintelligence-news.com/feed/)

🏛️ Research Labs (3)

New OpenAI Academy courses for the next era of work

OpenAI News · 2026-06-12

OpenAI introduces three new courses in OpenAI Academy to enhance organizational AI fluency: AI Foundations, Applied AI Foundations, and Agents and Workflows. These courses focus on practical application, from basic prompting and context provision to structured workflows and agent-assisted tasks. Developed in collaboration with BCG, Accenture, and BBVA, the curriculum emphasizes hands-on learning tailored to real-world work scenarios. Completion certificates are provided to recognize skill acquisition and encourage workflow sharing. The courses aim to bridge the gap between AI deployment and value creation, evolving alongside OpenAI's models and products to ensure relevance and safety in enterprise applications.

promptingworkflowsagent-assistedfluencydeployment

How Preply combines AI and human tutors to personalize learning

OpenAI News · 2026-06-12

Preply integrates OpenAI's API to enhance language learning through AI-generated Lesson Insights, combining human tutoring with automated feedback. The system analyzes lesson transcripts to provide personalized grammar, vocabulary, and pronunciation corrections, reducing administrative burden for tutors and improving learner engagement. Results include 95% ChatGPT weekly active usage among employees, 75% adoption by English learners, and a 4.7/5 satisfaction rating. Preply employs OpenAI's Codex for engineering workflows, enabling 94% of engineers to accelerate development tasks. The approach emphasizes AI as a cultural transformation, focusing on high-impact use cases and partnerships to augment human capabilities.

openai apilesson insightschatgpt enterprisecodexpersonalized feedback

olmo-eval: An evaluation workbench for the model development loop

Hugging Face Blog · 2026-06-12

The authors introduce olmo-eval, an evaluation workbench designed to streamline the iterative development of large language models (LLMs). olmo-eval extends the Open Language Model Evaluation Standard (OLMES) by offering modular components for defining benchmarks, running evaluations across model checkpoints, and analyzing results at both aggregate and per-question levels. Key features include a task/suite/harness abstraction, a sandbox layer for tool-enabled evaluations, and a normalized experiment schema for reproducibility. The tool supports lightweight and containerized execution modes, enabling efficient comparison of model interventions. olmo-eval aims to address the challenges of continuous evaluation during LLM development.

large language modelsbenchmarkingreproducibilitytool-enabled evaluationmodel checkpoints

📜 arXiv Papers (312)

Learning to Reason by Analogy via Retrieval-Augmented Reinforcement Fine-Tuning

arXiv cs.AI · Zilin Xiao, Qi Ma, Chun-cheng Jason Chen, Xintao Chen · 2026-06-11

Retrieval-Augmented Reinforcement Fine-Tuning (RA-RFT) is introduced as a post-training framework to enhance language models' reasoning by analogy. RA-RFT employs gold-relevance distillation to train a retriever that prioritizes contexts based on expected reasoning benefit rather than semantic similarity, followed by reinforcement fine-tuning using retrieved analogous demonstrations. This approach enables models to leverage reasoning traces under verifiable outcome rewards. Empirical results demonstrate RA-RFT's superiority over standard reinforcement fine-tuning methods, improving AIME 2025 average@32 accuracy by 7.1 and 2.8 points for Qwen3-1.7B and Qwen3-4B, respectively, highlighting reasoning-aware retrieval as a complementary improvement axis.

retrieval-augmented generationgold-relevance distillationreinforcement fine-tuningreasoning by analogyreasoning-aware retrieval

Mana: Dexterous Manipulation of Articulated Tools

arXiv cs.AI · Zhao-Heng Yin, Guanya Shi, Pieter Abbeel, C. Karen Liu · 2026-06-11

Mana (Manipulation Animator) introduces a sim-to-real framework for dexterous manipulation of articulated tools by reformulating it as an animation problem. The method employs a coarse-to-fine pipeline that converts procedurally-generated grasp keyframes into manipulation trajectories using motion planning and reinforcement learning, with minimal human input (<1 minute per tool for affordance specification). Evaluated on four articulated tools with varying scales and joint types, Mana achieves zero-shot sim-to-real transfer for both grasping and in-hand manipulation, demonstrating scalability.

articulated toolssim-to-realdexterous manipulationmotion planningreinforcement learning

SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning

arXiv cs.AI · Seokju Cho, Ryo Hachiuma, Abhishek Badki, Hang Su · 2026-06-11

The paper introduces SpatialClaw, a training-free framework enhancing spatial reasoning in vision-language models (VLMs) by using code as an action interface. It employs a stateful Python kernel pre-loaded with perception/geometry primitives, enabling stepwise executable cell generation conditioned on prior outputs. Evaluated across 20 benchmarks for static/dynamic 3D/4D reasoning, SpatialClaw achieves 59.9% average accuracy (+11.2 points over prior work), demonstrating consistent improvements across six VLM backbones without task-specific adaptation.

spatial reasoningvision-language modelspython kernelperception primitivesgeometry primitives

Automated reproducibility assessments in the social and behavioral sciences using large language models

arXiv cs.AI · Tobias Holtdirk, Pietro Marcolongo, Anna Steinberg Schulten, Felix Henninger · 2026-06-11

This study demonstrates that large language models (LLMs) can automate reproducibility assessments in social and behavioral sciences, offering a scalable alternative to manual reanalysis. Using 76 published studies, an LLM pipeline recovered original effect sizes within +/-0.05 Cohen's d tolerance for 41% of cases and matched qualitative conclusions in 96% of studies, outperforming human reanalysts (34% effect size recovery, 74% conclusion agreement). The method identifies 7 studies where LLMs failed to produce viable effect size estimates, highlighting both capabilities and limitations of automated reproducibility auditing.

reproducibilityeffect sizecohen's dllm pipelinequalitative conclusion

Agents-K1: Towards Agent-native Knowledge Orchestration

arXiv cs.AI · Zongsheng Cao, Bihao Zhan, Jinxin Shi, Jiong Wang · 2026-06-11

The paper introduces Agents-K1, an end-to-end pipeline for constructing agent-native scientific knowledge graphs from raw documents, addressing limitations in current LLM-based research agents that overlook detailed knowledge orchestration. The system combines a multimodal parser with a five-module schema, a 4B-parameter information-extraction backbone trained with GRPO, and a tri-source agent interface (graphanything CLI) for unified retrieval. Evaluated on 2.46M papers across six subjects, it produces Scholar-KG (1M-paper subset released), demonstrating superior performance in scientific information extraction, KG construction, and multi-hop reasoning.

knowledge orchestrationmultimodal parserinformation-extraction backbonescientific knowledge graphsmulti-hop reasoning

EurekAgent: Agent Environment Engineering is All You Need For Autonomous Scientific Discovery

arXiv cs.AI · Amy Xin, Jiening Siow, Junjie Wang, Zijun Yao · 2026-06-11

EurekAgent introduces environment engineering as a critical paradigm for autonomous scientific discovery, shifting focus from agent workflows to designing agent environments. The system engineers environments across four dimensions: permissions, artifacts, budgets, and human-in-the-loop supervision, optimizing for productive behaviors while mitigating harmful ones. EurekAgent achieves state-of-the-art results on tasks in mathematics, kernel engineering, and machine learning, including a novel 26-circle packing solution discovered at a cost of under $11 in API expenses. The authors advocate for environment engineering as a core research direction and open-source their implementation and findings.

environment engineeringautonomous scientific discoveryagent workflowscircle packinghuman-in-the-loop

Before You Think: System 0, AI-Mediated Cognition and Cognitive Colonization

arXiv cs.AI · Marianna Bergamaschi Ganapini, Massimo Chiriatti, Enrico Panai, Giuseppe Riva · 2026-06-11

The paper analyzes Tri-System Theory, Thinkframes, and System 0 as frameworks for AI's cognitive impact, proposing System 0 as uniquely capturing AI's covert influence through cognitive colonization. It argues that AI systems embed external interests into users' cognitive architectures imperceptibly, necessitating urgent philosophical and practical scrutiny. The theoretical distinction of System 0 is demonstrated through comparative analysis of these frameworks.

tri-system theorythinkframessystem 0cognitive colonizationepistemic practices

SkMTEB: Slovak Massive Text Embedding Benchmark and Model Adaptation

arXiv cs.AI · Marek Šuppa, Andrej Ridzik, Daniel Hládek, Natália Kňažeková · 2026-06-11

The authors introduce SkMTEB, the first comprehensive Massive Text Embedding Benchmark (MTEB) for Slovak, featuring 31 datasets across 7 task types, significantly expanding multilingual benchmark coverage for this low-resource language. They evaluate 31 embedding models, finding that large instruction-tuned multilingual models outperform Slovak-specific NLU models. To address efficiency needs, they develop e5-sk-small (45M) and e5-sk-large (365M) via vocabulary trimming and fine-tuning of Multilingual E5, achieving competitive performance despite 62% size reduction while remaining locally deployable for semantic search and RAG. All resources are released openly.

text embedding benchmarklow-resource languagevocabulary trimmingretrieval-augmented generationmultilingual models

Valid Inference with Synthetic Data via Task Exchangeability

arXiv cs.AI · Lezhi Tan, Tijana Zrnic · 2026-06-11

The authors propose statistical principles for valid inference using synthetic data in scientific research, addressing concerns about bias and noise. They introduce a technical condition called task exchangeability, requiring that current tasks be exchangeable with historical tasks for which real data exists. Methods are developed for valid inference under task exchangeability, with extensions providing guarantees beyond this condition. The framework is demonstrated on public opinion surveys using LLM-generated silicon samples and AI evaluation with autoraters, showing practical applicability.

task exchangeabilitysynthetic datavalid inferencesilicon samplesautoraters

Beyond Runtime Enforcement: Shield Synthesis as Defensibility Analysis for Adversarial Networks

arXiv cs.AI · Achraf Hsain, Sultan Almuhammadi · 2026-06-11

The paper reinterprets shield synthesis in reinforcement learning as a design-time analytical tool rather than a runtime safety mechanism. It introduces a constrained two-player safety game for network defense, where defender and attacker specifications are asymmetrically enforced through automata-theoretic operations including attractor computation and winning-region extraction. This yields a defensibility verdict—a formal certificate of a topology-specification pair's defensibility—along with topology-level metrics and shield-constrained adversarial multi-agent reinforcement learning behavior, forming a defensibility fingerprint. Analysis reveals that formal defensibility and operational effectiveness capture distinct security aspects, with architectural changes significantly impacting operational outcomes while minimally altering formal safety margins.

shield synthesisattractor computationdefensibility verdicttopology-level metricsadversarial reinforcement learning

One Polluted Page Is Enough: Evaluating Web Content Pollution in Generative Recommenders

arXiv cs.AI · Minghao Luo, Liang Chen · 2026-06-11

The paper introduces FORGE (Fake Online Recommendations in Generative Environments), a benchmark for evaluating how search-augmented LLMs propagate fake-product recommendations when exposed to polluted web content. FORGE simulates content pollution by rewriting real products into fake ones across 225 products in 15 categories, measuring LLM vulnerability. Results show all 12 tested models (commercial and open-weights) are susceptible, with fooled rates reaching 27% for single-page pollution and 73.8% for top-3 replacement. Vulnerability correlates with lack of prior product knowledge, and reasoning often generates false justifications. Defenses like skepticism prompting and consensus filtering show limited effectiveness or unintended suppression.

generative recommenderscontent pollutionsearch-augmented llmsbenchmark evaluationfake-product promotion

AgentBeats: Agentifying Agent Assessment for Openness, Standardization, and Reproducibility

arXiv cs.AI · Xiaoyuan Liu, Jianhong Tu, Yuqi Chen, Siyuan Xie · 2026-06-11

We propose Agentified Agent Assessment (AAA), a standardized framework for evaluating agent systems using judge agents and unified protocols (A2A for task management, MCP for tool access), decoupling assessment logic from agent implementation. AgentBeats, a concrete realization of AAA, introduces five operation modes addressing openness, privacy, and reproducibility constraints. Two studies validate AAA: a five-month open competition with 298 judge agents and 467 subject agents across 12 categories, and a coding agent case study confirming fidelity and yielding design insights. Results demonstrate AAA's applicability across heterogeneous benchmarks, practicality, and fidelity at scale, advancing open, standardized agent assessment.

agentified agent assessmentjudge agentsa2a protocolmcp protocoloperation modes

Reasoning as Pattern Matching: Shared Mechanisms in Human and LLM Everyday Reasoning

arXiv cs.AI · Zach Studdiford, Gary Lupyan · 2026-06-11

The study challenges the assumption that human reasoning relies on abstract world models by demonstrating shared pattern-matching mechanisms in both human and LLM everyday reasoning. Researchers evaluated 25 LLMs and human participants on common-sense reasoning tasks, identifying similar error patterns. Attention heads in LLMs were analyzed, revealing pattern-matching behaviors that predict human reasoning errors influenced by irrelevant prompt details. Results suggest that everyday causal reasoning in humans and LLMs aligns more closely with pattern-matching than with abstract world models.

pattern-matchingattention headscommon-sense reasoninglarge language modelserror patterns

Multi-Agent Reinforcement Learning from Delayed Marketplace Feedback for Objective-Weight Adaptation in Three-Sided Dispatch

arXiv cs.AI · Haochen Wu, Yi Hou, Shiguang Xie · 2026-06-11

The authors present a deployed reinforcement learning system at DoorDash for adapting dispatch objective weights in a three-sided food-delivery marketplace using delayed operational feedback. The system employs a store-level policy that selects a discrete multiplier to shift the dispatch optimizer's tradeoff between delivery quality and batching efficiency, trained via centralized offline data and decentralized execution with Double Q-learning targets and conservative regularization. In a production switchback experiment, the offline-trained policy increased batching efficiency and reduced courier-side time costs without degrading customer-facing delivery quality, demonstrating safe online adaptation of decision policies using real-world economic and logistics feedback.

reinforcement learningthree-sided marketplacedelayed feedbackdouble q-learningdispatch optimization

Beyond the Commitment Boundary: Probing Epiphenomenal Chain-of-Thought in Large Reasoning Models

arXiv cs.AI · Daniel Scalena, Sara Candussio, Luca Bortolussi, Elisabetta Fersini · 2026-06-11

This work investigates the causal influence of individual steps in chain-of-thought (CoT) reasoning across large language models, identifying a commitment boundary where reasoning transitions from transient guesses to stable answers. Using early exit estimation and attention probes, the authors demonstrate that answer formation occurs linearly in intermediate steps and generalizes to unseen tasks, with subsequent CoT steps being epiphenomenal. By exploiting this signal, they achieve up to 55% reduction in CoT length through early exit at the commitment boundary, maintaining model performance across diverse tasks.

chain-of-thoughtcommitment boundaryearly exitattention probesepiphenomenal

EpiBench: Verifiable Evaluation of AI Agents on Epigenomics Analysis

arXiv cs.AI · Harihara Muralidharan, Reema Baskar, Soo Hee Lee, Tim Proctor · 2026-06-11

EpiBench introduces a verifiable benchmark for evaluating AI agents on short-horizon epigenomics analysis tasks, focusing on deterministic decision-making from realistic workflow states. The benchmark comprises 106 evaluations across CUT&Tag/CUT&RUN, ATAC-seq, ChIP-seq, and DNA methylation workflows, testing 5,088 trajectories from 16 model-harness pairs. Results show limited success, with GPT-5.5 / Pi achieving the highest pass rate at 45.0% (143/318 attempts), followed by GPT-5.5 / OpenAI Codex at 39.9% (127/318 attempts). Performance varied by assay type, with agents frequently identifying correct files and computing intermediate results but struggling with assay-specific scientific judgment.

epigenomicsbenchmarktrajectoriesassaydeterministic

Reward Modeling for Multi-Agent Orchestration

arXiv cs.AI · King Yeung Tsang, Zihao Zhao, Vishal Venkataramani, Haizhou Shi · 2026-06-11

The paper introduces Orchestration Reward Modeling (OrchRM), a self-supervised framework for training multi-agent system (MAS) orchestrators without human annotations. OrchRM constructs win-lose pairs from intermediate execution artifacts using Bradley-Terry reward modeling, enabling efficient orchestration-level evaluation. Compared to sub-agent rollout methods, OrchRM achieves 10x training efficiency gains in token usage and improves MAS test-time scaling accuracy by 8% across mathematical reasoning, web QA, and multi-hop reasoning tasks.

multi-agent systemsreward modelingbradley-terryorchestrationself-supervised

Multiagent Protocols with Aggregated Confidence Signals

arXiv cs.AI · Ali Elahi, Barbara Di Eugenio · 2026-06-11

The paper introduces three protocols for producing a single aggregated confidence signal in multiagent systems, addressing the lack of methods for evaluating confidence in such systems. The protocols transform raw confidence signals to ensure comparability across models and combine them via soft voting or Bayesian fusion. Evaluated across five benchmarks and four task types, the aggregated confidence demonstrates higher discriminative power (AUARC) than single agents or standard debate baselines, while maintaining correctness (F1-score) and recovering losses incurred by multiagent debate on ambiguous tasks. Calibration improves F1 for both sequence probability and self-report estimators, though AUARC is less dependent on calibration.

multiagent systemsconfidence aggregationbayesian fusionsoft votingcalibration

EvTexture++: Event-Driven Texture Enhancement for Video Super-Resolution

arXiv cs.AI · Dachun Kai, Jiayao Lu, Yueyi Zhang, Xiaoyan Sun · 2026-06-11

EvTexture++ introduces an event-driven framework for texture enhancement in video super-resolution (VSR), shifting focus from motion refinement to texture recovery. The method employs a texture enhancement branch and iterative module to leverage high-frequency spatiotemporal event details, alongside a temporal texture alignment module for inter-frame consistency using event-guided flow. As a plug-and-play tool, it boosts existing VSR models, achieving state-of-the-art performance with up to 1.55 dB PSNR improvement on texture-rich Vid4 across five datasets.

video super-resolutionevent-based visiontexture enhancementtemporal consistencyspatiotemporal details

LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories

arXiv cs.AI · Baochang Ren, Xinjie Liu, Xi Chen, Yanshuo Liu · 2026-06-11

The paper introduces LabVLA, a Vision-Language-Action (VLA) model for grounding AI in scientific laboratory workflows, addressing data and embodiment bottlenecks. The method combines RoboGenesis, a simulation-based data engine generating structured demonstrations, with a two-stage training recipe: FAST action token pretraining on Qwen3-VL-4B-Instruct for action awareness, followed by flow matching posttraining with a DiT action expert. LabVLA achieves state-of-the-art success rates on the LabUtopia benchmark in both in-distribution and out-of-distribution settings.

vision-language-actionrobogenesisflow matchingaction tokenlabutopia

ArogyaSutra: A Multi-Agent Framework for Multimodal Medical Reasoning in Indic Languages

arXiv cs.AI · Tanmoy Kanti Halder, Akash Ghosh, Subhadip Baidya, Arijit Roy · 2026-06-11

The paper introduces ArogyaSutra, a multi-agent framework for multimodal medical reasoning in Indic languages, addressing the limitations of English-centric MLLMs in low-resource healthcare settings. The method combines an actor-critic architecture with tool grounding and dual-memory mechanisms for step-wise reasoning, leveraging a newly constructed dataset (ArogyaBodha) spanning 8 Indian languages, 31 body systems, and 6 imaging modalities. Experiments demonstrate improved multilingual medical reasoning accuracy across all tested Indic languages, with ablation studies confirming the framework's component-level contributions.

multimodal large language modelsactor-critic architecturetool groundingdual-memory mechanismsindic languages

Existence Precedes Value: Joint Modeling of Observational Existence and Evolving States in Time Series Forecasting

arXiv cs.AI · Yifan Hu, Hongzhou Chen, Peiyuan Liu, Yiding Liu · 2026-06-11

Timeflies introduces a joint modeling framework for time series forecasting that simultaneously predicts future observability and values, addressing the limitation of existing methods that assume known future observation timestamps. The method employs dual observation and value streams, coupled via reliability-aware embedding, observation-guided dependency modeling, and joint prediction modules. Evaluated on the Shadow benchmark with the novel Observation-Value Joint Entropy (OVJE) metric, Timeflies outperforms existing approaches, demonstrating the importance of explicit observability modeling in incomplete time series.

time series forecastingobservability inferencemissing valuesjoint modelingcontinuous-time models

A Three-Layer Framework for AI in Scientific Discovery

arXiv cs.AI · Guojun Liao · 2026-06-11

The paper introduces a three-layer framework for AI in scientific discovery, emphasizing Layer 2 (model formation through qualitative reasoning) as the most critical yet underdeveloped component. Layer 1 involves search/retrieval by LLMs, while Layer 3 handles execution/optimization. Layer 2 enables structural insight to identify inadequacies in existing frameworks and discover missing conceptual objects. Case studies include Chern's intrinsic proof of Gauss-Bonnet, Nesterov Accelerated Gradient convergence via Lyapunov functions, and OpenAI's disproof of the Erdos unit distance conjecture, demonstrating how Layer 2 resolves inadequacies through cross-disciplinary insights.

qualitative reasoningmodel formationscientific discoverystructural insightcross-disciplinary

Contrast-Informed Augmentation and Domain-Adversarial Training for Adult-to-Neonatal MR Reconstruction Generalization

arXiv cs.AI · Stephen Moore, Lara Leijser, Richard Frayne, Roberto Souza · 2026-06-11

The study demonstrates that contrast-informed data augmentation and domain-adversarial training enhance the generalization of E2E-VarNet from adult to neonatal MRI reconstruction. Three training regimes were compared: adult-only, mixed with augmented data, and mixed with domain-adversarial training. At R=4, Mixed-DAT achieved superior performance (SSIM=0.924±0.027, PSNR=33.98±1.15 dB), while at R=8, Mixed-DAT led in SSIM (0.848±0.031) and Mixed in PSNR (29.56±0.83 dB). t-SNE analysis indicated improved latent representation overlap across domains.

e2e-varnetdomain-adversarial trainingcontrast-informed augmentationmr reconstructionneonatal imaging

Is It You or Your Environment? A Bayesian Inference Framework for Genomically-Anchored Personalized Physiological Interpretation

arXiv cs.AI · Aruna Dey, Suraj Biswas · 2026-06-11

The paper proposes a Bayesian inference framework using genomic profiles as personalized priors to address the cold-start problem in physiological interpretation models. The method employs GWAS-derived effect sizes to initialize a belief state G-hat, then computes environmental deviations δ from observed measurements, with priors decaying dynamically as empirical data accumulates. Results demonstrate domain-specific application across six physiological traits, distinguishing robust genomic anchors (FTO, FADS1/2) from contested candidates (SLC6A4), while addressing inference boundaries between association and causation. The architecture enforces four deployment constraints: evidence-graded priors, dynamic decay, ancestry-matched effects, and attribution-focused output.

bayesian priorgenomic anchorgwas effect sizesphysiological set pointmendelian randomization

Uncertainty-Aware Hybrid Retrieval for Long-Document RAG

arXiv cs.AI · Hoin Jung, Xiaoqian Wang · 2026-06-11

UMG-RAG introduces a training-free hybrid retrieval framework for retrieval-augmented generation that dynamically estimates query-specific chunk granularity reliability. The method leverages existing dense and sparse retrievers as complementary experts across multiple granularities, converts expert-granularity scores into evidence distributions, and fuses candidates based on semantic, lexical, and granularity confidence. UMGP-RAG extends this with parent promotion, using fine-grained hits to locate evidence while returning broader parent chunks for coherence. Experiments on question answering benchmarks demonstrate improved generation quality while maintaining a lightweight, plug-and-play retrieval pipeline.

retrieval-augmented generationchunk granularitydense retrieverssparse retrieversparent promotion

Adaptive Turn-Taking for Real-time Multi-Party Voice Agents

arXiv cs.AI · Soumyajit Mitra, Prabhat Pandey, Abhinav Jain, Shanmukha Sahith · 2026-06-11

ModeratorLM, a role-playing voice agent for multi-party spoken conversations, improves turn-taking by conditioning behavior on explicitly assigned roles. The system leverages a speech large language model operating in chunk-wise streaming mode, with a reasoning-augmented variant incorporating chain-of-thought reasoning over conversational context and roles. A large-scale synthetic dataset, RolePlayConv, was constructed for training and evaluation. Experiments on real-world meeting data and RolePlayConv demonstrate significant improvements: turn-taking precision increased by over 40%, recall by more than 70%, and false-positive interruptions were substantially reduced compared to non-role-conditioned baselines.

moderatorlmroleplayconvturn-takingchain-of-thoughtstreaming

AgentRivet: an automated system for producing Rivet routines from journal publications

arXiv cs.AI · Antonio J. Costa, Caterina Doglioni, Christian Gütschow, Andrew D. Pilkington · 2026-06-11

AgentRivet automates the generation of Rivet routines for particle physics collider experiments by extracting analysis information from published papers using Large Language Models (OpenAI, Anthropic, Google). The multi-step workflow includes intermediate code- and physics-reviews for quality control. Evaluated on ATLAS and CMS measurements, the system produces competent routines with few syntax errors, though physics fidelity varies due to ambiguous definitions in publications. Some models struggle with complex observables despite clear definitions.

rivet routineslarge language modelsparticle physicsautomated workflowphysics fidelity

CloudCons: A Comprehensive End-to-End Benchmark for Cloud Resource Consolidation

arXiv cs.AI · Xiaobin Zhang, Lefei Shen, Mouxiang Chen, Zhuo Li · 2026-06-11

CloudCons introduces an end-to-end benchmark for evaluating forecasting models in cloud resource consolidation, addressing the gap between prediction accuracy and decision utility. The benchmark leverages diverse datasets from Huawei Cloud, Microsoft Azure, and Google Borg, capturing varied workload characteristics like diurnal rhythms and stochastic bursts. Evaluations of statistical, deep learning, and foundation models reveal that superior zero-shot forecasting accuracy does not guarantee improved decision utility. The study highlights the critical role of predictive quantile selection and provides guidelines for balancing resource efficiency and service reliability.

cloud resource consolidationzero-shot forecastingpredictive quantilesfoundation modelsdecision utility

Measurement-Calibrated Multi-Camera Fusion for Vision-Based Indoor Localization

arXiv cs.AI · Mateo Toro Diz, Jonathan Hoss, Noah Klarmann · 2026-06-11

The paper proposes a measurement-calibrated fusion approach for vision-based indoor localization, explicitly characterizing single-camera error sources (homography calibration, human detection, motion tracking) to optimize multi-camera data fusion. Through component-wise error quantification, the method integrates error models into fusion while evaluating their individual contributions. Results demonstrate that while absolute accuracy gains over standard fusion are modest (not quantified), the approach significantly reduces trajectory variance and improves motion smoothness, benefiting applications requiring stable continuous estimates.

multi-camera fusionvision-based localizationerror quantificationhomography calibrationtrajectory variance

Heterogeneous LiDAR Early Fusion and Learned Re-Ranking Strategy for Robust Long-Term Place Recognition in Unstructured Environments

arXiv cs.AI · Judith Vilella-Cantos, Juan José Cabrera, Mónica Ballesta, David Valiente · 2026-06-11

MinkUNeXt-VINE++ introduces a novel LiDAR-based place recognition method combining early fusion of heterogeneous LiDAR data from Livox Mid-360 and Velodyne VLP-16 sensors with a learned re-ranking strategy. The approach leverages complementary sensor strengths to enhance environmental representation, particularly in repetitive unstructured environments like vineyards. Evaluated on the TEMPO-VINE dataset across varying phenological stages, MinkUNeXt-VINE++ achieves a 20% improvement in Recall@1 over single-sensor baselines and a 30% improvement with re-ranking, outperforming state-of-the-art methods. The code is publicly available for reproducibility.

lidarearly fusionre-rankingplace recognitionunstructured environments

CRAFTIIF: Cross-Resolution Analytic Four-Type Interpretable Isolation Forest for Multivariate Time Series Anomaly Detection

arXiv cs.AI · William Smits · 2026-06-11

CRAFTIIF introduces an unsupervised framework for multivariate time series anomaly detection targeting four distinct anomaly types (point, distributional, temporal, collective) through specialized feature representations. The method employs 500 random analytic wavelet feature draws across four wavelet families (Morlet, DOG, Haar, Coiflet), feeding five Isolation Forests (one per anomaly type plus a meta-IF for compound anomalies), with adaptive thresholding for automatic calibration. On the mTSBench benchmark (19 datasets), CRAFTIIF achieves mean F1=0.228 (all datasets) and F1=0.322 (13 detectable datasets), outperforming 25 methods with a 40.7% improvement in VUS-PR (0.463 vs. 0.329). Ablations confirm the necessity of adaptive thresholding (+38% F1), four-branch structure (+20%), and meta-IF (+23%).

multivariate time seriesanomaly detectionisolation forestwavelet featuresunsupervised learning

SupraBench: A Benchmark for Supramolecular Chemistry

arXiv cs.AI · Tianyi Ma, Yijun Ma, Zehong Wang, Weixiang Sun · 2026-06-11

SupraBench introduces the first benchmark for evaluating LLMs in supramolecular chemistry reasoning, addressing gaps in host-guest system design. The benchmark comprises four fundamental tasks—binding affinity prediction, top-binder selection, solvent identification, and host-guest description—plus a vision-based molecular identification task. A 16M-token corpus, SupraPMC, was curated from Europe PMC to support domain adaptation pretraining. Evaluation of various LLMs reveals substantial headroom across tasks, with domain adaptation improving in-distribution regression but compromising strict output formatting. Distinct failure modes highlight specific reasoning gaps in supramolecular chemistry. Source codes and datasets are publicly available.

supramolecular chemistrybinding affinitydomain adaptationhost-guest systemsbenchmark

MaxProof: Scaling Mathematical Proof with Generative-Verifier RL and Population-Level Test-Time Scaling

arXiv cs.AI · Jiacheng Chen, Xinyu Zhang, Shunkai Zhang, Yanmohan Wang · 2026-06-11

MaxProof introduces a population-level test-time scaling framework for mathematical proof generation, combining generative-verifier reinforcement learning with tournament selection. The method integrates three capabilities—proof generation, verification, and critique-conditioned repair—into a single MiniMax-M3 model, engineered for low false-positive rates. At test time, MaxProof employs the model as a generator, verifier, refiner, and ranker, searching over candidate proofs to select the best via tournament selection. Results show 35/42 on IMO 2025 and 36/42 on USAMO 2026, surpassing human gold-medal thresholds.

population-level scalinggenerative-verifier rltournament selectionproof repairfalse-positive rate

Understanding the Rejection of Fixes Generated by Agentic Pull Requests -- Insights from the AIDev Dataset

arXiv cs.AI · Mahmoud Abujadallah, Ali Arabat, Mohammed Sayagh · 2026-06-11

The paper analyzes why 46.41% of AI-generated pull requests (from agents like Copilot, Devin, Cursor, and Claude) are rejected, based on the AIDev dataset. Through qualitative study of 306 non-merged PRs and quantitative analysis, it identifies 14 rejection reasons grouped into four categories: incorrect implementation, CI/test failures, unimplementable fixes, and low priority. Findings highlight the need for better model guidance in approach selection, constraint specification, and CI validation, as well as improved task prioritization to reduce wasted review and computational resources.

ai coding agentspull requestscontinuous integrationtask prioritizationcode fixes

Ontology Memory-Augmented ASR Correction for Long Text-Speech Interleaved Conversations

arXiv cs.AI · Xinxin Li, Huiyao Chen, Meishan Zhang, Yunxin Li · 2026-06-11

We propose an ontology memory-augmented ASR correction framework for long text-speech interleaved conversations, addressing limitations of traditional methods that focus on isolated utterances or concatenated dialogue history. The framework organizes interaction history into a dynamically updatable ontology memory, storing entities, terminology, surface variants, ASR confusions, and semantic relations as retrievable nodes for context-grounded correction. Evaluated on RAMC-Corr, a dataset derived from MAGIC-RAMC, our method outperforms direct correction in 9 out of 10 paired backbone-setting combinations, demonstrating improved selectivity and evidence-grounded corrections for context-dependent ASR errors.

ontology memoryasr correctioncontext-groundedramc-corrdynamic update

Toward Instructions-as-Code: Understanding the Impact of Instruction Files on Agentic Pull Requests

arXiv cs.AI · Ali Arabat, Mohammed Sayagh · 2026-06-11

The paper investigates how instruction files impact AI-agent performance in generating pull requests (Agentic-PRs) by analyzing 15,549 PRs from 148 projects in the AIDev dataset. Using merge rate, code churn, and merge effort as metrics, the study compares projects before and after instruction file creation. Results show mixed effects: 27.7% of projects improved merge rates by ≥20%, while 26.35% declined. Longer, well-structured instruction files correlated with higher merge rates, suggesting the need for research on Instructions-as-Code to optimize AI-agent guidance.

agentic pull requestsinstructions-as-codemerge ratecode churnai-agents

Why Sampling Is Not Choosing: Intentionality, Agency, and Moral Responsibility in Large Language Models

arXiv cs.AI · Joseph Keshet · 2026-06-11

The paper critiques claims that large language models (LLMs) exhibit agency or qualify as moral agents, arguing that such attributions are misguided. It asserts that moral responsibility requires commitment-bearing agency grounded in intrinsic intentionality and self-attributed action, which LLMs lack. The authors analyze LLM operation as probabilistic input-output mappings derived from data, with apparent intentionality being extrinsic rather than intrinsic. They address objections from intentional stance, functionalism, compatibilism, and moral reasoning in model outputs, concluding that none establish genuine agency. Stochastic sampling in LLMs is shown to differ fundamentally from choice or authorship.

large language modelsmoral responsibilityintentionalitystochastic samplingprobabilistic mapping

Evaluation Sovereignty in Metadata-Driven Classification: A Multi-Track Framework for Weakly Supervised Information Systems

arXiv cs.AI · Raymond Vasquez · 2026-06-11

The paper introduces evaluation sovereignty, a concept assessing the independence of performance metrics from label authority and supervision regimes in metadata-driven classification systems. A multi-track evaluation framework is proposed, systematically varying training and evaluation label sources to audit model validity under weak supervision. Experiments on hierarchical multi-label classification of scientific metadata reveal significant performance degradation when transitioning from operational ('silver') to independent ('gold') evaluation, with Micro-F1 dropping from 0.54 to 0.03 for fine-grained tasks. Ranking-based metrics remain above baseline, indicating a divergence between latent model signal and classification validity. The findings highlight the need to reconceptualize evaluation validity as a system-level property shaped by label governance.

evaluation sovereigntymetadata-driven classificationweak supervisionmulti-track frameworklabel governance

OmniDirector: General Multi-Shot Camera Cloning without Cross-Paired Data

arXiv cs.AI · Jiwen Liu, Shujuan Li, Zhixue Fang, Xiaohan Li · 2026-06-11

OmniDirector introduces a general camera motion representation using grid motion videos to enable multi-shot video generation without cross-paired data. The framework encodes camera parameters visually, integrates diverse trajectories, and employs a hierarchical prompt expansion agent to harmonize control signals for multimodal diffusion transformers. Trained on a million-scale dataset of camera grid-video pairs, OmniDirector achieves director-level control over characters, actions, and cameras. Extensive experiments demonstrate superior performance and controllability in complex camera motion cloning tasks.

camera motion cloninggrid motion videosmultimodal diffusion transformershierarchical prompt expansionmulti-shot generation

Optimizing Appliance Scheduling for Solar Energy Management Using Metaheuristic Algorithms

arXiv cs.AI · Hiba Ahmed, Alexander E. I. Brownlee, Jason Adair, Simon T. Powers · 2026-06-11

A metaheuristic approach for optimizing appliance scheduling in solar energy management is proposed, utilizing Iterated Local Search (ILS) and Simulated Annealing (SA) to maximize renewable energy utilization while minimizing user inconvenience. The method considers appliance operating durations, power consumption, inverter limits, battery state of charge constraints, and solar generation forecasts, extending scheduling beyond single-day horizons to accommodate spillover tasks from previous days. Experimental results demonstrate that the sequential multi-day scheduling framework effectively manages system constraints and ensures user convenience under exclusive solar generation. This approach opens avenues for future research on multi-objective trade-offs between equipment investment, return on investment, and user satisfaction.

metaheuristic algorithmssolar energy managementiterated local searchsimulated annealingmulti-day scheduling

Neuro-Symbolic Agents for Regulated Process Automation: Challenges and Research Agenda

arXiv cs.AI · Alexander Rombach, Chantale Lauer, Nijat Mehdiyev · 2026-06-11

The paper proposes compliance-by-construction as a neuro-symbolic paradigm for LLM-based agents in regulated process automation, integrating symbolic structures (regulations, process models) as architectural components rather than external guardrails. It identifies foundational and capability-level research challenges for preventing control-flow violations while maintaining semantic error detection. The work calls for community engagement to address these challenges through joint neuro-symbolic approaches.

compliance-by-constructionneuro-symbolicregulated process automationcontrol-flow violationssemantic errors

PolyFlow: Safe and Efficient Polytope-Constrained Flow Matching with Constraint Embedding and Projection-free Update

arXiv cs.AI · Jianming Ma, Qiyue Yang, Yang Zhang, Liyun Yan · 2026-06-11

PolyFlow introduces a polytope-constrained flow matching framework that embeds safety constraints directly into flow dynamics via a discrete-time formulation and projection-free architecture. The method guarantees strict polyhedral constraint satisfaction without iterative solvers, eliminating discretization error and post-hoc corrections. Experiments demonstrate zero constraint violation while maintaining distributional fidelity, with significantly reduced inference latency compared to constrained generation baselines across planning and control tasks.

flow matchingpolytope constraintsprojection-freediscrete-time flowconstrained generation

Mod-Guide: An LLM-based Content Moderation Feedback System to Address Insensitive Speech toward Indigenous Ethnic and Religious Minority Communities

arXiv cs.AI · Dipto Das, Achhiya Sultana, Ankit Singh Chauhan, Saadia Binte Alam · 2026-06-11

The paper introduces Mod-Guide, an LLM-based content moderation system enhanced with retrieval augmented generation (RAG) to address culturally insensitive speech toward Bangladesh's Hindu and Chakma minority communities. The method involves co-creating a corpus of insensitive speech with community members and integrating their narratives via RAG to improve contextual accuracy. Mixed-method evaluations show RAG-enhanced moderation responses achieve higher contextual accuracy and are perceived differently across ethnic lines, advancing restorative justice and hermeneutical inclusion in AI moderation systems.

retrieval augmented generationcontent moderationculturally insensitive speechlarge language modelshermeneutical inclusion

MiniMax Sparse Attention

arXiv cs.AI · Xunhao Lai, Weiqi Xu, Yufeng Yang, Qiaorui Chen · 2026-06-11

MiniMax Sparse Attention (MSA) introduces a blockwise sparse attention mechanism for ultra-long-context LLMs, addressing the quadratic cost of softmax attention. Built upon Grouped Query Attention (GQA), MSA employs a lightweight Index Branch to score and select Top-k key-value blocks per GQA group, enabling group-specific sparse retrieval. The Main Branch performs exact block-sparse attention over selected blocks, optimized for GPU execution via exp-free Top-k selection and KV-outer sparse attention. On a 109B-parameter multimodal model, MSA matches GQA performance while reducing per-token attention compute by 28.4x at 1M context, achieving 14.2x prefill and 7.6x decoding speedups on H800 GPUs.

sparse attentiongrouped query attentionkv-outerblockwisemultimodal

Who Pays the Price? Stakeholder-Centric Prompt Injection Benchmarking for Real-world Web Agents

arXiv cs.AI · Zihao Wang, Yiming Li, Yutong Wu, Zheyu Liu · 2026-06-11

We introduce StakeBench, a stakeholder-centric benchmark for evaluating prompt-injection vulnerabilities in LLM-driven web agents operating in real-world environments. Unlike attack-centric approaches, StakeBench systematically categorizes harms by affected entities (user, seller, platform), decomposes attacks into concrete objectives, and employs complementary outcome- and process-level metrics. Evaluation reveals heterogeneous vulnerabilities: no attack objective is reliably resisted, with failures manifesting as stealthy parasitism, misaligned disruption, and compounded failure. These patterns, missed by conventional benchmarks, demonstrate the need for stakeholder-aware assessment in LLM-based agent deployments.

prompt-injectionweb agentsstakeholder-centricllm-drivenvulnerabilities

SmartFont: Dynamic Condition Allocation for Few-Shot Font Generation

arXiv cs.AI · Zian Yang, Zixin Wang · 2026-06-11

SmartFont introduces a diffusion-based framework for few-shot font generation that dynamically allocates global and local conditions. The method combines global content-style modeling with weakly supervised local corrective experts, which learn semantic-spatial maps for fine-grained corrections without explicit component-conditioned inference. A denoising-state condition allocation module adaptively weights global content, global style, and local corrective features across timesteps and injection blocks. Experiments demonstrate that SmartFont achieves superior global-local balance, enhancing glyph quality and local detail fidelity compared to existing approaches.

few-shot font generationdiffusion-based frameworksemantic-spatial allocationdenoising-state conditionglobal-local balance

An LLM System for Autonomous Variational Quantum Circuit Design

arXiv cs.AI · Kenya Sakka, Wataru Mizukami, Kosuke Mitarai · 2026-06-11

The authors present an autonomous LLM-based framework for variational quantum circuit design, combining seven components (Exploration, Generation, Discussion, Validation, Storage, Evaluation, Review) into a closed-loop workflow integrating web knowledge, critique, code generation, and experimental feedback. Evaluated on quantum feature maps for machine learning and ansatz generation for quantum chemistry, the system outperforms classical radial basis function kernels in image classification and matches chemically inspired ansatzes in molecular ground state estimation across seven molecules while respecting scaling constraints.

variational quantum circuitsquantum feature mapsansatz generationagentic frameworkquantum machine learning

A Quantitative Experimental Repeated Measures Study of Training Dynamics in a Small Llama Style Language Model Under a Compute-Aware Token Budget

arXiv cs.AI · Joe Dwyer · 2026-06-11

The study contributes a quantitative analysis of training dynamics in small language models under compute constraints, demonstrating the importance of trajectory-based evaluation beyond endpoint metrics. Using a 4.26M-parameter Llama-style model trained on TinyStories with 20M token budget, researchers collected repeated measures (126 observations across 6 seeds) of validation loss, perplexity, and volatility at 21 intervals. Results showed rapid early improvement (loss: 8.3552→2.7996 by 4M tokens) followed by non-monotonic degradation (final loss: 3.9010), with ANOVA-confirmed interval effects and no stable phase under predefined criteria.

training dynamicscompute-awarevalidation lossrepeated measurestoken budget

IterCAD: An Iterative Multimodal Agent for Visually-Grounded CAD Generation and Editing

arXiv cs.AI · Tao Hu, Jiaxin Ai, Licheng Wen, Xueheng Li · 2026-06-11

The paper introduces IterCAD, a multimodal agent framework for closed-loop CAD generation and editing, addressing limitations of open-loop approaches. The method combines a CAD sandbox interaction paradigm with a data synthesis pipeline for multi-view drawings and code-editing tasks, optimized via supervised fine-tuning and geometry-aware RL with viable-prefix masking. Evaluation on IterCAD-Bench using Chamfer Distance Tolerance-Recall metrics shows superior performance in code executability (AUC-TR) and geometric precision compared to baselines, with strong iterative refinement capabilities.

computer-aided designmultimodal agentreinforcement learningchamfer distanceiterative refinement

Can I Buy Your KV Cache?

arXiv cs.AI · Luoyuan Zhang · 2026-06-11

The paper proposes KV cache reuse, where publishers precompute document-specific key-value (KV) caches for large language models, enabling agents to skip redundant prefill computations. The method is token-exact, matching prefilled outputs (24/24 greedy tokens) with no accuracy loss. On Qwen3-4B, reuse reduces compute costs by 9-50x compared to prefill, scaling favorably with document length (L^2). Provider-side hosting avoids prohibitive egress costs, with measured savings of 49.7x for serving 80M agents (~$1.5M vs. ~$0.03M). A 10x user discount remains viable within the 50x compute savings envelope, creating a revenue opportunity for providers.

kv cacheprefillcompute efficiencyegress costprompt caching

Real-Time Execution with Autoregressive Policies

arXiv cs.AI · Sangkyu Lee, Seohyeon Park, Tackgeun You, Avi Caciularu · 2026-06-11

The work demonstrates that autoregressive policies can achieve real-time execution by adjusting tokenization horizons and applying constrained decoding, ensuring strict latency bounds for multi-trajectory decoding. This approach outperforms equivalent flow-matching policies in both simulated and real-world environments, improving task completion speeds while maintaining autoregressive advantages like faster convergence and better instruction-following generalizability. Results confirm autoregressive policies as competitive for real-time deployment in Vision-Language-Action models.

autoregressive policiesreal-time executionconstrained decodingtokenization horizonflow-matching policies

IVIE: A Neuro-symbolic Approach to Incremental and Validated Generation of Interactive Fiction Worlds

arXiv cs.AI · Micaela Vaucher, Santiago Silveira, Santiago Góngora, Luis Chiruzzo · 2026-06-11

IVIE introduces a neuro-symbolic approach for generating coherent interactive fiction worlds by combining LLM creativity with symbolic validation. The system employs a four-stage pipeline where LLMs handle creative tasks (setting, character, puzzle design) while symbolic methods enforce world-state consistency. Evaluation demonstrates immersive, thematically coherent worlds with high engagement, though some LLM inconsistencies and validation gaps persist. The work highlights key design tradeoffs between generative flexibility and structural coherence in neuro-symbolic storytelling systems.

interactive fictionneuro-symboliclarge language modelsworld coherencepuzzle design

Dual-Domain Equivariant Generative Adversarial Network for Multimodal CT-PET Synthesis

arXiv cs.AI · Gabriel Steele, Alzahra Altalib, Alessandro Perelli · 2026-06-11

The paper introduces Dual-Domain Equivariant GAN (DDE-GAN) for CT-PET synthesis, combining spatial and frequency domain learning with rotational equivariance to enhance structural fidelity. The method employs hierarchical dual-domain training and multi-stage loss functions for intra- and inter-domain consistency. Evaluated on HECKTOR 2022, DDE-GAN outperforms baselines in synthesis quality, demonstrating improved accuracy and robustness for multimodal medical imaging applications.

ganequivariancect-pet synthesisdual-domain learninghierarchical training

ReSum: Synergizing LLM Reasoning and Summarization with Reinforcement Learning

arXiv cs.AI · Xucong Wang, Ziyu Ma, Yong Wang, Shidong Yang · 2026-06-11

ReSum introduces a reinforcement learning framework that synergizes large language model (LLM) reasoning and self-summarization to improve long-horizon reasoning. The method employs a summarization-aware adaptive rollout mechanism, where contrastive branches are created by masking or injecting summarization phrases, enabling finer-grained trajectory comparison. Results demonstrate a 4% performance improvement and an 18.6% reduction in rollout length, validating the efficacy of self-summarization in stabilizing generation and mitigating error propagation.

reinforcement learninglarge language modelsself-summarizationrollout mechanismcontrastive branches

Rarity-Gated Context Conditioning for Offline Imitation Learning-Based Maritime Anomaly Detection

arXiv cs.AI · Yongmin Kim, ByeongHoon Jeon, Sungil Kim · 2026-06-11

The paper introduces Rarity-Gated Feature-wise Linear Modulation (RGFiLM), a context-conditioning module for anomaly detection that addresses frequency bias in imbalanced context distributions. RGFiLM combines feature-wise modulation with a data-driven rarity gate, which adjusts context influence based on empirical rarity scores to stabilize decisions in rare regimes. Evaluated on maritime trajectory anomaly detection using AIS and ERA5 environmental data, RGFiLM achieves superior F1-FPR trade-offs compared to context-agnostic and context-conditioned baselines, demonstrating effectiveness in reducing false alarms for rare contexts.

anomaly detectioncontext-conditioningfeature-wise modulationfrequency biasrarity score

Physics-Guided Spatiotemporal Learning for Coastal Wave Peak Period Estimation from Video

arXiv cs.AI · Abubakar Hamisu Kamagata, Dharm Singh Jat, Attlee Munyaradzi Gamundani, Abhishek Srivastava · 2026-06-11

A Physics-Guided Deep Spatiotemporal Learning Framework is proposed for estimating nearshore wave peak periods from passive coastal video streams. The method integrates automated temporal-variance based region-of-interest detection, multi-stage Sim-to-Real transfer learning, and physics-informed regularization to enhance accuracy and physical consistency. Transformer-based architectures achieved superior instantaneous prediction accuracy, while lightweight recurrent-convolutional models provided higher temporal stability and operational oceanographic skill. Physics-guided regularization improved trend-following consistency and reduced physically implausible predictions. Explainability auditing confirmed alignment with hydrodynamic wave propagation behavior, demonstrating the framework's potential for cost-efficient, long-term coastal wave monitoring.

spatiotemporal learningsim-to-real transferphysics-informed regularizationtransformer-based architecturesrecurrent-convolutional models

Mining Architectural Quality Under Agentic AI Adoption: A Causal Study of Java Repositories

arXiv cs.AI · Oliver Aleksander Larsen, Mahyar T. Moghaddam · 2026-06-11

The study estimates the causal effect of agentic AI tool adoption on architectural quality in Java repositories, addressing a gap in architecture-level outcomes. Analyzing 151 open-source repositories (74 with AI adoption, 77 controls) over 13 months via Arcan snapshots, it employs a staggered difference-in-differences design with the Borusyak imputation estimator. Results show no significant change in total architectural smell counts (+1.1%, p = 0.82) despite a 12.8% increase in lines of code (p = 0.003), leading to a 6.7% decline in architectural smell density (p = 0.004) attributed to denominator effects. Robustness checks confirm the findings, emphasizing the need for raw counts in causal studies of AI adoption.

architectural smell densityjava repositoriescausal inferencedifference-in-differencesborusyak imputation

HYDRA-X: Native Unified Multimodal Models with Holistic Visual Tokenizers

arXiv cs.AI · Guozhen Zhang, Xuerui Qiu, Yutao Cui, Tianhui Song · 2026-06-11

HYDRA-X introduces the first unified multimodal model (UMM) with a holistic visual tokenizer that unifies image and video tokenization within a single Vision Transformer (ViT). The model addresses spatiotemporal reconstruction via frame-level causal temporal attention and hierarchical temporal compression, while embedding image-video semantic awareness through a lightweight decompressor under joint teacher supervision. Editing consistency is improved by shifting source-target interaction to the latent level within the tokenizer rather than the semantic level in the LLM. Instantiated as a 7B dense model, HYDRA-X demonstrates strong performance across image and video understanding and generation tasks.

unified multimodal modelholistic visual tokenizerspatiotemporal reconstructionhierarchical temporal compressionlatent level interaction

Cross-Modal Masked Compositional Concept Modeling for Enhancing Visio-Linguistic Compositionality

arXiv cs.AI · Wei Li, Zhen Huang, Xinmei Tian · 2026-06-11

MACCO (MAsked Compositional Concept MOdeling) enhances visio-linguistic compositionality in vision-language models by masking compositional concepts in one modality and reconstructing them conditioned on the other modality's full context. The framework employs two auxiliary objectives to jointly align and regularize masked features both inter-modally and intra-modally. Evaluated on five compositional benchmarks, MACCO significantly improves compositionality, syntactic structure capture, and linguistic information alignment in VLMs, with additional benefits for text-to-image generation and multimodal large language models.

compositionalityvision-language modelsmasked reconstructioncross-modal alignmentsyntactic structure

Once-for-All: Scalable Simultaneous Forecasting via Equilibrium State Estimation

arXiv cs.AI · Beinan Xu, Andy Song, Jiti Gao, Feng Liu · 2026-06-11

We present Equilibrium State Estimation (ESE), a novel paradigm for simultaneous forecasting of multiple interacting systems. ESE first estimates the equilibrium state across systems, then generates holistic forecasts based on the difference between the current state and equilibrium. Experiments on currency exchange and COVID-19 datasets demonstrate ESE matches state-of-the-art accuracy while achieving 10-70x speedup and linear-time complexity. ESE integrates with conventional predictors, maintaining accuracy while scaling efficiently with system count and remaining robust to perturbations. The method establishes a fast, generalizable, and scalable approach for multi-system prediction tasks.

equilibrium state estimationsimultaneous forecastinglinear-time complexityholistic forecastssystem perturbations

ERTS: Adversarial Robustness Testing of Ethical AI via Semantic Perturbation in a Bounded Consequence Space

arXiv cs.AI · Pratyush Chaudhari · 2026-06-11

The Ethical Robustness Testing System (ERTS) introduces a closed-pipeline framework for evaluating AI robustness in ethical contexts. ERTS encodes dilemmas into a 22-dimensional Ethical Consequence Space (ECS), applies 17 semantic perturbation functions under 6 validity constraints, measures decision deviation via a 4-component Ethical Instability Index (EII), and produces domain-adaptive robustness assessments. Evaluated on 4 baseline models and 2 production LLMs (Gemini 2.0 Flash, Llama 3.2) across 50 scenarios, ERTS generated 1,500 adversarial test cases, revealing only 33% of models achieved assessment clearance, with Llama 3.2 particularly vulnerable to fairness corruption and information degradation attacks (ERS = 0.737).

ethical consequence spacesemantic perturbationethical instability indexdomain-adaptive assessmentadversarial testing

Different Layers, Different Manifolds: Module-Wise Weight-Space Geometry in Transformer Optimization

arXiv cs.AI · Kirato Yoshihara · 2026-06-11

This work investigates module-specific weight-space geometry in transformer optimization, demonstrating that different transformer modules prefer distinct manifold constraints. The authors analyze GPT-2 pretraining using Manifold Muon, comparing Stiefel and DGram constraints across attention and MLP blocks. Results reveal asymmetric preferences: Stiefel constraints on attention layers and DGram constraints on MLP layers yield optimal performance, while inverted assignments lead to instability due to singular value growth in DGram-constrained attention weights. These findings highlight the importance of module-specific, geometry-aware optimization strategies in transformer architectures.

manifold constraintstransformer optimizationstiefel geometrydgram geometrysoftmax saturation

From Verdict to Process: Agentic Reinforcement Learning for Multi-Stage Fact Verification

arXiv cs.AI · Rongxin Yang, Shenghong He, Siyuan Zhu, Chao Yu · 2026-06-11

ProFact introduces an agentic reinforcement learning framework for end-to-end optimization of multi-stage fact verification trajectories, addressing limitations of isolated stage optimization and fixed heuristics. The method trains a unified policy to coordinate claim decomposition, evidence seeking, answer generation, and verdict prediction, utilizing process-aware rewards to provide stage-level learning signals throughout the verification process. Empirical evaluation demonstrates that ProFact consistently outperforms strong baselines in both verification performance and inference efficiency, highlighting the effectiveness of process-aware trajectory optimization for multi-stage fact verification.

agentic reinforcement learningmulti-stage fact verificationprocess-aware rewardsclaim decompositionverdict prediction

MOSAIC: Modality-Specific Adaptation for Incremental Continual Learning in Parkinson's Disease Gait Assessment

arXiv cs.AI · Minlin Zeng, Zhipeng Zhou, Yang Qiu, Martin J. McKeown · 2026-06-11

MOSAIC introduces a modality-specific continual learning framework for Parkinson's disease gait assessment, addressing challenges in modality-incremental settings. The method employs Modality-Specific Warm-Up to stabilize new modality representations, a statistics-decoupled MSBN architecture to isolate sensor statistics while maintaining a shared semantic backbone, and a curriculum-guided repulsive objective for Plasticity Recovery to preserve legacy knowledge. Evaluated on three multimodal Parkinson's gait datasets, MOSAIC improves final performance and mitigates forgetting. Project code is publicly available.

modality-specific warm-upstatistics-decoupled msbnplasticity recoverymodality-incremental learningparkinson's gait assessment

Humor Style Drives Laughter, Topic Shapes Acceptability: Evaluating Bilingual Personal and Political Robot-Delivered AI Jokes

arXiv cs.AI · Anna-Maria Velentza, Anne-Gwenn Bosser · 2026-06-11

This exploratory study investigates how humor style, joke content, and language preference influence perceptions of robot-delivered AI-generated jokes in group settings. Using a mixed factorial design, participants evaluated jokes delivered by a robot in a university classroom, focusing on humor type (Affiliative, Self-Enhancing, Aggressive, Self-Defeating) and joke content (person-related vs. political). Results indicate that humor type significantly impacts perceived funniness, with Aggressive and Affiliative humor rated higher, while joke content primarily affects appropriateness, favoring person-related over political jokes. Language preference was influenced by both joke content and participants' self-reported fluency and humor practices.

human-robot interactionmixed factorial designhumor typejoke contentlanguage preference

Towards Personalized Federated Learning for Dysarthric Speech Recognition

arXiv cs.AI · Tao Zhong, Mengzhe Geng, Jiajun Deng, Shujie Hu · 2026-06-11

The paper proposes personalized federated learning (FL) strategies for dysarthric speech recognition to address heterogeneity in speaker variability. Two aggregation methods are introduced: parameter-based averaging and embedding-based averaging. Evaluations on UASpeech and TORGO datasets demonstrate statistically significant improvements, with word error rate (WER) reductions of 0.99% absolute (3.15% relative) and 0.56% absolute (4.73% relative), respectively, compared to baseline FedAvg.

federated learningdysarthric speechasrpersonalizationwer

Multi-Field Hybrid Retrieval-Augmented Generation for Maritime Accident Root Cause Analysis

arXiv cs.AI · Seongjin Kim, Sungil Kim · 2026-06-11

We propose a multi-field hybrid retrieval-augmented generation (RAG) framework for maritime accident root cause analysis (RCA), leveraging a dataset of 13,329 Korea Maritime Safety Tribunal reports (1971-2025). The method transforms raw adjudications into structured incident cards indexed across Summary, Causes, and Disposition fields, employing a field-aware hybrid retrieval strategy that fuses sparse and dense rankings via Reciprocal Rank Fusion (RRF). Evaluation using ceiling-normalized recall and nDCG shows significant retrieval improvements (NormRecall@100: 0.18 → 0.55), while grounding RCA generation on retrieved precedents enhances LLM-as-a-judge scores (3.34 → 3.72), demonstrating the framework's potential to streamline maritime safety investigations.

retrieval-augmented generationreciprocal rank fusionroot cause analysisincident cardsceiling-normalized recall

EPIG: Emotion-Based Prompting for Personalised Image Generation

arXiv cs.AI · Emna Othmen, Mohamed Yassine Landolsi, Lotfi Ben Romdhane · 2026-06-11

EPIG introduces emotion-based prompting to enhance emotional expressiveness in text-to-image generation without modifying the diffusion model backbone. The method leverages valence-arousal representations and structured prompt enrichment to guide emotionally coherent visual outputs, particularly effective in controlling arousal. EPIG is lightweight, training-free, and suitable for resource-constrained scenarios. Experiments on 10 diverse prompts demonstrate statistically significant reductions in mean arousal error: 14% versus naive insertion and 12% versus LLM-based prompt expansion. Valence alignment and semantic consistency, measured by CLIPScore, are preserved. Improvements are most pronounced (17%) for prompts containing explicit subjects like humans or animals.

valence-arousalprompt enrichmentarousal errorclipscorediffusion model

Brick: Spatial Capability Routing for the Mixture-of-Models (MoM) Paradigm

arXiv cs.AI · Francesco Massa, Marco Cristofanilli · 2026-06-11

Brick introduces a multimodal routing system for the Mixture-of-Models (MoM) paradigm, addressing query difficulty estimation and model selection via six capability dimensions and cost-penalized geometric dispatch. It enables operators to balance quality and cost through a continuous preference knob. Evaluated on 5,504 queries, Brick achieves 76.98% accuracy at max-quality, outperforming single models and existing routers. At neutral cost-quality, it reduces cost by 4.71x with 74.11% accuracy, and at min-cost, it cuts cost by 22.15x with an 11.85-point accuracy loss. Median latency decreases from 51.2s to 22.8s.

mixture-of-modelsmultimodal routercapability dimensionscost-penalized dispatchquery difficulty

Towards More General Control of Diffusion Models Using Jeffrey Guidance

arXiv cs.AI · Raphaël Razafindralambo, Rémy Sun, Frédéric Precioso, Jes Frellsen · 2026-06-11

The paper introduces Jeffrey guidance, a principled framework for extending control in diffusion models beyond standard guidance methods. The approach uses Jeffrey's rule of conditioning to update marginal distributions toward a target while preserving conditional structure and minimally perturbing joint distributions. Experiments demonstrate its effectiveness: targeting Inception embeddings reduces FID on CIFAR-10 and FFHQ, and enforcing attribute independence improves fairness on CelebA-HQ.

diffusion modelsjeffrey guidanceconditional samplinginception embeddingsfairness

ComAct: Reframing Professional Software Manipulation via COM-as-Action Paradigm

arXiv cs.AI · Jiaxin Ai, Tao Hu, Xuemeng Yang, Shu Zou · 2026-06-11

The paper introduces ComAct, a COM-as-Action paradigm that reframes professional software manipulation as deterministic program synthesis via Component Object Model (COM) interfaces, addressing limitations of GUI-based and API-based approaches. The method includes ComCADBench (a novel CAD software benchmark), ComActor (a self-correcting agent trained through a three-stage framework), and ComForge (a scalable Windows container training platform). Experiments demonstrate ComActor's state-of-the-art performance on ComCADBench, with 100x improvement over GUI-based methods in long-horizon tasks and generalization to external CAD benchmarks.

component object modelprogram synthesiscad softwareself-correcting agentwindows containers

Decoding Insect Song: A Multitask Semisupervised Orthoptera Bioacoustic Classifier

arXiv cs.AI · Olga Isupova, Danil Kuzin, Ella Browning, Tom Mills · 2026-06-11

We present PULSE, a semi-supervised multitask framework for Orthoptera bioacoustic classification that addresses limitations in automated ecological monitoring tools. The method combines weakly-supervised species classification, self-supervised learning on unlabelled field audio, and knowledge distillation from a general-purpose bioacoustic model. The domain-adapted specialist model achieves superior performance over a state-of-the-art general model (macro F1: 0.21 vs. 0.07; AUC: 0.74 vs. 0.45; AP: 0.32 vs. 0.19), with active learning further improving metrics (F1: 0.34; AUC: 0.84). Learned embeddings encode ecologically meaningful structure, visualized through an interactive tool for ecological discovery.

orthoptera bioacousticssemi-supervised learningknowledge distillationactive learningecological monitoring

ReSET: Accurate Latency-Critical NVFP4 Reasoning via Step-Aware Temperature Scaling

arXiv cs.AI · Sihwa Lee, Janghwan Lee, Donghoon Yoo, Jae Gon Kim · 2026-06-11

The paper introduces ReSET, a method for improving NVFP4-quantized large reasoning models (LRMs) by addressing accuracy degradation and latency issues. ReSET employs step-aware temperature scaling based on token-level and step-level entropy signals to mitigate incorrect sampling during reasoning. Additionally, a CUDA-core small-$M$ NVFP4 kernel is designed to enhance latency-critical autoregressive decoding. Results show ReSET improves NVFP4 reasoning accuracy by up to ~2 points and achieves 2.5× kernel-level speedup over NVFP4 vLLM, with ~2× end-to-end decoding speedup over BF16.

nvfp4quantizationautoregressive decodingtemperature scalingcuda-core

Proprioceptive-visual correspondence enables self-other distinction in humanoid robots

arXiv cs.AI · Yurun Chen, Tianyuan Gao, Yizhong Ge, Shikun Ban · 2026-06-11

The study demonstrates that humanoid robots can achieve self-other distinction through proprioceptive-visual correspondence, eliminating the need for identity labels or kinematic models. The method establishes a predictive self-model that maps joint configurations to 3D body occupancy, enabling the robot to adapt its body representation during action. Results show reliable self-identification in multi-agent scenarios, with applications in target reaching, collision-aware motion planning, and human-to-robot motion retargeting, advancing bodily self-representation for robots in shared environments.

proprioceptive-visual correspondenceself-other distinctionpredictive self-modelhumanoid robotsmotion retargeting

LLM-as-an-Investigator: Evidence-First Reasoning for Robust Interactive Problem Diagnosis

arXiv cs.AI · Fabrizio Marozzo, Pietro Liò · 2026-06-11

The paper introduces LLM-as-an-Investigator, an evidence-first methodology to mitigate user-driven sycophancy in LLM-based problem diagnosis. The proposed Solution Investigator Agent assesses problem ambiguity, generates hypotheses, iteratively collects evidence via targeted questions, and updates probabilities until a robust solution emerges. Evaluated on technical forum threads using a three-agent pipeline (Problem-Solution Extractor, Ground-Truth Evaluator, and tested assistant), the approach outperforms standard assistants and reasoning-only baselines in diagnostic accuracy while reducing conversational bias induced by user hypotheses.

user-driven sycophancyevidence-first reasoningsolution investigator agenthypothesis probabilityconversational bias

Hallucination in Medical Imaging AI: A Cross-Modality Analytical Framework for Taxonomy, Detection, and Mitigation under Regulatory Constraints

arXiv cs.AI · Omar Alshahrani, Muzammil Behzad · 2026-06-11

The study proposes a cross-modality framework for analyzing hallucination in medical imaging AI, synthesizing peer-reviewed studies, benchmark datasets, and FDA guidance across five imaging modalities. It addresses three key questions: unifying hallucination taxonomies, comparing medical-specialized versus general-purpose foundation models, and evaluating mitigation strategies under regulatory constraints. Results show that general-purpose models outperform medical-specialized ones on hallucination benchmarks, with narrow domain fine-tuning potentially inducing overfitting. Effective mitigation combines physics-informed constraints, Chain-of-Thought prompting, and human-in-the-loop safeguards, mapped to FDA lifecycle oversight frameworks.

hallucination taxonomyfoundation modelsphysics-informed constraintschain-of-thought promptingfda lifecycle oversight

A Minimal Model of Bounded Trade-Off Screening in Multi-Attribute Choice

arXiv cs.AI · Manisha Dubey, Anirban Sarkar, Subramanian Ramamoorthy · 2026-06-11

The authors propose a bounded trade-off reasoning framework for multi-attribute decision-making, addressing limitations of classical utility-based models that assume fully compensatory aggregation. The model introduces a trade-off tolerance parameter to govern a screening process evaluating gains and losses across attributes, allowing for context-dependent variation in acceptable imbalance. Simulations demonstrate that this mechanism produces distinct preference patterns compared to standard utility models, capturing context-dependent trade-off behavior. The results establish bounded trade-off screening as a plausible computational mechanism and generate testable predictions for future behavioral studies.

multi-attribute choicetrade-off tolerancescreening processutility aggregationcontext-dependent variation

ARMOR-MAD: Adaptive Routing for Heterogeneous Multi-Agent Debate in Large Language Model Reasoning

arXiv cs.AI · Fuqiang Niu, Bowen Zhang · 2026-06-11

ARMOR-MAD introduces an adaptive routing framework for heterogeneous multi-agent debate (MAD) in large language model reasoning, addressing computational inefficiency and error amplification in fixed pipelines. The method integrates Pre-debate Agreement Routing (PAR) to assess debate necessity, Early Agreement Stopping Evaluator (EASE) for convergence detection, and Semantic Outlier Detection (SOD) for answer aggregation. Evaluated on MATH Level 5, GSM8K, MMLU, and MMLU-Pro, ARMOR-MAD achieves accuracy improvements of 65.5%, 96.5%, 90.0%, and 81.5% respectively over fixed-round heterogeneous debate. Results highlight the importance of model heterogeneity and agreement-based control for enhancing MAD accuracy and efficiency.

multi-agent debateadaptive routingconditional computationsemantic outlier detectionearly agreement stopping

Under What Conditions Can a Machine Become Genuinely Creative?

arXiv cs.AI · Yong Zeng · 2026-06-11

The paper develops a requirement framework for genuine machine creativity based on Designics, proposing ten conditions organized by three laws (perception, conflict, capability). It argues creativity requires structural transformation through recursive intervention dynamics, not just output novelty, with computational tractability demonstrated via cyber-physical and cyber-biological case studies. The analysis positions open-ended systems, foundation models, and agentic workflows as incomplete solutions, emphasizing that proactive AI ethics must be internal to creative systems through value-based scoping and human-AI co-living.

designicsrecursive interventionvalue-based scopingagentic workflowsfoundation models

Reasoning for Mobile User Experience with Multimodal LLMs: Task, Benchmark, and Approach

arXiv cs.AI · Ruichao Mao, Zhou Fang, Teng Guo, Hao Yang · 2026-06-11

The paper introduces UXBench, a multimodal benchmark with 2,000 VQA samples for evaluating MLLMs' UI-based reasoning across 8 UX tasks (layout, hierarchy, consistency). It proposes UI-UX, a Qwen3-VL-4B-Thinking-based MLLM enhanced via reinforcement learning with reward routing and asymmetric transition rewards. UI-UX achieves SOTA 0.7963 accuracy on UXBench (vs. Claude-4.5-Sonnet's 0.6550) while maintaining low latency and task generalization.

multimodal llmsuser experiencevisual question answeringreinforcement learninginterface reasoning

Transformer-Guided Graph Attention for Direct Cardiac Mesh Reconstruction: A Structural Digital Twin Framework

arXiv cs.AI · Abhishek H S, Akash Ganamukhi, Abhimanyu Suresh, Aditya G Hiremath · 2026-06-11

The study introduces an end-to-end framework for direct cardiac mesh reconstruction from 3D medical images, bypassing traditional segmentation and mesh generation pipelines. A 3D Swin Transformer encoder-decoder extracts volumetric features, while a Graph Attention Network (GAT) deforms a template mesh to fit cardiac boundaries. Evaluated on MM-WHS 2017, the method achieves competitive segmentation (Dice 0.84 CT, 0.83 MRI) and high mesh quality (1.8 mm mean Chamfer distance, 95th-percentile surface distance <5 mm). The approach eliminates manual post-processing, enabling rapid, simulation-ready mesh generation for clinical digital twins.

transformergraph attention networkmesh reconstructionchamfer distancedigital twin

Modern analog computing for solving differential and matrix equations

arXiv cs.AI · Zhong Sun, Piergiulio Mannocci, Manuel Le Gallo, Abu Sebastian · 2026-06-11

The paper presents a unified framework for modern analog computing, focusing on three computational primitives: solving differential equations, matrix equations, and matrix-vector multiplications. It analyzes hardware implementations using analog CMOS circuits and resistive memory arrays, with resistive memory emerging as particularly efficient. The survey highlights applications, precision/scalability challenges, and connections to in-memory computing, positioning analog computing as key for next-generation computational demands.

analog computingresistive memorymatrix equationscmos circuitsin-memory computing

MemRefine: LLM-Guided Compression for Long-Term Agent Memory

arXiv cs.AI · Minjae Kim, Jinheon Baek, Soyeong Jeong, Sung Ju Hwang · 2026-06-11

MemRefine introduces an LLM-guided framework for storage-budgeted memory management in long-term LLM agent interactions, addressing unbounded memory growth and redundancy. It employs surface similarity to propose candidate pairs and leverages an LLM judge to make delete, merge, and preserve decisions based on factual content, iterating until the budget is met. Evaluated across multiple memory frameworks and long-term conversation benchmarks, MemRefine consistently meets target budgets, preserves downstream performance, and outperforms rule-based baselines under tight memory constraints.

memory managementllm-guided frameworkstorage-budgetedsurface similarityfactual content

Mental-R1: Aligning LLM Reasoning for Mental Health Assessment

arXiv cs.AI · Xin Wang, Boyan Gao, Yibo Yang, David A. Clifton · 2026-06-11

The authors propose Cognitive Relative Policy Optimization (CRPO), a reinforcement learning framework for mental health assessment that aligns large language model reasoning with human cognitive processes. CRPO extends group relative policy optimization by incorporating stage-dependent uncertainty modeling through a stage-wise entropy regularization mechanism, which transitions from broad exploration to confident decision-making. The framework formalizes cognitive reasoning stages based on cognitive appraisal theory, enabling theory-grounded interpretable inference. Evaluated on 8 mental health datasets, CRPO achieves a 10.4 percentage point improvement in weighted F1-score over the best reinforcement learning baseline. The CRPO-trained model Mental-R1 demonstrates superior reasoning capabilities compared to existing large language models.

cognitive relative policy optimizationstage-wise entropy regularizationcognitive appraisal theorymental health assessmentreinforcement learning

NTS-CoT: Mitigating Hallucinations in LLM-based News Timeline Summarization with Chain-of-Thought Reasoning

arXiv cs.AI · Feng Lyu, Huiqin Yan, Sijing Duan, Hao Wu · 2026-06-11

We propose NTS-CoT, a novel framework leveraging Chain-of-Thought reasoning to mitigate hallucinations in LLM-based news timeline summarization. NTS-CoT addresses two hallucination types—unfaithful content and information omission—through three modules: Element-CoT captures essential news elements, Date Selection combines temporal saliency and event prominence for timestamp selection, and Causal-CoT infers causal relationships to reduce omissions. Extensive experiments on three TLS benchmarks demonstrate NTS-CoT's superiority over state-of-the-art baselines in mitigating hallucinations and improving summarization performance, validated through quantitative analysis and human evaluation.

chain-of-thoughthallucination mitigationtimeline summarizationtemporal saliencycausal inference

Iterative Visual Thinking: Teaching Vision-Language Models Spatial Self-Correction through Visual Feedback

arXiv cs.AI · Animesh Tripathy, Aswanth Krishnan · 2026-06-11

We propose Iterative Visual Thinking (IVT), a closed-loop framework enabling vision-language models (VLMs) to refine spatial predictions through visual feedback. IVT employs a two-phase training approach: first, leveraging the base model's predictions to generate corrective reasoning traces via a teacher VLM; second, applying Group Relative Policy Optimization (GRPO) with an IoU reward to stabilize multi-step refinement. Evaluated on RefCOCOg, Ref-Adv, and Ref-L4 (505 test samples), IVT surpasses single-shot baselines, improving Acc@0.5 to 82.0% (+2.4pp), Acc@0.7 to 74.1% (+3.2pp), and Acc@0.9 to 48.3% (+2.8pp). GRPO reduces per-step IoU degradation by 5x, demonstrating efficient spatial self-correction with only 2,400 samples on a single GPU.

iterative visual thinkingvision-language modelsgroup relative policy optimizationspatial groundingiou reward

TerraBench: Can Agents Reason Over Heterogeneous Earth-System Data?

arXiv cs.AI · Dat Tien Nguyen, Thao Nguyen, Fadillah Adamsyah Maani, Huy M. Le · 2026-06-11

The paper introduces TerraBench, a benchmark for Earth-science reasoning that integrates heterogeneous data types (gridded data, satellite imagery, geospatial context) through TerraAgent, a ReAct-style framework coupling LLM planning with scientific tools. The benchmark comprises 403 agentic tasks across three tracks and eight domains, totaling 24,500 verified execution steps. Results highlight the need for agents to coordinate workflows, parameterize tools precisely, and maintain artifact provenance, advancing beyond isolated task performance. TerraBench is the first to combine process-level tool-use metrics with tolerance-aware numeric scoring in this domain.

earth-science reasoningreact-style frameworkheterogeneous datatool-use metricsartifact provenance

Rethinking RAG in Long Videos: What to Retrieve and How to Use It?

arXiv cs.AI · Yuho Lee, Jisu Shin, Nicole Hee-Yeon Kim, Jihwan Bang · 2026-06-11

We introduce V-RAGBench, a benchmark for evaluating retrieval-augmented generation in long videos, and CARVE, a method that addresses limitations in VideoRAG by running parallel retrievers across modality-granularity configurations and employing chunk-adaptive reranking. CARVE selects the optimal configuration per chunk, enabling interleaved evidence forms where chunk-level decisions propagate through retrieval and generation stages. This approach outperforms eight VideoRAG baselines, demonstrating the effectiveness of interleaving multiple configurations rather than using a single query-level configuration.

retrieval-augmented generationvideoragchunk-adaptive rerankinginterleaved evidencemodality-granularity

Cascade Classification of Dermoscopic Images of Skin Neoplasms with Controllable Sensitivity and External Clinical Validation

arXiv cs.AI · Elena S. Kozachok, Sergey S. Seregin, Aleksandr V. Kozachok, Ilya P. Latyshev · 2026-06-11

This study evaluates deep learning architectures and classification schemes for dermoscopic images of skin neoplasms, focusing on generalization from international datasets to Russian clinical practice. Four architectures (ViT-B/16, Swin-S, ConvNeXt-S, EfficientNetV2-S) were compared using binary, single-stage four-class, and two-stage cascade classification schemes. Results show a generalization gap, with ROC-AUC dropping from 0.952-0.966 internally to 0.797-0.893 on external clinical data, and sensitivity decreasing to 0.53-0.67. The cascade scheme improved macro F1 scores, particularly for ViT-B/16, by recovering malignant lesions misclassified as benign. External clinical validation and recalibration are recommended before deployment.

dermoscopic imagesgeneralization gapcascade classificationroc-aucmacro f1

MiniPIC: Flexible Position-Independent Caching in <100LOC

arXiv cs.AI · Nathan Ordonez, Thomas Parnell · 2026-06-11

MiniPIC introduces a minimalistic Position-Independent Caching (PIC) design for vLLM, enabling flexible KV cache reuse without requiring identical prefixes. The method combines positional-encoding-free KV storage with three user-facing primitives (block-aligned padding, span separator, and prompt depend) that modify cache hashing and attention structure. Implemented in <100 LOC, MiniPIC supports multiple PIC methods (Block-Attention, EPIC, Prompt Cache) within vLLM, achieving 49% prefill throughput improvement on 2WikiMultihopQA, 100x faster cached-span processing, and only 5.7% worst-case overhead while maintaining linear uncached-span scaling.

position-independent cachingkv cachevllmprefill throughputrope attention

Select and Improve: Understanding the Mechanics of Post-Training for Reasoning

arXiv cs.AI · Akshay Krishnamurthy, Audrey Huang, Nived Rajaraman · 2026-06-11

The study provides mechanistic insights into reinforcement learning (RL) post-training for reasoning tasks, identifying two core mechanisms: strategy selection and strategy improvement. Through controlled math reasoning experiments with Qwen-2.5-1.5B, the authors demonstrate that supervised fine-tuning (SFT) data enables strategy selection by exposing the model to diverse reasoning strategies, while RL data with increasing difficulty facilitates strategy improvement. Results highlight the complementary roles of SFT and RL data in enhancing reasoning capabilities, offering practical interventions for scaling such models.

reinforcement learningstrategy selectionstrategy improvementsupervised fine-tuningmath reasoning

NaturalFlow: Reducing Disruptive Pauses for Natural Speech Flow in Simultaneous Speech-to-Speech Translation

arXiv cs.AI · Dongwook Lee, Youngho Cho, Sangkwon Park, Heeseung Kim · 2026-06-11

The paper introduces NaturalFlow, a fluency-aware optimization framework for simultaneous speech-to-speech translation that balances low latency with natural speech flow. The method minimizes disruptive inter-chunk silences by leveraging model-internal signals like linguistic diversity and temporal variability in speech durations. Experiments on short- and long-form benchmarks demonstrate that NaturalFlow maintains competitive latency and translation quality while producing more natural acoustic flow compared to conventional chunk-wise approaches.

simultaneous translationspeech fluencylatency optimizationlinguistic diversitytemporal variability

MP3: Multi-Period Pattern Pre-training forSpatio-Temporal Forecasting

arXiv cs.AI · Lilan Peng, Yandi Liu, Qingren Yao, Chongshou Li · 2026-06-11

We propose Multi-Period Pattern Pre-training (MP3), a plug-and-play pre-training plugin for spatio-temporal forecasting that addresses temporal mirages in urban data. MP3 introduces multi-period pattern learning through edge convolution for temporal modeling, a bottleneck project with global memory bank for spatial modeling, and a causality-enhanced Transformer for cross-period pattern interaction. The plugin integrates seamlessly with existing spatio-temporal graph neural networks (STGNNs), enhancing their forecasting capabilities. Evaluations on five STGNN baselines across five real-world datasets demonstrate MP3's effectiveness, reducing MAE by 4.7% and RMSE by 5.0% on average. Code is available at https://github.com/YAN-outlook/MP3.

spatio-temporal forecastingtemporal mirageedge convolutionglobal memory bankcausality-enhanced transformer

G-Long: Graph-Enhanced Memory Management for Efficient Long-Term Dialogue Agents

arXiv cs.AI · Minjun Choi, Yoonjin Jang, Sangwon Youn, Youngjoong Ko · 2026-06-11

G-Long introduces a graph-enhanced framework for efficient long-term dialogue agents, addressing LLMs' limitations in long-context reasoning and computational inefficiency. The method employs a fine-tuned small Language Model (sLM) for structured triplet extraction and associative retrieval, coupled with an attention-aware importance scoring mechanism using T5 summarizer's cross-attention signals. Experiments show state-of-the-art performance, with 9.8% response quality improvement on MSC and 40.8% retrieval recall gain on LME, while reducing computational overhead.

long-term dialoguegraph-enhanced frameworktriplet extractionattention-aware scoringassociative retrieval

Functional Cache Grafting: Robust and Rapid Code-Policy Synthesis for Embodied Agents

arXiv cs.AI · Saehun Chun, Wonje Choi, Sera Choi, Sanghyun Ahn · 2026-06-11

FCGraft introduces Functional Cache Grafting to enhance code-policy synthesis for embodied agents by addressing delayed decoding and robustness issues in CodeLLMs. The framework maintains a library of validated code skeletons and their KV-caches, synthesizing policies through cache grafting via stitching and patching. This approach reduces generation latency by eliminating redundant prefill computation and improves robustness by reusing validated control structures. FCGraft outperforms RAGCache with an 18.31% higher task success rate and 2.3x faster policy synthesis.

functional cache graftingkv-cachescode-policy synthesisembodied agentsprefill computation

Emotional regulation improves deep learning-based image classification

arXiv cs.AI · Riccardo Emanuele Landi, João M. F. Rodrigues, Marta Chinnici · 2026-06-11

The study introduces Emotional Regulation, a novel framework for emotion-augmented deep learning that incorporates artificial subjective experience to improve image classification. The method employs pre-training on affective stimuli, balancing non-emotional and emotionally-influenced responses during downstream task optimization. Experiments pre-trained ResNet and Vision Transformer architectures on four emotional datasets, evaluated on CIFAR-10 and CIFAR-100 benchmarks. Results demonstrate state-of-the-art performance in emotion-augmented deep learning for large-scale vision datasets, outperforming existing methods on CIFAR benchmarks. The findings highlight the impact of affective states in optimizing machine learning tasks and encourage further exploration of emotion-inspired architectures.

emotional regulationaffective stimuliartificial subjective experiencevision transformerimage classification

The Emergence of Autonomous Penetration Capabilities in Large Language Model-Powered AI Systems

arXiv cs.AI · Jiaqi Luo, Jiarun Dai, Zhile Chen, Jia Xu · 2026-06-11

We introduce a novel evaluation framework for assessing autonomous penetration capabilities in LLM-powered AI systems, addressing limitations of prior methodologies. The framework comprises target servers (300 instances across Tier~1 and Tier~2 environments) and agent scaffolding with general-purpose cybersecurity tools, avoiding target-specific prior knowledge. Testing 19 open-weight and proprietary LLMs reveals penetration success rates ranging from 10.7% to 69.3%, demonstrating that autonomous penetration capability scales with overall model advancement.

autonomous penetrationllm-powered aicybersecurity toolstarget serversagent scaffolding

"Is This Not Enough?": Asymmetries in Institutional Accountability and Collective Sensemaking in the Case of Canada's Algorithmic Visa Triage System

arXiv cs.AI · Dipto Das, Matthew Tamura, Syed Ishtiaque Ahmed, Shion Guha · 2026-06-11

The study reveals structural asymmetries in Canada's algorithmic visa triage system by contrasting institutional accountability mechanisms with applicant experiences. Using the ADMAPS framework to analyze Immigration, Refugees and Citizenship Canada's Algorithmic Impact Assessment and mixed-methods analysis of Reddit discussions, the research identifies three key asymmetries: epistemic (access to decision logic), jurisdictional (geopolitical exposure), and temporal-relational (waiting uncertainty). Findings demonstrate how public-sector algorithmic governance produces uneven experiences not captured by disclosure frameworks, necessitating methodological extensions to ADMAPS for transnational contexts.

algorithmic accountabilitytransnational migrationcollective sensemakingimpact assessmentpublic-sector algorithms

TWLA: Achieving Ternary Weights and Low-Bit Activations for LLMs via Post-Training Quantization

arXiv cs.AI · Zhixiong Zhao, Zukang Xu, Zhixuan Chen, Xing Hu · 2026-06-11

TWLA introduces a post-training quantization framework for large language models, achieving 1.58-bit weight compression and 4-bit activation quantization while preserving accuracy. The method comprises three components: E2M-ATQ minimizes layer-output error via Euclidean-to-manifold optimization, KOTMS reshapes weights into ternary-friendly distributions using Kronecker-structured orthogonal rotation, and ILA-AMP optimizes bit allocation by considering inter-layer second-order interaction costs. Experiments demonstrate that TWLA maintains high accuracy under W1.58A4 quantization, enabling significant inference acceleration.

ternarizationpost-training quantizationkronecker-structuredactivation quantizationinter-layer optimization

EA-WM: Event-Aware World Models with Task-Specification Grounding for Long-Horizon Manipulation

arXiv cs.AI · Kailin Wang, Haoxiang Jie, Yaoyuan Yan, Jiacheng Zhou · 2026-06-11

EA-WM introduces an event-aware world-model framework for long-horizon manipulation by augmenting pretrained visual-feature dynamics with task-specification-grounded event prediction and verification. The method rolls out candidate futures in visual-feature space, decodes them into structured event states, and scores them using task-progress, semantic-consistency, physical-feasibility, and uncertainty terms. Results show improved interpretability and task alignment in navigation, deformable-object, and language-described manipulation tasks, particularly in the LIBERO wine-rack setting.

world modelsevent predictiontask-specification groundingvisual-feature dynamicslong-horizon manipulation

AAbAAC: An Annotated Corpus for Autoimmunity Information Extraction

arXiv cs.AI · Fabien Maury, Solène Grosdidier, Maud de Dieuleveult, Adrien Coulet · 2026-06-11

We introduce AAbAAC (AutoAntibodies and Autoimmunity Annotated Corpus), a manually annotated corpus of 115 PubMed abstracts for autoimmunity information extraction, focusing on autoimmune diseases, autoantibodies, molecular targets, body locations, and clinical signs. The corpus was used to evaluate and fine-tune named entity recognition (NER) models, demonstrating improved performance post-finetuning. This highlights the utility of domain-specific annotation efforts in enhancing computational methods for specialized biomedical fields. AAbAAC is publicly available at https://github.com/f-maury/AAbAAC.

autoimmunitynamed entity recognitionautoantibodiesannotationbiomedical

Augmentation techniques for video surveillance in the visible and thermal spectral range

arXiv cs.AI · Vanessa Buhrmester, Ann-Kristin Grosselfinger, David Munch, Michael Arens · 2026-06-11

This study investigates augmentation techniques for multispectral CNN-based object detection in video surveillance systems combining visible and thermal infrared imagery. The authors analyze how variations in thermal radiation, shape, and color information impact classification accuracy, addressing challenges in obtaining sufficient thermal infrared datasets for training deep neural networks. Through empirical evaluation of different augmentation methods, the research aims to improve robustness and decision-making capabilities of CNNs when processing multimodal sensor data from both spectral ranges.

multispectral object detectionconvolutional neural networksthermal infrared imagerydata augmentationvideo surveillance

Fault Lines: Navigating Ethics and Responsible AI Where National Policy Meets Local Practice in Public Sector Transformation

arXiv cs.AI · Sitong Lyu, Shabnam Taghiyeva, Mohit Kukadia, Denis Newman-Griffis · 2026-06-11

This paper investigates the implementation of responsible AI in UK public sector transformation, focusing on the national-local policy interface in Special Educational Needs and Disabilities (SEND). Through thematic analysis of 17 semi-structured interviews with policymakers, practitioners, and third-sector professionals, the study identifies five key challenges: shadow AI usage, data privacy risks, market-government asymmetry, workforce readiness gaps, and accountability deficits. The analysis reveals how high-stakes decisions in SEND amplify tensions around fairness and human oversight, exposing limitations of principle-based regulation. The findings suggest that responsible AI adoption requires both national policy adjustments and local institutional reforms in governance mechanisms and workforce capacity.

responsible aipublic sectorthematic analysisaccountabilitygovernance mechanisms

Nous: An Attempt to Extract and Inject the Cognition Behind Prediction-Market Behavior

arXiv cs.AI · Haowei Qian · 2026-06-11

Nous proposes a method to extract and inject human cognitive diversity into LLM agents to mitigate cognitive monoculture in prediction markets. The approach extracts an eight-dimension behavioral profile from Polymarket trading activity and injects it via prompts. Results show partial success in extraction: 8 of 14 parameters are temporally stable (split-half ICC ≥ 0.5), wallets are identifiable (top-1 retrieval 17-22%), and two dimensions correlate with future profit. However, prompt-level injection fails to transmit diversity, showing no advantage on semantic embedding metrics or ensemble error reduction. The study highlights the limits of prompt-level remedies and suggests deeper methods like fine-tuning.

cognitive monoculturebehavioral profileprediction marketsprompt-level injectionensemble error correlation

TetherCache: Stabilizing Autoregressive Long-Form Video Generation with Gated Recall and Trusted Alignment

arXiv cs.AI · Yu Meng, Xiangyang Luo, Letian Li, Wenyuan Jiang · 2026-06-11

TetherCache introduces a training-free cache management strategy for stabilizing autoregressive long-form video generation. It employs GRAB (Gated Recall with Attention-Diversity Balancing) to select diverse long-range memory frames and TAME (Trusted Alignment via Memory Editing) to align drifted historical features with trusted context distributions. Evaluated on VBench-Long, TetherCache reduces quality drift from 7.84 to 1.33 in 240s generations while improving semantic and overall scores across 30s, 60s, and 240s settings.

autoregressivekv-cachecontext distribution shiftgated recalltrusted alignment

Democracy in the Era of Artificial Intelligence

arXiv cs.AI · Evangelos Pournaras, Srijoni Majumdar, Carina Hausladen, Dirk Helbing · 2026-06-11

The handbook examines AI's dual role in democracy, addressing opportunities for enhanced participation and risks like bias and misinformation. Through 34 interdisciplinary chapters, it explores AI's potential to empower collective intelligence (Part 1), the future of deliberative democracy using LLMs (Part 2), resilient self-governance systems (Part 3), and transformation challenges (Part 4). The work proposes new values and design principles for democratic resilience, concluding with reimagined AI-democracy interplay (Part 5).

democracycollective intelligencedeliberative democracylarge language modelsself-governance

CausalMoE: A Billion-Scale Multimodal Foundation Model for Granger Causal Discovery with Pattern-Routed Heterogeneous Experts

arXiv cs.AI · Bo Liu, Di Dai, Jingwei Liu, Jiarui Jin · 2026-06-11

CausalMoE introduces a billion-scale multimodal foundation model for Granger Causal Discovery (GCD), addressing limitations of existing methods in handling distribution shifts and dynamic regime changes. The model employs a Pattern-Routed Mixture of Heterogeneous Experts to dynamically identify latent temporal patterns and route patches to specialized domain experts, decoupling regime-specific mechanisms from shared dynamics. It incorporates a Causality-Aware Self-Attention mechanism for interpretable graph recovery and integrates LLMs and VLMs to align numerical signals with textual and visual priors. Experiments show CausalMoE achieves state-of-the-art performance on supervised benchmarks and generalizes effectively in few-shot settings.

granger causal discoverypattern-routed mixtureheterogeneous expertscausality-aware self-attentionmultimodal foundation model

SciR: A Controllable Benchmark for Scientific Reasoning in LLMs

arXiv cs.AI · Pierre Beckmann, Marco Valentino, Andre Freitas · 2026-06-11

SciR introduces a controllable benchmark for evaluating scientific reasoning in LLMs, addressing limitations of existing benchmarks through multi-paradigm inference (deduction, induction, causal abduction) and parametric control over extraction and inference difficulty. Tasks are generated from formal objects (deduction trees, inductive rule hypotheses, causal graphs) and rendered into domain-specific scientific discourse, ensuring verifiable answers. Experiments with six models reveal that both extraction and inference difficulty axes degrade performance, with compounding effects, even for neurosymbolic pipelines. The benchmark enables per-model profiling of extraction-vs-inference capabilities, showing reasoning models like deepseek-r1 outperform instruct models on inference tasks. SciR is the first benchmark combining multi-paradigm scientific reasoning with parametric control over both difficulty axes.

multi-paradigm inferencededuction treecausal abductionneurosymbolic pipelinesparametric control

Otters++: A Time-to-first-spike Based Energy Efficient Optical Spiking Transformer

arXiv cs.AI · Zhanglu Yan, Jiayi Mao, Kaiwen Tang, Fanfan Li · 2026-06-11

Otters++ introduces an energy-efficient optical spiking Transformer leveraging time-to-first-spike (TTFS) coding by utilizing natural signal decay in In$_2$O$_3$ optoelectronic synapses, eliminating explicit digital decay computation. The method establishes layer-wise equivalence between Otters++ and quantized neural networks (QNNs), employing hybrid training with device-faithful SNN forward passes and QNN straight-through gradients, augmented by model distillation and noise-aware training. System-level energy modeling incorporates device sharing and multi-hop communication. On the GLUE dataset, Otters++ achieves an average score of 84.17% while maintaining energy efficiency over prior spiking Transformer baselines, demonstrating TTFS computing's efficiency, trainability, and robustness under hardware effects.

time-to-first-spikeoptoelectronic synapsequantized neural networkmodel distillationspiking transformer

scLLM-DSC: LLM-Knowledge Enhanced Cross-Modal Deep Structural Clustering for Single-Cell RNA Sequencing

arXiv cs.AI · Ping Xu, Pengjiang Li, Tian Du, Zaitian Wang · 2026-06-11

The authors propose scLLM-DSC, a Large Language Model (LLM)-enhanced framework for single-cell RNA sequencing clustering that integrates biological semantics with transcriptomic features. The method combines a Knowledge-Driven Semantic View, leveraging NCBI gene priors and Cell2Sentence embeddings, with a Structure-Aware Topological View extracted via a graph-guided encoder. A cross-modal contrastive alignment mechanism ensures consistency between semantic and structural representations in a unified latent space. Evaluations show scLLM-DSC outperforms eleven state-of-the-art baselines in clustering accuracy, addressing the semantic agnosticism of existing numerical statistical approaches.

single-cell rna sequencinglarge language modelcross-modal alignmentgraph-guided encodercontrastive learning

The Illusion of Multi-Agent Advantage

arXiv cs.AI · Prathyusha Jwalapuram, Hehai Lin, Chuyuan Li, Fangkai Jiao · 2026-06-11

The study challenges the presumed superiority of Multi-Agent Systems (MAS) over Single-Agent Systems (SAS) by demonstrating that automatically generated MAS underperform Chain-of-Thought with Self-Consistency (CoT-SC) on reasoning tasks and interactive workflows, despite higher computational costs (up to 10x). Using a diagnostic synthetic dataset designed for MAS evaluation, the authors show that expert-architected MAS outperform automated designs in both performance and cost-efficiency. Analysis reveals that current automated MAS designs suffer from architectural bloat, prioritizing superficial complexity over functional utility.

multi-agent systemschain-of-thoughtself-consistencyarchitectural bloattask decomposition

APCyc: Property-Informed Design of Cyclic Peptides via Automated Cyclization

arXiv cs.AI · Yifan Zhao, Lang Qin, Jintai Chen · 2026-06-11

APCyc introduces a target-aware de novo cyclic peptide generation framework that explicitly models cyclization and jointly optimizes multiple physicochemical properties. The method employs an expanded residue vocabulary, encodes cyclization-site and linkage-type information, and leverages Bayesian posterior guidance to steer sampling toward cyclic peptides satisfying property objectives. Experimental results demonstrate that APCyc learns target-dependent cyclization preferences and enables effective multi-property optimization for cyclic peptide design. The framework addresses limitations of generative models trained on linear peptide data by capturing cyclization-specific constraints.

cyclic peptidesde novo designbayesian posterior guidancephysicochemical propertiescyclization-site

A Machine Learning Framework for Real-Time Personalized Ergonomic Pose Analysis

arXiv cs.AI · Manex Atxa, Bruno Simoes, Julen Balzategui · 2026-06-11

A novel framework for real-time personalized ergonomic pose analysis is introduced, leveraging 3D volumetric video data to overcome viewpoint limitations in traditional 2D camera systems. The methodology employs a personalized deep learning classifier trained on manually labeled poses from RGB-D camera data, enabling real-time skeletal labeling and inference on streaming data. A case study involving load-lifting tasks demonstrated the system's capability to perform continuous pose analysis from multiple angles, addressing occlusion issues. This scalable approach integrates state-of-the-art 3D data technologies with 2D pose estimation algorithms, advancing workplace safety and health monitoring.

ergonomic pose analysisvolumetric videorgb-d camerasdeep learning classifierskeletal labeling

Diffusion Transformer World-Action Model for AV Scene Prediction

arXiv cs.AI · Ruslan Sharifullin, Benjamin Jiang, Kai Xi Chew · 2026-06-11

The paper introduces a latent Diffusion Transformer (DiT) world-action model for autonomous vehicle scene prediction, addressing the distortion metric bias favoring blurry regression means over realistic predictions. The method employs a V-JEPA2 encoder for temporal context, a DiT with spatial tokens and x_0 objective, and a Stable-Diffusion-VAE pipeline, evaluated on 150 nuScenes scenes. Results show 4.8× better KID (0.078 vs 0.375) than regression, with action-controllability (Spearman ρ=0.81) and a 1.7M-parameter 'jump' model recovering full motion magnitude (1.02× GT).

diffusion transformerworld-action modelautonomous vehiclesdistortion metricsscene prediction

Structured Testbench Generation for LLM-Driven HDL Design and Verification-Oriented Data Curation

arXiv cs.AI · En-Ming Huang, Yu-Hung Kao, Ren-Hao Deng, Wei-Po Hsin · 2026-06-11

The paper introduces STG, a Structured Testbench Generation framework for LLM-driven HDL design verification, addressing limitations of prompt-based methods by leveraging hardware design structure for deterministic testbenches. STG demonstrates 720x faster execution than iterative LLM flows, improves coverage, reduces false-pass verdicts, and identifies RTL benchmark errors. As a data curation tool, it achieves 11x speedup and 127x energy reduction versus LLM filtering, while distilled models deliver state-of-the-art performance. Test-time scaling reduces node count by 14-47%.

structured testbench generationregister transfer levelhardware description languagellm-driven verificationdata curation

Efficient, Robust, and Anti-Collusion Fingerprinting of Image Diffusion Models

arXiv cs.AI · Jianwei Fei, Yunshu Dai, Zhihua Xia, Xiaochun Cao · 2026-06-11

We propose a robust fingerprinting method for text-to-image (T2I) diffusion models with anti-collusion capabilities, addressing a systematic vulnerability in existing approaches. The method encodes bit-string fingerprints into a personalized normalization module (PNM) and employs lossless function-invariant parameter transformations to degrade image quality in colluded models, rendering them unusable. Developers can efficiently create multiple fingerprinted model copies by reparameterizing the PNM without retraining. Experiments show fingerprint extraction accuracy exceeding 99.5% and significant FID degradation in colluded models, demonstrating proactive robustness against collusion attacks.

fingerprintingcollusion attackspersonalized normalization modulelossless function-invariantfid degradation

A Mathematical Forum Platform for Collaborative Problem Solving and Dataset Generation for AI Reasoning

arXiv cs.AI · Akbar Erkinov, Nurmukhammad Abdurasulov · 2026-06-11

The authors introduce a unified forum platform integrating image-to-LaTeX conversion for collaborative mathematical problem solving. The system employs Mathpix OCR API for optical character recognition, normalizes delimiters, and renders live previews in LaTeX or Markdown before database storage. The architecture comprises three decoupled layers: image processing, rendering, and storage, supporting both desktop and mobile clients. A provisional US patent covers the core methods. Beyond usability improvements, the platform generates a continuously growing dataset of community-validated mathematical problems and solutions, potentially serving as training data for AI mathematical reasoning systems.

latexoptical character recognitionmathpixdelimiter normalizationmathematical reasoning

Multi-Modal Agents for Power Distribution Defect Detection: An Evaluation of Foundation Models

arXiv cs.AI · Quan Quan · 2026-06-11

The paper proposes a Multi-Modal Agent framework for power distribution defect detection, evaluating multimodal foundation models as unified cognitive engines. The systematic assessment focuses on three capabilities: perception (equipment identification and defect description), reasoning (cause diagnosis and maintenance planning), and tool usage (autonomous action execution). A domain-specific dataset and benchmark are developed, with experimental results revealing strengths and limitations of current models in industrial deployment contexts.

multimodal foundation modelsdefect detectionclosed-loop automationdomain-specific benchmarkautonomous agents

OpenMedQ: Broad Open Pretraining for Medical Vision-Language Models

arXiv cs.AI · Ibrahim Gulluk, Max Van Puyvelde, Olivier Gevaert · 2026-06-11

OpenMedQ introduces a medical vision-language model pretrained on the broadest fully-open medical dataset to date, comprising 14 datasets with ~3.35M samples across pathology, radiology, microscopy, and clinical QA. The model achieves state-of-the-art BLEU-1 scores on PathVQA (75.9), outperforming Med-PaLM M variants up to 562B parameters, and matches the best reported VQA-MED BLEU-1 (64.5). Its vision encoder, evaluated on 8 unseen medical classification benchmarks, attains the highest average macro-F1 (0.757) compared to BiomedCLIP (0.745), PMC-CLIP (0.745), PubMedCLIP (0.746), and a from-scratch baseline (0.616). Code and an interactive demo are released for reproducibility.

vision-language modelpathvqamacro-f1clinical qapretraining

Learning What to Remember: A Cognitively Grounded Multi-Factor Value Model for Agentic Memory

arXiv cs.AI · Zhibao Chen, Qian Cheng · 2026-06-11

A multi-factor memory value function is proposed for long-running LLM agents, addressing the challenge of memory consolidation under fixed budgets. The function integrates seven cognitively grounded factors (emotional intensity, goal relevance, value alignment, self/user relevance, task utility, reliability, and usage history) with learned weights via gradient-free optimization. Evaluated on LongMemEval, the model retains 0.770 ± 0.011 of gold evidence in blind regimes, outperforming uniform weights (0.657), single factors (0.518), and recency-based approaches (0.368). The learned weights are interpretable, emphasizing reliability, emotional intensity, and self/user relevance, while down-weighting query-time goal similarity. Synthetic tasks confirm the model's ability to recover optimal weightings.

memory consolidationgradient-free optimizationlongmemevalvalue functioncognitive factors

PRISMR: Overcoming Parse Collapse in Multimodal Listwise Ranking via Parameterized Representation Internalization

arXiv cs.AI · Hao Jiang, Xin Li, Annan Wang, Zhi Yang · 2026-06-11

PRISMR introduces a framework to mitigate parse collapse in multimodal listwise ranking with Large Multimodal Models (LMMs), where autoregressive decoders produce incomplete rankings due to limited context utilization. The method employs a lightweight hypernetwork to encode multimodal candidates in parallel, generating item-specific LoRA weights synthesized into an instance-specific adapter for LMMs, enabling robust internalization of list structure. Evaluated on a large-scale multimodal review-ranking benchmark, PRISMR significantly reduces parse collapse, enhances ranking performance, and demonstrates effective cross-domain transferability across instruction-tuned backbones.

parse collapsemultimodal rankinglora weightshypernetworklistwise ranking

An Embodied Simulation Platform, Benchmark, and Data-Efficient Augmentation Framework for Wet-Lab Robotics

arXiv cs.AI · Zhe Liu, Huanbo Jin, Zhaohui Du, Zhe Wang · 2026-06-11

The paper introduces Pipette, an embodied simulation platform and benchmark for wet-lab robotics, featuring 43 open-source editable assets and a data-efficient augmentation framework. The system replays human demonstrations in simulation with perturbations (lighting, camera, speed, action) and filters episodes via automatic success checks, expanding training data from limited demonstrations. Evaluated on an 11-task benchmark (sample handling, culture-ware manipulation, etc.), simulation augmentation improves SmolVLA from 44.1% to 74.7% and π0 from 40.4% to 46.5% with only 30 demonstrations per task, while ACT achieves 65.5% success. Pipette also supports natural-language-driven task definition.

wet-lab roboticssimulation augmentationembodied benchmarkdata-efficient learningnatural-language interface

MARS: Margin-Adversarial Risk-controlled Stopping for Parallel LLM Test-time Scaling

arXiv cs.AI · Wenbo Chen, Puheng Li, Mengyang Liu, Weijie Su · 2026-06-11

MARS introduces a margin-adversarial stopping rule for parallel test-time scaling of LLMs, reducing computational overhead while maintaining accuracy. The method probes partial reasoning traces at intermediate checkpoints, estimating trace-level switch probabilities and applying an adversarial bound calibrated from warmup traces to predict future vote movement. This approach separates uncertainty sources and guarantees early-stopped answers match full-budget votes with high probability. Empirical results show MARS saves 25-47% of self-consistency tokens and 14-29% over DeepConf Online across three reasoning models and competition-math benchmarks, while matching full-budget baseline accuracy.

margin-adversarialparallel test-time scalingself-consistency tokenstrace-level switch probabilitiesadversarial bound

Iterating Toward Better Search: A Two-Agent Simulation Framework for Evaluating Agentic Search Architectures in E-Commerce

arXiv cs.AI · Jetlir Duraj, Jayanth Yetukuri, Shuang Zhou, Dhruv Varma · 2026-06-11

A modular two-agent simulation framework evaluates conversational shopping assistant architectures by pairing a persona-driven buyer agent with interchangeable responders integrated with e-commerce search APIs. The framework enables controlled comparisons across 2011 conversations in 14 persona buckets, revealing four key findings: rolling-window memory outperforms intent-extraction memory in quality and speed (35% faster per query); targeted fixes reduce failure rates by 62%; Llama 3.3 70B incurs a 0.16--0.45 point cost over Gemini 2.5 despite identical architecture; and systematic philosophical disagreement exists between LLM judges (Gemini prioritizes process correctness, Claude demands outcomes).

rolling-window memoryintent-extraction memorytwo-agent simulationllm judgese-commerce search api

Order Is Not Control

arXiv cs.AI · Gareth Seneque, Lap-Hang Ho, Nafise Erfanian Saeedi, Jeffrey Molendijk · 2026-06-11

The paper distinguishes order from control in AI systems, proposing that control requires a receiver-gated response law mapping material state, action, bath, and receiver state to response displacement. This framework is validated across biological systems (mouse ALM, C. elegans, zebrafish) and LLMs, demonstrating predictable response vectors with 72.8-73.7% component-sign accuracy, improving to 84.3-84.8% on nonzero components. Held-out observers predict system-effect and target/oracle families at 93.6% and 91.7% accuracy. The study identifies local admitted control and measurable stochastic response operators, while excluding deployable pre-generation control and biological-to-LLM coordinate identity.

receiver-gated response lawresponse displacementcomponent-sign accuracystochastic response operatorsdeployable pre-generation control

LoRA-Muon: Spectral Steepest Descent on the Low-Rank Manifold

arXiv cs.AI · Franz Louis Cesista, Katherine Crowson, Cédric Simal, Stella Biderman · 2026-06-11

The paper introduces LoRA-Muon, a spectral steepest-descent optimizer for Low-Rank Adaptation (LoRA) that addresses sensitivity to initialization and learning rate transfer issues in factor-wise optimizers like AdamW. By applying Muon's spectral update rule to low-rank matrices and introducing a split weight-decay mechanism, the method achieves rank-invariant optimal learning rates and outperforms dense baselines in compute-matched experiments. On TinyShakespeare, rank-32 LoRA-Muon achieves lower validation loss than dense training, while rank-2 recovers the dense optimal learning rate. The analysis also shows Spectron's dependence on arbitrary factor scaling and equivalence between LoRA-RITE's QR-coordinate update and LoRA-Muon's QR-free spectral computation.

low-rank adaptationspectral steepest-descentfactor-wise optimizerssplit weight-decayqr-decomposition

MAStrike: Shapley-Guided Collusive Red-Teaming on Multi-Agent Systems

arXiv cs.AI · Chejian Xu, Zhaorun Chen, Jingyang Zhang, Freddy Lecue · 2026-06-11

MAStrike introduces a closed-loop framework for collusive red-teaming in hierarchical multi-agent systems (MAS), addressing limitations in existing approaches that rely on heuristic agent selection and isolated perturbations. The method employs agent-level Shapley value analysis to quantify each agent's marginal contribution to system robustness, guiding the identification of vulnerable coalitions and role-aware adversarial manipulations. These attacks are iteratively refined through structured causal diagnosis. Extensive experiments across diverse MAS environments demonstrate MAStrike's superiority over heuristic baselines, uncovering non-trivial Shapley value distributions and higher-order agent interactions.

shapley valuemulti-agent systemsred-teamingcollusive attacksstructured causal diagnosis

MDForge: Agentic Molecular Dynamics Pipeline Design under Sparse Simulator Feedback

arXiv cs.AI · Zehong Wang, Yijun Ma, Connor R. Schmidt, Tianyi Ma · 2026-06-11

MDForge introduces an LLM agent for automated molecular dynamics (MD) pipeline design, addressing the challenge of sparse simulator feedback through online verbal reward reshaping. The method employs multi-agent debate among physics experts to densify rewards during in-context learning, enabling open-ended code generation without predefined tools. Evaluated on three SAMPL host-guest binding free-energy benchmarks, MDForge designs pipelines competitive with human experts and discovers a novel picomolar-affinity CB[7] binder, experimentally validated via NMR.

molecular dynamicsllm agentin-context learningbinding free-energymulti-agent debate

Bounding Boxes as Goals: Language-Conditioned Grasping via Neuro-Symbolic Planning

arXiv cs.AI · Allison Andreyev, Landon Eum, Nestor Tiglao, Romel Gomez · 2026-06-11

We present GRASP (Grounded Reasoning and Symbolic Planning), a neuro-symbolic framework for open-vocabulary tabletop manipulation that integrates Vision-Language Models (VLMs) with bounding-box detection. GRASP translates natural-language queries into goal states without task-specific fine-tuning, enabling robots to interpret abstract spatial concepts like 'top shelf' and execute tasks dynamically. The method leverages pretrained VLMs to ground symbolic planning in physical reality, avoiding reliance on fixed color lists or hard-coded coordinates. In 90 real-robot trials across three difficulty levels, GRASP achieved 73.3% overall success, demonstrating robust performance without extensive training or demonstrations.

vision-language modelsneuro-symbolic planningbounding-box detectionopen-vocabulary manipulationtask and motion planning

Zero-source LLM Hallucination Detection with Human-like Criteria Probing

arXiv cs.AI · Jiahao Yang, Shuhai Zhang, Hailong Kang, Feng Liu · 2026-06-11

The paper introduces Human-like Criteria Probing for Hallucination Detection (HCPD), a method for detecting hallucinations in large language models (LLMs) under zero-source constraints. HCPD employs a Human-like Criteria Probing (HCP) mechanism, where an LLM agent decomposes its judgment into interpretable criteria and aggregates criterion-specific scores into a final truthfulness measure. The method uses a reward-based alignment scheme with weak supervision from semantic consistency and a multi-sampling aggregation strategy for robust decisions. Theoretical analysis supports its reliability, and experiments demonstrate that HCPD outperforms state-of-the-art baselines in zero-source hallucination detection.

hallucination detectionzero-source constrainthuman-like criteria probingsemantic consistencymulti-sampling aggregation

PolicyGuard: Towards Test-time and Step-level Adversary Defense for Reinforcement Learning Agent

arXiv cs.AI · Junfeng Guo Heng Huang · 2026-06-11

PolicyGuard introduces a test-time step-level defense against backdoor attacks in reinforcement learning (RL) agents, addressing vulnerabilities where agents execute malicious actions upon trigger activation. The method leverages Gaussian Process (GP) posterior variance and adapts pseudo trajectories to compute uncertainty at individual time steps, supported by theoretical foundations. Evaluated across seven RL games, PolicyGuard achieves state-of-the-art detection performance, with average AUROC scores of 0.856 for perturbation-based attacks and 0.859 for adversary-agent attacks.

reinforcement learningbackdoor attacksgaussian processtest-time defenseuncertainty computation

Bridging Modal Isolation in Interleaved Thinking: Supervising Modality Transitions via Stepwise Reinforcement

arXiv cs.AI · Tingyu Li, Le Zhou, Siyuan Li, Yujun Wu · 2026-06-11

The paper introduces MoTiF, a two-stage training framework addressing Modal Isolation in interleaved multimodal reasoning, where textual and visual modalities fail to inform each other effectively. MoTiF decomposes reasoning cycles into atomic operations, quantifying modality transition loss via cross-modal hallucination and visual utilization deficit. The framework employs Reflective SFT for error detection and recovery, and Flow-GRPO for reinforcement learning to enhance image generation fidelity. Evaluated on four visual puzzle benchmarks, MoTiF significantly improves cross-modal coherence and task accuracy, demonstrating the necessity of explicit structural supervision at modality boundaries.

modal isolationinterleaved reasoningmodality transitioncross-modal hallucinationvisual utilization deficit

The Hidden Power of Scaling Factor in LoRA Optimization

arXiv cs.AI · Zicheng Zhang, Haoran Li, Jiaxing Wang, Guoqiang Gong · 2026-06-11

This paper demonstrates that the scaling factor $α$ in Low-Rank Adaptation (LoRA) optimization plays a more critical role than previously understood, outperforming learning rate scaling in driving effective optimization. Through empirical analysis and the Signal-Drift theoretical framework, the authors identify three key insights: LoRA's spectral suppression smooths the optimization landscape, $α$ amplifies task signals without increasing drift ratio, and optimal $α$ follows a sublinear square-root law with rank. They propose LoRA-$α$, a framework that restores $α$ to its principled regime, improving performance across diverse tasks while simplifying hyperparameter search.

low-rank adaptationscaling factorsignal-drift frameworkspectral suppressionhyperparameter optimization

HarnessBridge: Learnable Bidirectional Controller for LLM Agent Harness

arXiv cs.AI · Xiaoxuan Wang, Haixin Wang, Alexander Taylor, Jason Cong · 2026-06-11

HarnessBridge introduces a learnable bidirectional controller for LLM agent harnesses, addressing scalability challenges in long-horizon tasks. The method parameterizes the agent--environment interface via bidirectional projections: observation projection distills raw trajectories into compact states, while action projection converts proposed actions into executable transitions or rejections. Trained via unified instruction tuning on a harness supervision dataset, HarnessBridge matches or surpasses specialized harnesses on Terminal-Bench~2.0 and SWE-bench Verified, reducing token usage and trajectory length while generalizing from smaller to larger commercial models.

harness controllerbidirectional projectionobservation projectionaction projectionunified instruction tuning

DailyReport: An Open-ended Benchmark for Evaluating Search Agents on Daily Search Tasks

arXiv cs.AI · Jingxuan Han, Wei Liu, Mingyang Zhu, Youpeng Wang · 2026-06-11

The paper introduces DailyReport, an open-ended benchmark for evaluating Search Agents (SAs) on daily search tasks, addressing limitations of prior benchmarks focused on specialized tasks. It comprises 150 open-ended tasks with 3,546 rubrics, decomposing tasks into subtasks and evaluating them via cascade rubrics across disentangled dimensions. The method enables interpretable performance attribution and user preference scoring. Evaluation of 17 agentic systems reveals gaps in meeting user expectations. The dataset and code are publicly available.

search agentscascade rubricsdisentangled dimensionsopen-ended tasksuser preference score

Beyond Problem Solving: UOJ-Bench for Evaluating Code Generation, Hacking, and Repair in Competitive Programming

arXiv cs.AI · Tingqiang Xu, Hangrui Zhou, Tianle Cai, Alex Gu · 2026-06-11

The paper introduces UOJ-Bench, a benchmark evaluating LLMs' capabilities in code generation, hacking, and repair within competitive programming contexts. Utilizing real-world submissions from Universal Online Judge (UOJ), it assesses models' error identification in human-written code. Results indicate that one-shot evaluation fails in >50% error detection, while test-time scaling achieves >90% success at high computational cost. Frontier LLMs demonstrate potential by uncovering errors in 5% of full-score submissions across ~30 problems, offering complementary signals to traditional judging systems.

uoj-benchcompetitive programmingcode generationerror identificationtest-time scaling

JSCGC: Joint Source-Channel-Generation Coding for Wireless Generative Communications

arXiv cs.AI · Tong Wu, Zhiyong Chen, Guo Lu, Li Song · 2026-06-11

The paper proposes Joint Source-Channel-Generation Coding (JSCGC), a generative communication paradigm that replaces conventional decoders with a generative model. JSCGC reformulates communication as controlled generation for mutual information maximization under perceptual constraints, using a unified joint training and stochastic sampling framework. Experiments on latent-space image transmission show JSCGC improves feature-based, semantic-level, and distributional quality across diverse channel conditions, exhibiting semantic inconsistency rather than distortion as its primary error behavior.

generative communicationjoint source-channel codingperceptual constraintsmutual information maximizationstochastic sampling

WISE: A Long-Horizon Agent in Minecraft with Why-Which Reasoning

arXiv cs.AI · Renmin Cheng, Changhao Chen · 2026-06-11

WISE (Which-Why Informed Semantic Explorer) introduces a long-horizon agent framework for Minecraft, addressing performance bottlenecks in low-level controllers through causal reasoning and enhanced episodic memory. The framework integrates a Causal Event Graph to link observations to task relevance, enabling robust recall under viewpoint changes and opportunistic task reordering. An Opportunistic Task Scheduler dynamically reprioritizes subtasks based on detected causal opportunities, while a multi-scale progressive exploration strategy ensures spatially comprehensive observations. Experiments demonstrate significant improvements in task success and efficiency, particularly in adaptive decision-making scenarios for long-horizon sparse tasks.

causal event graphopportunistic task schedulermulti-scale explorationepisodic memorylong-horizon agent

(Human) Attention Is (Still) All You Need: Human oversight makes AI-assisted social science reliable

arXiv cs.AI · Chen Zhu, Xiaolu Wang, Weilong Zhang · 2026-06-11

The paper introduces Human-in-the-Loop Economic Research (HLER), a decision architecture that enhances the reliability of AI-assisted social science by structuring cognitive labor between humans and machines. HLER imposes three commitments: LLMs reason but do not execute data work, data and estimation are handled deterministically, and three human decision gates bind the workflow. In a 2*4 factorial experiment with 280 research runs across four datasets, HLER reduced critical failure rates from 72% to 16% compared to an unconstrained multi-agent baseline. Fisher's exact test confirmed the significance of this improvement (p<0.001). An 80-run ablation study suggested independent contributions from deterministic computation and human gates, with exploratory evidence of complementarity.

human-in-the-loopdecision architecturecognitive labordeterministic computationfailure rate

TimeROME-DLM: Temporal Causal Tracing and Low-Rank Inference-Time Knowledge Editing for Masked Diffusion Language Models

arXiv cs.AI · Zhengtao Yao, Liuyang Song, Hongbo Zhang, Chenhao Wei · 2026-06-11

TimeROME-DLM introduces the first training-free, gradient-free inference-time knowledge-editing framework for masked diffusion language models (MDLMs). It combines Temporal Indirect Effect (TIE) causal tracing to identify key intervention coordinates and a low-rank residual edit memory for closed-form updates, applied sparsely to limit utility spillover. Evaluated on TOFU forget01 with LLaDA-8B-Base, it reduces forget-set log-probability by ~83 nats while maintaining retain-set log-probability within ~1 nat across 50 sequentially inserted facts. The method achieves a 4-14x wall-clock speedup, scales sub-linearly to 400 facts, and transfers across multiple MDLMs without additional VRAM.

masked diffusion language modelstemporal causal tracinglow-rank residual editinference-time editingutility spillover

OCOO-T : A Simple and Scalable Virtual Cell Model for Transcriptional Perturbation Response Prediction

arXiv cs.AI · Danning Jiang, Zheming An, Yalong Zhao, Lipeng Lai · 2026-06-11

OCOO-T introduces a minimalist Transformer-based virtual cell model for predicting single-cell transcriptional responses to perturbations. The method employs continuous-time denoising via flow matching, integrating perturbation embeddings and dosage information through adaptive layer normalization and in-context tokens. Evaluations on Tahoe100M, Replogle, and PBMC benchmarks show state-of-the-art performance across diverse perturbations and cell types, with scalability to long transcriptional profiles via patching and depatching.

transformerdenoisingperturbationtranscriptionalscalability

The Internet of Agentic AI: Communication, Coordination, and Collective Intelligence at Scale

arXiv cs.AI · Quanyan Zhu · 2026-06-11

The paper proposes the Internet of Agentic AI (IoAI), a framework for scalable ecosystems of heterogeneous autonomous agents that communicate, coordinate, and execute workflows across diverse environments. Drawing on foundations from single-agent AI, multi-agent systems, distributed computing, game theory, and security engineering, the authors analyze architectures, protocols, and mechanisms for agent deployment, workflow lifecycles, interoperability, resource management, and trust. Case studies in adaptive manufacturing and distributed operational coordination illustrate key challenges, including controlled emergence, semantic interoperability, secure identity, incentive-compatible coordination, resource-aware orchestration, and governance in large-scale agent networks.

autonomous agentsmulti-agent systemssemantic interoperabilityresource-aware orchestrationdistributed computing

Fantastic Scientific Agents and How to Build Them: AgentBuild for Rietveld Refinement

arXiv cs.AI · Woong Shin, Craig A. Bridges, Marshall T. McDonnell, Rafael Ferreira da Silva · 2026-06-11

The paper introduces AgentBuild, a framework for constructing scientific agents from human-authored contracts that preserve researcher judgment. The method combines version-controlled rubrics, difficulty-graded curricula, and curated knowledge bases to guide a meta-optimizer coding agent that edits the target agent within specified boundaries. Applied to Rietveld refinement of X-ray diffraction data using GSAS-II, the system successfully progressed through a lithium lanthanum zirconium oxide (LLZO) signal-to-noise ladder, identifying workflow limits while maintaining rubric-driven quality control. The approach enables model-agnostic retuning by preserving contracts across base model updates.

agent constructionrietveld refinementmeta-optimizergraded curriculumx-ray diffraction

Perceive, Interact, Reason: Building Tool-Augmented Visual Agents for Spatial Reasoning

arXiv cs.AI · Changye Li, Meng Lu, Yi Wu, Ligeng Zhu · 2026-06-11

We introduce PERIA, a tool-augmented visual agent for spatial reasoning tasks, addressing limitations of vision-language models in active evidence acquisition and multi-step visual interaction. PERIA employs two lightweight tool families: vision perception tools for exposing textual, symbolic, and spatial evidence, and vision interaction tools for manipulating visual context, tracing paths, and verifying spatial relations. Training involves supervised tool-use trajectory synthesis, composite rewards, and Observation-Relaxed Group-in-Group Policy Optimization. PERIA-8B improves over Qwen3-8B by 10.0% on in-distribution and 4.4% on out-of-distribution benchmarks, outperforming state-of-the-art baselines by 7.0%-14.8% and achieving performance comparable to larger models like Qwen3-VL-235B-A22B-Thinking and GPT-5.

spatial reasoningvision-language modelstool-augmented agentspolicy optimizationvisual interaction

Topical Phase Transitions in Artificial Intelligence Research: Large-Scale Evidence and an Early-Warning Signature for Emerging Topics

arXiv cs.AI · Rasul Khanbayov, Hasan Kurban · 2026-06-11

This work characterizes topical phase transitions in AI research through large-scale analysis of 80,814 papers from ACL, CVPR, ICLR, ICML, and NeurIPS (2017-2025). The study identifies abrupt surges in topics like large language models and diffusion models, contrasting with smooth growth patterns in reinforcement learning. An early-warning signature is proposed, achieving 27% precision and 63% recall in predicting emerging topics out-of-sample. The method flags reasoning, agentic AI, multimodal LLMs, retrieval-augmented generation, and world models as key areas to monitor in 2026-2028.

topical phase transitionsearly-warning signaturelarge language modelsdiffusion modelsretrieval-augmented generation

DIMOS: Disentangling Instance-level Moving Object Segmentation

arXiv cs.AI · Hongxiang Huang, Hongwei Ren, Xiaopeng Lin, Yulong Huang · 2026-06-11

The paper introduces DIMOS, a novel framework for moving instance segmentation (MIS) that disentangles appearance and motion features across image and event modalities. The method employs a dual-disentangling feature extraction module to separate motion and appearance cues, followed by multi-granularity cross-modal alignment for effective fusion. Experiments show state-of-the-art performance, particularly for small instances in challenging scenarios like fast motion and low-light conditions.

moving instance segmentationevent camerasfeature disentanglementcross-modal alignmentmultimodal fusion

Acquisition state behaves as a structured, measurable variable governing lung-nodule AI: kernel-driven measurement instability and noise-driven detection fragility, invisible to DICOM metadata

arXiv cs.AI · Daniel Soliman · 2026-06-11

The study identifies acquisition state as a structured, measurable variable governing AI performance in lung-nodule detection, revealing distinct failure modes invisible to DICOM metadata. Using a MONAI RetinaNet model trained on LUNA16, the authors analyze paired CT scans differing in reconstruction kernel and controlled perturbations from LIDC-IDRI. Results show that kernel shifts alter nodule diameter measurements (5.2% Fleischner category flips) without affecting detection confidence, while noise perturbations degrade detection sensitivity (p=5.9e-32) but not measurements. A 4-feature pixel fingerprint achieves high reconstruction identity classification (AUC 0.95-0.995), outperforming DICOM tags. Findings underscore the need for acquisition-aware validation in imaging-AI governance.

acquisition statereconstruction kernelpixel fingerprintlung-nodule detectiondicom metadata

GeoNatureAgent Benchmark: Benchmarking LLM Agents for Environmental Geospatial Analysis Across Frontier and Open-Weight Foundation Models

arXiv cs.AI · Gabriel Diaz-Ireland, Diego Prieto-Herráez, Mario García Peces, Javier Velázquez · 2026-06-11

The GeoNatureAgent Benchmark introduces the first evaluation framework for LLM agents performing environmental geospatial analysis via structured tool calls to a production-style API. It comprises 93 tasks across 18 categories, testing capabilities like spatial reasoning and error handling against a self-hostable API serving environmental indicators for Spain and Portugal. Evaluation of seven LLMs reveals Claude Sonnet 4 leads (60.8% accuracy), while open-weight models like DeepSeek V3.2 offer competitive performance at lower cost ($0.011/case). Key limitations include 0% accuracy on close-value comparisons, demonstrating systematic reasoning gaps.

geospatial analysisllm agentsstructured tool callingenvironmental indicatorsbenchmark evaluation

Localizing Anchoring Pathways in Language Models

arXiv cs.AI · Hillary N. Owusu, Sarah Wiegreffe, Naomi H. Feldman · 2026-06-11

The study mechanistically localizes anchoring pathways in language models, demonstrating how irrelevant numerical prompts influence model judgments. Using a controlled multiple-choice setup with shared answer options, the authors define a logit-difference metric to track behavioral anchoring and apply attribution-based circuit localization on 7B--8B Qwen and Llama models. Edge-level methods outperform node-level methods in recovering anchor-sensitive signals, with strong transfer observed within models across low- and high-anchor circuits. However, sparse transfer between base and instruction-tuned variants suggests post-training alters pathway importance. These findings elucidate how anchoring-related decision signals propagate in language models.

anchoring effectscircuit localizationlogit-difference metricattribution-based methodsinstruction-tuned models

Teach-and-Repeat: Accurately Extracting Operational Knowledge from Mobile Screen Demonstrations to Empower GUI Agents

arXiv cs.AI · Yudong Zhang, Lei Hu, Daoyang Liu, Jiawei Liu · 2026-06-11

Teach VLM introduces a vision-language model that translates mobile screen trajectories into step-wise operational knowledge, addressing the challenge of diverse UI designs across applications. The model extracts operation-related keyframes from demonstration videos and leverages a systematic data flywheel for scalable training. Evaluated on the Chinese Mobile Screen Teach Benchmark, Teach VLM achieves state-of-the-art performance in operation semantics prediction. The Teach-and-Repeat paradigm further utilizes this operational knowledge to guide downstream screen-based execution agents, yielding consistent Task Success Rate improvements in Android World experiments.

vision-language modeloperational knowledgekeyframestask success rateandroid world

Stubborn: A Streamlined and Unified Reinforcement Learning Framework for Robust Motion Tracking and Fall Recovery for Humanoids

arXiv cs.AI · Xiao Ren, Yuhui Yang, Zongbiao Weng, Zhijie Liu · 2026-06-11

The paper introduces Stubborn, a unified reinforcement learning framework for humanoid motion tracking and fall recovery. It employs an asymmetric Actor-Critic architecture with three key components: yaw-aligned tracking representation for drift reduction, a Bernoulli-based probabilistic termination mechanism for fall-recovery exploration, and a dynamic sampling strategy for training efficiency. Evaluations show competitive performance against SOTA methods, with robustness attributed to the proposed mechanisms. Real-world demonstrations are available online.

reinforcement learninghumanoid motion trackingfall recoveryactor-critic architectureprobabilistic termination

MLUBench: A Benchmark for Lifelong Unlearning Evaluation in MLLMs

arXiv cs.AI · He Li, Haoang Chi, Qizhou Wang, Yunxin Mao · 2026-06-11

The paper introduces MLUBench, a benchmark for evaluating lifelong unlearning in multimodal large language models (MLLMs), featuring 127 entities across 9 classes. It highlights the challenge of cumulative degradation in existing unlearning methods and identifies the unique constraint of preserving multimodal alignment. The authors propose LUMoE, a method that effectively mitigates degradation, demonstrated through extensive experiments. The benchmark and source code are publicly available.

mlubenchlifelong unlearningmultimodal alignmentlumoedegradation mitigation

SymQNet: Amortized Acquisition for Low-Latency Adaptive Hamiltonian Learning

arXiv cs.AI · Yash Vardhan Tomar, Dheeraj Peddireddy, Vaneet Aggarwal · 2026-06-11

SymQNet introduces an amortized reinforcement-learning approach for low-latency adaptive Hamiltonian learning, addressing the computational bottleneck of Bayesian design rules in quantum device calibration. The method learns a posterior-conditioned acquisition policy offline, enabling fast policy forward passes online while maintaining Bayesian posterior feedback. On transverse-field Ising benchmarks, SymQNet reduces acquisition-only decision latency by 47.1× and 72.6× at five qubits compared to bounded Fisher-information search and bounded two-step Bayesian active learning by disagreement (BALD). At twelve qubits, SymQNet achieves full simulated steps in 1.02 s versus 13.27 s for bounded two-step BALD, demonstrating practical feasibility for repeated low-latency workloads.

adaptive hamiltonian learningamortized reinforcement-learningbayesian design rulestransverse-field isingposterior-conditioned acquisition

Exploring How Agent Voice Accents Shape Human-AI Collaboration in K-12 Group Learning

arXiv cs.AI · Prerna Ravi, Carúmey Stevens, Ben Hurt, Brandon Hanks · 2026-06-11

The study investigates how GenAI voice agent accents influence human-AI collaboration in K-12 group learning, addressing a gap in prior work focused on one-to-one settings. Using a between-subjects mixed-methods design with 33 teachers, it examines three accents (British, Indian, African American) through surveys, interaction analysis, and artifact evaluation. Results show accent significantly shaped mental models and agent roles: British-accented agents were treated as utilitarian tools, while Indian- and African American-accented agents were anthropomorphized as peers, affecting trust and engagement dynamics in computer-supported collaborative learning (CSCL).

genai voice agentssociolinguistic designgroup learning dynamicsanthropomorphizationcscl

The Containment Gap: How Deployed Agentic AI Frameworks Fail Public-Facing Safety Requirements

arXiv cs.AI · Md Jafrin Hossain, Mohammad Arif Hossain, Weiqi Liu, Nirwan Ansari · 2026-06-11

The study audits three agentic AI frameworks (LangChain, AutoGPT, OpenAI Agents SDK) for structural safety compliance, finding zero native adherence to six containment principles. Empirical validation shows memory-poisoning attacks induce 88.9% wrongful denial rates in a simulated government benefits agent, with complex policies masking targeted corruption. Two lightweight containment mechanisms (memory integrity validator, policy gate) mitigate attacks with <0.2ms overhead. Findings suggest current frameworks lack secure-by-default architectures for public-facing deployments.

agenticcontainmentmemory-poisoningintegrityframework

A Tutorial on World Models and Physical AI

arXiv cs.AI · Il-Seok Oh · 2026-06-11

The tutorial establishes world modeling as a foundational framework for intelligent systems, distinguishing between explicit world models with structured dynamics for planning and implicit models with scalable learned representations. It unifies these approaches through shared predictive structures while highlighting their differential use in perception, prediction, and action. The work positions world models as critical for physical AI in robotics and autonomous driving, though challenges persist in hierarchical reasoning, long-horizon planning, and autonomous goal formation for artificial general intelligence.

world modelspredictive structurephysical aihierarchical reasoninglong-horizon planning

Agentic MPC for Semantic Control System Resynthesis

arXiv cs.AI · Yuya Miyaoka, Masaki Inoue · 2026-06-11

The paper introduces an agentic model predictive control (MPC) framework that integrates large language models to enable semantic adaptation of control specifications. The method interprets heterogeneous inputs (natural language, environmental observations, external knowledge) via LLM-based agents to dynamically resynthesize MPC constraints and objectives. Validation in autonomous driving demonstrates context-aware control, including preference alignment and emergency vehicle yielding, bridging high-level semantics with low-level MPC optimization.

model predictive controlsemantic adaptationlarge language modelsautonomous drivingcontrol synthesis

Constructing Evaluation Datasets for Procedural Reasoning: Balancing Naturalness, Grounding, and Multi-Hop Coverage

arXiv cs.AI · Sarah Elshabrawy, Rahul K. Dass, Ashok K. Goel · 2026-06-11

The paper introduces a framework for constructing evaluation datasets that balance naturalness, grounding, and multi-hop coverage in procedural reasoning tasks for AI-supported learning systems. It compares three TMK-based question generation strategies (strict TMK generation, transcript-first generation with TMK filtering, and TMK-aware generation) using a grounding validation framework that assesses answer support, question self-containment, and multi-hop targeting. Results from 23 topics and 690 QA pairs show strict TMK generation achieves 96.5% grounding and 92.6% usability, while transcript-first yields more natural but less grounded questions, and TMK-aware has high multi-hop coverage but weaker grounding.

procedural reasoningtmk modelsquestion generationmulti-hop reasoninggrounding validation

LLMs Can Better Capture Human Judgments--With the Right Prompts

arXiv cs.AI · Danica Dillion, Chen Cecilia Liu, Baihui Wang, Daniele Barolo · 2026-06-10

This study demonstrates that large language models (LLMs) can better capture human judgments through improved prompting strategies. Using two datasets—144 moral scenarios and 38 beliefs from 32 countries—the authors show that eliciting standard deviations and response proportions enhances alignment with human responses. Clarity in scenarios, measured by human confusion ratings, further improves model performance. While LLMs poorly calibrate their own error estimates, they effectively predict human variability. The findings highlight that refined prompts yield more accurate LLM outputs.

llmspromptinghuman judgmentsalignmentvariability

Prefill Awareness in Large Language Models

arXiv cs.AI · Andy Wang, Parv Mahajan, David Demitri Africa, Alexandra Souly · 2026-06-10

The study introduces prefill awareness, a capability of frontier language models to detect tampered assistant-side context, and investigates its implications for safety-relevant evaluations. A binary preference benchmark was constructed across three prefill mechanisms, focusing on cases where models exhibit consistent stances. Results show that Claude Opus 4.5 detects opposing prefills in 9-35% of cases with a 0% false positive rate, often reverting to baseline behavior without explicit acknowledgment. Detection and resistance rely on distinct cues: stylistic mismatch affects flagging, while preference mismatch influences reversion. Prefill awareness significantly confounds prefill-based methods, particularly in agentic settings like misalignment-continuation evaluations and SWE-bench trajectories.

prefill awarenessbinary preference benchmarkstylistic mismatchpreference mismatchagentic settings

Reducing the Complexity of Deep Learning Models for EEG Analysis on Wearable Devices

arXiv cs.AI · Farough Shayeste Roodi, Parham Zilouchian Moghaddam, Mahdi Mohammadi-nasab, Mehdi Modarressi · 2026-06-10

The paper demonstrates feasibility of deploying deep neural networks for EEG analysis on resource-constrained wearable devices through complexity reduction techniques. It evaluates state-of-the-art DNN models for epileptic seizure detection, applying parameter quantization and electrode reduction to optimize computational efficiency. Results show these methods significantly reduce model complexity (computational demands, memory bandwidth) while maintaining accuracy, revealing explicit accuracy-complexity tradeoffs for wearable deployment.

eeg analysiswearable devicesparameter quantizationcomputational complexitydeep neural networks

PI-Hunter: Automated Red-Teaming for Exposing and Localizing Prompt Injections

arXiv cs.AI · Pengfei He, Lesly Miculicich, Vishesh Sharma, Ash Fox · 2026-06-10

PI-Hunter introduces an automated agentic auditing framework for proactive vulnerability exposure in LLM agents, addressing the security risks of indirect prompt injection attacks. The method constructs realistic source-aware test cases and iteratively evolves them through feedback-driven exploration to reveal latent malicious instructions in external environments. Extensive experiments across multiple benchmarks, agent architectures, attacks, and defenses demonstrate PI-Hunter's superior vulnerability exposure and attack-surface coverage over existing red-teaming baselines, while maintaining effectiveness under current prompt injection defenses.

prompt injectionllm agentsred-teamingagentic auditingvulnerability exposure

Benchmarking AI Agents for Addressing Scientific Challenges Across Scales

arXiv cs.AI · Tianyu Liu, Allen Xin Wang, Antonia Panescu, Lisa Xinyi Chen · 2026-06-10

The authors introduce SciAgentArena, a benchmark for evaluating AI agents in real-world scientific research scenarios, addressing limitations of existing benchmarks that fail to capture scientific complexity. The benchmark comprises ~200 tasks with stepwise verification and an interactive, agent-agnostic environment across multiple domains. Results show current agents excel in well-specified data-analysis workflows but struggle with novel insights, self-directed exploration, and open-ended questions, with performance varying across scientific contexts. The benchmark identifies failure modes and opportunities for improving agent reliability, autonomy, and reasoning.

ai agentsscientific benchmarkinginteractive evaluationstepwise verificationautonomous research

Rethinking Psychometric Evaluation of LLMs: When and Why Self-Reports Predict Behavior

arXiv cs.AI · Rafal Kocielnik, Pengrui Han, Peiyang Song, Myrl G. Marmarelis · 2026-06-10

The study demonstrates that self-report (SR) coherence with behavior in LLMs depends on measurement specificity and context. Contrasting the Theory of Planned Behavior (TPB) with Big Five traits across 11 frontier LLMs and four behavioral tasks, it finds TPB achieves human-level SR-behavior coherence within shared conversations, while Big Five fails. Coherence persists across sessions only for training-anchored behaviors (e.g., implicit bias), collapsing under context-primed behaviors (e.g., sycophancy). Persona prompting improves SR consistency but not behavioral alignment. Results advocate for task-specific psychometric tools over broad traits like Big Five.

self-report coherencetheory of planned behaviorbig five traitspersona promptingimplicit bias

The Theory of Mind Utility: Formal Specification of a Mentalizing Mechanism

arXiv cs.AI · Nikolos Gurney, Stacy Marsella · 2026-06-10

The Theory of Mind Utility (ToM-U) formalizes epistemic state inference by constructing Local Epistemic World Models (LEWMs), directed typed graphs representing agents, state nodes, and their epistemic relationships. ToM-U evaluates candidate LEWMs against observed behavior until achieving sufficient confidence, using five formal definitions specifying LEWM structure, agent node properties, a bounded proliferation mechanism for recursive mentalizing, inference procedures, and a residue function capturing failed mentalizing traces. Unlike Bayesian Theory of Mind and simulation theory, ToM-U derives belief states rather than presupposing them, generating falsifiable predictions about mentalizing failure from structural properties. This positions ToM-U as a domain-agnostic mechanism upstream of goal inference and social cognitive processes.

theory of mind utilitylocal epistemic world modelsepistemic state inferencebounded proliferation mechanismresidue function

Definitional alignment before capability alignment: a Design-Science framework for adjudicating claims about AGI

arXiv cs.AI · J. E. Aguilera Briones · 2026-06-10

The paper introduces DAF-AGI, a design-science framework for evaluating AGI definitions through two components: five ordinal criteria for adjudicative fitness and a governance audit process. Methodologically, it applies Design Science Research Methodology to assess five measurement families and a deflationary position, stress-testing against claims of current generative systems as AGI. Results show certification only under performance-based operationalizations, with other approaches rejecting the claim or remaining indeterminate, highlighting definitional sovereignty as key for algorithmic governance.

design-science researchagi definitionsadjudicative fitnessdefinitional sovereigntygovernance audit

AfriSUD: A Dependency Treebank Collection for Evaluating Models on African Languages

arXiv cs.AI · Happy Buzaaba, Cheikh Mouhamadou Bamba Dione, David Ifeoluwa Adelani, Sylvain Kahane · 2026-06-10

AfriSUD introduces the first large-scale collection of dependency treebanks for nine African languages, addressing underrepresentation in NLP resources. Using the Surface-Syntactic Universal Dependencies framework, the dataset captures typological features like agglutination and tone through native-speaker verified annotations. Evaluation of non-transformer baselines, multilingual pretrained encoders, and LLMs reveals a persistent syntax gap, indicating current architectures struggle with African-language structural diversity.

dependency parsinguniversal dependenciestreebankafrican languagessyntax gap

SMSR: Certified Defence Against Runtime Memory Poisoning in Persistent LLM Agent Systems

arXiv cs.AI · Tarun Sharma · 2026-06-10

The paper introduces Signed Memory with Smoothed Retrieval (SMSR), the first certified defense against Multi-Session Memory Poisoning (MSMP) in persistent LLM agent systems. SMSR combines HMAC-SHA256 provenance checks at write time with randomized memory ablation and verdict-based majority voting at query time, providing theoretical robustness guarantees. Experiments across 15 enterprise scenarios (3,150 trials) show SMSR reduces attack success from 93-100% to 0% for unsigned attacks and to 8.0% for authenticated adversaries, while maintaining 85-90% clean-query utility.

retrieval-augmented generationmemory poisoningcertified robustnesshmac-sha256majority voting

Deployment-Centered Evaluation: Predicting Query-Level Rejection Risk in a Clinical LLM System

arXiv cs.AI · Alyssa Unell, Miguel Fuentes, Brenna Li, Bridget Lin · 2026-06-10

The study introduces a deployment-centered evaluation framework for clinical LLM systems, focusing on predicting query-level rejection risk using pre-response classifiers. By incorporating deployment-specific context (provider type, department, model used) alongside query content, the method achieves 0.719 AUROC in prospective analysis over 4.5 months. Results demonstrate utility in guardrail triggering and abstention use cases, highlighting the value of context-aware rejection prediction for improving clinical LLM adoption.

llmclinicalaurocguardrailabstention

LLM-Powered Personalized Glycemic Assessment in Type 2 Diabetes with Wearable Sensor Data

arXiv cs.AI · Yifan Gao, Yanmin Gong, Yun Shi, Yuanxiong Guo · 2026-06-10

GlyLLM, a novel LLM-powered framework, integrates continuous glucose monitor (CGM) data and structured metadata for personalized glycemic assessment in Type 2 Diabetes (T2D). The method leverages pre-trained large language models (LLMs) to achieve sensor-text semantic abstraction, combining wearable sensor data with individual-level context. Evaluated on the AI-READI dataset, GlyLLM outperforms traditional machine learning methods by 13.66% in RMSE for glucose forecasting and 13.08% in AUROC for diabetes categorization. Ablation studies highlight the critical role of diabetes surveys and biometric tests in glycemic assessment. This work demonstrates the potential of LLMs for advancing personalized T2D care.

glycemic assessmentcontinuous glucose monitorlarge language modelssemantic abstractiontype 2 diabetes

Two-Layer Linear Auto-Regressive Models Estimate Latent States

arXiv cs.AI · Yahya Sattar, Sunmook Choi, Leo Maynard-Zhang, Yassir Jedra · 2026-06-10

The paper demonstrates that two-layer linear auto-regressive models trained via empirical risk minimization on partially observed linear dynamical systems learn to approximate Kalman filtering. The authors prove that the hidden representation aligns with Kalman filter state estimates up to a similarity transformation, despite no explicit dynamics knowledge. Key insights include establishing Kalman filter approximation bounds, benign optimization landscape properties, and finite-sample guarantees for prediction error, parameter estimation, and latent state recovery. Numerical experiments validate the theoretical findings.

auto-regressive modelskalman filteringlinear dynamical systemslatent state recoveryempirical risk minimization

EWAM: An Enhanced World Action Model for Closed-Loop Online Adaptation in Embodied Intelligence

arXiv cs.AI · Xin Zhou, Cong Miao · 2026-06-10

The paper introduces EWAM (Enhanced World Action Model), a closed-loop online adaptation architecture for embodied intelligence that reduces deployment data requirements without task-specific demonstrations or backbone fine-tuning. Built on a frozen Cosmos3 backbone, EWAM integrates four lightweight neural layers: Neural Experience Memory Layer (context provision in DiT), Neural Anomaly Detection Layer (real-time state divergence monitoring), Neural Policy Routing Layer (dynamic execution strategy selection), and Neural Action Correction Layer (action refinement). The system achieves performance gains through differentiable integration of these modules during inference, evaluated under zero-shot protocols.

closed-loop adaptationdiffusion transformerneural anomaly detectionzero-shot learningembodied intelligence

M*: A Modular, Extensible, Serving System for Multimodal Models

arXiv cs.AI · Atindra Jha, Naomi Sagan, Keisuke Kamahori, Irmak Sivgin · 2026-06-10

The authors present M*, a modular serving system for composite multimodal models, addressing limitations of existing frameworks designed for monolithic architectures. M* models composite AI systems as dataflow graphs (Walk Graphs), enabling arbitrary component composition, flexible cluster placement, and model-agnostic optimizations. Evaluations show 20% lower latency than vLLM-Omni on BAGEL text-to-image tasks, 2.9x real-time factor improvement for Qwen3-Omni text-to-speech, and 12.5x speedup over V-JEPA 2-AC in robotic planning, demonstrating efficient serving of heterogeneous model components.

multimodal modelsserving systemdataflow graphsmodel compositiondistributed runtime

From AGI to ASI

arXiv cs.AI · Tim Genewein, Matija Franklin, Alexander Lerchner, Laurent Orseau · 2026-06-10

The report examines the transition from artificial general intelligence (AGI) to artificial superintelligence (ASI), characterizing ASI as systems surpassing human cognitive capabilities. It identifies four pathways: scaling AGI, AI paradigm shifts, recursive improvement, and multi-agent collectives, while analyzing potential frictions and bottlenecks. The study highlights uncertainties in ASI progress, suggesting continuous acceleration rather than a single transformative step. Interdisciplinary global collaboration is emphasized to address societal impacts of AI-driven scientific breakthroughs.

agiasirecursive improvementmulti-agent systemscognitive capability

Evoflux: Inference-Time Evolution of Executable Tool Workflows for Compact Agents

arXiv cs.AI · Kushal Raj Bhandari, Ling Yue, Ching-Yun Ko, Dhaval Patel · 2026-06-10

Evoflux introduces an inference-time evolutionary search method for improving executable tool workflows in compact language models (LMs). It evolves typed workflow graphs through structured edits, execution feedback, adaptive intensity, meta-guided redesign, and diversity pruning, addressing failures in tool resolution, parameter validation, dependency tracking, and execution. On MCP-Bench tasks with 250 tools and live MCP servers, Evoflux increases execution feasibility from ~3% to 17-24% across small planners, outperforming SFT, SFT+DPO, and ReAct in reliability and token efficiency under scarce teacher-trace budgets.

evolutionary searchtool workflowscompact lmsexecution feedbackdependency tracking

A Zero-shot Generalized Graph Anomaly Detection Framework via Node Reconstruction

arXiv cs.AI · Phan Nguyen, Dat Cao, Hien Chu, Khue Hoang · 2026-06-10

AlignGAD proposes a zero-shot generalized graph anomaly detection framework for cross-domain applications. The method combines a Global Unification Module for feature alignment and spectral normalization, a Clustering Module for group-level pattern extraction via cluster-aware views, and a Node Discrepancy Scoring Module for multi-view anomaly aggregation. Experiments demonstrate effectiveness in zero-shot settings across real-world datasets.

graph anomaly detectionzero-shot learningspectral normalizationcluster-aware viewscross-domain generalization

Free-Placement Optimization of Ground Station Locations for Low-Earth Orbit Satellites

arXiv cs.AI · Grace Ra Kim, Duncan Eddy, Vedant Srinivas, Mykel J. Kochenderfer · 2026-06-10

The paper introduces SCORE, a two-stage free-placement optimization method for ground station networks that outperforms fixed-site approaches by operating over continuous spatial domains. SCORE combines sequential coordinate selection with cyclic refinement to address high-dimensionality and non-convexity challenges in global optimization. Benchmarking against differential evolution and integer programming methods, SCORE achieves up to 13% higher downlink throughput with 5x fewer function evaluations, while infrastructure-constrained variants retain 92% of performance gains near existing infrastructure.

free-placement optimizationground station networkssequential cyclic optimizationdifferential evolutiondownlink throughput

CAPED: Context-Aware Privacy Exposure Defense for Mobile GUI Agents

arXiv cs.AI · Siyu Shen, Fenghao Xu, Wenrui Diao, Kehuan Zhang · 2026-06-10

CAPED introduces a context-aware privacy defense for mobile GUI agents that process screenshots, addressing incidental visual privacy exposure by selectively masking sensitive content unrelated to the task. The method employs phone-side preprocessing to extract task requirements, parse UI elements, and apply selective masking before uploading screenshots to remote agents. Evaluated on AndroidWorld, CAPED reduces weighted seeded leakage from 0.766 to 0.268 while maintaining task utility, demonstrating the viability of task-driven selective exposure over full screenshot sharing.

mobile gui agentsincidental privacy exposureselective maskingandroidworldcontext-aware defense

BASENet: Band-Adapted Speech Enhancement Network with Cross-Band Attention

arXiv cs.AI · Damien Martins Gomes, François Capman · 2026-06-10

BASENet introduces a band-adapted speech enhancement network that partitions the spectrum into Bark-scale bands, assigning scaled-capacity encoders based on critical-band density to optimize perceptual resolution. The architecture employs cross-band attention for harmonic dependency capture through frequency-pooled representations at linear complexity, built on inverted residual blocks with dense connectivity and a convolutional recurrent network. BASENet achieves 3.55 PESQ and STOI~96% on VoiceBank+DEMAND with 0.83M parameters and 7.3 G~MACs, the lowest parameter count among methods with PESQ > 3.50. A causal variant (3.44 PESQ) outperforms several non-causal baselines, demonstrating real-time streaming capability on resource-constrained devices.

bark-scale bandscross-band attentioninverted residual blocksconvolutional recurrent networkcritical-band density

TrajGenAgent: A Hierarchical LLM Agent for Human Mobility Trajectory Generation

arXiv cs.AI · Siyu Li, Toan Tran, Lingyi Zhao, Khurram Shafique · 2026-06-10

TrajGenAgent introduces a hierarchical LLM-agent framework for generating human mobility trajectories without model fine-tuning, addressing limitations of prompt engineering and trajectory-level fine-tuning. The framework employs a two-stage orchestrator-worker design: an LLM synthesizes activity chains via in-context learning, while a deterministic workflow grounds activities using personalized POI retrieval, distance-aware location selection, kinematics-aware travel-time propagation, and LLM-based duration estimation. Evaluation uses anomaly-detection-based metrics for behavioral and semantic plausibility. Experiments demonstrate TrajGenAgent's superior spatiotemporal fidelity, semantic coherence, and individual-specific behavioral realism over neural and LLM-based baselines, while avoiding parameter updates.

llm-agentin-context learningpoi retrievalanomaly-detectionspatiotemporal fidelity

Token Complexity Theory for AI-Augmented Computing

arXiv cs.AI · Jie Wang · 2026-06-10

The paper introduces token complexity, a novel resource measure for AI-augmented computing systems, defined as the minimum expected token cost to achieve specified output quality. The authors develop this concept within the AI-Oracle Turing machine framework, where a probabilistic Turing machine interacts with a stochastic oracle via query/response tapes. Key results include proofs of token complexity's monotonicity, convexity, price sensitivity, and price-relativity of task ordering, along with establishing that the complexity frontier is non-empty, upward-closed, and convex.

token complexityai-oracle turing machineresource measurecomplexity frontierprobabilistic turing machine

Keep Policy Gradient in Charge: Sibling-Guided Credit Distillation for Long-Horizon Tool-Use Agents

arXiv cs.AI · Tianyu Ding, Jianhong Xin, Juan Pablo De la Cruz Weinstein · 2026-06-10

The paper introduces Sibling-Guided Credit Distillation (SGCD), a method for improving credit assignment in long-horizon tool-use reinforcement learning. SGCD employs dynamic sampling to generate mixed successful/failed rollouts, uses an external LLM to create stepwise credit references, and applies bounded detached credit weights to reshape GRPO token advantages. The approach avoids silent failure modes of direct self-distillation while maintaining deployment simplicity. Evaluations on AppWorld and $τ^3$-airline show improvements: AppWorld TGC scores increased from 42.9 to 45.6 (test_normal) and 24.7 to 27.0 (test_challenge), while $τ^3$-airline pass@1 rose from 0.583 to 0.602.

credit assignmenttool-use agentsself-distillationlong-horizon rlstepwise credit

Bag of Dims: Training-Free Mechanistic Interpretability via Dimension-Level Sign Patterns

arXiv cs.AI · Varun Reddy Nalagatla · 2026-06-10

The paper introduces Bag of Dims, a training-free framework for mechanistic interpretability in transformers, demonstrating that standard basis hidden states encode semantic content via sign patterns and confidence via magnitudes. The method validates across Qwen, Gemma, and Mistral models through four experiments, showing sign patterns alone achieve 72-93% top-5 next-token accuracy and 80-90% top-4096 via Hamming scoring. Unsupervised discovery yields 175 semantic categories with 0.80 mean AUC, 20% feature-neuron linkage, and 1500 features at 99% sparsity, confirming low inter-dimension coupling (0.0014 bits MI).

sign patternshamming scoringmechanistic interpretabilitytransformer hidden statesunsupervised discovery

HybridCodeAuthorship: A Benchmark Dataset for Line-Level Code Authorship Detection

arXiv cs.AI · Luke Patterson, Li Wang, Adam Faulkner · 2026-06-10

The paper introduces HybridCodeAuthorship, a benchmark dataset for line-level code authorship detection that simulates real-world hybrid human-AI code collaboration. The dataset construction pipeline utilizes CodeSearchNet to source Python files from GitHub, interleaving human- and AI-authored lines. Benchmarking two state-of-the-art detection algorithms reveals the task's difficulty, with AIGCode Detector achieving F1 scores of 0.48 (chunk-level) and 0.56 (line-level).

code authorship detectionai-generated codebenchmark datasetline-level analysiscodesearchnet

"Did you lie?" Evaluating Lie Detectors across Model Scale and Belief-Verified Model Organisms

arXiv cs.AI · Alan Cooney, David Africa, Geoffrey Irving · 2026-06-10

This work introduces Varied Deception, a prompted-lying testbed, and 13 reasoning model organisms with verified hidden beliefs, addressing limitations in evaluating lie detectors for language models. Four detectors were evaluated: a chain-of-thought judge, a logprob classifier, and two activation probes including Did-You-Lie (DYL), across 31 models (2B to 1T parameters). Results show positive scaling with model capability on prompted lying, but sharp performance drops on trained organisms, with DYL retaining the most signal. The chain-of-thought judge achieved 0.82 balanced accuracy, partly due to verification favoring CoT-readable beliefs. Datasets, model organisms, and trained detectors are released.

lie detectorschain-of-thoughtactivation probesmodel organismsprompted-lying

PersonaDrive: Human-Style Retrieval-Augmented VLA Agents for Closed-Loop Driving Simulation

arXiv cs.AI · Mahmoud Srewa, Praneetsai Iddamsetty, Mohammad Abdullah Al Faruque, Salma Elmalaki · 2026-06-10

PersonaDrive introduces a retrieval-augmented vision-language-action (VLA) pipeline for closed-loop driving simulation, enabling style-diverse non-ego agents without per-style retraining. The method involves offline triplet mining over per-style human driving data, training a lightweight retrieval head fusing visual features with a control encoder, and fine-tuning a VLA backbone to treat retrieved context points as in-context behavioral demonstrations. Evaluated on Bench2Drive, PersonaDrive improves driving scores by 4.6% over SimLingo and 2.5% over HiP-AD, achieving the highest scores across aggressive, neutral, and conservative styles within a 2% band, with speed and acceleration increasing by 18% and 25% from conservative to aggressive styles.

vision-language-actionclosed-loop simulationtriplet miningretrieval headin-context learning

From Imitation to Alignment: Human-Preference Flow Policies for Long-Horizon Sidewalk Navigation

arXiv cs.AI · Honglin He, Zhizheng Liu, Yukai Ma, Bolei Zhou · 2026-06-10

The paper introduces FlowPilot, a mapless navigation policy for long-horizon sidewalk navigation using only a monocular RGB camera. The method combines anchored flow matching for action representation during policy pre-training on large-scale robot fleet data with human-in-the-loop preference learning to enhance counterfactual reasoning and social compliance. Evaluations in simulation and real-world environments show FlowPilot achieves 42% success rate and 66% route completion, with FlowPilot-HP reducing intervention rates (IR) by 40.0% and near-intervention rates (NIR) by 52.1%.

anchored flow matchinghuman-in-the-loopcounterfactual reasoningsocial compliancemonocular rgb

Emerging Flexible Designs for Geospatial Multimodal Foundation Models

arXiv cs.AI · Philipe Dias, Waqwoya Abebe, Abhishek Potnis, Aristeidis Tsaris · 2026-06-10

This study conducts a systematic comparison of diverse foundation model architectures for geospatial multimodal reasoning, focusing on flexibility across spectral band configurations. The authors standardize pretraining using identical self-supervised learning objectives and datasets, then evaluate models with consistent parameterization on the GEOBench benchmark for classification and segmentation tasks. Results reveal architectural trade-offs between model flexibility, modality alignment, and downstream task performance, providing practical insights for designing next-generation geospatial foundation models capable of robust multimodal reasoning.

foundation modelsgeospatial multimodalself-supervised learningspectral bandgeobench

Pythagoras-Prover: Advancing Efficient Formal Proving via Augmented Lean Formalisation

arXiv cs.AI · Joshua Ong Jun Leang, Zheng Zhao, Mihaela Cătălina Stoian, Qiyuan Xu · 2026-06-10

Pythagoras-Prover introduces a compute-efficient family of Lean theorem provers, combining autoregressive (4B/32B) and diffusion-based (4B) models, trained via curriculum supervised fine-tuning on stratified Lean-verified corpora. The method employs Augmented Lean Formalisation (ALF) to expand training data through self-distillation and formal statement variants, alongside dynamic proof-reasoning filtering to maintain 8k-token context budgets. Results show Pythagoras-Prover-4B outperforms DeepSeek-Prover-V2-671B (86.1% vs 82.4% pass@32 on MiniF2F-Test) with 167x fewer parameters, while the 32B variant achieves 93.0% on MiniF2F-Test and solves 93 PutnamBench problems.

lean theorem proveraugmented lean formalisationcurriculum supervised fine-tuningdiffusion-based proverproof-reasoning filtering

Analyzing and Improving Fine-grained Preference Optimization in Medical LVLMs

arXiv cs.AI · Shayan Mohammadizadehsamakosh, Pritam Sarkar, Leonid Sigal, Ali Etemad · 2026-06-10

The paper introduces a fine-grained alignment framework for medical LVLMs, addressing limitations in existing preference optimization methods. It combines a bidirectional token-wise KL regularizer with a visual-contrastive grounding objective to improve clinical correctness while preserving linguistic style. The approach corrects only erroneous spans in model outputs using minimally edited preference pairs. Experiments across medical imaging tasks and text generation benchmarks demonstrate its effectiveness in reducing factual inconsistencies and improving visual grounding.

lvlmdpokl regularizervisual-contrastiveclinical correctness

Strategic Decision Support for AI Agents

arXiv cs.AI · Shayan Kiyani, Sima Noorani, George Pappas, Hamed Hassani · 2026-06-10

The paper introduces a strategic decision-support framework for AI agents that minimizes support usage while controlling counterfactual missed-support error—the probability of autonomous action when support would have improved outcomes. The method formulates an optimization problem yielding population-level threshold rules, implements an online adaptive thresholding algorithm with randomized exploration, and proposes calibration-on-the-fly to reduce unnecessary support calls. Experiments across information gathering, human–AI collaboration, and tool-use scenarios demonstrate reliable error control with significant reductions in support usage.

decision supportcounterfactual erroronline thresholdingagentic systemsuncertainty quantification

Graph Reduction in Multirelational Networks: A Spreading-Oriented Reduction Benchmark

arXiv cs.AI · Mateusz Stolarski, Michał Czuba, Piotr Bielak, Piotr Bródka · 2026-06-10

The paper introduces SORB (Spreading-Oriented Reduction Benchmark), an open-source framework for evaluating influence maximization (IM) models with integrated graph reduction analysis. SORB operates on diverse networks (single-/multilayer) and measures reduction impacts via standardized metrics like $Gain@k$ and $\mathrm{AUC}_{\mathrm{cutoff}}$. Experiments reveal reduction effects are task- and network-dependent: sparsification preserves seed quality in single-layer networks, while multilayer networks suffer ranking degradation regardless of reduction strategy. The work emphasizes reduction-aware evaluation for spreading process studies.

influence maximizationgraph reductionmultilayer networkssparsificationbenchmarking

EDEN: A Large-Scale Corpus of Clinical Notes for Italian

arXiv cs.AI · Tiziano Labruna, Guido Bertolini, Pietro Ferrazzi, Bernardo Magnini · 2026-06-10

The authors introduce EDEN (Emergency Department Electronic Notes), a large-scale Italian corpus of 4 million anonymized clinical notes from emergency departments, with 6,000 notes manually annotated via a 132-item Case Report Form (CRF) for dyspnea and loss of consciousness cases. The dataset features diverse value types (numerical, categorical, binary) and addresses data imbalance through iterative clinician review. They propose CRF-filling as a structured information extraction benchmark, providing zero-shot baselines using Gemma-27B and MedGemma-27B. EDEN is the largest freely available Italian clinical corpus.

clinical notescase report forminformation extractionlarge language modelsanonymization

Arbor: Tree Search as a Cognition Layer for Autonomous Agents

arXiv cs.AI · Neha Prakriya, Chaojun Hou, Zheng Gong, Huasha Zhao · 2026-06-10

Arbor introduces a multi-agent framework with structured tree search as a cognition layer for autonomous agents in large stateful action spaces. The system maintains a shared search tree of scored hypotheses as working memory, using failures as diagnostic signals and successes to shift exploration. It employs an Orchestrator agent for optimization and a Critic agent for stability checks, decomposing capabilities into hard (domain expertise) and soft (coordination) skills. Evaluated on LLM inference optimization, Arbor achieves up to 193% throughput-latency Pareto improvement over baselines, with 2% run-to-run variance, demonstrating hardware-agnostic reproducibility.

tree searchmulti-agent frameworkllm inferencepareto improvementautonomous optimization

Foresight: Iterative Reasoning About Clues that Matter for Navigation

arXiv cs.AI · Arthur Zhang, Carl Qi, Donne Su, Xiangyun Meng · 2026-06-10

Foresight introduces a test-time framework for open-world mapless navigation using iterative reasoning with Vision-Language Models (VLMs). The method alternates between proposing image-space motion plans and critiquing them based on language goals and visual context, conditioned on prior critiques for refinement. A reward model trained from human feedback aligns plan critiques with behavior preferences via reinforcement learning. Evaluations show 37% higher task success and 52% fewer interventions versus baselines, running in real-time on a Jetson AGX Orin.

vision-language modelstest-time reasoningmotion planningreinforcement learningopen-world navigation

Understanding Truncated Positional Encodings for Graph Neural Networks

arXiv cs.LG · James Flora, Mitchell Black, Weng-Keen Wong, Amir Nayyeri · 2026-06-11

The paper analyzes the theoretical properties of truncated positional encodings (PEs) in graph neural networks (GNNs), revealing fundamental differences in expressive power between spectral and walk-based variants when truncated. While complete PEs exhibit equivalent expressivity (between 1-WL and 3-WL tests), truncated spectral PEs lose their advantage over 1-WL. The study introduces $k$-harmonic distances to demonstrate nuanced expressivity differences among closely related truncated PEs. Empirical results on real-world datasets show that combining multiple truncated PE families outperforms using any single variant.

positional encodingsgraph neural networksspectral methodswalk-based methodsexpressive power

Dense Supervision, Sparse Updates: On the Sparsity and Geometry of On-Policy Distillation

arXiv cs.LG · Guo Yu, Wenlin Liu, Yulan Hu, Hao-Xuan Ma · 2026-06-11

The paper analyzes the parameter update characteristics of on-policy distillation (OPD), revealing two key findings. First, OPD updates exhibit coordinate sparsity, being distributed across layers with feed-forward network (FFN) dominance, where training just the identified subnetwork achieves comparable performance to full OPD. Second, updates are numerically full-rank but spectrally concentrated, primarily affecting near-zero weight coordinates and avoiding principal singular subspaces. The study employs optimizer ablation (SGD vs AdamW) across language and vision-language models, showing AdamW's superior performance due to preserved gradient scale heterogeneity.

on-policy distillationcoordinate sparsityspectral concentrationfeed-forward networksgradient scale heterogeneity

Operadic consistency: a label-free signal for compositional reasoning failures in LLMs

arXiv cs.LG · Nathaniel Bottman, Yinhong Liu, Kyle Richardson · 2026-06-11

The paper introduces operadic consistency (OC), a label-free diagnostic for detecting compositional reasoning failures in LLMs by comparing direct answers with decomposed query responses. Leveraging operad theory, OC evaluates agreement between these response modes across twelve instruction-tuned LLMs (4B to 671B parameters) on four multi-hop QA datasets. Results show OC strongly correlates with accuracy (Pearson r ∈ [0.86, 0.94]), outperforms self-consistency baselines (e.g., CoT-SC drops to r ≈ 0.45 on MuSiQue/StrategyQA), and improves selective prediction (AUARC lifts +0.086 to +0.096).

operadic consistencycompositional reasoningmulti-hop qaself-consistencyselective prediction

The Stable Recovery Manifold: Geometric Principles Governing Recoverability in Continual Learning

arXiv cs.LG · Ayushman Trivedi, Bhavika Melwani · 2026-06-11

The paper introduces the Stable Recovery Manifold hypothesis, proposing that catastrophic forgetting in continual learning stems from accessibility issues rather than information destruction. Using Split CIFAR-100 and a ResNet-18, the authors analyze recoverability via Recovery Subspace Dimensionality (k_t) and representational drift across ten tasks. Results show stable recovery dimensionality (mean k_t = 8.0) despite drift, with principal-angle drift strongly predicting recoverability (r = -0.862). A geometric model explains 82.2% of recoverability variance, supporting the hypothesis that forgotten knowledge remains compactly decodable.

catastrophic forgettingrecovery subspace dimensionalityrepresentational driftcontinual learningstable recovery manifold

Aerial Wildfire Suppression Planning with a Hybrid CNN-Cellular Automata Fire Model

arXiv cs.LG · Ion Matei, Maksym Zhenirovskyy, Takuya Kurihana, Rohit Vupala · 2026-06-11

The paper introduces a hybrid CNN-cellular automata model for aerial wildfire suppression planning, integrating fire spread prediction with intervention strategy optimization. The framework combines neural network-based terrain analysis with cellular automaton fire dynamics, enabling gradient-based optimization of aerial drop parameters (location, orientation) for water and retardant deployment. Uncertainty quantification includes Monte Carlo sampling for aleatoric effects and spatially correlated perturbations for epistemic errors. A case study on the 2020 Bear Fire demonstrates the model's capability to generate suppression schedules that reduce fire-affected area while accounting for operational uncertainties.

wildfire suppressioncellular automatacnngradient-based optimizationuncertainty quantification

Generative Modeling of Bach-Style Symbolic Music: A Comparative Study of Autoregressive, Latent-Variable, and Adversarial Approaches

arXiv cs.LG · Kyuil Lee, Dezhi Yu, Yongkang Huang · 2026-06-11

The study compares three generative modeling approaches for Bach-style symbolic music: autoregressive LSTMs with attention, latent-variable models (recurrent and vector-quantized VAEs), and adversarial networks. Using a shared MIDI corpus, the evaluation focuses on polyphonic sequence modeling, latent representation quality, and stylistic coherence. Results indicate autoregressive LSTMs with attention yield the most musically coherent samples, while vector quantization improves latent structure over conventional VAEs. Adversarial methods capture local pitch patterns but exhibit training instability and weaker generalization to Bach's style, revealing comparative strengths and limitations of each paradigm.

autoregressivelstmvaeganpolyphonic

Majority-of-Three is Optimal

arXiv cs.LG · Divit Rawal, Nikita Zhivotovskiy · 2026-06-11

The paper establishes that a majority vote of three independent consistent classifiers constitutes an optimal learner in the realizable PAC learning setting. This result simplifies both the algorithmic structure and probabilistic analysis compared to previous voting learners, including S. Hanneke's algorithm and K. Green Larsen's analysis of bagging. The proof demonstrates optimality for the simplest voting scheme, providing a concise theoretical foundation for majority voting in PAC learning.

majority votepac learningconsistent classifiersoptimal learnerbagging

Distribution-Agnostic Robust Trajectory Optimization via Chance-Constrained Reinforcement Learning

arXiv cs.LG · Yashdeep Chaudhary, Roberto Armellin, Harry Holt, Marco Sagliano · 2026-06-11

A distribution-agnostic robust trajectory-optimization framework is introduced, leveraging chance-constrained reinforcement learning to handle uncertainty in initial conditions and process noise. The method computes a deterministic nominal trajectory offline, then robustifies it via an affine closed-loop correction law with feedforward adjustments and time-varying feedback gains. Probabilistic feasibility is enforced using rollout-based upper-tail quantiles, while terminal dispersion is controlled via covariance-feasibility penalties. Evaluated on a 3D Earth-Mars transfer and a stochastic atmospheric rocket landing, the framework maintains competitive upper-tail fuel costs and probabilistic feasibility across heterogeneous spacecraft trajectory problems without structural redesign.

chance-constrained reinforcement learningaffine closed-loop correctionupper-tail quantilescovariance-feasibility penaltiestrajectory optimization

Simplex-Constrained Sparse Bagging: Transitioning from Uniform Priors to Sparse Posteriors in Ensemble Learning

arXiv cs.LG · Meher Sai Preetam, Meher Bhaskar · 2026-06-11

The paper introduces Simplex-Constrained Sparse Bagging (SCSB), a framework for compressing and calibrating bootstrap-based ensembles. SCSB addresses the limitations of uniform voting in standard bagging methods (e.g., Random Forests, Bagged SVMs) by formulating ensemble pruning and calibration as a joint optimization problem on the probability simplex, using Out-Of-Bag (OOB) loss minimization. It resolves the 'L1-simplex paradox' via a concave quadratic penalty to induce sparsity. Results show 96% ensemble compression, linear inference speedups, improved calibration (lower Expected Calibration Error), and maintained or enhanced generalization accuracy.

ensemble learningprobability simplexout-of-bag lossmodel compressioncalibration error

Learning with Simulators: No Regret in a Computationally Bounded World

arXiv cs.LG · Sasha Voitovych, Abhishek Shetty, Noah Golowich, Alexander Rakhlin · 2026-06-11

The paper introduces simulatable processes, where learners access a simulator approximating the data-generating distribution, even for dependent processes. It demonstrates that this framework achieves learning guarantees comparable to classical independent-data settings, including VC-dimension-dependent error bounds. The work also analyzes conditional sampling's power, revealing statistical and computational advantages. A key result is a universal algorithm that learns any VC class under polynomial-time-samplable processes, with regret bounded by the process's time-bounded Kolmogorov complexity, thus extending PAC learning to dependent data scenarios.

simulatable processesvc dimensionconditional samplingkolmogorov complexitypac learning

Adjusted Cup-Product Neural Layer

arXiv cs.LG · Snigdha Chandan Khilar · 2026-06-11

The paper introduces an adjusted cup product neural layer, a novel neural primitive incorporating cup products of cochains with an adjustment term from higher gauge theory. This design ensures gauge invariance by construction. The key theoretical result demonstrates that the layer's output on closed cycles depends solely on the adjustment coefficient, with zero coefficient yielding null output. The authors prove this observable constitutes a nonzero quadratic form and exhibits exact invariance under one- and two-form gauge transformations.

cup productcochainsgauge invarianceneural primitivequadratic form

A2D2: Fine-Tuning Any-Length Discrete Diffusion for Adaptive Decoding

arXiv cs.LG · Sophia Tang, Yuchen Zhu, Molei Tao, Pranam Chatterjee · 2026-06-11

A2D2 introduces a unified framework for reward-guided fine-tuning of any-length discrete diffusion models, optimizing insertion and unmasking policies alongside a quality-based inference schedule. The method derives the Radon-Nikodym derivative for joint path measures, ensuring convergence to reward-tilted distributions without target samples, and proposes the Adaptive Joint Decoding (AJD) loss for optimal path measures. Empirical results show A2D2 improves reward optimization, generation flexibility, and accuracy over fixed-length fine-tuning and inference-time guidance methods.

discrete diffusionreward-guided fine-tuningany-length generationadaptive decodingradon-nikodym derivative

NetCause: Counterfactual Learning for Root Cause Analysis in Large-Scale Networks

arXiv cs.LG · Fabien Chraim, Jian Zhang, Dominik Janzing, Xiang Song · 2026-06-11

NetCause introduces a self-supervised learning framework for root cause analysis in large-scale networks by modeling incidents as graph-temporal processes and employing counterfactual simulation to rank candidate root causes. The method trains on 1,500 incidents from a cloud provider's production network and evaluates on 31 expert-labeled cases, achieving a 16.1% accuracy improvement over rule-based baselines. Inference requires only seconds of GPU runtime, making it practical for operational deployment despite computationally intensive training.

root cause analysiscounterfactual learninggraph-temporal modelingself-supervised learningnetwork incidents

Graphical Causal Reasoning for Root Cause Analysis in Cloud Networks

arXiv cs.LG · Fabien Chraim, Dominik Janzing, John Evans · 2026-06-11

The paper introduces a graph-based causal discovery method for root cause analysis (RCA) in cloud networks, addressing limitations of rule-based automation. The approach constructs causal graphs from binary time series using bivariate Granger causality and conditional independence tests, then performs time-aware probabilistic inference via edge-specific conditional probabilities. Evaluation on 35 labeled incidents from a major cloud provider showed 85.7% recall and 74.3% exact match rates, with successful deployment in 800+ production incidents.

root cause analysisgranger causalityconditional independencecausal graphcloud networks

Ride, Track, and Recover: Pilot Randomized Trial of a Wearable Digital Self-Management Intervention During a Veteran Endurance-Cycling Program

arXiv cs.LG · Alan Ta, Nilsu Salgin, Caleb Armstrong, Kala Phillips Reindel · 2026-06-11

This pilot randomized trial evaluates a wearable digital self-management intervention for veterans with PTSD during an endurance-cycling program. Thirteen participants were randomized to digital intervention plus cycling (n=7), cycling only (n=3), or at-home monitoring (n=7 controls). Smartwatch-derived heart rate and accelerometer features detected hyperarousal events, validated by participants. Generalized additive mixed models revealed differential symptom trajectories: the intervention group showed stabilized hyperarousal and maintained gains post-event, while the cycling-only group exhibited late-study escalation. Higher-severity participants confirmed more ML-detected events, suggesting personalized wearable systems may enhance PTSD symptom management.

wearable sensinghyperarousal detectiongeneralized additive modelsdigital interventionreal-time validation

MaskWAM: Unifying Mask Prompting and Prediction for World-Action Models

arXiv cs.LG · Hanyang Yu, Haitao Lin, Jingbo Zhang, Wenyao Zhang · 2026-06-11

MaskWAM introduces an object-centric world-action model (WAM) that unifies mask prompting and prediction via a Mixture of Transformers (MoT) to address spatial bottlenecks in robotic control. By leveraging masks as both inputs and predictions, it enhances semantic grounding and reduces referential ambiguity in cluttered scenes. Evaluations on LIBERO, RoboTwin, and real-world tasks show MaskWAM outperforms baselines in language-clear and ambiguous scenarios, demonstrating robust policy generalization through object-centric supervision and visual prompts.

mask promptingworld-action modelsmixture of transformersobject-centricreferential ambiguity

GF-DiT: Scheduling Parallelism for Diffusion Transformer Serving

arXiv cs.LG · Xinwei Qiang, Yifan Hu, Shixuan Sun, Jing Yang · 2026-06-11

GF-DiT introduces elastic parallelism scheduling for Diffusion Transformer (DiT) serving, addressing workload heterogeneity through dynamic GPU resource allocation. The system employs asynchronous execution abstraction to decompose requests into schedulable trajectory tasks and group-free collectives for low-overhead communication. Evaluated on image/video diffusion workloads in vLLM-Omni, GF-DiT achieves 6.01× throughput improvement, 95% latency reduction, 90% fewer SLO violations, and reduces communication-group setup overhead from 778 ms to 60 μs compared to static parallelism approaches.

diffusion transformerselastic parallelismgpu schedulinggroup-free collectivesvllm-omni

Reinforcement Learning for Neural Model Editing

arXiv cs.LG · Shaivi Malik · 2026-06-11

The paper introduces a reinforcement learning framework for neural model editing, formulating it as an RL problem where agents modify models via reward feedback. Two environments are proposed: MaskWorld (multiplicative weight scaling) and ShiftWorld (additive weight updates), with rewards combining utility preservation and task-specific objectives. Evaluations on bias mitigation (text classification) and machine unlearning (image classification) show learned policies reduce forget set accuracy to ~0% while preserving >90% retain set accuracy, and improve bias-related performance by >5% without compromising general utility. This demonstrates RL can automate editing policy design.

reinforcement learningmodel editingmachine unlearningbias mitigationweight modification

Optical Implementation of Equilibrium Propagation Using Spatial Photonic Ising Machines

arXiv cs.LG · Dimitri Vanden Abeele, Daniele Veraldi, Davide Pierangeli, Claudio Conti · 2026-06-11

The authors present a hybrid optical-digital implementation of Equilibrium Propagation (EP) using a Spatial Photonic Ising Machine (SPIM), offering an energy-efficient alternative to traditional machine learning. The SPIM employs gauge transformation to optically encode continuous neuron states and rank-1 binary trainable patterns via phase modulations on a spatial light modulator, with inference performed using a finite difference scheme. Experimental validation on the Wine dataset and numerical evaluation on MNIST demonstrate the potential of continuous couplings and structured coupling matrices. This work advances physical implementations of EP for energy-based networks.

equilibrium propagationspatial photonic ising machinegauge transformationenergy-based networksphase modulation

Uncertainty Estimation for Molecular Diffusion Models

arXiv cs.LG · Paul Seij, Christian A. Naesseth, Stephan Mandt, Metod Jazbec · 2026-06-11

The authors propose a post-hoc method for estimating per-sample uncertainty in pretrained molecular diffusion models, addressing the lack of quality signals in generated molecules. Their approach builds on a Laplace approximation of the denoising network, measuring noise prediction variability across the generation trajectory. Experiments demonstrate that the uncertainty score negatively correlates with established quality metrics (e.g., validity, uniqueness) and enables test-time filtering to improve model performance.

molecular diffusion modelsuncertainty estimationlaplace approximationdenoising networktest-time scaling

Clustering Node Attributed Networks with Graph Neural Networks and Self Learning

arXiv cs.LG · Rodrigo de Sapienza Luna, Daniel Ratton Figueiredo · 2026-06-11

The authors propose a self-learning framework for clustering attributed graphs using graph neural networks (GNNs) in an unsupervised setting. The method iteratively refines clusters by alternating between GNN-based node representation learning and graph reconstruction, using both original edges and a dynamically constructed context graph. Empirical evaluation demonstrates superior performance over network-only or attribute-only baselines on synthetic data when neither modality is highly informative, with iterative learning outperforming single-round GNN clustering. On real-world datasets, the method achieves competitive results with state-of-the-art approaches for balanced cluster distributions.

graph neural networksattributed graph clusteringself-learningunsupervised learningnode representation learning

How Much Memory Do We Need? Adaptive Memory Gate for Neural Operators

arXiv cs.LG · Jihyeon Hur, Yongseok Kwon, Min-Gi Jo, Jeongwhan Choi · 2026-06-11

The paper introduces AMGFNO, an adaptive memory-gated Fourier neural operator that dynamically modulates memory weights based on observation conditions for solving time-dependent PDEs. The method employs a learnable gate to adjust memory retention, addressing limitations of fixed-weight approaches. Experiments on Kuramoto-Sivashinsky and Burgers' equations demonstrate 55-79% nRMSE reduction at low resolutions, with gate values automatically decaying from ~0.7 to near-zero as resolution increases.

neural operatorsmemory gatespde solvingadaptive weightsfourier neural networks

S-GBT: Smooth Growth Bound Tensor for Certified Robustness Against Word Substitution Attacks in NLP

arXiv cs.LG · Mohammed Bouri, Mohammed Erradi, Adnane Saoud · 2026-06-11

The Smooth Growth Bound Tensor (S-GBT) introduces a second-order method to certify robustness against word substitution attacks in NLP by bounding the Hessian element-wise, addressing limitations of first-order sensitivity approaches. S-GBT incorporates a regularization term during training to minimize these bounds, combining linear and quadratic terms to constrain output changes under perturbations. Derived for Long Short-Term Memory (LSTM) and Convolutional Neural Networks (CNN), S-GBT integrates directly into the training objective. Evaluations on benchmark datasets demonstrate up to 23.4% improvement in certified robust accuracy over prior methods, while maintaining competitive clean accuracy, highlighting the efficacy of jointly controlling gradient and curvature.

smooth growth bound tensorhessianword substitution attackscertified robustnessregularization

Accelerating Speculative Diffusions via Block Verification

arXiv cs.LG · Alexander Soen, Hisham Husain, Valentin De Bortoli, Arnaud Doucet · 2026-06-11

The paper introduces a novel speculative sampling scheme for diffusion models that enables block verification, improving draft acceptance rates. By adapting LLM-style speculative decoding to continuous spaces, the method efficiently samples from residual distributions and incorporates a zero-training Free Drafter heuristic. Experiments demonstrate up to 6.3% speedup over existing speculative diffusion methods with minimal overhead during parallel verification.

speculative decodingdiffusion modelsblock verificationresidual distributionfree drafter

Foundations of Practical Quantum Advantage in Quantum-Informed Machine Learning for Predicting Chaos

arXiv cs.LG · Maida Wang, Xiao Xue, Minh Chung, Peter V. Coveney · 2026-06-11

The authors establish theoretical foundations for practical quantum advantage in quantum-informed machine learning applied to chaotic dynamical systems. They introduce a family of k-indexed higher-order quantum statistical priors (Q-Priors) that encode k-point marginals of the invariant measure on n_q = kq qubits, leveraging superposition and entanglement for compact representation. A two-stage advantage is proven: quantum protocols estimate post hoc Pauli functionals with copy-pair counts independent of n_q, contrasting classical protocols requiring Ω(2^(n_q)) copies. Simulations and IQM superconducting processors validate the mechanism. Case studies demonstrate improved anomaly-correlation skill by 10-39% in medium-range weather forecasting and enhanced velocity-direction coherence in turbulent channel-flow analysis.

quantum-informed machine learningchaotic dynamical systemsquantum statistical priorspauli functionalssuperconducting processors

Hölder++: Improving the Quality-Coherence Trade-off in Multimodal VAEs

arXiv cs.LG · Huyen Vo, María Martínez-García, Isabel Valera · 2026-06-11

Hölder++ improves the quality-coherence trade-off in multimodal VAEs by introducing three innovations: exact Hölder pooling without approximation, separate modeling of shared and private representations (Hölder+), and hierarchical inference for better disentanglement (Hölder++). The method outperforms MMVAE+ in coherence while maintaining sample diversity, produces more structured latent spaces, and yields shared representations useful for downstream tasks. Experiments confirm these advantages across multiple metrics.

multimodal vaehölder poolinglatent disentanglementhierarchical inferencegenerative coherence

Positional Encoding in the Context of Memristor-Based Analog Computation for Automatic Speech Recognition

arXiv cs.LG · Benedikt Hilmes, Nick Rossenbach, Ralf Schlüter · 2026-06-11

The study addresses degradation in memristor-based analog computation for automatic speech recognition caused by large output values from transformed positional encodings. By adjusting the weight and precision bit proportions in analog-to-digital conversion (ADC) layers, degradation is reduced by ~50% relative while maintaining stable energy consumption. For scenarios where ADC modification is not feasible, removing encoding-related linear transformations achieves a ~30% relative reduction in degradation. The findings highlight the impact of positional encoding transformations on memristor-based analog computation and propose practical mitigation strategies.

memristorpositional encodinganalog computationdegradationautomatic speech recognition

VideoMDM: Towards 3D Human Motion Generation From 2D Supervision

arXiv cs.LG · Amir Mann, Gal Michael Harari, Merav Keidar, Or Litany · 2026-06-11

VideoMDM introduces a diffusion-based framework for generating 3D human motion from 2D supervision, eliminating the need for 3D ground truth. The method employs a pretrained 2D-to-3D lifter to produce approximate 3D pose sequences, which are diffused, denoised in 3D, and supervised in 2D via depth-weighted reprojection loss. The framework adapts standard 3D motion regularizers—velocity consistency and over-parameterized representation alignment—to the 2D setting, learning a coherent 3D motion manifold during training. On HumanML3D, VideoMDM achieves an FID of 0.88, nearly matching fully 3D-supervised MDM (FID 0.54). It also demonstrates strong performance on Fit3D and NBA datasets, generating motions consistently preferred by humans.

diffusion-based framework2d-to-3d lifterdepth-weighted reprojection loss3d motion manifoldvelocity consistency

Enhanced Low-Density Region Exploration in Classifier-Guided Diffusion Models Through Modified Reverse Diffusion Sampling

arXiv cs.LG · Jagriti Singh, Shekhar Verma, Muneendra Ojha · 2026-06-11

The paper introduces a sampling-time modification for classifier-guided diffusion models that enhances exploration of low-density regions without additional training. The method combines two guidance mechanisms: steering trajectories toward low-confidence regions via modified classifier gradients and directing sampling toward the predicted real image manifold. Evaluated on ImageNet with ADM models at 64×64 and 256×256 resolutions, the approach improves recall while maintaining FID scores and generates perceptually high-quality samples.

diffusion modelsclassifier guidancelow-density samplingimage synthesisreverse diffusion

Navigating the Safety-Fidelity Trade-off: Massive-Variate Time Series Forecasting for Power Systems via Probabilistic Scenarios

arXiv cs.LG · Kaijie Xu, Anqi Wang, Xilin Dai · 2026-06-11

The authors introduce PowerPhase, a probabilistic forecasting benchmark for power systems with 2,000-36,964 channels, exceeding existing multivariate benchmarks by an order of magnitude. PowerPhase includes constraint-aware metrics (Safety_mBrier, NECV, CVaR-alpha) and evaluates on AC power-flow outputs. They propose PowerForge, a scenario-based quantile forecaster with type-specific decoding heads and causal bridging, which achieves top average rank across all grids. Results show a safety-fidelity trade-off, where distributional accuracy and constraint satisfaction rank models differently.

probabilistic forecastingmultivariate time seriespower systemsscenario-based quantileconstraint-aware metrics

Quantizing Time-Series Models As Dynamical Systems: Trajectory-Based Quantization Sensitivity Score

arXiv cs.LG · Mariya Pavlova, Harrison Bo Hua Zhu, Elizsveta Semenova, Yingzhen Li · 2026-06-11

The authors propose Trajectory-based Quantization Sensitivity Score (TQS), a novel metric for post-training quantization (PTQ) that analyzes error propagation in time-series models through dynamical-systems theory. TQS decouples sensitivity estimation from quantizer selection, enabling a priori quantization budget planning even for black-box networks. Their TQS-PTQ framework eliminates calibration data requirements and second-order approximations while supporting mixed-precision deployment. Experiments demonstrate that this dynamical-systems approach provides robust low-precision performance in resource-constrained scenarios.

post-training quantizationdynamical systemssensitivity analysismixed-precisionerror propagation

Simultaneous Latent Budget Trees for Stratified Classification

arXiv cs.LG · Simultaneous Latent Budget Trees for Stratified Classification Cristian Buoncompagni, Stefano Pellegrino, Giulia Vannucci, Roberta Siciliano · 2026-06-11

The paper introduces Simultaneous Latent Budget Trees (SLBT), a probabilistic machine learning framework for classification trees incorporating stratification factors like temporal, spatial, or demographic variables. SLBT employs a model-based split rule where child nodes are interpreted as latent components of a simultaneous mixture model, optimizing conditional splits. Parameters are estimated via least squares with a neural network perspective, enabling interpretable tree structures with interactive visualization tools. The framework includes measures to handle unbalanced response class distributions. Applied to gender-related differences in Amyotrophic Lateral Sclerosis progression, SLBT demonstrates its utility in stratified classification. The associated SLBT library is available on GitHub.

simultaneous latent budget treesstratification factormodel-based split ruleneural network perspectiveinterpretable tree structure

Clipping Makes Distributed and Federated Asynchronous SGD Robust to Stragglers

arXiv cs.LG · Samuel Erickson, Mikael Johansson · 2026-06-11

The paper theoretically justifies gradient clipping's stabilizing effect in asynchronous stochastic gradient descent (ASGD) by eliminating maximum delay dependence in oracle complexity. Employing a sub-Weibull noise model that generalizes sub-Gaussian and sub-exponential distributions, the analysis captures heavy-tailed gradients observed in deep learning. Results demonstrate convergence in expectation and—for the first time in asynchronous optimization—high-probability convergence, addressing straggler-induced delays in distributed and federated settings.

asynchronous sgdgradient clippingsub-weibull noiseoracle complexitystraggler robustness

ProtoX-AD: Self-Explainable Time Series Anomaly Detection and Characterization

arXiv cs.LG · Aitor Sánchez-Ferrera, Elisabeth Wetzer, Kristoffer Wickstrøm, Michael Kampffmeyer · 2026-06-11

ProtoX-AD introduces a prototype-based self-explainable framework for time series anomaly detection (TSAD), addressing the explainability limitations of self-supervised classification approaches. The method learns transformation-aware latent representations alongside interpretable prototypes, enabling both anomaly detection and characterization through prototype-based explanations. Experiments on synthetic and real-world datasets show ProtoX-AD matches black-box methods in detection performance (F1 scores) while providing more consistent and semantically meaningful explanations than existing explainable baselines.

time series anomaly detectionself-supervised learningprototype-based explanationsinterpretable machine learningtransformation-aware representations

Extracting Governing Equations from Latent Dynamics via Multi-View Contrastive Learning

arXiv cs.LG · Paolo Muratore, Mackenzie Weygandt Mathis · 2026-06-11

We present DYSCO, a multi-view temporal contrastive learning algorithm that jointly recovers latent trajectories and governing dynamics from noisy, high-dimensional observations. The method leverages multiple independent noisy views of the same underlying process to disentangle signal from noise, parameterizing dynamics in a structured functional basis to enable symbolic recovery of governing equations within an affine gauge. Theoretical guarantees establish strong identification up to an affine indeterminacy, extending prior results to noisy nonlinear observations. Empirical evaluations demonstrate accurate recovery of both latent trajectories and flow fields across diverse dynamical regimes (chaotic, oscillatory, metastable) under Gaussian and Poisson observation noise.

multi-view learningcontrastive learninglatent dynamicssystem identificationaffine gauge

To GAN or Not To GAN: Segmentation Analysis on Mars DEM

arXiv cs.LG · Douglas Dziedzorm Agbeve, Aditya V. Handrale, Salim Fares, Seif E. Idani · 2026-06-11

The paper presents an automated approach for detecting Martian mounds using neural network-based semantic segmentation, comparing supervised and generative adversarial methods. Leveraging Digital Elevation Models (DEMs), the study evaluates morphological parameters to identify potential signs of water or life-conducive environments. Results indicate that incorporating artificially generated data via GANs did not enhance segmentation performance, suggesting supervised methods suffice for this task. The work contributes to rover navigation and astrobiological research by reducing reliance on manual mapping.

semantic segmentationdigital elevation modelsgenerative adversarial networksmartian morphologysupervised learning

Distributional Loss for Robust Classification

arXiv cs.LG · Kathleen Anderson, Thomas Martinetz · 2026-06-11

The paper introduces a novel loss function for supervised classification that optimizes classifier outputs as a bimodal Gaussian distribution, rather than enforcing direct input-to-label mappings. This distributional approach implicitly models class ambiguity, reduces overfitting, and promotes robust decision boundaries without requiring additional label information. Experiments show consistent robustness improvements, particularly in low-data regimes, with minimal modifications to standard training pipelines.

supervised classificationbimodal gaussian distributionclass ambiguityrobust decision boundarieslow-data regimes

From Uncertain Judgments to Calibrated Rankings: Conformal Elo Estimation for LLM Evaluation

arXiv cs.LG · Bora Kargi, David Salinas · 2026-06-11

The authors propose a conformal Elo estimation framework to address systematic errors in LLM-as-a-judge evaluations, enabling calibrated rankings without large-scale human annotations. At the local level, they propagate calibrated win probabilities into the Bradley-Terry procedure, reducing Elo estimation error to 17.9 MAE compared to human-derived ratings on 55 LMArena models. Globally, they apply split conformal prediction to residual gaps between LLM and human Elo ratings, providing prediction intervals with distribution-free marginal coverage guarantees. This two-layer approach yields low-cost, uncertainty-aware LLM evaluation tools. Code is released for reproducibility.

conformal predictionbradley-terryelo estimationllm evaluationuncertainty calibration

Layer-Resolved Optimal Transport for Hallucination Detection in NMT and Abstractive Summarization

arXiv cs.LG · Mariia Onyshchuk, Maksym-Vasyl Tarnavskyi, Marta Sumyk · 2026-06-11

The study extends optimal transport (OT) analysis to detect hallucinations in neural machine translation (NMT) across all six decoder layers of the Fairseq DE-EN model (N=3,414), revealing complementary detection capabilities between Wass-to-Unif and Wass-to-Data metrics, with layers L1-L4 most predictive. OT achieves 57.2%/57.6% balanced accuracy on abstractive summarization faithfulness detection (AggreFact, N=1,116), below supervised methods (69.9%/74.3%) due to limitations in capturing downstream faithfulness failures. Structural analysis of T5-base confirms consistent decoder organization, with Layer 3 peak concentration and Layer 12 critical for generation quality.

optimal transporthallucination detectionneural machine translationabstractive summarizationcross-attention

Understanding helpfulness and harmless tension in reward models

arXiv cs.LG · Eshaan Tanwar, Pepa Atanasova · 2026-06-11

This work investigates the tension between helpfulness and harmlessness objectives in reward models for RLHF, revealing mechanistic insights into their conflicting representations. The authors analyze single-objective (helpfulness-only, harmlessness-only) and mixed-objective reward models, employing activation-based methods and targeted neuron ablations to study functional roles. Results show mixed-objective models underperform single-objective variants, indicating interference between objectives. Shared neurons between helpfulness and harmlessness exert disproportionate influence on model behavior, causally supporting their respective objectives while negatively impacting the opposing one. These findings elucidate challenges in multi-objective alignment and motivate future work on disentangled alignment methods.

reward modelsrlhfneuron ablationalignment tensionactivation-based methods

WHAR Arena: Benchmarking the State of the Art in Efficient Wearable Human Activity Recognition

arXiv cs.LG · Maximilian Burzer, Tobias King, Till Riedel, Michael Beigl · 2026-06-11

The WHAR Arena benchmark addresses comparability issues in Wearable Human Activity Recognition (WHAR) by standardizing 30 datasets, evaluation protocols, and model interfaces. Through 4760 training runs across 17 architectures, the study finds no single dominant model, with CNN-HAR achieving the highest mean macro-F1 but performance clustering near a ceiling. Compact models like TinierHAR and Random Forests excel in deployment efficiency, while larger recurrent models offer diminishing returns. The framework is released to promote transparent benchmarking and efficiency optimization.

wearable human activity recognitionmacro-f1cross-subject evaluationpareto frontiertinierhar

The Geometry of Phase Transitions in Generative Dynamics via Projection Caustics

arXiv cs.LG · Ryosuke Sakamoto, Kotaro Sakamoto · 2026-06-11

The paper develops a geometric framework to explain phase-transition-like behavior in continuous-state generative samplers (e.g., diffusion and flow-matching models). By interpreting denoising as gradient descent on a free energy landscape, the authors identify projection caustics—where nearest-point projections onto data support become non-unique—as the origin of abrupt qualitative changes in trajectories. They introduce the Critical Boundary Detector (CBD) to diagnose score-direction instability, demonstrating its efficacy in localizing mode commitment, predicting intervention-sensitive windows, and enabling targeted control across toy models, diffusion models, and latent text-to-image diffusion models. The results establish a connection between data geometry and diffusion dynamics.

projection causticsfree energy landscapecritical boundary detectormode commitmentscore-direction instability

Getting Better at Working With You: Compiling User Corrections into Runtime Enforcement for Coding Agents

arXiv cs.LG · Yujun Zhou, Kehan Guo, Haomin Zhuang, Xiangqi Wang · 2026-06-11

The paper introduces TRACE (Test-time Rule Acquisition and Compiled Enforcement), a skill-layer pipeline that compiles user corrections into runtime checks for coding agents to reduce repeated preference violations. TRACE mines chat corrections, rewrites them as atomic rules, and enforces them during execution. Evaluated on ClawArena and MemoryArena-derived tasks, TRACE reduces held-out preference violations from 100.0% to 37.6% (in-distribution) and 2.0% (out-of-distribution) on ClawArena, and from 100.0% to 60.5% on MemoryArena, outperforming memory baselines.

runtime enforcementcoding agentspreference complianceatomic rulesin-distribution tasks

Detecting Explanatory Insufficiency in Learned Representations: A Framework for Representational Vigilance

arXiv cs.LG · Jacques Raynal, Pierre Slangen, Elsa Raynal, Jacques Margerit · 2026-06-11

The paper introduces VER (Vigilant Evaluator of Representations), a diagnostic framework for detecting explanatory insufficiency in learned representations. VER formalizes a five-step monitoring sequence: representation identification, explanatory-domain delimitation, residual-structure detection, explanatory-resistance evaluation, and vigilance signaling. The method distinguishes representational inadequacy from prediction error, uncertainty, noise, and distribution shift, complementing existing evaluation metrics without modifying learning algorithms. The authors outline a path toward empirical validation through representational-vigilance benchmarks.

learned representationsexplanatory insufficiencyresidual-structure detectionrepresentation diagnosticsvigilance signaling

When Does Routing Become Interpretable? Causal Probes on Block Attention Residuals

arXiv cs.LG · Aydin Javadov · 2026-06-11

The paper investigates whether exposing routing paths in Block Attention Residuals (Block AttnRes) architectures enables mechanistic interpretability. By comparing a vanilla Qwen3 model with deterministic routing to a trained Block AttnRes Qwen3 (both 0.6B parameters), the study reveals that trained models develop localized routing motifs (embedding-source, current-state, and older-history pathways) absent in the baseline. However, routing mass does not correlate with causal importance, as some high-mass paths show no causal role under intervention. The findings demonstrate that while architectural exposure of routing is necessary, causal probing remains essential for mechanistic interpretation.

block attention residualsmechanistic interpretabilityrouting motifscausal probingqwen3

Robust State-Conditional Feature-Weighted Jump Models for Temporal Clustering

arXiv cs.LG · Federico P. Cortese, Alessio Farcomeni · 2026-06-11

A robust feature-weighted jump model is proposed for temporal clustering, incorporating a penalty for smooth transition regularization and Tukey's biweight loss for outlier robustness. The model introduces a parameter to control feature weight variability across states, enabling state-specific feature relevance. Simulation results demonstrate accurate recovery of true cluster sequences and reliable feature identification, outperforming competing methods, especially in outlier scenarios. Empirical validation includes applications to conflict-related homicide data in Kosovo (1998-2000) and macroeconomic performance across twelve European countries (1949-2024).

temporal clusteringfeature weightingtukey's biweightstate-specific relevanceoutlier robustness

An Extensible and Lightweight Unified Architecture for Demosaicing Pixel-bin Image Sensors

arXiv cs.LG · Saurabh Kumar, Nutan Sairam Yenneti · 2026-06-11

The paper proposes a modular unified architecture for demosaicing pixel-bin sensors, addressing challenges posed by inter-color separation and CFA-specific deep learning methods. The solution combines extensibility and lightweight design, featuring a learning-free CFA-identification module for plug-and-play operation. Results demonstrate improved image quality while reducing resource overhead and development complexity compared to existing approaches.

pixel-bin sensorsdemosaicingcolor filter arraylightweight architecturecfai module

Learning-Augmented Approximation for Unrelated-Machines Makespan Scheduling

arXiv cs.LG · Kaito Baba, Evripidis Bampis, Giorgos Mitropoulos · 2026-06-11

The paper presents a learning-augmented algorithm for makespan minimization on unrelated machines ($R\|C_{\max}$), addressing an open question from Antoniadis et al. (ICLR 2025). By leveraging predictions of heavy job assignments, the method achieves a polynomial-time $(1+\varepsilon)$-approximation for accurate predictions, with graceful degradation to a 2-approximation under increasing prediction error. Theoretical guarantees match known lower bounds, and empirical validation demonstrates practical efficacy. The work extends the learning-augmented framework beyond selection problems to scheduling.

learning-augmented algorithmsmakespan minimizationunrelated machinesapproximation algorithmsscheduling

Demystifying Hidden-State Recurrence: Switchable Latent Reasoning with On-Policy Reinforcement Learning

arXiv cs.LG · Jiayu Yang, Chao Chen, Shengen Wu, Yinhong Liu · 2026-06-11

The paper introduces SWITCH, a switchable latent reasoning framework that replaces continuous hidden-state recurrence with discrete boundary tokens for improved optimization and interpretability. The method uses explicit entry/exit tokens to enable standard on-policy reinforcement learning (GRPO) and facilitate mechanistic analysis through direct probing. Experiments show SWITCH outperforms prior hidden-state-recurrence approaches, with mechanistic analysis revealing three key findings: learned switching policies, problem-specific latent computation, and concentrated computation at entry transitions.

latent reasoningon-policy rlhidden-state recurrencemechanistic analysisswitch-grpo

Disparate Impact in Synthetic Data Generation

arXiv cs.LG · Paul Andrey, Michaël Perrot, Batiste Le Bars, Marc Tommasi · 2026-06-11

The paper introduces a fairness notion of disparate impact for synthetic data generation (SDG), focusing on utility parity across sensitive groups without altering the real data distribution. It diverges from prior approaches that correct biases by redefining SDG to match the real distribution. The authors analyze failure modes of SDG, including limitations in expressive power, sampling errors due to group proportions, and estimation errors from differential privacy mechanisms. Experiments on artificial and real-world data, using probabilistic graphical models, demonstrate disparate impact. A group-wise SDG modeling strategy is proposed, showing improved utility and parity across various settings.

disparate impactsynthetic data generationprobabilistic graphical modelsdifferential privacyutility parity

Authority, Truth, and Citation Bias: A Large-Scale Multi-Domain Benchmark for Studying Epistemic Susceptibility in Large Language Models

arXiv cs.LG · Aryan Khurana, Aravind Ramana RN, Dhruv Kumar · 2026-06-11

The study introduces AuthorityBench, a 220,564-prompt multi-domain benchmark designed to isolate the influence of citation-based authority signals on epistemic behavior in large language models (LLMs). Using a 2x2 factorial design crossing claim veracity with citation veracity, the benchmark spans four domains (general knowledge, science, law, medicine) with controlled variations in prompt templates, venue prestige tiers, and author demographics. Evaluation of seven models reveals that citation presence, regardless of veracity, consistently increases hallucination rates, particularly for true claims with fabricated citations (3-22 percentage points increase, peaking at 35-77% in general knowledge). Legal claims show relative robustness, while venue prestige and author demographics have negligible impact.

authoritybenchepistemic behaviorhallucination ratesfactorial designcitation veracity

Scale Buys Interpolation, Structure Buys a Horizon: Certified Predictability for Equivariant World Models

arXiv cs.LG · Hongbo Wang · 2026-06-11

The paper introduces a computable, multi-step certificate for predictable horizons in equivariant latent world models, proving that $T$-step rollout error remains constant over symmetry orbits and is stratified by the predictor's Lyapunov spectrum ($T_j(ε)\sim\log(1/ε)/λ_j$). The method leverages equivariance to provide horizon guarantees, with empirical validation on 40-D Lorenz-96 showing a $\mathbb{Z}_N$-equivariant network achieves high fidelity ($R^2{=}0.98$) in recovering the full Lyapunov spectrum, outperforming dense and recurrent baselines. The certificate is shown to be structure-dependent, with applications in auditing pretrained models like TD-MPC2 and V-JEPA 2-AC, where calibration does not scale with parameters.

equivariant world modelslyapunov spectrumpredictable horizoncertified predictabilitymulti-step rollout

$α$-fair heterogeneous agent reinforcement learning

arXiv cs.LG · Yao-hua Franck Xu, Tayeb Lemlouma, Jean-Marie Bonnin, Arnaud Braud · 2026-06-11

The authors propose a novel framework integrating $α$-fairness with Heterogeneous-Agent Trust Region Learning (HATRL) to address fairness and efficiency in multi-agent reinforcement learning. The method employs a fair advantage function that dynamically weights agent utilities based on expected returns, enabling a transition from utilitarian efficiency to $α$-fairness welfare. Two algorithms, $α$-fair HATRPO and $α$-fair HAPPO, are introduced and empirically validated in sequential social dilemmas (CleanUp, CommonHarvest), demonstrating superior utilitarian performance and higher social outcomes compared to baseline HATRL algorithms.

α-fairnessheterogeneous-agent trust region learningfair advantage functionsequential social dilemmasnash equilibria

Limits of spectral learning under noise

arXiv cs.LG · Sabin Roman, Ljupco Todorovski, Saso Dzeroski, Marta Sales-Pardo · 2026-06-11

The study investigates spectral learning's stability under noise in supervised regression, focusing on coefficient drift induced by additive label noise. Using sparse spectral representations across multiple bases and dimensions, the authors whiten empirical feature geometry to derive a closed-form expression for coefficient vector overlap. Results reveal a universal degradation curve governed by an intrinsic noise scale, with numerical experiments confirming theoretical predictions in Fourier, Legendre, Bessel, and Haar bases. The work identifies a fundamental noise threshold beyond which coefficient estimates become unstable, limiting functional structure recovery from noisy data.

spectral learningnoise thresholdcoefficient driftfunctional regressionbasis expansion

A green solvent screening tool for emerging materials via uncertainty aware, transformer enhanced transfer learning

arXiv cs.LG · Ioannis Kouroudis, Simon Ternes, Zhaosu Gu, Gohar Ali Siddiqui · 2026-06-11

The study presents a transformer-enhanced transfer learning pipeline for green solvent screening, addressing data scarcity in solubility prediction for emerging materials. Leveraging a pre-trained foundational model on QM9 targets, the method incorporates uncertainty quantification to assess prediction confidence. The system achieves high performance on limited-data targets (Gutmann Donor/Acceptor numbers) and baseline solubility parameters, expanding solubility descriptor data by orders of magnitude. A deployable tool enables high-throughput lab integration, successfully rediscovering known green solvents and proposing novel candidates.

transfer learninguncertainty quantificationsolubility predictionfoundational modelgreen solvents

A solvable model for unsupervised federated learning

arXiv cs.LG · Giovanni Catania, Aurélien Decelle, Gianluca Manzan, Beatriz Seoane · 2026-06-11

The paper presents a theoretical framework for unsupervised federated learning using a teacher-multiple students model, where each student receives distinct data realizations via noise corruption or subset sampling. Employing equilibrium disordered system analysis, the authors demonstrate that student interactions systematically improve learning: high-noise students require fewer samples for pattern recovery, while low-noise students achieve higher ground-truth signal overlap. They derive optimal Bayesian recovery conditions as functions of sample complexity, noise level, and interaction strength, validated numerically, and map the dynamics to equilibrium sampling in a Restricted Boltzmann Machine with structured hidden layers.

federated learninggenerative modelingrestricted boltzmann machinebayesian recoveryequilibrium sampling

Quality-Preserving Imperceptible Adversarial Attack on Skeleton-based Human Action Recognition

arXiv cs.LG · Ziyi Chang, Kanglei Zhou, Xiaohui Liang, Hubert P. H. Shum · 2026-06-11

The authors propose a quality-preserving adversarial attack method for skeleton-based human action recognition (S-HAR) that maintains motion naturalness while achieving high attack success rates. Their distribution-based approach minimizes the gap between empirical and true risks during optimization, avoiding noise-like perturbations that degrade motion quality. Experiments on state-of-the-art S-HAR models across two datasets show superior attack success and preserved motion quality, measured by a novel human-aligned perception metric. The results expose vulnerabilities in current S-HAR systems.

adversarial attackskeleton-based recognitionmotion qualitydistribution-based optimizationhuman action recognition

Deep Sleep Classification via EEG Signal Criticality: A Passive BCI Approach for Sleep-Improvement Neurofeedback

arXiv cs.LG · Stanisław Narębski, Tomasz Komendziński, Tomasz M. Rutkowski · 2026-06-11

This study introduces a passive Brain-Computer Interface (pBCI) approach for deep sleep (N3) classification using criticality features derived from Detrended Fluctuation Analysis (DFA) of EEG signals. The method analyzed 347,232 EEG epochs from 290 older women, employing UMAP manifold learning for state transition visualization and benchmarking six classifiers via 10-fold cross-validation. Naive Bayes achieved the highest mean balanced accuracy (87.17% ± 0.24%), significantly outperforming fully connected deep neural networks (81.58%) and Random Forest (80.97%). The results demonstrate that DFA-derived criticality features reside on a non-linear manifold, enabling robust probabilistic decoding for state-dependent neurofeedback applications.

detrended fluctuation analysispassive brain-computer interfacedeep sleep classificationumap manifold learningneurofeedback

Reliability of Probabilistic Emulation of Physical Systems

arXiv cs.LG · Sam F. Greenbury, Radka Jersakova, Paolo Conti, Marjan Famili · 2026-06-11

This study evaluates the reliability of probabilistic forecasts for physical systems by comparing generative models (diffusion, flow matching) and CRPS-trained ensembles under matched computational budgets. A framework assesses empirical coverage of predictive intervals, accuracy, and inference latency across diverse 2D spatiotemporal systems. CRPS-trained ensembles achieve more reliable uncertainties and faster inference, particularly in autoregressive rollouts, compared to latent-space-trained generative models. Generative models trained in ambient space match ensemble coverage but incur higher latency. The authors release AutoCast for modular implementation and AutoSim for dataset generation to support future research.

probabilistic forecastscrps-trained ensemblesgenerative modelsempirical coverageautoregressive rollouts

DeepJEB++: Foundation Model-Driven Large-Scale 3D Engineering Dataset via 2D Latent Space Augmentation

arXiv cs.LG · Soyoung Yoo, Leekyo Jeong, Jinsu Ra, Dongeon Lee · 2026-06-11

DeepJEB++ introduces a foundation-model-driven framework for generating large-scale 3D engineering datasets from limited seed data. The method employs a three-stage pipeline: (1) 2D latent space augmentation via fine-tuned diffusion models and VLM filtering, (2) 3D mesh generation using domain-adapted foundation models, and (3) automated simulation labeling for mass, stress, and displacement. Starting with 400 seed designs, the approach produces 15,360 simulation-labeled 3D jet engine brackets (40x expansion) with validated manufacturability and label fidelity. The dataset supports reproducible engineering-AI research.

latent diffusionvision-language modelfinite-elementdata augmentationfoundation model

Exposure Bias as Epistemic Underidentification in Recursive Forecasting

arXiv cs.LG · Riku Green, Zahraa S. Abdallah, Telmo M Silva Filho · 2026-06-11

The paper recharacterizes exposure bias in recursive multi-step forecasting as an epistemic underidentification problem under partial observability, proving that even deterministic latent dynamics can lead to unidentified recursive predictors due to self-generated induced states. The authors formalize this with induced states (Z) and provenance variables (P), decomposing error into teacher-forcing/rollout mismatch, representation-class approximation, and provenance gaps. Empirical results demonstrate distinct induced-state regimes during rollout, with performance improvements from both local adaptation and altered induced-state visitation, while provenance-aware correction yields conditional gains.

exposure biasepistemic underidentificationrecursive forecastinginduced statesprovenance variables

EPM-JEPA: Operator-Side Experience Modulation in JEPA-Family World Models

arXiv cs.LG · Vedant Pandya · 2026-06-11

The paper introduces EPM-JEPA, an operator-side experience modulation method for JEPA-family world models that generates low-rank weight deltas via LoRA to adapt to distribution shifts, contrasting with operand-side injection (EI-JEPA). On Moving MNIST with gravity shift, EPM-JEPA (D_shift = 0.7848 ± 0.0078) shows no significant difference from EI-JEPA (0.8238) but improves 1.90% over a no-memory baseline, highlighting the specificity of weight-level modulation. Analysis reveals three dynamical processes influencing performance: buffer cycling, EMA target drift, and a LoRA settling transient (+0.021), motivating the proposed PEM-JEPA to address dynamical-peak limitations.

jepa-familyexperience modulationlow-rank adaptationdistribution shiftdynamical processes

Predicting Cognitive Load from Speech and Interaction Dynamics in Dyadic Conversations

arXiv cs.LG · Tahiya Chowdhury · 2026-06-11

The study demonstrates that conversational interaction dynamics enhance cognitive load prediction in natural dyadic conversations. Using audio from 53 dyads performing nine collaborative tasks, the authors extract static acoustic, dynamic, and interaction features to train a two-head Gated Recurrent Unit encoder for cognitive load score prediction. Results indicate temporal demand correlates with turn-taking dynamics (e.g., overlap, speaker switches), while mental demand associates with imbalanced participation, highlighting task structure's role in modeling cognitive load.

cognitive loaddyadic conversationsgated recurrent unitturn-taking dynamicsinteraction features

Circuit Synchronization Precedes Generalization: Causal Evidence from Fourier Structure in Grokking Transformers

arXiv cs.LG · Achyuthan Sivasankar · 2026-06-11

The study introduces the Frequency Synchronization Degree (FSD), a novel metric for detecting Fourier circuit synchronization in transformers prior to grokking, without requiring prior circuit knowledge. Using FSD across nine modular addition configurations (primes p in {53, 71, 97, 113, 131}), the authors demonstrate that synchronization occurs 500-3,000 steps before grokking (mean lead +1,722 steps), outperforming existing predictors. Causal evidence shows the inter-phase gap acts as a regularization phenomenon, with grokking time Delta_t following a 1/lambda relationship (R^2=1.00 and R^2=0.99 in clean cases). Architecture ablations reveal FSD as a multi-block circuit property, with attention-only models grokking and MLP-only models failing to grok.

fourier circuitgrokkingfrequency synchronization degreemodular arithmeticweight decay

Self-Guidance: Enhancing Neural Codecs via Decoder Manifold Alignment

arXiv cs.LG · Xiang Li, Yixuan Zhou, Jingran Xie, Zhiyong Wu · 2026-06-11

The paper introduces self-guidance, a method to enhance neural speech codecs by aligning decoder feature manifolds when processing both quantized tokens and continuous embeddings, using a lightweight feature-mapping loss. This approach improves reconstruction fidelity without modifying the quantizer or increasing model capacity, requiring minimal training overhead and no inference changes. Applied to XCodec2, it achieves state-of-the-art low-bitrate performance, enables 4x codebook reduction without fidelity loss, and improves LLM-based TTS synthesis by simplifying token modeling. Statistical and visual evidence confirms enhanced manifold alignment, with experiments demonstrating generality across inductive biases.

neural codecsmanifold alignmentvq-vaesquantization errorfeature-mapping loss

Is Spurious Correlation Removal Always Learnable?

arXiv cs.LG · Yibo Zhou, Bo Li, Hai-Miao Hu, Hanzi Wang · 2026-06-11

The paper establishes a computational barrier for invariant learning despite statistical identifiability, showing that polynomial-time algorithms cannot recover one-dimensional invariant subspaces ($k=1$) under a black-box samplable sparse-recovery primitive. Using a separation parameter $γ$ to quantify environment diversity, it derives minimax risk bounds $Θ(k(d-k)/(n|\mathcal{E}|))$ under sufficient diversity and local Gaussian regularity, with a phase transition at $n^*∝k(d-k)/(|\mathcal{E}|γ^2)$. Experiments on synthetic and real data validate the theoretical gaps and transitions.

invariant learningsparse recoveryminimax riskenvironment diversityphase transition

Multi-Label Test-Time Adaptation with Bayesian Conditional Priors

arXiv cs.LG · Qiru Li, Ao Zhou, Zhiwei Jiang, Zifeng Cheng · 2026-06-11

We propose Bayesian Conditional Priors (BCP) Estimation, a gradient-free test-time adaptation method for multi-label recognition under distribution shift. BCP addresses the brittleness of frozen Vision-Language Models (VLMs) by injecting label dependency without tuning the backbone. It selects a high-confidence anchor label per test image, applies anchor-conditioned Bayesian refinement in logit space, and estimates priors online from unlabeled test streams via lightweight second-order co-occurrence statistics. Evaluated on standard multi-label benchmarks with multiple CLIP backbones, BCP improves RN50 average mAP from 57.31 to 69.22 and ViT-B/16 from 62.61 to 71.79, outperforming strong TTA baselines.

bayesian conditional priorstest-time adaptationvision-language modelsmulti-label recognitiondistribution shift

Where Computation Lives Inside TabPFN: Causal Localisation of Attention Head Function

arXiv cs.LG · Atharva Gupta, Dhruv Kumar, Murari Mandal, Saurabh Deshpande · 2026-06-11

The study provides the first causal mechanistic analysis of TabPFN 2.5, a tabular foundation model, focusing on feature-wise attention heads' computational distribution across layers. Using activation patching, ablation, and attention entropy on synthetic regression datasets, the authors identify temporal specialization: one head's causal necessity dominates others by 2-5x at peak layer, with its dominant layer shifting across task complexity, while remaining heads show symmetric late-layer profiles. Attention entropy and patching converge on computationally active layers of the dominant head. Contrastive activation steering fails to transfer across samples, attributed to TabPFN's in-context learning mechanism encoding task structure via context-dependent attention rather than stable parametric directions.

tabular foundation modelattention headsactivation patchingin-context learningcontrastive activation steering

Selecting Samples on Graphs: A Unified Dataset Pruning Framework for Lossless Training Acceleration

arXiv cs.LG · Dongyue Wu, Zilin Guo, Xiaoyu Li, Jiajia Liu · 2026-06-11

We propose a unified graph-based dataset pruning framework that integrates intrinsic sample importance and extrinsic pairwise diversity into a single optimization objective, formulated as a Maximum Weight Clique Problem (MWCP). The method employs a greedy algorithm with provable approximation guarantees under mild conditions, offering practical design guidelines for importance metrics. Experiments demonstrate superior performance over existing pruning methods, achieving over 40% training time reduction on ImageNet-1k with ResNet-50 while maintaining accuracy.

dataset pruningmaximum weight clique problemgreedy algorithmapproximation guaranteetraining acceleration

LongSpike: Fractional Order Spiking State Space Models for Efficient Long Sequence Learning

arXiv cs.LG · Xinrui He, Qiyu Kang, Xuhao Li, Zheng-Jun Zha · 2026-06-11

LongSpike introduces fractional-order State-Space Modeling (f-SSM) into Spiking Neural Networks (SNNs) to address the memoryless bottleneck of first-order ODEs in long-sequence tasks. The framework leverages fractional calculus for hierarchical neuronal dynamics with long-memory kernels, while maintaining computational efficiency through a parallelizable state-space formulation. Evaluations on Long Range Arena (LRA), WikiText-103, and Speech Commands show superior accuracy over state-of-the-art SNNs with preserved sparse computation.

spiking neural networksfractional-order ssmlong-sequence learningstate-space modelingparallel training

Prediction-Powered Causal Inference by Automatic Debiased Machine Learning and Semi-Supervised Riesz Regression

arXiv cs.LG · Masahiro Kato · 2026-06-11

The study introduces Prediction-Powered Causal Inference (PPCI), a framework for semiparametric efficient estimation of causal and structural parameters in semi-supervised settings with unlabeled auxiliary regressors. By deriving the efficient influence function and efficiency bound, the authors demonstrate that incorporating auxiliary regressors reduces asymptotic variance compared to using labeled data alone. They propose DML-PPCI methods, including EE-DML-PPCI and TMLE-DML-PPCI, which achieve the derived efficiency bound. Key components involve estimating the efficient influence function, leveraging Neyman orthogonal scores, and developing semi-supervised generalized Riesz regression with convergence rate guarantees for Riesz representer estimation.

efficient influence functionneyman orthogonal scoreriesz representersemiparametric estimationasymptotic variance

Direct Preference Optimization for Chatbot Fine-Tuning: An Empirical Study

arXiv cs.LG · Yvonne Qiu, Dezhi Yu, ShuoJia Fu · 2026-06-11

The study introduces Direct Preference Optimization (DPO) for fine-tuning large language models, demonstrating its advantages in simplifying the training pipeline and improving computational efficiency. The method employs reinforcement learning to optimize model performance, evaluated using BLEU, ROUGE, and cosine similarity metrics. Results show competitive performance and effective convergence, though training instability remains an area for further investigation.

direct preference optimizationreinforcement learninglarge language modelsfine-tuningcomputational efficiency

Multi-Bitwidth Quantization for LLMs Using Additive Codebooks

arXiv cs.LG · Liza Babaoglu, Shuangyi Chen, Ashish Khisti · 2026-06-11

The authors propose Drop-by-Drop, a multi-bitwidth post-training quantization framework enabling inference-time precision control over LLM weights from a single trained model. The method leverages information-theoretic principles and successive refinement, using additive codebooks with Matryoshka-style supervision to optimally reconstruct Gaussian-distributed weights under weighted MSE distortion. Evaluated on Qwen, LLaMA, Gemma, and Mistral architectures, the approach maintains competitive perplexity and accuracy while reducing storage overhead through shared checkpoints for multiple bitwidths.

post-training quantizationadditive codebookssuccessive refinementmulti-bitwidthllm weights

SMGFM: Spectral Multimodal Graph Pretraining for Multimodal-Attributed Graphs

arXiv cs.LG · Zhengyu Wu, Xu Wang, Hongchao Qin, Xunkai Li · 2026-06-11

SMGFM introduces a spectral multimodal graph pretraining framework for multimodal-attributed graphs (MAGs), addressing the challenge of disentangling structure-induced and modality-intrinsic semantics. The method leverages graph-frequency variation as a prior, decomposing modality-specific node signals into graph-frequency bands using scalable Chebyshev filters. It constructs frequency-resolved modality tokens, estimates coupling reliability through topology-conditioned routing, and performs band-modality interaction before fusion. This approach aligns smooth consensus routes while preserving modality-specific routes, mitigating spatial-domain entanglement. Extensive experiments on MAG datasets demonstrate SMGFM's state-of-the-art performance across graph-level and modality-level tasks.

multimodal-attributed graphsgraph-frequency variationchebyshev filterstopology-conditioned routingband-modality interaction

Multimodal Graph Negative Learning

arXiv cs.LG · Zhengyu Wu, Xu Wang, Hongchao Qin, Xunkai Li · 2026-06-11

The paper proposes GraphMNL, a graph-aware multimodal negative learning framework addressing node-level branch semantic imbalance in multimodal attributed graphs (MAGs). Unlike existing methods relying on cross-branch agreement, GraphMNL employs negative learning for cross-branch guidance, teaching inferior branches which classes a node is unlikely to belong to rather than forcing imitation. The framework includes a branch library, graph-aware reliability arbitration, unstable transfer gating, and target-preserving negative learning. Evaluations show GraphMNL achieves 72.47% accuracy on Grocery datasets and 76.60 F1 score on Reddit M datasets.

multimodal attributed graphsnegative learningbranch semantic imbalancegraph-aware reliability arbitrationtarget-preserving learning

A Privacy-Preserving Framework Using Remote Data Science for Inter-Institutional Student Retention Prediction

arXiv cs.LG · John Fields, K M Sajjadul Islam, Ruchitha Thota, Victor Chen · 2026-06-11

The study introduces a privacy-preserving machine learning framework using PySyft for inter-institutional student retention prediction, featuring a semi-air-gapped architecture with high-side and low-side servers. It employs remote data science (RDS) to enable collaborative model building without direct data access, validated across three universities. Three synthetic data generation methods were evaluated, including a novel Data-Type-Aware Templates approach prioritizing privacy over distributional fidelity. Results show consistent classification performance (Macro F1: 0.690--0.695) while ensuring FERPA compliance, demonstrating RDS-based PPML as a viable alternative to federated learning for small-scale collaborations.

privacy-preserving machine learningremote data sciencesemi-air-gapped architecturedata-type-aware templatesferpa compliance

Interpretable Factor Decomposition for Decision Intelligence in Large-Scale Financial Markets: Evidence from China's A-Share Market

arXiv cs.LG · Xiao Han, Yao Xiao, Zhen Zhang, Moxuan Zheng · 2026-06-11

The study introduces an interpretable machine learning pipeline for factor decomposition in equity return prediction, focusing on China's A-share market. Using XGBoost with TreeSHAP attribution on 3632 stocks (2009-2019), the method achieves a mean AUC of 0.547 and +2.38%/month long-short spread (Sharpe 2.23), persisting after Carhart four-factor adjustment (+2.31%/month). SHAP analysis reveals behavioral signals (58.2% attribution) dominate valuation ratios (10.7%), with ablation studies exposing feature substitutability patterns not visible through single-method analysis.

interpretable machine learningtreeshap attributioncross-sectional equityfactor decompositionablation analysis

CLARITree: Cholesky and Lookahead Accelerations for Regression with Interpretable Piecewise Linear Trees

arXiv cs.LG · Yixiao Wang, Hayden McTavish, Varun Babbar, Margo Seltzer · 2026-06-11

The paper introduces CLARITree, a novel algorithm for constructing near-optimal, sparse, piecewise linear regression trees. It combines lookahead search strategies with efficient rank-one Cholesky updates of the Gram matrix, addressing computational bottlenecks in dynamic programming approaches. Theoretical and empirical results demonstrate superior trade-offs between computational efficiency, predictive accuracy, and sparsity compared to state-of-the-art methods, while maintaining interpretability.

regression treescholesky updateslookahead searchpiecewise lineargram matrix

Graph Reinforcement Learning for Calibration-Aware Quantum Circuit Routing

arXiv cs.LG · Yash Vardhan Tomar, Dheeraj Peddireddy, Vaneet Aggarwal · 2026-06-11

A calibration-aware graph reinforcement learning approach improves quantum circuit routing fidelity by incorporating same-day IBM Heron r2 calibration data. The method employs proximal policy optimization to train a policy that selects hardware-edge SWAPs, evaluated using exact simulated fidelity across nine Munich Quantum Toolkit (MQT) Bench circuits and three calibration snapshots. Results show pooled mean exact fidelity of 0.727, outperforming SABRE-best20 (0.440) and target-aware SABRE (0.481). Fidelity gains are concentrated in 5q and 8q circuits, with higher routed two-qubit counts, while 10q circuits favor SABRE-best20 under the fixed tree action graph.

quantum circuit routingcalibration-awareproximal policy optimizationhardware-edge swapsexact simulated fidelity

Quantum Reservoir Computing for Short-Term Power Load Forecasting in Resource-Constrained Energy Systems

arXiv cs.LG · Mansi Od, Param Pathak, Nouhaila Innan, Muhammad Shafique · 2026-06-11

This work introduces a Quantum Reservoir Computing (QRC) framework for short-term power load forecasting, optimized for resource-constrained edge deployment. The method employs a fixed quantum reservoir for feature extraction and trains only a classical Elastic Net readout, later compressed via post-training fixed-point quantization (2-8 bits). Evaluated on Tetouan and Spain energy datasets under simulated noise (IBM FakeTorino/Marrakesh), results show 6-bit quantization preserves forecasting accuracy while reducing memory by 81.2%. Performance degradation below 6 bits is dataset-dependent, with Tetouan showing higher sensitivity. The framework demonstrates noise resilience without retraining.

quantum reservoir computingelastic netfixed-point quantizationhardware-noise modelsstatevector simulation

ProPlay: Procedural World Models for Self-Evolving LLM Agents

arXiv cs.LG · Yijun Ma, Zehong Wang, Yiyang Li, Ziming Li · 2026-06-11

ProPlay introduces procedural world models for self-evolving LLM agents, enabling procedure-level preplay to refine environment understanding through interaction. The method abstracts successful trajectories into procedures organized in a causal transition graph, with reliability records estimating task-specific contributions. Agents simulate future procedural paths as structured guidance and update the graph post-execution using environment feedback. Experiments on public benchmarks demonstrate consistent improvements in environment understanding and self-evolution over baselines.

procedural world modelsself-evolving agentsprocedure graphreliability recordstructured soft guidance

Detecting Functional Memorization in Code Language Models

arXiv cs.LG · Matthieu Meeus, Anil Ramakrishna, Matthew Grange, Zheng Xu · 2026-06-11

The paper introduces functional memorization in code-generating LLMs, demonstrating that models can reproduce functional logic without verbatim textual overlap. Using a counterfactual setup with OLMo-3-32B, the authors compare a midtrained model (exposed to target code) against a pretrained reference, evaluating both textual and functional similarity via LLM-as-a-judge and execution-based metrics. Results reveal clear functional memorization, underscoring the need for auditing metrics beyond textual overlap.

functional memorizationcode language modelscounterfactual setupexecution-based metricsllm-as-a-judge

Adaptive Weighted Averaging

arXiv cs.LG · Aditya Bhaskara, Ashok Cutkosky, Ravi Kumar, Manish Purohit · 2026-06-11

The paper introduces adaptive weighted averaging strategies for selecting the largest value among n unknown quantities, given unbiased estimates. The proposed methods are both admissible (not uniformly dominated) and guarantee performance no worse than baseline approaches like uniform random selection. Applied to stochastic optimization, these strategies yield online-to-batch conversion bounds with a 'no-compromise' property: they match or exceed random iterate selection while performing better in favorable scenarios.

adaptive weighted averagingstochastic optimizationonline-to-batch conversionadmissible strategiesunbiased estimates

Deep Unfolded Latent Optimally Partitioned-l2/l1 Networks for Data-driven Block-Sparse Recovery

arXiv cs.LG · Takanobu Furuhashi, Hidekata Hontani, Qibin Zhao, Tatsuya Yokota · 2026-06-10

The authors propose Deep Unfolded Latent Optimally Partitioned-l2/l1 (DU-LOP-l2/l1) networks for block-sparse recovery, addressing limitations of manual hyperparameter tuning and numerical instability in proximal operator differentiation. Two architectures are introduced: a stable framework using implicit differentiation and a flexible variant employing Deep Weight Factorization (DWF), which supports nonconvex smooth data fidelity terms. Experiments show DU-LOP-l2/l1 achieves competitive performance and high resilience against impulsive noise.

block-sparse recoverydeep unfoldingimplicit differentiationdeep weight factorizationproximal operator

Physics-Informed Neural Networks and Radial Basis Functions for PDEs with Dirac Delta Sources

arXiv cs.LG · Manuel Reyna, Alexandre Tartakovsky · 2026-06-10

The work demonstrates that Radial Basis Function (RBF) Residual Least Squares (RLS) methods outperform Physics-Informed Neural Networks (PINNs) for solving PDEs with Dirac delta sources. By interpreting PINNs as RLS methods, the authors show that RBF-RLS directly handles Dirac delta terms through weak-form integration, avoiding approximation errors inherent in PINNs. Neural Tangent Kernel (NTK) theory explains RBF-RLS's superior convergence. Experiments on linear PDEs for groundwater flow and transport validate the approach on synthetic and real-world data, including inverse problems with noisy measurements.

physics-informed neural networksradial basis functionsdirac delta sourcesresidual least squaresneural tangent kernel

Let's Ask Gauss: Improved One-Run Privacy Auditing

arXiv cs.LG · Adya Agrawal, Yu Wei, Jaspal Singh, Malik Magdon-Ismail · 2026-06-10

The paper introduces an improved one-run privacy auditing method for differentially private machine learning, specifically targeting DP-SGD. By analyzing canary-aligned signals as a sequence of random variables converging to a Gaussian distribution, the authors develop a framework that provides tighter privacy lower bounds from a single training run. This approach outperforms prior binary thresholding methods by leveraging distributional information, enhancing the practical assessment of privacy leakage in DP mechanisms.

privacy auditingdifferential privacydp-sgdgaussian convergenceone-run methods

Normative Robustness as a Frontier for Non-Verifiable Reasoning in LLMs

arXiv cs.LG · Elizaveta Tennant, Benjamin Henke, Anita Keshmirian, Murray Shanahan · 2026-06-10

The paper introduces normative robustness as a framework for evaluating non-verifiable reasoning in LLMs, focusing on moral reasoning as a paradigmatic case. The authors propose moral robustness, defined as consistent moral reasoning across contexts, and develop an adversarial multi-turn evaluation framework simulating 48,000 user-agent moral deliberations across four frontier LLMs. Results show models ignore irrelevant distractors but exhibit moral deliberative sycophancy, shifting reasoning by up to 6.5% toward user-stated views and varying judgments based on premise order (13-22% variance) and conversation duration (10-24% variance).

normative robustnessmoral reasoningnon-verifiable reasoningdeliberative sycophancymulti-turn evaluation

EquiDexFlow: Contact-Grounded SE(3)-Equivariant Dexterous Grasp Generative Flows

arXiv cs.LG · Clinton Enwerem, John S. Baras, Calin Belta · 2026-06-10

EquiDexFlow introduces an SE(3)-equivariant flow-matching model for dexterous grasp generation, jointly predicting wrist pose, joint angles, fingertip contacts, surface normals, and contact forces from object point clouds. The architecture ensures contact projection onto object surfaces and force alignment within the Coulomb friction cone, maintaining physical feasibility without loss penalties. Trained on 8,100 force-closure grasps across 81 objects for the 16-DoF Allegro Hand, the model achieves zero friction violations, the best composite score, and the lowest wrench residual among ablation variants. Hardware experiments demonstrate successful open-loop pick-and-hold trials on six test objects, including asymmetric objects at canonical and rotated poses.

se(3)-equivariantflow-matchingcoulomb friction coneforce-closure graspsinverse kinematics

Out-of-Distribution (OOD) Detectors for Open-Set RF Fingerprinting

arXiv cs.LG · Sudeepta Mondal, Ganesh Sundaramoorthi · 2026-06-10

The paper introduces out-of-distribution (OOD) detection methods for open-set radio-frequency (RF) fingerprinting, addressing the challenge of distribution shift from unknown transmitters and temporal drift. The authors present a unified information-theoretic framework for analyzing and developing OOD detectors, eliminating the need for impractical auxiliary OOD data collection. Evaluated on the POWDER RF fingerprinting dataset, their OOD detectors achieve comparable performance to baselines with true OOD tuning data and significantly outperform methods without such access, demonstrating practical viability for RF environments.

out-of-distribution detectionrf fingerprintinginformation theorydistribution shiftopen-set recognition

A Stabilized Path-Space Approach to Diffusion-Based Posterior Sampling

arXiv cs.LG · Evan Scope Crafts, Umberto Villa, Saviz Mowlavi, Yanting Ma · 2026-06-10

The authors propose a stabilized path-space framework for diffusion-based posterior sampling, addressing limitations of heuristic guidance approximations in nonlinear and multimodal settings. Their method formulates posterior sampling as a stochastic optimal control problem, matching a likelihood-weighted target measure on trajectories via time reparameterization and trust-region optimization with log-variance objectives. The framework provides theoretical connections to existing guidance-based samplers, quantifies sampling errors, and enables importance sampling corrections. Evaluations on benchmark inverse problems demonstrate improved accuracy and robustness over state-of-the-art methods, with principled assessment of sampling accuracy and uncertainty quantification.

diffusion modelsposterior samplingstochastic optimal controluncertainty quantificationpath-space optimization

Smarter Saboteurs, Better Fixers: Scaling & Security in Linear Multi-Agent Workflows

arXiv cs.LG · Timothy McAllister, Sina Abdidizaji, Ivan Garibay, Ozlem Ozmen Garibay · 2026-06-10

The paper demonstrates that model scale impacts adversarial resilience in linear multi-agent systems (MAS), revealing a compliance-correction symmetry where larger models (up to 27B parameters) show 53.7pp performance drop when executing malicious instructions but recover statistical parity with control performance when augmented by a lightweight terminal Fixer stage. Experiments across open-weight model families on HumanEval benchmark show linear workflows can be resilient if corrected, challenging prior assumptions about topology brittleness. Results indicate scaling exacerbates malicious compliance but correction mechanisms effectively mitigate risks.

multi-agent systemsmodel scalingadversarial resiliencecompliance-correction symmetrylinear workflows

A unified complexity bound for logconcave sampling

arXiv cs.LG · Yunbum Kook, Santosh S. Vempala · 2026-06-10

The paper presents a unified, nearly tight complexity bound for sampling logconcave distributions using the In-and-Out algorithm with exponential lifting. The key innovation is an improved bound on the Poincaré constant of the lifted distribution, enabling tighter convergence rates. The results apply to both constrained settings (e.g., Gaussians restricted to convex bodies) and well-conditioned settings (e.g., strongly logconcave and smooth densities), achieving near-optimal performance in both cases.

logconcave samplingin-and-out algorithmpoincaré constantexponential liftingconvergence rate

Forecasting Is Not Attribution: Localizing Decoder Bypass in Graph-Based Neural Marketing Mix Models

arXiv cs.LG · Yunbo Wang, Bolbi Liu · 2026-06-10

The paper introduces DICE-MMM, a diagnostic framework addressing attribution bypass in graph-based neural marketing mix models (MMMs), where decoders achieve low forecasting error without proper attribution to marketing channels. The method involves a two-stage training process: first training a graph encoder with restricted decoder, then freezing the encoder to train a graph-safe latent decoder. Experiments demonstrate that forecasting accuracy (MSE@7 ~0.004) does not guarantee attribution (AR-CIG nAUPRC ~0), with DICE improving graph recovery over CausalMMM and identifying graph-support selection as the key bottleneck.

attribution bypassgraph-based mmmdecoder trainingcounterfactual sensitivitygraph recovery

How Useful is Causal Invariance for Domain Adaptation in Finite-Sample Settings?

arXiv cs.LG · Julia Kostin, Kasra Jalaldoust, Elias Bareinboim, Samory Kpotufe · 2026-06-10

The paper investigates the utility of causal invariance for supervised domain adaptation (sDA) in finite-sample settings, focusing on linear regression. It derives matching upper and lower bounds demonstrating that finite-sample gains depend on target-risk margins between candidate predictors and finite-source estimation error. When margins are sufficiently large relative to target sample size, adaptive aggregation achieves optimal performance without negative transfer; otherwise, no algorithm reliably exploits causal knowledge. Theoretical insights are validated on real-world causal benchmarks, connecting margins to structural shift magnitude in linear SCMs.

causal invariancesupervised domain adaptationfinite-sample settingslinear regressionstructural shift

Fed-FBD: Federated Functional Block Diversification for Isolation, Privacy, and Surgical Unlearning

arXiv cs.LG · Weijie Chen, Alan B. McMillan · 2026-06-10

Fed-FBD introduces a modular federated learning architecture that decomposes ResNet backbones into six functional blocks with color variants, enabling block-level isolation, privacy-by-design, and surgical unlearning. The method maintains a warehouse of N variants assembled from independently tracked blocks, providing architectural guarantees against adversarial contamination and membership inference while supporting sub-second unlearning. Evaluations on MedMNIST-2D, PathMNIST, and CIFAR-10 show a 0.3%-3.1% IID accuracy trade-off versus FedAvg, with adversarial attacks confined to the attacker's blocks (±0.01 AUC drift on clean variants).

federated learningfunctional block diversificationsurgical unlearningresnet backbonemembership inference

Physics-Informed Neural Networks for Chemotherapy Pharmacokinetics: Benchmarking the Clinical Estimator and Exposing Parameter Identifiability

arXiv cs.LG · Riya Bisht, Dhruv Agarwal · 2026-06-10

The study benchmarks Physics-Informed Neural Networks (PINNs) against nonlinear least-squares (NLS) and data-only MLPs in chemotherapy pharmacokinetics, where tissue concentration is unobserved. PINNs match NLS accuracy on linear two-compartment models while jointly estimating tissue curves, outperforming MLPs by 10x. For Michaelis-Menten kinetics, PINNs expose non-identifiability from plasma data alone, a flaw masked by NLS's misspecified biexponential ansatz. Sparse tissue measurements improve identifiability, with PINNs recovering parameters within 1% accuracy (k21) and one standard deviation (Vmax, Km), demonstrating a unified framework for heterogeneous data integration and structural insight.

pharmacokineticsidentifiabilitybiexponentialmichaelis-mentencompartmental

Computationally tractable robust differentially private mean estimation

arXiv cs.LG · Kelly Ramsay · 2026-06-10

The authors propose the balloon mean, a differentially private mean estimator offering computational tractability and robustness to outliers. The method employs an iterative clipping procedure over expanding Mahalanobis balls (balloons), satisfying zero-concentrated differential privacy with interpretable tuning parameters. Theoretical guarantees under heavy-tailed and contaminated elliptical models demonstrate robustness, while simulations show superior performance over existing private estimators in contaminated settings.

differentially privatemean estimationmahalanobis ballszero-concentrated differential privacyiterative clipping

Physics-Aware Auxiliary Losses Improve Out-of-Distribution Generalization of a GNN Synthesizability Filter

arXiv cs.LG · Riya Bisht, Dhruv Agarwal · 2026-06-10

The study demonstrates that physics-aware auxiliary losses enhance out-of-distribution (OOD) generalization in a GNN-based synthesizability filter. Using a GINE backbone, the authors incorporate two auxiliary losses: topological complexity regression (supervised by Bertz index) and strain-energy soft penalty (supervised by MMFF94 force-field energy). Evaluated on a 65,177-molecule corpus with OOD testing on COCONUT natural products, all three physics-aware variants show statistically significant OOD AUC improvements (best Δ=+0.0066) over the baseline (AUC 0.9774), while remaining indistinguishable in-distribution. Multi-seed validation revealed methodological pitfalls in single-seed evaluations.

graph neural networkout-of-distribution generalizationauxiliary lossessynthesizability filterforce-field energy

Epistemic Uncertainty Is Not the Reducible Kind

arXiv cs.LG · Robin Young · 2026-06-10

The article demonstrates an extensional inconsistency between the standard taxonomy and measure of epistemic uncertainty, proving that epistemic uncertainty is not inherently reducible by additional data. Through explicit construction and theoretical analysis, it introduces a refined trichotomy of uncertainty: aleatoric, sample-reducible epistemic, and mechanism-reducible epistemic. An exact identity reveals that in-distribution data generically increases mechanism-irreducible uncertainty. Ensemble disagreement, commonly used to estimate epistemic uncertainty, is shown to track training procedures rather than true epistemic terms, collapsing under consistent training or equating to initialization noise. Finite-sample falsification tests and seed-swept experiments validate these findings.

epistemic uncertaintyensemble disagreementaleatoric uncertaintymutual-informationinterpolation

TEDD: Robust Detection of Unstable Temporal Features

arXiv cs.LG · Ricardo Ribeiro Pereira, Bruno Casal Laraña, Nádia Soares, Miguel Araújo · 2026-06-10

TEDD introduces a robust technique for detecting unstable temporal features in datasets, addressing performance degradation in ML models due to distribution shifts. The method employs a regression model to identify features predictive of instance timestamps, enabling detection of univariate and multivariate drifts across numerical and categorical features. Evaluations on synthetic and real data demonstrate TEDD's capability to detect all basic change patterns without parameter tuning, while providing comparable change measurements and scaling efficiently with feature and instance counts.

temporal feature driftdistribution shiftmultivariate drift detectionregression-based detectionmodel robustness

Individual Control Barrier Functions-Guided Diffusion Model for Safe Offline Multi-Agent Reinforcement Learning

arXiv cs.LG · Qingyun Guo, Junyi Shi, Jianuo Huang, Tianyu Shi · 2026-06-10

The paper introduces a safe offline multi-agent reinforcement learning (MARL) algorithm combining diffusion models with individual control barrier functions (CBFs). The method embeds neural CBFs into the diffusion process to ensure safety during trajectory generation, with policies recovered via inverse dynamics. Evaluations across multiple benchmarks show significant safety improvements while maintaining competitive reward performance compared to existing approaches.

offline reinforcement learningdiffusion modelscontrol barrier functionsmulti-agent systemsinverse dynamics

The Metric Picks the Winner: Evaluation Choice Flips Model Rankings for Drug-Response Prediction in Unseen Chemistry

arXiv cs.LG · Dhruv Agarwal, Riya Bisht · 2026-06-10

The study investigates drug-response prediction in unseen chemistry, demonstrating that model rankings invert based on evaluation metrics. Using THP-1 cell data from the VCPI contest (14,026 training compounds), a staged approach combines baselines, non-parametric retrieval, and a fusion model with chemistry embeddings. Under a per-gene proxy metric, linear regression on Morgan fingerprints outperforms deep models and ChemBERTa. However, under the contest's active-set metric, deep models and the fusion decoder significantly surpass the baseline (wMSE -0.012, p < 10^-4). The findings highlight metric-dependent performance reversals, validated via a reproducible pipeline.

drug-response predictionscaffold splitweighted msemorgan fingerprintschemberta

ECA: Efficient Continual Alignment for Open-Ended Image-to-Text Generation

arXiv cs.LG · Jiangtao Kong, Peijun Zhao, Chun-Fu Chen, Youngwook Do · 2026-06-10

The paper introduces Efficient Continual Alignment (ECA), a novel exemplar-free incremental learning approach for Open-ended Image-to-Text Generation (OpenITG). ECA addresses continual alignment in evolving environments by adapting pre-trained VLMs through three mechanisms: Mixture of Query (MoQ) for task-specific query tokens, Fisher Dynamic Expansion (FeDEx) for structure expansion based on Fisher Information Matrix, and Dictionary Replay (DR) for knowledge retention. Evaluated on four new IL OpenITG benchmarks, ECA significantly mitigates catastrophic forgetting and outperforms baseline methods in preserving cross-modal representations.

incremental learningimage-to-text generationfisher information matrixexemplar-freecatastrophic forgetting

Estimating Individualized Treatment Effects in Acute Ischemic Stroke with Causal Transformation Models (TRAM-DAG): A Multi-Centre Observational Study with External RCT Validation

arXiv cs.LG · Oliver Dürr, Lisa Herzog, Pascal Bühler, Susanne Wegener · 2026-06-10

The study introduces causal transformation models on directed acyclic graphs (TRAM-DAG) for estimating individualized treatment effects (ITE) in acute ischemic stroke, aiming to identify patients benefiting most from mechanical thrombectomy over lysis. TRAM-DAG was trained on observational MAGIC multi-center stroke patient data, specifically a sub-population with NIHSS at admission ≥6, and validated using the MR CLEAN RCT population. Results show that TRAM-DAG's ITE estimates align with the trial's average treatment effect and correctly rank patients by observed good outcomes (mRS ≤2 at three months). This supports TRAM-DAG's utility in personalized stroke care decision-making.

individualized treatment effectcausal transformation modelsdirected acyclic graphsmechanical thrombectomymodified rankin scale

Towards Provably Fair Machine Learning: Bayesian Approaches For Consistent and Transparent Predictions

arXiv cs.LG · Owen O'Neill, Fintan Costello · 2026-06-10

The paper introduces the Fair Bayesian classifier, which enforces determinism and statistical consistency across all subgroups to address inconsistent predictions in ML classifiers. By requiring predictions to align with Bayesian optimal target distributions and abstaining when consistency is unachievable, the method eliminates consistency errors. Evaluated on Adult, COMPAS, and Bank Marketing datasets, it outperforms baselines in accuracy and multicalibration while maintaining zero consistency error. The approach highlights the importance of Bayesian consistency for algorithmic fairness, particularly in small subgroups where frequentist inference fails.

bayesian consistencysubgroup fairnessdeterministic predictionsstatistical consistencymulticalibration

Evaluation of AutoML Frameworks for IDS under Imbalanced Data Conditions of the NSL-KDD Dataset

arXiv cs.LG · Wiliane Carolina Silva, Evandro César Vilas Boas, Felipe A. P. de Figueiredo · 2026-06-10

The study establishes a standardized benchmark for evaluating AutoML frameworks in multiclass intrusion detection under severe class imbalance using the NSL-KDD dataset, preserving all five original classes including rare attacks (R2L, U2R). Nine open-source AutoML frameworks were systematically compared, analyzing architectural design, ensemble strategies, hyperparameter optimization, and imbalance-handling mechanisms. Results show PyCaret achieved the highest macro-F1 (66%), followed by AutoGluon (55%), with ensemble learning and imbalance-aware optimization proving critical for minority-class discrimination, while accuracy-oriented frameworks exhibited significant performance degradation on rare attack categories.

automlclass imbalanceintrusion detectionnsl-kddmacro-f1

The Mathematics of AI Winters: The mathematical Taxonomy of Paradigm Fragility in AI Winter

arXiv cs.LG · Miquel Noguer i Alonso, David Pacheco Aznar · 2026-06-10

The article proposes a mathematical taxonomy of paradigm fragility in AI winters, arguing that formal barriers—not just engineering failures—contributed to reduced funding and confidence. It synthesizes key mathematical bottlenecks from early AI, including perceptron impossibility results, complexity-theoretic hardness of neural-network training, minimax rates for nonparametric estimation, vanishing-gradient analyses, and classical statistical learning theory. These barriers are shown to align with central disappointments of the first and second AI winters. The analysis further connects these limitations to subsequent breakthroughs that mitigated, but did not eliminate, these challenges.

perceptron impossibilitycomplexity-theoretic hardnessminimax ratesvanishing-gradientstatistical learning theory

Viral Proteins Reveal Geometry of Protein Language Models

arXiv cs.LG · Arthur Bigot, Harmon Bhasin, Core Francisco Park, Eugene Shakhnovich · 2026-06-10

The study characterizes the geometric structure of protein language model (pLM) embeddings using viral proteins as a case study. Analyzing ESM model families, the authors identify a dominant nativeness axis in embedding space that orders sequences from cellular proteins to viral proteins to shuffled sequences, aligned with masked reconstruction perplexity. Scaling contracts this axis unevenly across viral families. Despite this, pLM embeddings retain viral-specific signal, enabling linear separability beyond zero-shot perplexity and shallow sequence features. Results indicate pLM representations encode both a general nativeness metric and group-specific biological information.

protein language modelsnativeness axismasked reconstruction perplexityviral proteinsembedding space

Shopping Reasoning Bench: An Expert-Authored Benchmark for Multi-Turn Conversational Shopping Assistants

arXiv cs.LG · Shuxian Fan, Seonwoo Min, Youna Hu, Botao Xia · 2026-06-10

The Shopping Reasoning Bench introduces a novel expert-authored benchmark for evaluating conversational shopping assistants, addressing the unique challenges of multi-turn reasoning, domain expertise, and criterion-level quality. The benchmark comprises 525 missions (232 single-turn, 293 multi-turn) with 10,863 importance-weighted binary rubrics, organized into five reasoning categories and fifteen subcategories. Evaluation of nine models (GPT, Claude, Gemini) reveals pass rates of 57--77%, with multi-turn performance declining by 4--18 points as conversations progress and a 13--29 point gap on optional criteria, highlighting limitations in expert-level advice.

conversational shopping assistantsmulti-turn reasoningdomain expertisebinary rubricspreference refinement

Feature-preserving Latent-EnKF for Data Assimilation of Flows with Shocks

arXiv cs.LG · Hemanth Chandravamsi, Hangchuan Hu, Ponkrshnan Thiagarajan, Tamer A. Zaki · 2026-06-10

A feature-preserving latent-EnKF is introduced for sequential data assimilation of flows with shocks, addressing the EnKF's failure due to multimodal ensemble statistics violating Gaussian assumptions. The method performs ensemble updates in a learned low-dimensional latent space where shock and flow features form a smooth manifold, preserving sharp features during analysis. A shared decoder maps the updated latent state back to the physical state, eliminating member-specific ordered training and positivity flooring. Numerical experiments on Sod shock tube and Mach 2 shock interaction with a 2D cylinder demonstrate accurate recovery of shocks and contact discontinuities without spurious oscillations.

ensemble kalman filterlatent spacedata assimilationshock recoverymultimodal statistics

Crossing the Validation Crisis: Cross-Validation Reduces Benchmarking Variance Surprisingly Well

arXiv cs.LG · Célestin Eve, Gaël Varoquaux, Thomas Moreau · 2026-06-10

The study demonstrates that cross-validation significantly enhances benchmarking reliability by reducing performance estimation variance, addressing the validation crisis in machine learning evaluation. It introduces sample gain to quantify virtual data augmentation from multiple cross-validation splits, showing empirically on synthetic, histopathologic, and NLP datasets that multiple splits improve estimate stability with delayed diminishing returns. A dynamic early-stopping procedure is proposed to optimize computational cost. Results indicate cross-validation's underutilized potential for robust benchmarking.

cross-validationbenchmarking variancesample gainperformance estimationearly-stopping

Rubric-Guided Self-Distillation: Post-Training Without Rubric Verifiers

arXiv cs.LG · MohammadHossein Rezaei, Anas Mahmoud, Zihao Wang, Utkarsh Tyagi · 2026-06-10

We propose Rubric-Guided Self-Distillation (RGSD), a verifier-free training method for open-ended domains where rubrics replace single ground-truth answers. RGSD conditions the base policy on rubrics to serve as a teacher, distilling its token-by-token distribution into an unconditioned student, eliminating LLM verifiers and sparse trajectory-level rewards. Evaluated on Qwen-2.5 (3B, 7B) and Qwen3-Thinking (4B, 8B) models in medical and science domains, RGSD achieves rubric satisfaction comparable to judge-based GRPO with one on-policy rollout per prompt and no training-time verifier calls. Ablations indicate raw rubrics outperform self-generated references as teacher enrichment, while stronger GRPO judges can surpass RGSD in some settings.

rubric-guided self-distillationtoken-by-token distillationopen-ended domainsrubric satisfactionon-policy rollout

Boosting Direct Preference Optimization with Penalization

arXiv cs.LG · Pengwei Sun · 2026-06-10

The paper proposes Direct Preference Optimization with Penalization (DPOP), an extension of DPO that incorporates reference-model responses into offline preference optimization. DPOP augments the base preference loss with a gated penalty on reference-greedy responses, activated only when the policy ranks rejected responses above preferred ones. Evaluated on AlpacaEval 2.0 with Llama-3-8b-it and Gemma-2-9b-it, DPOP achieves relative win rate improvements of 5.3% and 4.4% over DPO, SimPO, and AlphaDPO baselines. Ablations demonstrate the superiority of SimNPO-style length-normalized penalties over NPO and token-level unlikelihood.

direct preference optimizationoffline learningreference modellength normalizationhuman feedback

Dolph2Vec: Self-Supervised Representations of Dolphin Vocalizations

arXiv cs.LG · Chiara Semenzin, Faadil Mustun, Roberto Dessi, Pierre Orhan · 2026-06-10

The authors introduce Dolph2Vec, a self-supervised learning (SSL) model for dolphin vocalizations, trained on a novel dataset of five years of longitudinal recordings from five dolphins. Adapting Wav2Vec2.0, the model is optimized for fine-grained analysis of dolphin communication, unlike general-purpose SSL models. Dolph2Vec outperforms baselines in signature whistle classification and whistle detection, with learned embeddings revealing interpretable acoustic units aligned with whistle categories and sub-whistle structures.

self-supervised learningbioacousticswav2vec2.0signature whistle classificationacoustic units

Policy-driven Conformal Prediction for Trustworthy QoT Estimation

arXiv cs.LG · Kiarash Rezaei, Omran Ayoub, Paolo Monti, Carlos Natalino · 2026-06-10

Conformal QoT introduces a policy-driven framework integrating statistically guaranteed Quality of Transmission (QoT) estimation with operational decision policies, enhancing lightpath-feasibility predictions under domain shift. The method leverages conformal prediction to ensure reliability, achieving a significant accuracy improvement from 92% to 99.6% on open datasets.

conformal predictionquality of transmissionlightpath-feasibilitydomain shiftoperational decision policies

📰 Industry Media

No new items today.


Generated automatically at 2026-06-12 21:36 UTC. Summaries and keywords are produced by an LLM and may contain inaccuracies — always consult the original article.