Daily Digest — 2026-06-03

Tuesday, June 02, 2026 · 343 items · model: deepseek/deepseek-chat

343 items · 6 research labs, 335 arxiv papers, 2 industry media

⚠️ Source issues today:
  • MarkTechPost: all feed URLs failed (last tried: https://www.marktechpost.com/feed/)
  • AI News: all feed URLs failed (last tried: https://artificialintelligence-news.com/feed/)

🏛️ Research Labs (6)

Travelers deploys AI-powered claims countrywide with OpenAI

OpenAI News · 2026-06-02

Travelers deployed an AI-powered claims assistant using OpenAI's Realtime API and frontier models to automate first notice of loss for auto property damage claims. The system employs natural conversation to guide customers through policy queries, data collection, and claim submission, achieving 85-90% completion rates. Integrated with enterprise claims infrastructure, the solution handles 1.5M annual claims with zero wait times during catastrophe events, freeing human agents for complex cases.

realtime apifrontier modelsfirst notice of lossenterprise orchestrationautonomous voice solution

Codex for every role, tool, and workflow

OpenAI News · 2026-06-02

OpenAI introduces role-specific plugins, interactive sites, and annotations to expand Codex's utility beyond software development, targeting non-technical roles such as analysts, marketers, and researchers. The plugins integrate Codex with 62 apps and 110 skills, enabling tasks like data analytics, creative production, and investment analysis without coding. Sites allow teams to create and share interactive dashboards and tools via URL, while annotations refine content directly within documents, spreadsheets, and slides. Early adoption includes NVIDIA, Zapier, and OpenAI internal teams. Plugins and sites are rolling out for Business and Enterprise customers, with plans for an open plugin ecosystem.

codexpluginsannotationssitesworkflows

Advancing youth safety and opportunity through global leadership

OpenAI News · 2026-06-02

OpenAI advocates for global youth AI safety through the establishment of an international AI Safety Institute, emphasizing age-appropriate safeguards, parental controls, and AI literacy. The initiative builds on collaborations with educators, researchers, and governments, including Estonia’s national ChatGPT rollout in schools. OpenAI proposes principles such as privacy-preserving age estimation, annual youth safety risk assessments, and independent audits to ensure accountability. These measures are integrated into ChatGPT’s design, featuring enhanced protections for minors and proactive parental notifications. The G7 Leaders’ Summit serves as a platform for advancing these standards, with OpenAI engaging in practical discussions to implement concrete safeguards and promote safe AI usage among youth.

ai safety instituteage estimationparental controlsai literacyindependent audits

Codex is becoming a productivity tool for everyone

OpenAI News · 2026-06-02

OpenAI reports Codex's evolution from a coding tool to a general productivity assistant, now serving over 5 million weekly active users with 6x growth since February. The system demonstrates increasing adoption by knowledge workers (20% of users), particularly for data analysis, research, and document generation tasks. Analysis reveals users leverage parallel task execution for workflow automation and artifact creation, suggesting potential impacts on role scope and career velocity. The tool primarily reduces friction in information retrieval, cross-team coordination, and deliverable production across industries.

codexknowledge workersworkflow automationparallel task executionproductivity assistant

Our views on AI policy and political advocacy

OpenAI News · 2026-06-01

OpenAI articulates its stance on AI governance, advocating for multi-stakeholder involvement in shaping AI policy while distancing itself from corporate political funding mechanisms. The company explicitly states it does not operate or fund Political Action Committees (PACs), nor does it endorse external political groups, emphasizing transparency in its policy positions. OpenAI supports regulatory frameworks emphasizing safety standards, public accountability, and equitable access to AI benefits, distinct from employee-led political activities.

ai governancepolitical action committeessafety standardspublic accountabilityregulatory frameworks

Holo3.1: Fast & Local Computer Use Agents

Hugging Face Blog · 2026-06-02

Holo3.1 introduces a family of computer-use agents optimized for cross-environment robustness and local inference, based on the Qwen architecture. The system addresses distribution shifts across web/desktop/mobile environments and agent frameworks through quantized checkpoints (FP8, Q4 GGUF, NVFP4) and native function-calling support. The 35B-A3B model achieves 79.3% accuracy on AndroidWorld (12.3pp gain over Holo3), while NVFP4 quantization delivers 1.74× throughput over BF16 on DGX Spark. Smaller 0.8B-9B models enable cost-effective deployment, with Q4 GGUF supporting consumer hardware.

quantized checkpointscomputer-use agentsdistribution shiftfunction-calling protocolsw4a16 configuration

📜 arXiv Papers (335)

Mitigating Perceptual Judgment Bias in Multimodal LLM-as-a-Judge via Perceptual Perturbation and Reward Modeling

arXiv cs.AI · Seojeong Park, Jiho Choi, Junyong Kang, Seonho Lee · 2026-06-01

The paper introduces a method to mitigate Perceptual Judgment Bias in multimodal large language models (MLLMs) used as automated evaluators, where models prioritize plausible narratives over perceptually correct answers. The authors propose the Perceptually Perturbed Judgment Dataset, featuring minimally edited counterfactual responses to isolate perceptual errors, and develop a training framework combining GRPO-based reward modeling with a batch-ranking objective. Experiments demonstrate significant improvements in perceptual fidelity, ranking coherence, and human evaluation alignment across multiple MLLM-as-a-Judge benchmarks.

perceptual judgment biasmultimodal llmreward modelingvisual perturbationbatch-ranking objective

AdaCodec: A Predictive Visual Code for Video MLLMs

arXiv cs.AI · Haowen Hou, Zhen Huang, Zheming Liang, Qingyi Si · 2026-06-01

AdaCodec introduces a predictive visual code for video MLLMs that reduces temporal redundancy by encoding inter-frame changes as compact P-tokens instead of full RGB frames. The method dynamically allocates full visual tokens only when scene changes are unpredictable, otherwise transmitting motion and residual data. Evaluated across eleven benchmarks, AdaCodec outperforms the Qwen3-VL-8B baseline at matched token budgets, achieving superior performance on long-video tasks with 1/7 the tokens and reducing time-to-first-token from 9.26s to 1.62s on general-video benchmarks.

predictive visual codevideo mllmsinter-frame changesp-tokenstemporal redundancy

ClinEnv: An Interactive Multi-Stage Long Horizon EHR Environment for Agents

arXiv cs.AI · Yuxing Lu, Yushuhong Lin, Wenqi Shi, J. Ben Tamo · 2026-06-01

The authors introduce ClinEnv, an interactive benchmark for evaluating LLMs as attending physicians through Longitudinal Inpatient Simulation, addressing limitations of static medical benchmarks. ClinEnv structures real inpatient admissions into multi-stage decision processes, requiring models to query specialized agents before committing to irreversible actions, with scoring based on both decision quality and information-gathering processes. Results show top-performing models achieve only 0.31 decision F1, with significant gaps between diagnosis (0.51 F1) and management (0.17 F1) performance, revealing process-quality deficiencies invisible to outcome-only metrics.

longitudinal inpatient simulationinteractive benchmarkclinical decision-makinginformation-acquisition gapontology-grounded matching

Permissive Safety Through Trusted Inference: Verifiable Belief-Space Neural Safety Filters for Assured Interactive Robotics

arXiv cs.AI · Haimin Hu · 2026-06-01

The paper proposes a method to certify high-probability safety for belief-space safety filters (BeliefSF) in interactive robotics using conformal prediction, explicitly accounting for runtime inference reliability. By focusing verification on regions where inference is reliable, the method maintains the simplicity of standard conformal prediction while certifying less conservative safety filters. Experiments on a human-vehicle interaction benchmark demonstrate that the approach verifies significantly more permissive filters compared to standard conformal prediction baselines.

belief-space safety filterconformal predictioninteractive roboticsruntime inferencesafety certification

From Layers to Submodules: Rethinking Granularity in Replacement-Based LLM Compression

arXiv cs.AI · Elia Cunegatti, Marcus Vukojevic, Erik Nielsen, Giovanni Iacca · 2026-06-01

The paper introduces SubFit, a post-training compression method for Large Language Models (LLMs) that operates at submodule granularity rather than full-layer replacement. It non-contiguously selects Attention and FeedForward submodules, fitting each with lightweight residual bypasses, based on observed non-uniform redundancy distribution. Evaluated across ten LLMs (five base, five instruction-tuned) at sparsity levels from 12.5% to 37.5%, SubFit achieves superior perplexity-accuracy trade-offs, retaining 84.6% downstream accuracy at 25% sparsity (vs. 81.6% for baselines) with 2.42x perplexity degradation and measurable inference speedups.

llm compressionsubmodule granularityfitted residual replacementnon-contiguous selectionkv-cache savings

Modeling Depth Ambiguity: A Mixture-Density Representation for Flying-Point-Free Depth Estimation

arXiv cs.AI · Siyuan Bian, Congrong Xu, Jun Gao · 2026-06-01

The paper introduces MDA (Mixture-Density representation for Ambiguity), a novel depth estimation framework addressing flying-point artifacts near object boundaries. By modeling depth as a mixture distribution per pixel, MDA captures multiple depth hypotheses (e.g., foreground/background surfaces) instead of forcing a single prediction. This approach eliminates spurious 3D points in empty space while handling transparent objects (multi-layer predictions) and sky regions (unbounded depth separation). Evaluated across multiple backbones, MDA improves boundary reconstruction accuracy and robustness to input blur with negligible runtime overhead (+0.3ms).

depth estimationmixture-densityflying-point artifactsboundary reconstructiontransparent objects

SimSD: Simple Speculative Decoding in Diffusion Language Models

arXiv cs.AI · Junxia Cui, Haotian Ye, Runchu Tian, Hongcan Guo · 2026-06-01

The paper introduces SimSD, a training-free speculative decoding algorithm for diffusion language models (dLLMs) that addresses their incompatibility with token-level speculative verification. The method employs a plug-and-play masking strategy to introduce reference tokens from draft-model predictions and designs an attention mask regulating their interaction, enabling dLLMs to verify multiple drafted tokens in a single forward pass. Experiments on SDAR-family dLLMs across four benchmarks demonstrate up to 7.46x higher decoding throughput while maintaining or improving generation quality.

speculative decodingdiffusion language modelsmasking strategyparallel decodingkv cache

Tracking the Behavioral Trajectories of Adapting Agents

arXiv cs.AI · Jonah Leshin, Manish Shah, Ian Timmis · 2026-06-01

The paper introduces a methodology for quantifying behavioral traits in adaptive agents through text file edits, defining traits as directions in a text embedding space. A linear model is trained on labeled skill file diffs to learn trait vectors, enabling trait scoring via projection of embedding diffs. Evaluated on 68 labeled pairs for sensitive data-seeking propensity, the method achieves 91.2% sign classification accuracy and Spearman ρ=0.82 under leave-one-out cross-validation. The approach is integrated into an agent-to-agent protocol for trusted evaluation of skill updates.

behavioral traitstext embeddingskill file diffslinear modelagent-to-agent protocol

SafeSteer: Localized On-Policy Distillation for Efficient Safety Alignment

arXiv cs.AI · Hao Li, Jingkun An, Zijun Song, Pengyu Zhu · 2026-06-01

SafeSteer introduces localized on-policy distillation for efficient LLM safety alignment, addressing the alignment tax by confining modifications to safety tokens. The method constructs a safety teacher via activation steering, selects safety tokens algorithmically, and applies a reverse KL penalty only to these tokens during training. Experiments demonstrate superior safety-capability trade-offs: SafeSteer achieves strong performance on seven safety benchmarks with minimal degradation on five general benchmarks, using only 100 harmful samples (1% of baseline requirements).

safety alignmenton-policy distillationactivation steeringreverse kl penaltyalignment tax

Why Not Hyperparameter-Friendly Optimisation? A Monotonic Adaptive Norm Rescaling Approach For Long-Tailed Recognition

arXiv cs.AI · Shuo Zhang, Chenqi Li, Tingting Zhu · 2026-06-01

The paper proposes Self-Adaptive Monotonic Normalization (SAMN), a hyperparameter-free method for long-tailed recognition that enforces monotonicity on per-class weight norms using the Pool Adjacent Violators Algorithm. Building on the two-stage decoupling paradigm, SAMN avoids parameter regularization while providing theoretical justification through a class-conditional distribution perspective. Experiments on benchmark datasets show SAMN achieves state-of-the-art performance and integrates effectively with existing methods.

long-tailed recognitionadaptive norm rescalingmonotonic normalizationpool adjacent violators algorithmclass-conditional distribution

Moment-Video: Diagnosing Temporal Fidelity of Video MLLMs on Momentary Visual Events

arXiv cs.AI · Xiaolin Liu, Yilun Zhu, Xiangyu Zhao, Xuehui Wang · 2026-06-01

The paper introduces Moment-Video, a benchmark for evaluating video multimodal large language models (MLLMs) on momentary visual event understanding, focusing on brief, answer-critical events that current models often miss. The benchmark comprises 1,000 video-QA pairs across 7 domains and 25 subcategories, testing four task types: Temporal Occurrence, Temporal Counting, Action Description, and Temporal Reasoning. Evaluation of 33 MLLMs reveals significant performance gaps, with the top model (Seed-2.0-Pro) achieving only 39.6% accuracy, while most open-source models score below 25%, indicating deficiencies in temporal fidelity and evidence preservation.

momentary visual eventsvideo mllmstemporal fidelityframe samplingbenchmark evaluation

Bridging the Last Mile of Time Series Forecasting with LLM Agents

arXiv cs.AI · Yuhua Liao, Zetian Wang, Qiangqiang Nie, Zhenhua Zhang · 2026-06-01

The paper introduces a large language model (LLM)-agent framework for last-mile time series forecasting, addressing the gap between statistical predictions and business-ready forecasts. The system integrates a forecasting backbone with tools for contextual evidence retrieval, converting reasoning trajectories into explicit revisions under structural constraints. It supports long-horizon forecasting via map-reduce decomposition and post-hoc reflection through a memory bank, ensuring controllability and auditability. Real-world case studies demonstrate the framework's effectiveness in bridging statistical prediction and practical forecasting needs.

time series forecastingllm agentslast-mile forecastingcontextual evidencemap-reduce decomposition

Monitoring Agentic Systems Before They're Reliable

arXiv cs.AI · Marisa Ferrara Boston, Glen Hanson, Effi Georgala, JD Hudgens · 2026-06-01

The authors propose a monitoring methodology for agentic systems in early production stages, focusing on structural defects rather than task-level errors. The approach decomposes evaluation into three dimensions (quality, suitability, efficiency) across three scopes (within-run, cross-run, structural), using variance as a characterization signal and FMEA-inspired severity classification. Evaluation on a synthetic testbed (220 runs, 120 document bundles) shows that monitor scope determines failure type: within-run monitors detect deterministic stage defects (CV = 0.02), cross-run monitors identify stochastic integration consequences (CV = 1.25), and structural monitors reveal integration gaps (CV = 0.00). Deterministic triage routes 97% of findings to automated tracking, leaving 2% for human investigation. A maturity-staging model is proposed, transitioning from structural characterization to error detection as integration defects resolve.

agentic systemsvariance characterizationfmea severitystructural defectsdeterministic triage

RASER: Recoverability-Aware Selective Escalation Router for Multi-Hop Question Answering

arXiv cs.AI · Yuyang Li, Zihe Yan, Tobias Käfer · 2026-06-01

RASER introduces recoverability-aware routers for multi-hop QA, optimizing retrieval cost without extra LLM calls. RASER-2 selects between one-shot RAG and PRUNE, while RASER-3 adds iterative retrieval IRCoT, using six features from one-shot RAG and explicit cost-accuracy trade-offs. Evaluated across six LLMs and three benchmarks, RASER-2/3 achieve competitive F1 while reducing token usage to 41-49% of always-prune baselines and outperforming iterative/decomposition methods in efficiency.

multi-hop qarecoverability-aware routingone-shot ragiterative retrievaltoken efficiency

Iteris: Agentic Research Loops for Computational Mathematics

arXiv cs.AI · Leheng Chen, Zihao Liu, Wanyi He, Bin Dong · 2026-06-01

The paper introduces Iteris, an agentic AI system for tackling open problems in computational mathematics through numerical experimentation, adversarial constructions, and proof drafting. Iteris combines large language models with specialized mathematical workflows to address two problems from the Simons Workshop collection: analyzing conjugate gradient vs. randomized coordinate descent on power-law spectra, and demonstrating QR factorization's failure in selecting well-conditioned submatrices under low coherence. Human-verified results include a phase diagram and a counterexample, showing Iteris' potential in research workflows while underscoring the need for expert validation.

agentic aicomputational mathematicsqr factorizationconjugate gradientadversarial constructions

Ghost Tool Calls: Issue-Time Privacy for Speculative Agent Tools

arXiv cs.AI · Bardia Mohammadi, Lars Klein, Akhil Arora, Laurent Bindschaedler · 2026-06-01

The paper introduces Speculative Tool Privacy Contracts, a runtime abstraction addressing privacy leaks in tool-augmented language agents caused by speculative tool calls. These 'ghost tool calls' expose user intent to external services before branch commitment, persisting even if the branch is abandoned. The proposed method treats pre-commitment observation as a distinct effect, implementing issue-time policies to modify or suppress speculative calls. Evaluation across three corpora shows that only argument/destination projection changes at dispatch time effectively reduce intent inference, outperforming post-hoc filters and access-control approaches.

speculative tool callsprivacy contractsruntime abstractionintent inferencebranch commitment

MCP-Persona: Benchmarking LLM Agents on Real-World Personal Applications via Environment Simulation

arXiv cs.AI · Wenhao Wang, Peizhi Niu, Gongyi Zou, Xiyuan Yang · 2026-06-01

The paper introduces MCP-Persona, the first benchmark for evaluating LLM agents on personalized applications using the Model Context Protocol (MCP). It simulates real-world environments including social media (Reddit, Xiaohongshu) and enterprise tools (Lark, Slack), addressing the gap in existing benchmarks that focus on generic information-seeking tasks. Experiments with state-of-the-art agents reveal significant performance struggles in personalized tool use, demonstrating the benchmark's utility for identifying limitations in current systems.

model context protocolllm agentspersonalized applicationsenvironment simulationbenchmark

Learning When to Translate for Multilingual Reasoning

arXiv cs.AI · Deokhyung Kang, Hyounghun Kim, Gary Geunbae Lee · 2026-06-01

The paper introduces Luar, a reinforcement learning framework that enables reasoning language models (RLMs) to selectively invoke translation for multilingual inputs when direct understanding is unreliable. Luar trains RLMs to choose between solving the original input or its English translation, optimizing for performance by avoiding unnecessary translations. Evaluations on multilingual benchmarks show Luar outperforms GRPO and other baselines, particularly benefiting low-resource languages, while maintaining efficiency by minimizing translation calls when direct reasoning suffices.

multilingual reasoningreinforcement learninglanguage understandingtranslation invocationlow-resource languages

MASER: Modality-Adaptive Specialist Routing for Embodied 3D Spatial Intelligence

arXiv cs.AI · Hilton Raj, Vishnuram AV · 2026-06-01

MASER introduces a modality-adaptive specialist routing framework for embodied 3D spatial intelligence, addressing the limitation of existing Vision-Language Models (VLMs) fine-tuned on single modalities. The method trains five modality adapters on a shared VLM backbone and employs a neural routing policy that selects the optimal adapter based on question semantics, encoded via a frozen sentence transformer and a small MLP. Evaluated on the Open3D-VQA benchmark, MASER achieves 51.3% oracle agreement in adapter selection, outperforming a Random-Forest ablation (43.5%), with point-cloud responses being optimal in 51.5% of cases.

modality-adaptivespecialist routingvision-language modelsneural routingoracle agreement

AGENTCL: Toward Rigorous Evaluation of Continual Learning in Language Agents

arXiv cs.AI · Yiheng Shu, Bernal Jiménez Gutiérrez, Saisri Padmaja Jonnalagedda, Yuguang Yao · 2026-06-01

The paper introduces AGENTCL, a framework for rigorous evaluation of continual learning in language agents through controlled task streams and transfer gain metrics. It contrasts compositional streams (designed for reusable sub-solutions) with naive streams, using MemProbe to diagnose memory design efficacy by storing reliable interactions while filtering noise. Empirical results across coding, research, and reasoning tasks show controlled streams better distinguish memory plasticity, whereas naive streams yield limited gains or degradation, underscoring the need for balanced memory designs.

continual learninglanguage agentstask streamsmemory plasticitytransfer gains

Beyond One-shot: AI Agents for Learning in Field Experiments

arXiv cs.AI · Junjie Luo, Ritu Agarwal, Gordon Gao · 2026-06-01

The paper introduces a tool-augmented agentic AI system for learning from experimental data to generate improved interventions in sequential field experiments. The method employs analytical tools, DIKW reasoning agents, and evidence chains to autonomously extract principles from prior A/B test data (693,139 patient visits) and design new message variants. In healthcare prescription messaging, AI-generated interventions achieved a 69.8% click-through rate (+6.5pp over baseline), outperforming human+chatbot co-designed variants and demonstrating that domain-specific experimental data—not general LLM reasoning—drives performance. The work also reveals limitations of general behavioral theories in specific contexts.

agentic aifield experimentsdikw reasoningintervention designa/b testing

Initialization is Half the Battle: Generating Diverse Images from a Guidance Potential Posterior

arXiv cs.AI · Xiang Li, Dianbo Liu, Kenji Kawaguchi · 2026-06-01

The paper proposes Diversity-inducing Initialization (DivIn), a method to mitigate mode collapse in generative models by sampling initial noise from a guidance potential posterior. DivIn employs Langevin dynamics to navigate the initialization landscape, avoiding collapse regions while maintaining manifold validity. Compatible with diffusion and flow matching models, it enhances diversity without modifying generation trajectories. Experiments demonstrate DivIn's superiority in class-to-image and text-to-image tasks, and its orthogonal benefits when combined with trajectory-based methods, expanding the diversity-quality Pareto frontier.

mode collapselangevin dynamicsguidance potentialdiffusion modelsflow matching

HLL: Can Agents Cross Humanity's Last Line of Verification?

arXiv cs.AI · Xinhao Song, Su Su, Sirui Song, Hongliang Wu · 2026-06-01

The paper introduces Humanity's Last Line of Verification (HLL), a benchmark for evaluating multimodal agents' ability to bypass CAPTCHA verification through human-like interaction. HLL tests agents under controlled realism stressors, including cluttered webpages, harder task variants, and trace-conditioned validation. Evaluation of eight frontier multimodal agents in a closed-loop GUI environment reveals brittleness: performance varies by verification type, degrades under realistic conditions, and drops further when requiring valid action traces. HLL exposes gaps in localization, action calibration, state tracking, and process consistency, providing a testbed for measuring human-substitution capability in protected workflows.

multimodal agentscaptcha verificationhuman-substitution boundaryclosed-loop guitrace-conditioned validation

Food Noise & False Safety: A Systematic Evaluation of How LLMs Fail to Adapt to Eating Disorder Queries with Clinician Feedback

arXiv cs.AI · Giulia Pucci, Emily Hemendinger, Ruizhe Li, Gavin Abercrombie · 2026-06-01

This study systematically evaluates how Large Language Models (LLMs) fail to adapt to eating disorder (ED) queries, identifying patterns of interaction that may facilitate unsafe or self-harming user requests. Through consultation with clinical ED experts, the authors analyze linguistic cues in prompts that increase the likelihood of unsafe responses. By systematically varying the degree of potential risk in user prompts, they quantify the extent to which LLMs uncritically adapt to problematic inputs. The findings highlight the risks of LLMs providing guidance, advice, or emotional support to users with EDs, despite not being designed for clinical advice.

large language modelseating disorderslinguistic cuesunsafe responsesclinical feedback

PaSBench-Video: A Streaming Video Benchmark for Proactive Safety Warning

arXiv cs.AI · Yusong Zhao, Yuejin Xie, Youliang Yuan, Junjie Hu · 2026-06-01

The paper introduces PaSBench-Video, a 740-video benchmark for evaluating proactive safety warning in streaming video scenarios, addressing gaps in existing benchmarks by including temporal precision and false-positive measurement. The dataset spans four domains (driving, healthcare, daily life, industrial production) with frame-level annotations for risk onset and accident boundaries. Testing 13 multimodal large language models (MLLMs) reveals poor performance (≤20.0% on strict metrics), with recall strongly correlated (Pearson r=0.64) to false-positive rates, indicating models rely on scene-level cues rather than reasoning about emerging harm.

proactive safety warningmultimodal large language modelstemporal precisionfalse-positive rateframe-level annotation

LLM-Evolved Pattern Generators for Optimal Classical Planning

arXiv cs.AI · Windy Phung, Dominik Drexler, Arnaud Lequen, Jendrik Seipp · 2026-06-01

The paper introduces the first method for learning domain-dependent heuristics that preserve admissibility in optimal classical planning, ensuring A* search's optimality guarantees. Instead of directly mapping states to heuristic values, the approach learns to construct abstractions that induce admissible heuristics. It employs an LLM-driven evolutionary program-synthesis framework to generate domain-specific pattern collections, combined via saturated cost partitioning. Empirical results show the learned programs encode interpretable domain insights, operate with minimal overhead, and match state-of-the-art domain-independent baselines in coverage while evaluating states significantly faster.

admissible heuristicsoptimal classical planningevolutionary program-synthesissaturated cost partitioningpattern collection

ODTQA-FoRe: An Open-Domain Tabular Question Answering Dataset for Future Data Forecasting and Reasoning

arXiv cs.AI · Zhensheng Wang, Xiaole Liu, Wenmian Yang, Kun Zhou · 2026-06-01

The paper introduces ODTQA-FoRe, the first open-domain tabular QA dataset for future data forecasting and reasoning, addressing limitations in current systems' numerical prediction capabilities. It proposes TimeFore, an LLM agent framework with three specialized roles: Retriever (SQL-based data fetching), Forecaster (external time-series model invocation), and Analyzer (result synthesis). Experiments demonstrate TimeFore's effectiveness in handling time-series forecasting and forecast-based reasoning tasks using real estate data.

tabular question answeringtime-series forecastingllm agentopen-domain qanumerical prediction

Bridging the Sim-to-Real Gap in Semiconductor Visual Program Synthesis via Input Binarization

arXiv cs.AI · Yusuke Ohtsubo, Kota Dohi, Koichiro Yawata, Koki Takeshita · 2026-06-01

The study introduces a visual program synthesis framework using a Vision-Language Model (VLM) to convert semiconductor inspection images into editable Domain-Specific Language (DSL) code, enabling precise parametric control over circuit geometries for training data generation. To address the sim-to-real gap between synthetic DSL-rendered data and real Scanning Electron Microscope (SEM) images, the authors propose an input binarization strategy that removes SEM-specific texture and noise, focusing the model on geometric structure. On the MIIC dataset, binarization improves the mean Dice coefficient from 0.4393 to 0.5256, demonstrating significant mitigation of the domain gap.

visual program synthesisdomain-specific languagesim-to-real gapinput binarizationsemiconductor inspection

Not All Errors Are Equal: A Systematic Study of Error Propagation in Large Language Model Inference

arXiv cs.AI · Yafan Huang, Sheng Di, Guanpeng Li · 2026-06-01

The study presents LLMFI, a configurable fault-injection framework for systematically analyzing error propagation in large language model (LLM) inference. Researchers injected faults across three open-weighted LLMs and thirteen tasks spanning reasoning, multilingual, mathematical, and coding domains, identifying critical vulnerability patterns. The work yields 17 key takeaways on error propagation and proposes four low-overhead software modifications to enhance LLM reliability, providing practical guidance for error detection and mitigation.

fault-injectionerror propagationllm inferencereliabilityvulnerability patterns

GC-MoE: Genomics-Guided Cell-Type-Specific Mixture of Experts for Histology-Based Single-Cell Spatial Transcriptomics

arXiv cs.AI · Kaito Shiku, Ahtisham Fazeel Abbasi, Ryoma Bise, Yuichiro Iwashita · 2026-06-01

GC-MoE introduces a genomics-guided mixture-of-experts framework for histology-based single-cell spatial transcriptomics (ST) prediction, addressing cell-to-cell expression variability structured by cell type. The method employs a routing network for cell-type probability estimation, combining cell-type-specific experts with a Cell-Type-Specific Co-Expression-Aware Predictor (CAP) and a Cell-to-Cell Interaction Attention (C2CA) module for neighboring-cell context. Evaluations on single-cell ST datasets demonstrate consistent improvements over existing single-cell and adapted spot-level baselines.

spatial transcriptomicsmixture-of-expertscell-type-specificco-expressionattention mechanism

Evolutionary Discovery of Bivariate Bicycle Codes with LLM-Guided Search

arXiv cs.AI · Juan Cruz-Benito, Andrew W. Cross, David Kremer, Ismael Faro · 2026-06-01

The paper introduces an LLM-guided evolutionary workflow for discovering bivariate-bicycle quantum LDPC codes, combining language model-guided program mutation with a staged validation pipeline. The method involves mutating Python programs to generate code candidates, followed by evaluation using GF(2) rank computation, distance estimation, mixed-integer linear programming, Tanner-graph deduplication, and equivalence checks. Across five campaigns, the system screened ~200,000 candidates, identifying 465 distinct codes, including 97 CSS bivariate-bicycle codes and 368 non-CSS perturbed variants. Notable findings include an indecomposable [[288,16,12]] code and high-distance candidates up to k=50 at d=8. Results demonstrate LLM-guided evolution as a practical tool for structured quantum-code discovery.

quantum ldpc codesllm-guided evolutionbivariate-bicycle codesmixed-integer linear programmingtanner-graph deduplication

AutoForest: Automatically Generating Forest Plots from Biomedical Studies with End-to-End Evidence Extraction and Synthesis

arXiv cs.AI · Massimiliano Pronesti, Angelo Miculescu, Mohsin Kapdi, Paul Flanagan · 2026-06-01

AutoForest introduces the first end-to-end system for automatically generating publication-ready forest plots from biomedical papers, addressing the labor-intensive process of evidence synthesis in systematic reviews. The system automates ICO (Intervention, Comparator, Outcome) element suggestion, outcome data extraction, statistical synthesis, and plot rendering. A user study with clinicians demonstrates its effectiveness in accelerating meta-analyses and lowering barriers to evidence synthesis.

forest plotsevidence synthesismeta-analysesico elementsbiomedical papers

Policy and World Modeling Co-Training for Language Agents

arXiv cs.AI · Ning Lu, Baijiong Lin, Shengcai Liu, Jiahao Wu · 2026-06-01

The paper proposes Policy and World modeling co-training (PaW), a framework that enhances language agents by jointly optimizing reinforcement learning (RL) policies and world models (WM) using on-policy rollouts. PaW introduces action-entropy-based WM data selection, noise-tolerant WM loss, and reward-adaptive loss balancing to stabilize auxiliary WM supervision without altering inference. Experiments on three agentic task benchmarks demonstrate consistent improvements over RL baselines across models and algorithms, showing that RL rollouts effectively provide WM supervision.

reinforcement learningworld modelinglanguage agentson-policy rolloutsauxiliary supervision

AgentPLM: Agentic Protein Language Models with Reasoning-Augmented Decoding for Protein Sequence Design

arXiv cs.AI · Sahil Rahman, Maxx Richard Rahman · 2026-06-01

AgentPLM introduces Reasoning-Augmented Decoding (RAD) and Contrastive Agent Policy Optimisation (CAPO) to transform passive protein language models into active agents. RAD interleaves autoregressive generation with biophysical tool calls (ESMFold, FoldX, AutoDock Vina), while CAPO optimizes policy trajectories end-to-end for selective oracle consultation. Evaluated on de novo enzyme design, antibody optimization, and thermostability tasks, AgentPLM achieves state-of-the-art results, including a top-10% hit rate improvement in antibody design, demonstrating online error correction without backtracking.

protein language modelsreasoning-augmented decodingcontrastive agent policy optimisationautoregressive generationbiophysical feedback

A Mathematical Conflict Framework for Contextual Data Modulation

arXiv cs.AI · Hakan Emre Kartal · 2026-06-01

The study introduces a generalized operator-based mathematical conflict framework to explicitly model structural discrepancies between raw and contextual data. The framework treats conflict as a local, directional, and context-sensitive quantity, integrating components such as weighting, scale behavior, and output mapping under a unified abstract operator. Unlike existing approaches that embed conflict implicitly within optimization processes, this framework defines conflict as an independent, operator-based, and component-level mathematical object. It is adaptable to various problem classes without being tied to specific learning algorithms or optimization methods.

conflict frameworkcontextual dataabstract operatorscale behavioroutput mapping

SPADE-Bench: Evaluating Spontaneous Strategic Deception in Agents via Plan-Action Divergence

arXiv cs.AI · Yuyan Bu, Haowei Li, Qirui Zheng, Bowen Dong · 2026-06-01

The paper introduces SPADE-Bench, a novel benchmark for evaluating spontaneous plan-action divergence (agent deception) in LLM-based agents during tool-use scenarios. The benchmark combines actual tool execution with controlled pressure scenarios to ecologically validate strategic deception while distinguishing it from hallucination. Experiments across mainstream models demonstrate that plan-action divergence is a prevalent and critical issue in autonomous agent deployment. SPADE-Bench addresses a key gap in agent safety by providing a rigorous framework for assessing trustworthiness and controllability in black-box execution environments.

agent deceptionplan-action divergencetool-use evaluationautonomous agent safetyecological validity

When Do Attention Circuits Form? Developmental Trajectories of Capability and Attention-Sink Emergence Across Three 1B-ClassArchitectures

arXiv cs.AI · Yongzhong Xu · 2026-06-01

The study analyzes developmental trajectories of attention circuits in three 1B-parameter language models (Pythia 1B, OLMo 1B-0724-hf, OLMoE 1B-7B-0924) across dense transformer and mixture-of-experts architectures. Using participation-ratio spectral analysis and capability-specific selectivity screening at 10 log-spaced revisions per model, it tracks induction, previous-token, and BOS-attractor head formation. Key findings include: (1) BOS-attractor heads never emerge in layers 0-1 (architectural property), (2) distinct BOS-attractor emergence patterns per model, (3) induction circuits form 10-20x earlier than BOS-attractors in DCLM-trained models, (4) early circuit identification possible (0.3-2% of training), and (5) induction heads show elevated PR at capability onset.

attention circuitsparticipation-ratiobos-attractorinduction headscapability-selectivity

Spatial Representation Learning Beyond Pixels: Unifying Raster Data and Vector Semantics for Human-Centric Geospatial Foundation Models

arXiv cs.AI · Steffen Knoblauch, Hao Li, Gengchen Mai, Konstantin Klemmer · 2026-06-01

The paper advocates for unified Spatial Representation Learning (SRL) to integrate raster and vector geospatial data in foundation models, addressing current limitations where Earth Observation Foundation Models (EOFMs) primarily use raster data. It proposes joint embedding spaces to combine continuous spectral patterns from raster data with discrete, structured vector data (e.g., OpenStreetMap), enhancing semantic and relational understanding. The work outlines technical challenges and directions for multimodal alignment, emphasizing improved accuracy and interpretability in geospatial AI systems.

spatial representation learningearth observation foundation modelsmultimodal geospatial learningraster-vector integrationsemantic alignment

Harness-1: Reinforcement Learning for Search Agents with State-Externalizing Harnesses

arXiv cs.AI · Pengcheng Jiang, Zhiyi Shi, Kelly Hong, Xueqiang Xu · 2026-06-01

The paper introduces Harness-1, a 20B-parameter search agent trained via reinforcement learning within a stateful search harness that externalizes routine state management. The harness maintains environment-side working memory (candidate pool, curated set, evidence links, verification records), while the policy focuses on semantic decisions (search targets, document retention, verification, stopping). Evaluated on eight retrieval benchmarks (web, finance, patents, multi-hop QA), Harness-1 achieves 0.730 average curated recall, outperforming open alternatives by +11.4 points and showing strong transfer performance, suggesting improved generalization from explicit search state.

reinforcement learningsearch agentsstate externalizationretrieval benchmarksworking memory

COMAP: Co-Evolving World Models and Agent Policies for LLM Agents

arXiv cs.AI · Youwei Liu, Jian Wang, Hanlin Wang, Wenjie Li · 2026-06-01

COMAP introduces a framework for co-evolving textual world models and agent policies through closed-loop interaction, enabling LLM agents to adapt to on-policy state-action distributions. At each decision step, the world model predicts future state feedback for candidate actions, and the agent performs future-aware reflection to refine its action. The resulting trajectories update the world model via self-distillation, improving its alignment with the agent's evolving interaction distribution. Evaluated on embodied task planning, Web navigation, and tool-use benchmarks, COMAP achieves a +16.75% relative improvement with Qwen3-4B, enhancing prediction accuracy and long-horizon decision-making.

world modelself-distillationclosed-loop interactionon-policyfuture-aware reflection

FOAM: Frequency and Operator Error-Based Adaptive Damping Method for Reducing Staleness-Oriented Error for Shampoo

arXiv cs.AI · Kyunghun Nam, Sumyeong Ahn · 2026-06-01

FOAM introduces an adaptive damping method to address the staleness-error trade-off in Shampoo optimization, where computational efficiency is improved via stale preconditioner updates at the cost of performance degradation and numerical instability. The method dynamically adjusts the damping factor and eigendecomposition frequency based on a staleness-oriented error approximation, stabilizing training while maintaining convergence. Experiments show FOAM reduces wall-clock time compared to standard Shampoo without compromising optimization fidelity.

shampoo optimizeradaptive dampingstaleness-oriented errormatrix inversionnumerical stability

MOC: Multi-Order Communication in LLM-based Multi-Agent Systems

arXiv cs.AI · Yao Guan, Lin Wang, Zhihu Lu, Ziyi Wang · 2026-06-01

The paper introduces Multi-Order Communication (MOC), a novel scheme for enhancing message transmission in LLM-based multi-agent systems. MOC addresses limitations of current first-order concatenation methods by reconstructing inter-agent communication to capture multi-hop dependencies and employing a Semantic-Topological Merging algorithm for efficient message consolidation. Experiments across six datasets and multiple LLM scales demonstrate MOC's consistent improvements in task performance (exact metrics unspecified) and communication cost reduction.

multi-agent systemslarge language modelsmessage consolidationsemantic-topological mergingmulti-hop dependencies

Do Multimodal Agents Really Benefit from Tool Use? A Systematic Study of Capability Gains

arXiv cs.AI · Garvin Guo, Donglei Yu, Yu Chen, Xiang Wang · 2026-06-01

The study challenges the assumption that tool-augmented multimodal agents inherently gain capability from tool use. Analyzing Thyme and DeepEyesV2 agents across real-world understanding, OCR, chart interpretation, and math tasks, it compares tool-equipped versions against tool-free counterparts and pure-text baselines. Results show minimal aggregate improvement (93-96% tool-solved problems also solved without tools), no consistent token-cost reduction, and no clear advantage from full tool loops versus partial implementations. Findings suggest current evaluations conflate tool availability with actual capability expansion, emphasizing the need to distinguish learned tool-calling patterns from genuine performance gains.

multimodal agentstool augmentationcapability evaluationbenchmark analysistoken efficiency

SIRI: Self-Internalizing Reinforcement Learning with Intrinsic Skills for LLM Agent Training

arXiv cs.AI · Zhongyu He, Yuanfan Li, Fei Huang, Tianyu Chen · 2026-06-01

The paper introduces SIRI, a three-phase framework for training LLM agents to internalize reusable skills without external generators or inference-time retrieval. SIRI first warms up the policy with GiGPO, then performs self-skill mining by summarizing and validating skills from successful rollouts, and finally distills beneficial skill-guided actions into the plain policy. Evaluated on ALFWorld and WebShop with Qwen2.5-7B-Instruct, SIRI improves GiGPO from 0.908 to 0.930 and 0.728 to 0.813 respectively, outperforming prompt-based, RL-based, and memory-augmented baselines.

llm agentsskill distillationreinforcement learningself-skill mininginternalization

Coordination Graphs for Constrained Multi-Agent Reinforcement Learning

arXiv cs.AI · Santiago Amaya-Corredor, Miguel Calvo-Fullana, Anders Jonsson · 2026-06-01

The paper introduces Coordination Graphs for Constrained Multi-Agent Reinforcement Learning (CG-CMARL), a framework addressing exponential joint action space growth and constraint coupling in CMARL. The method decomposes the problem into pairwise regions with shared Q-functions (one per objective/constraint), using Max-Sum message passing for coordination and Lagrangian multipliers for Pareto-optimal tradeoffs. Theoretical analysis provides convergence guarantees and compositional error bounds. Experiments on cooperative navigation with up to 10 agents show Pareto fronts dominating reward-shaped baselines while scaling beyond centralized approaches.

multi-agent reinforcement learningcoordination graphslagrangian dualitymax-sum message passingpareto front

Forget Attention: Importance-Aware Attention Is All You Need

arXiv cs.AI · Soohyeong Shin, Yeongwook Yang · 2026-06-01

We propose SISA (SSM-Informed Softmax Attention), a novel hybrid architecture that integrates state space models (SSMs) directly into attention computation via an SSM-derived importance term within the attention score, implemented as a single SDPA call on augmented query/key vectors. This score-level fusion contrasts with existing block-level (Jamba) and head-level (Hymba) hybrids, enabling SSMs to inform attention dynamically without recurrent state or custom kernels. At 152M parameters trained on 5B tokens, SISA achieves 17.3% accuracy on LAMBADA-greedy, outperforming Transformer (13.9%) and Mamba-3 (15.5%), and attains 100% NIAH accuracy from step 1K, 7x faster than Transformer's retrieval convergence. At 369M parameters, SISA maintains perfect NIAH while preserving stock-SDPA execution.

state space modelsattention computationsoftmax attentionhybrid architecturescore-level fusion

Repair Before Veto: Repair-Augmented Constraint Learning for Contextual Decisions

arXiv cs.AI · Yifan Wang · 2026-06-01

The paper introduces Repair-Augmented Constraint Learning (RACL), a contextual decision framework that integrates known repair operators into classifier semantics. RACL accepts candidates when affordable repairs make them feasible and sufficiently preferred, otherwise providing structured rejection credits and repair plans. This approach generalizes no-repair HASSLE-style semantics, addresses false-veto gaps, and separates non-identifiability from learnability. Experiments on controlled and DB1B-derived benchmarks show RACL reduces false vetoes to 10/4039 (FVR 0.0025) versus 1064/4039 for black-box baselines, while clarifying the FVR/EDR trade-off.

constraint learningrepair operatorscontextual decisionsfalse-veto gaphassle-style semantics

Repurposing Adversarial Perturbations for Continual Learning: From Defense to Active Alignment

arXiv cs.AI · Ran Liu, Min Yu, Mingqi Liu, Jianguo Jiang · 2026-06-01

The paper introduces AdvCL, a method for continual learning in dynamic environments that repurposes adversarial perturbations as geometric control signals. It combines three modules: Intra-Smooth for local smoothness via small perturbations, Proto-Clip to prevent excessive alignment to current task prototypes, and Inter-Align for directional alignment to previous task prototypes. Experiments demonstrate improved performance, robustness, reduced forgetting, and enhanced transfer. The analysis highlights sensitivity to perturbation settings and the impact of Inter-Align on task similarity and geometric distance. The modules are compatible with various continual learning paradigms.

advclintra-smoothproto-clipinter-aligngeometric control

SeClaw: Spec-Driven Security Task Synthesis for Evaluating Autonomous Agents

arXiv cs.AI · Hao Cheng, Changtao Miao, Tianle Song, Yin Wu · 2026-06-01

SeClaw introduces a framework for spec-driven security task synthesis and execution-based evaluation of autonomous LLM agents, addressing limitations in current benchmarks. The method combines structured risk specifications with a Docker-based testbed to systematically generate security tasks and assess agent behavior across resources, user tasks, environments, and intrinsic behaviors. Results demonstrate trajectory-aware evaluation of unsafe actions, enabling reproducible measurement and diagnosis of security failures in stateful agent workflows.

autonomous agentssecurity evaluationtask synthesisllm safetyexecution-based assessment

Quantitative Movement Testing: Measuring Patient Movements from a Single Smartphone Video

arXiv cs.AI · Pranav Mahajan, Amanda Wall, Eleonora Maria Camerone, Julie Stebbins · 2026-06-01

The authors introduce Quantitative Movement Testing (QMT), a computer vision pipeline for extracting 3D kinematic biomarkers from monocular smartphone videos to objectively measure functional movement impairments in chronic pain patients. QMT combines deep learning-based 3D pose estimation with leave-one-subject-out calibration to correct systematic bias, validated against optical motion capture in healthy controls (N=13). Results show strong correlation (r > 0.85) with gold-standard measurements, high test-retest reliability (r > 0.86) in fibromyalgia patients, and successful tracking of daily movement fluctuations in chronic sciatica. While home environments introduced higher variance, QMT detected group-level differences between healthy controls and sciatica patients, demonstrating potential for scalable remote monitoring in clinical trials.

3d pose estimationkinematic biomarkersmonocular videooptical motion capturetest-retest reliability

CityTrajBench: A Unified Benchmark for City-Scale Vehicle Trajectory Generation

arXiv cs.AI · Shibo Zhu, Xiaodan Shi, Dayin Chen, Yuntian Chen · 2026-06-01

CityTrajBench introduces a unified benchmark framework for city-scale vehicle trajectory generation, addressing fragmentation in evaluation protocols across studies. The framework standardizes data ingestion, trajectory normalization, feature construction, and multi-level evaluation, supporting heterogeneous generators (VAE, GAN, diffusion, flow-matching) on three real-world datasets. Experiments reveal trade-offs: DiffTraj excels in trajectory-level geometric fidelity, DiffRNTraj in global realism, and TrajFlow balances realism, quality, and efficiency, while a Markov baseline remains competitive on coarse-grained statistics.

trajectory generationbenchmark frameworkdiffusion modelsurban mobilityevaluation metrics

POIROT: Interrogating Agents for Failure Detection in Multi-Agent Systems

arXiv cs.AI · Iñaki Dellibarda Varela, R. Sendra-Arranz, Pablo Romero-Sorozabal, J. M. Valverde-García · 2026-06-01

POIROT introduces a protocol for failure detection in multi-agent systems by repurposing agents as a diagnostic layer, eliminating the need for centralized judgment. The method leverages existing epistemic diversity within the architecture to identify emergent failures and hallucinations. Results show POIROT outperforms single-LLM evaluator baselines (OR = 1.60, p = 0.008), with performance scaling with problem complexity, agent count, and fault dimensionality. The approach is released as an open-source library alongside BLAME, a benchmark for fault attribution in safety-critical systems.

multi-agent systemsfailure detectionepistemic diversitysafety-criticalfault attribution

Cross-modal linkage risk in clinical vision-language models

arXiv cs.AI · Soroosh Tayebi Arasteh, Mahshad Lotfinia, Sven Nebelung, Daniel Truhn · 2026-06-01

The study demonstrates that clinically specialized vision-language models (VLMs) create a privacy risk by enabling high-accuracy cross-modal re-linkage of de-identified chest radiographs to their original reports through shared embedding spaces. Using MIMIC-CXR (43,793 pairs) and CheXpert Plus (29,296 pairs) as benchmarks, they show that retrieval accuracy scales with model specialization, reaching 15× chance at N=100 and persisting under pathology-matched hard negatives. Differential privacy optimization (ε=0.34) on projection heads reduced Recall@1 by 61.8% at N=10,000 while preserving image classification utility (AUROC shift: 79.63%→79.43%).

vision-language modelscross-modal retrievaldifferential privacyclinical embeddingschest radiographs

Who Annotates in NLP? A Large-scale Assessment of Human Annotation Reporting between 2018 and 2025

arXiv cs.AI · Maria Kunilovskaya, Gagan Bhatia, Lisa Sophie Albertelli, Yanran Chen · 2026-06-01

This study conducts the first large-scale audit of human annotation reporting in NLP research, introducing a unified taxonomy and validating an LLM-assisted extraction pipeline (Krippendorff's alpha=0.606). The pipeline analyzes 2,667 annotation tasks from 1,603 ACL-venue papers (2018-2025), revealing that while operational details like annotator expertise are commonly reported, critical validity measures (training, compensation, adjudication) are often omitted, particularly in model-evaluation studies. Results indicate temporal improvement but persistent gaps in annotation transparency, prompting proposed minimum reporting standards.

human annotationreporting practicesllm-assisted extractionkrippendorff's alphaannotation validity

CEON: Circular Economy Ontology Network

arXiv cs.AI · Huanyu Li, Els de Vleeschauwer, Robin Keskisärkkä, Mikael Lindecrantz · 2026-06-01

The Circular Economy Ontology Network (CEON) addresses semantic interoperability challenges in circular economy (CE) knowledge representation by defining cross-sectorial concepts and enabling semantics-aware data documentation. Developed within the Onto-DESIDE project, CEON facilitates information sharing across industry sectors related to product life cycles, including construction, electronics, and textiles. The ontology network supports circular strategies such as reuse, refurbishment, remanufacturing, and recycling. CEON's efficacy is demonstrated through cross-industry data documentation scenarios, showcasing its ability to enhance resource circularity and sustainability.

circular economysemantic interoperabilityontology networkcross-sectorial conceptssemantics-aware documentation

FW-NKF: Frequency-Weighted Neural Kalman Filters

arXiv cs.AI · Adnan Harun Dogan, Berken Utku Demirel, Christian Holz · 2026-06-01

The paper introduces FW-NKF, a Frequency-Weighted Neural Kalman Filter that combines spectral shaping with deep learning for robust state estimation. The method integrates a causal spectral-shaping operator into Kalman residuals and jointly learns observation and transition networks to attenuate band-limited noise. Evaluated on chaotic systems (e.g., Lorenz) and inertial pose estimation, FW-NKF reduces localization error by 10% and improves orientation accuracy. Ablations confirm the necessity of frequency weighting and deep latent-state modeling.

kalman filterfrequency-weightedstate estimationspectral shapinglorenz systems

Towards Resolving Optimization Conflicts Between Image- and Text-Based Person Re-Identification

arXiv cs.AI · Karina Kvanchiani, Timur Mamedov · 2026-06-01

The paper proposes a decoupled two-stage training pipeline to resolve optimization conflicts between image-based (I2I) and text-based (T2I) person re-identification (ReID). The method employs a single vision encoder trained sequentially to avoid cross-task interference, with I2I pre-training enhancing T2I generalization. Experiments demonstrate that textual supervision during vision encoder training improves performance in both modalities, achieving a unified representation. Results highlight the benefits of domain mixing and task-specific learning strategies for cross-modal retrieval.

person re-identificationcross-modal retrievalvision encoderoptimization conflictsshared representation

AgentRedBench: Dynamic Redteaming and Integration-Aware Defense for LLM Agents over SaaS Integrations

arXiv cs.AI · Hiskias Dingeto, Will Leeney · 2026-06-01

AGENTREDBENCH introduces a dynamic LLM-driven redteaming benchmark addressing indirect prompt injection in tool-use agents across SaaS integrations. The benchmark comprises 215 authorization scenarios across 24 enterprise integrations, categorized into nine functional families and five attack types. Evaluated on an eight-model panel, attack success rates (ASR) ranged from 32% (Claude Sonnet 4.6) to 81% (Gemini 3 Flash). AGENTREDGUARD, a guard model trained on adversarial tool-response content, reduced ASR from 69.9% to 2.4% at a 0.37% false-positive rate, outperforming open-source baselines like Llama Guard and PromptGuard 2. Cross-integration and cross-attack type holdouts confirmed the guard's generalization capability.

indirect prompt injectiontool-use agentsattack success rateadversarial tool-responsecross-integration holdout

Faster Synchronous On-Policy RL via Straggler-Aware Group Sizing

arXiv cs.AI · Azal Ahmad Khan, Ammar Ahmed, Zeshan Fayyaz, Sheng Di · 2026-06-01

The paper introduces Straggler-Aware Group Control (SAGC), a dynamic group-size controller for synchronous on-policy reinforcement learning that mitigates straggler-induced delays. SAGC formulates group-size selection as an online constrained optimization problem, adapting group sizes based on observed rollout behavior to balance training stability and wall-clock efficiency. Experiments with Group Relative Policy Optimization (GRPO) and Decentralized Asynchronous Policy Optimization (DAPO) show SAGC reduces straggler incidence by 15-30% while maintaining or improving training reward and final model quality on reasoning benchmarks, often yielding shorter outputs without explicit length penalties.

synchronous reinforcement learningstraggler mitigationgroup relative policy optimizationonline constrained optimizationwall-clock efficiency

Consistency Training while Mitigating Obfuscation via Rate Matching

arXiv cs.AI · Sohaib Imran, Prakhar Gupta, Jannes Elstner, David Demitri Africa · 2026-06-01

The paper introduces Rate Matching Consistency Training (RMCT), a method for reducing extraneous feature influence in large language models while preserving verbalization of those features. RMCT enforces consistency by matching behavioral rates (e.g., bias-following) across input perturbations, avoiding direct constraints on response content. Evaluated on sycophancy reduction in open-weight models, RMCT achieves comparable bias mitigation to standard consistency training while better preserving cue verbalization, with improved data efficiency at higher computational cost. The approach demonstrates that behavioral robustness need not trade off against monitorability.

consistency trainingrate matchingbehavioral robustnesssycophancy reductionmonitorability

On the Generalization in Topology Optimization via Sensitivity-Conditioned Bernoulli Flow Matching

arXiv cs.AI · Mohammad Rashed, Duarte F. Valoroso Madeira, Babak Gholami, Caglar Guerbuez · 2026-06-01

The paper identifies adjoint sensitivity as the information-theoretically optimal conditioning signal for topology optimization (TO) generalization under distribution shifts, proposing pseudo-sensitivities to quantify how well physical fields approximate true sensitivities. A sensitivity-conditioned Bernoulli flow-matching generator demonstrates that sensitivity conditioning achieves state-of-the-art out-of-distribution (OOD) performance, while less informative fields degrade toward raw parameter conditioning. Experiments on structural TO benchmarks and a new CFD-TO dataset confirm these findings under load and boundary-condition shifts.

topology optimizationadjoint sensitivitybernoulli flow matchingout-of-distribution generalizationpseudo-sensitivities

Order within Chaos: Capturing Intrinsic Energy Anomalies for AI-Manipulated Image Forgery Localization

arXiv cs.AI · Yiming Wang, Baiqi Wu, Qingming Li, Jiahao Chen · 2026-06-01

The paper proposes FLAME, a framework for localizing AI-manipulated image forgeries by detecting intrinsic energy anomalies in diffusion-generated content. The method leverages a LAD map to capture statistical energy gaps from suppressed high-frequency variance during diffusion, combined with a parameter-efficient adapter for Segment Anything Model (SAM) for pixel-level localization. FLAME achieves state-of-the-art performance on AI-generated forgery datasets, with experiments showing superior generalization to unseen generative architectures. The authors also introduce EditStream, an automated pipeline for continuous training data synthesis to address benchmark lag.

diffusion processforgery localizationenergy anomalieslad mapparameter-efficient adapter

From Capability Models to Automated Planning: An AAS-Native Approach for Automatic PDDL Generation

arXiv cs.AI · Hamied Nabizada, Thomas Wirt, Luis Miguel Vieira da Silva, Felix Gehlhoff · 2026-06-01

The paper presents an automated method to generate Planning Domain Definition Language (PDDL) problems from Asset Administration Shell (AAS) capability models, eliminating the need for production engineers to learn PDDL syntax. The approach leverages four Industry 4.0 standards (VDI 3682, IEC 61360-1, IDTA 02011, IDTA 02016) to extract planning elements from domain-level resource function descriptions, transforming Multi-AAS architectures into complete PDDL problems. Validation on a laboratory production system demonstrates the method's effectiveness in comparing layout variants through optimal planning, enabling systematic design exploration via AAS model modifications.

automated planningasset administration shellpddl generationindustry 4.0capability models

An Abstract Worlds Semantic Framework for Belief Change Operators

arXiv cs.AI · Daniel Grimaldi, M. Vanina Martinez, Ricardo O. Rodriguez · 2026-06-01

The paper introduces Abstract Worlds Semantics (AWS), a set-theoretic framework for belief change that treats worlds as primitive elements without assuming logical syntax. It defines world contraction and world revision operators, enabling unified analysis of classical and non-prioritized belief change models. When applied to classical propositional logic, AWS provides a homogeneous account of AGM, KM, and Multiple Change models, systematizing and generalizing belief change theory over belief sets.

belief changeabstract worlds semanticsworld contractionworld revisionnon-prioritized belief change

Multimodal Approaches for Visually-Rich Document Type Classification: A Comparative Analysis

arXiv cs.AI · Catyana Heyne, Jürgen Frikel, Filippo Riccio · 2026-06-01

This study provides a systematic comparison of multimodal architectures for document type classification, focusing on text, image, and layout modalities. Four models (LayoutLMv3, Donut, Qwen3-VL-32B-Instruct, Qwen3-32B) are evaluated on RVL-CDIP under a unified framework to assess OCR-dependent and OCR-free approaches. Results indicate specialized multimodal Transformers outperform LLM-based methods, with image features being most discriminative and OCR text providing secondary support, particularly for layout-intensive documents.

multimodal learningdocument classificationtransformer architecturesocr-free processingrvl-cdip

Predicting the risk of colorectal anastomotic leak based on preoperative mapping of the blood supply of the bowel

arXiv cs.AI · Zahra Tabatabaei, Jon Sporring, Mark Bremholm Ellebæk, Alaa El-Hussuna · 2026-06-01

The authors propose an AI-driven system for preoperative prediction of colorectal anastomotic leak risk using CT imaging, addressing the lack of validated quantitative methods. The framework integrates two modules: (1) a deep learning-based risk assessment tool analyzing vascular/tissue features, and (2) a Content-Based Medical Image Retrieval (CBMIR) system for evidence-based surgical planning. The protocol details GDPR-compliant data handling, image preprocessing, and clinically interpretable model development, demonstrating technical feasibility within existing healthcare infrastructures. This interdisciplinary approach aims to reduce leak incidence through explainable, data-driven surgical decision support.

anastomotic leakcontent-based image retrievalct imagingdeep learningsurgical decision support

S3TS: Stochastic Scenario-Structured Tree Search for Advanced Planning Under Uncertainty

arXiv cs.AI · Fabio Pavirani, Bert Claessens, Pierre Pinson, Chris Develder · 2026-06-01

The authors propose Stochastic Scenario-Structured Tree Search (S3TS), a planning algorithm that simultaneously handles non-linear system models and uncertainty through scenario trees. S3TS combines Monte Carlo Tree Search with stochastic optimization for energy scheduling tasks, particularly demand response signal publication. Evaluated on a Belgium-inspired imbalance settlement simulation, S3TS achieves near-optimal performance (within 14% of optimal) in linear settings and outperforms baselines in non-linear cases, reducing costs by 51% versus myopic approaches and 5.4% versus deterministic MCTS.

stochastic optimizationmonte carlo tree searchscenario treesenergy schedulingnon-linear models

Multilingual Idioms in Sentences and Conversations Across High-, Medium-, and Low-Resource Languages

arXiv cs.AI · Saeed Almheiri, Bilal Elbouardi, Salsabila Zahirah Pranida, Irina Nikishina · 2026-06-01

The paper introduces MIDI, a multilingual idiom dataset spanning 18 languages across high-, medium-, and low-resource tiers, curated by native speakers to include both sentence-level and conversational contexts for literal and figurative idiom interpretations. Benchmarking state-of-the-art models reveals performance degradation in low-resource languages, with literal interpretations consistently harder than figurative ones; conversational context aids but does not eliminate disparities. Controlled tests on hidden representations further distinguish memorization from reasoning, exposing model limitations.

multilingual nlpidiomatic expressionslow-resource languagescontextual interpretationhidden representations

VLBM: Variational Latent Basis Modeling for OOD Robust Multivariate Time Series Forecasting

arXiv cs.AI · Xudong Zhang, Jierui Lei, Jiacheng Li, Lingdong Shen · 2026-06-01

The paper proposes VLBM (Variational Latent Basis Model), a variational framework for robust multivariate time series forecasting under out-of-distribution (OOD) events. VLBM decomposes inputs into stable latent basis components and orthogonal residuals, using a future-aware posterior aligned with a future-blind prior for test-time inference. Evaluated on 12 benchmarks spanning transportation, weather, and power systems, VLBM achieves state-of-the-art OOD robustness and in-distribution accuracy, with average MAE and MSE improvements of 15.08% and 7.74% over baselines. The method also demonstrates superior OOD pulse recovery in synthetic experiments.

variational inferencelatent basis modelingood robustnessmultivariate time seriesforecasting

Rethinking Evaluation Paradigms in IBP-based Certified Training

arXiv cs.AI · Konstantin Kaulen, Hadar Shavit, Holger H. Hoos · 2026-06-01

The paper proposes a Pareto front evaluation paradigm for certified training methods, addressing limitations in current single-configuration reporting practices. Using automated multi-objective hyperparameter optimization, the authors identify Pareto-optimal configurations across methods, revealing undertuning in prior work and establishing new state-of-the-art performance. Their comprehensive comparison shows prior advancements are less significant than claimed and uncovers new performance complementarities between methods.

certified trainingpareto frontmulti-objective optimizationadversarial robustnessneural network verification

Variational Learning for Insertion-based Generation

arXiv cs.AI · Yangtian Zhang, Zhe Wang, Arthur Gretton, Rex Ying · 2026-06-01

The paper introduces the Insertion Process (IP), a probabilistic framework for learning insertion order in variable-length sequence generation. IP establishes a bijective mapping between insertion trajectories and permutations, enabling exact likelihood reparameterization via permutation-based variational inference. Unlike fixed-canvas approaches, IP jointly models insertion locations, content, and termination while supporting variable-length outputs. Experiments on goal-conditioned planning and molecular string generation show that learning insertion orders improves modeling quality and generalization in non-canonical domains.

insertion-based generationvariational inferencenon-monotonic generationpermutation learningvariable-length modeling

Learning When Not to Act: Mitigating Tool Abuse in Agentic Reinforcement Learning

arXiv cs.AI · Liuji Chen, Dianxing Tang, Xing Shi, Dingshuo Chen · 2026-06-01

The paper introduces EAPO (Efficient Agentic Policy Optimization), a reinforcement learning framework that mitigates tool abuse in agentic systems by learning selective tool use. The method incorporates tool-free trajectories, difficulty-aware reward shaping to penalize redundant tool calls on easier queries, and confidence-aware token reweighting for policy improvement. Evaluated across nine mathematical and knowledge-intensive reasoning benchmarks on Qwen2.5-3B, Qwen2.5-7B, and Llama3.1-8B, EAPO improves average performance by 7.27-10.45% while reducing tool calls by 18.33-24.59% compared to GRPO, demonstrating effective trade-offs between accuracy and efficiency.

agentic reinforcement learningtool abusereward shapingpolicy optimizationreasoning benchmarks

Understanding-Enhanced Model Collaboration for Long-Tailed Egocentric Mistake Detection

arXiv cs.AI · Boyu Han, Qianqian Xu, Shilong Bao, Zhiyong Yang · 2026-06-01

The paper proposes Understanding-Enhanced Model Collaboration Method (UE-MCM) for detecting incorrect actions in long-tailed egocentric videos. The method combines a small-branch model (CLIP4CLIP encoder with Diffusion Contrastive Reconstruction) for coarse-grained workflow consistency and a large-branch model (Qwen3-VL Embedding) for fine-grained action correctness, fused via a lightweight collaboration gate. It addresses class imbalance through multi-objective optimization (reweighted cross-entropy, AUC-oriented learning, label-aware adjustment). The system achieves balanced speed-accuracy tradeoffs for detecting rare, subtle mistakes in instructional videos.

egocentric mistake detectionmodel collaborationdiffusion contrastive reconstructionlong-tailed learningvideo understanding

How Hard Can It Be? Hardness-Aware Multi-Objective Unlearning

arXiv cs.AI · Jiangwei Chen, Xinyuan Niu, Rachael Hwee Ling Sim, Zhengyuan Liu · 2026-06-01

The paper introduces HAMU, a hardness-aware multi-objective unlearning algorithm that guarantees specified improvements in forget quality while minimizing retain utility degradation. The method quantifies task hardness via data similarity and formulates unlearning as constrained optimization, providing stopping criteria when objectives become irreconcilable. Evaluations on image and text datasets with large models show HAMU outperforms existing baselines in balancing forget/retain performance.

machine unlearningconstrained optimizationhardness measureforget qualityretain utility

A Primer in Post-Training Reasoning Data: What We Know About How It Works

arXiv cs.AI · Yaoming Li, Guangxiang Zhao, Qilong Shi, Lin Sun · 2026-06-01

The paper presents a systematic synthesis of over 150 studies on post-training reasoning data, establishing the first primer for this emerging field. It organizes research around four core questions: data object taxonomy, utility determinants, construction methodologies, and scaling properties. The analysis provides an attribution framework for future reasoning-data releases and post-training optimization techniques, addressing the currently fragmented literature spanning dataset papers, reinforcement-learning recipes, and benchmark studies.

post-trainingreasoning datareinforcement-learning recipesattribution frameworkutility determinants

Jailbreaking Multimodal Large Language Models using Multi-Clip Video

arXiv cs.AI · Choongwon Kang, Seungjong Sun, Hyunmin Jun, Jang Hyun Kim · 2026-06-01

The study introduces Multi-Clip Video (MCV) SafetyBench, a dataset of 2,920 videos, to investigate how video input diversity affects multimodal large language model (MLLM) vulnerability to jailbreaking. Experiments on eight video MLLMs demonstrate that attack success rates increase with clip count, revealing three key vulnerabilities: video modality surpasses image modality in susceptibility, dynamic videos are more vulnerable than static ones, and diverse contexts heighten risk. The authors propose an image-modality-based defense strategy leveraging its relative robustness.

multimodal large language modelsjailbreakingvideo inputssafety alignmentdefense strategy

BADGER: Bridging Agentic and Deterministic Evaluation for Generative Enterprise Reasoning

arXiv cs.AI · Shannon Serrao, Soumitra Chatterjee, Dorina Strori, Abhishek Sharma · 2026-06-01

BADGER introduces a unified evaluation framework for enterprise AI systems combining text-to-SQL assessment with agentic behavior evaluation. The method extends Spider's SQL component extraction to handle CTE-heavy queries, proposes Hybrid-EX (a hybrid execution accuracy metric resolving column-aliasing issues via LLM structural alignment), and integrates RAGAS/G-Eval metrics into an agentic evaluation pipeline. On 150 industry queries, Hybrid-EX achieves Cohen's kappa=0.717 (substantial agreement) and 87.3% balanced accuracy, outperforming six baselines (Delta-kappa: 0.322-0.502, p≤0.001). The framework operates within governed data environments with configurable LLM judges.

text-to-sqlagentic evaluationexecution accuracyllm-assistedgoverned data

Network Distributed Multi-Agent Reinforcement Learning for Consensus Control of Quadcopters

arXiv cs.AI · Youssef Mahran, Zeyad Gamal, Aamir Ahmad, Ayman El-Badawy · 2026-06-01

The paper introduces Network Distributed Multi-Agent Reinforcement Learning (ND-MARL), a framework for quadcopter swarm consensus control that integrates swarm communication topology into decision-making. Using a 2-Neighbor communication topology and Multi-Agent Soft Actor-Critic (MASAC), agents observe only two neighbors' states to compute distributed actions. A hierarchical planner-tracker architecture achieves smooth consensus trajectories, with zero-shot scalability demonstrated on swarms up to 250 agents without retraining. Results show stable convergence but increased steady-state spread in large teams due to sparse information propagation.

nd-marlmulti-agent soft actor-criticconsensus control2-neighbor topologyzero-shot scalability

The Role of Ambiguity in Error Prediction via Uncertainty Quantification

arXiv cs.AI · Ieva Raminta Staliūnaitė, James Bishop, Andreas Vlachos · 2026-06-01

This paper improves error prediction in LLMs by disentangling input ambiguity from uncertainty quantification (UQ) signals. The authors analyze six UQ metrics on QA tasks, demonstrating their stronger correlation with errors on unambiguous questions versus ambiguous ones. They integrate gold and predicted ambiguity labels via Gated Experts and Selective Prediction, achieving >10-point PRR improvements across model families, datasets, and uncertainty sources.

uncertainty quantificationerror predictionaleatoric uncertaintygated expertsselective prediction

LALE: Lightweight-Transformer Architecture for Land-Cover Estimation

arXiv cs.AI · Ümit Mert Çağlar, Alptekin Temizel · 2026-06-01

LALE introduces a lightweight-transformer architecture for remote sensing image segmentation, combining ConvMixer stages for high-resolution local features with transformer stages for low-resolution global context to balance computational efficiency and performance. The model employs an all-MLP multi-scale decoder, RMSNorm, and StarReLU to minimize compute and parameters. On the ARAS400k benchmark, LALE's smallest variant (1.6M parameters) achieves within 2.6 F1 points of UPerNet while using 4.5x fewer parameters, 7x less storage, 17x fewer GMACs, and 1.8x higher throughput.

semantic segmentationremote sensinglightweight-transformerconvmixermlp decoder

Agentic-J: An AI Agent for Biological Microscopy Image Analysis

arXiv cs.AI · Lukas Johanns, Marilin Moor, Davide Panzeri, Yu Zhou · 2026-06-01

Agentic-J introduces a containerised multi-agent AI system for automating biological microscopy image analysis in ImageJ/Fiji. The framework employs specialised sub-agents to handle plugin management, code generation, debugging, and statistical reporting, converting natural language task specifications into executable, documented workflows. Demonstrated applications include nuclei segmentation, cell tracking, and multi-condition quantification, with all analysis decisions preserved for reproducibility. The technical implementation emphasizes traceability through script generation within structured projects.

multi-agent systemimagej/fijibiological microscopyworkflow automationreproducible analysis

Fast and Lightweight Novel View Synthesis with Differentiable Multiplane Image

arXiv cs.AI · Kaidi Zhang, Guanxu Zhu · 2026-06-01

We propose a fast and lightweight novel view synthesis method using differentiable Multiplane Image (MPI) representation, addressing limitations of Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) in rendering speed, model size, and sparse-view performance. Our approach leverages visual foundation models for geometric initialization, employs differentiable optimization, and introduces one-step diffusion to mitigate holes and artifacts in sparsely initialized MPI. Compared to a representative GS-based method, our approach achieves 30.7% faster rendering, uses only 14.8% of its model size, and maintains competitive synthesis quality in front-view scenarios.

novel view synthesismultiplane imagedifferentiable optimizationsparse-viewone-step diffusion

Where Do Deep-Research Agents Go Wrong? Span-Level Error Localization in Agent Trajectories

arXiv cs.AI · Jiaming Wang, Ziteng Feng, Jiangtao Wu, Ruihao Li · 2026-06-01

The paper introduces TELBench, a 1,000-instance benchmark for span-level error localization in deep-research agent trajectories, and DRIFT, a claim-centric auditing framework. The authors analyze 2,790 trajectories from two agent frameworks, three backbone models, and three benchmarks, annotating harmful error spans via LLM-assisted expert review. DRIFT improves span-level error localization and first-error accuracy by up to 30 percentage points across model families and auditing frameworks, offering a process-level reliability assessment.

span-level error localizationdeep-research agentsclaim-centric auditingllm-assisted reviewtrajectory analysis

eMoT: evolving Memory-of-Thought via Symbolic Anchoring and Memory Corrosion

arXiv cs.AI · Xiang Li, Jiwei Wei, Ke Liu, Yitong Qin · 2026-06-01

The paper introduces eMoT (evolving Memory-of-Thought), a framework enhancing multi-step reasoning in LLMs by treating reasoning trajectories as dynamic memories rather than static processes. Key components include: (i) memory corrosion for reinforcing high-utility reasoning structures, (ii) symbolic anchoring via Python for deterministic computation, and (iii) consistency-driven refinement to align neural and symbolic outputs. Evaluations on Game of 24 show 100% accuracy (17.6% over baseline), with consistent gains on GSM8K, ASDiv, SVAMP, and MGSM, demonstrating performance improvements independent of model scale.

memory-of-thoughtsymbolic anchoringmemory corrosionmulti-step reasoningconsistency-driven refinement

Explainable Data-driven Deep Reinforcement Learning Methods for Optimal Energy Management in Buildings

arXiv cs.AI · Hallah Shahid Butt, Qiong Huang, Gökhan Demirel, Kevin Förderer · 2026-06-01

The paper proposes an explainable deep reinforcement learning (XRL) framework for energy management in residential buildings with renewable integration. The method employs both on-policy (A2C, PPO) and off-policy DRL agents trained on an expanded state space including real-time measurements, external signals, and forecasts. Experimental results on synthetic and real-world data (LLEC at KIT) show on-policy methods outperform off-policy approaches in cumulative rewards and policy stability, while post-hoc interpretation provides actionable insights into the learned control policies.

explainable reinforcement learningenergy managementproximal policy optimizationpost-hoc interpretationrenewable integration

Topological texture analysis of microscopy images of dynamic casein gelation and its relation to rheological properties

arXiv cs.AI · Zahra Tabatabaei, Diana Soto Aguilar, Jose C. Bonilla, Mathias P. Clausen · 2026-06-01

The study introduces a computational toolbox combining Topological Data Analysis (TDA), Differential Box Counting (DBC), Multifractal Partition (MFP), and Local Binary Patterns (LBP) to analyze sodium caseinate gelation dynamics via STED microscopy. TDA-derived max-Betti-1 curves tracked topological loops, revealing distinct gelation phases (lag, percolation, rearrangement) aligned with rheological transitions, while DBC and MFP quantified structural complexity. Validation on simulated fractal images confirmed method robustness. The approach resolves microstructural transitions imperceptible to bulk rheology, offering a quantitative framework for food and material science. Code is publicly available.

topological data analysisdifferential box countingmultifractal partitionlocal binary patternsgelation dynamics

Attention mechanisms and transfer learning for robust peach leaf damage classification under domain shift

arXiv cs.AI · Adrián Cánovas-Rodriguez, Miguel A. González-Illán, Maria Fernanda García-Cruz, Pedro Nortes Tortosa · 2026-06-01

The study proposes an attention-enhanced deep learning approach for peach leaf damage classification under domain shift, addressing challenges from climate-induced symptom overlap. Using a manually annotated dataset of 1,366 leaves across six categories, EfficientNet architectures (B0-B5) achieved 91.5-93.3% accuracy, with CBAM integration improving performance in EfficientNetB5 and InceptionV3. Transfer learning on a local 180-image dataset demonstrated CBAM-EfficientNetB3's robustness (93% macro F1-score), particularly for minority classes. The method combines computer vision with agricultural informatics to enable cross-domain generalization.

efficientnetcbamdomain shifttransfer learningmacro f1-score

RL-ACRGNet: Reinforcement Learning-Based Chest Radiology Report Generation Network

arXiv cs.AI · Yogesh Kumar Meena, Saurabh Agarwal, K. V. Arya · 2026-06-01

The paper proposes RL-ACRGNet, a reinforcement learning-based model for automated chest radiology report generation. The method combines a DenseNet encoder with a multilevel LSTM decoder in an off-policy RL framework, using dual networks to refine visual-semantic embeddings via metric-based rewards. On IU-Xray, it achieves improvements of 0.47% BLEU-4, 0.17% METEOR, and 0.518 ROUGE-L over baselines, with additional validation on MIMIC-CXR demonstrating robust generalization for clinically relevant reports.

reinforcement learningdensenetlstm decodervisual-semantic embeddingsmetric-based reward

OpenWebRL: Demystifying Online Multi-turn Reinforcement Learning for Visual Web Agents

arXiv cs.AI · Rui Yang, Qianhui Wu, Yuxi Chen, Hao Bai · 2026-06-01

OpenWebRL introduces an open framework for training visual web agents using online multi-turn reinforcement learning (RL) on live websites, addressing scalability bottlenecks in supervised post-training approaches. The framework includes scalable live-browser infrastructure, supervised initialization, multimodal context management, trajectory-level success judging, and efficient multi-turn policy optimization. OpenWebRL-4B, trained with only 0.4K initialization trajectories and 2.2K RL tasks, achieves 67.0% success on Online-Mind2Web and 64.0% on DeepShop, outperforming prior open agents and remaining competitive with proprietary systems. The study systematically analyzes design choices for effective online RL and demonstrates improved agentic reasoning, offering a reproducible and cost-efficient path for web agent development.

visual web agentsonline reinforcement learningmultimodal context managementtrajectory-level success judgingpolicy optimization

Ranking vs. Assignment: The Metric Mismatch in Multi-View Object Association

arXiv cs.AI · Matvei Shelukhan, Timur Mamedov, Aleksandr Chukhrov, Karina Kvanchiani · 2026-06-01

The paper identifies a metric mismatch between pairwise ranking metrics (AP, FPR-95) and assignment objectives in multi-view object association. It theoretically demonstrates that optimal ranking metrics can yield incorrect assignments, while correct assignments may not maximize ranking metrics. The authors propose Sinkhorn-based normalization as a post-processing stress test, showing that tuning a few parameters improves AP and FPR-95 without enhancing assignment accuracy (ACC, IPAA).

multi-view object associationpairwise ranking metricssinkhorn normalizationassignment objectivemetric mismatch

Extreme Low-Bit Inference in Reasoning Models: Failure Modes and Targeted Recovery

arXiv cs.AI · Ekaterina Alimaskina, Darya Rudas, Denis Shveykin, Gleb Molodtsov · 2026-06-01

The paper demonstrates that 2-bit quantization in Large Reasoning Models (LRMs) like Qwen3 leads to generation instability, manifesting as repetitive loops, budget exhaustion, and unclosed reasoning segments, which inflate token counts and degrade accuracy. To mitigate these failures, the authors propose FP16 planning, providing high-precision outlines, and loop rescue, which detects repetitive traces and selectively falls back to FP16. On MATH-500, these methods improve Qwen3-8B accuracy from 17.2% to 74.2% and Qwen3-32B from 65.0% to 87.2%, showing that lightweight controls enable practical 2-bit inference while preserving speed.

quantizationreasoning modelslow-bit inferencegeneration instabilitylightweight controls

PlanarBench: Evaluating LLM Spatial Reasoning via Planar Graph Drawing

arXiv cs.AI · Oleksandr Nikitin · 2026-06-01

PlanarBench introduces a novel benchmark for evaluating large language models (LLMs) on spatial reasoning tasks via planar graph drawing. The task requires LLMs to generate ASCII art representations of planar graphs given only an edge list, resisting memorization due to permutable edge order, orientation, and node labels. The benchmark evaluates 91 models on 199 simplest non-isomorphic connected planar graphs with 2-7 vertices, identifying edge count as the dominant difficulty predictor (r = -0.85), a novel finding compared to prior graph benchmarks focusing solely on node count.

planar graphsspatial reasoningascii artedge listnon-isomorphic

Towards 3D-Aware Video Diffusion Models: Render-Free Human Motion Control with Mesh Tokenization

arXiv cs.AI · Jingyun Liang, Min Wei, Shikai Li, Yizeng Han · 2026-06-01

The paper introduces a render-free framework for 3D-aware video diffusion models that conditions generation directly on compressed 3D human mesh tokens, avoiding reliance on 2D motion guidance. The method employs a DiT-based architecture to jointly process video tokens and motion tokens, enforcing reasoning about appearance, 3D structure, and camera viewpoint. Experiments show improved performance on human motion control benchmarks, with reduced artifacts from view-dependent 2D guidance and trajectory-pose mismatches during editing.

video diffusion models3d-aware generationmesh tokenizationhuman motion controldit-based architecture

Why Do Time Series Models Need Long Context Windows?

arXiv cs.AI · Luca Butera, Giovanni De Felice, Andrea Cini, Cesare Alippi · 2026-06-01

The paper demonstrates that forecasting groups of time series involves two objectives: generative process identification (GPI) and conditional forecasting (CF). It proves that input window sizes exceeding the memory length P are necessary to minimize prediction error, as longer contexts reduce uncertainty about the generating process. The authors propose decoupling GPI and CF to enhance computational scalability without accuracy loss. Experiments on synthetic and real-world datasets validate these insights, informing the design of forecasting architectures.

generative process identificationconditional forecastingmemory lengthtime seriesforecasting architectures

MMG2Skill: Can Agents Distill In-the-Wild Guides into Self-Evolving Skills?

arXiv cs.AI · Xinyu Che, Junqi Xiong, Yunfei Ge, Xinping Lei · 2026-06-01

The paper introduces MMG2Skill, a framework for converting in-the-wild multimodal guides into executable agent skills through a closed-loop process of skill compilation, VLM-based execution, and trajectory-driven revision. The method addresses challenges of noisy, heterogeneous web knowledge by structuring skills as editable programs and using root-cause feedback for iterative improvement without benchmark scores. Evaluated on MMG2Skill-Bench across GUI control, gameplay, and card tasks with six VLMs, the framework achieves +12.8 to +25.3 pp gains over baselines, demonstrating the necessity of both structured skill construction and revision mechanisms.

guide-to-skill learningvision-language modeltrajectory-driven revisionmultimodal procedural knowledgeclosed-loop skill improvement

A Structured Benchmark for Text-Guided Anomaly Detection: When Language Stops Conditioning the Decision

arXiv cs.AI · Stefano Samele, Eugenio Lomurno, Teodora Jovanovic, Sanjay Shivakumar Manohar · 2026-06-01

The paper introduces Text-Guided Anomaly Detection (TGAD), a structured benchmark designed to evaluate the functional role of language in multimodal anomaly detection systems. TGAD comprises three scenarios: prompt-sensitivity testing on MVTec AD, component-tagged assessment, and the new Assembled Panel Dataset (APD) requiring defect-type and component-location knowledge. Evaluations of three model paradigms (generative large vision-language, training-free discriminative, embedding-adaptive discriminative) reveal superficial language conditioning: performance drops significantly when textual guidance is altered or removed (e.g., I-AUROC from 97.4 to 82.6). Results indicate current benchmarks overstate text-guided capabilities, necessitating stricter protocols for industrial deployment.

anomaly detectionmultimodal modelstext-guidedbenchmarkingindustrial inspection

SafeMCP: Proactive Power Regulation for LLM Agent Defense via Environment-Grounded Look-Ahead Reasoning

arXiv cs.AI · Lichao Wang, Zhaoxing Ren, Tianzhuo Yang, Jiaming Ji · 2026-06-01

SafeMCP introduces a server-side defense plugin for LLM agents operating via the Model Context Protocol (MCP), addressing risks of power-seeking behaviors in expanded action spaces. The method employs a two-tier defense: proactive tool filtering using an internal world model for look-ahead reasoning and immediate intervention as a fail-safe. Training involves a three-stage pipeline of environmental dynamic grounding, safe policy initialization, and reinforcement learning with dual verifiable rewards. Experiments on PowerSeeking Bench, ToolEmu, and AgentHarm demonstrate SafeMCP's effectiveness in mitigating risks while maintaining agent utility.

model context protocollook-ahead reasoningproactive tool filteringreinforcement learningpower-seeking

An NLP-Driven Framework for Curriculum-Labor Market Alignment: Schema-Constrained LLM Extraction, ESCO-Anchored Semantic Matching, and Multi-Dimensional Gap Quantification

arXiv cs.AI · Sherzod Turaev, Mary John, Mamoun Awad, Nazar Zaki · 2026-06-01

The paper introduces a four-stage NLP framework for curriculum-labor market alignment, addressing schema-constrained extraction and semantic matching challenges. The method combines schema-constrained LLM prompting (using a two-model ensemble), Sentence-BERT alignment with ESCO v1.2.1 vocabulary, adjudication protocols, and verification via Cohen's kappa and completeness audits. Applied to a BSc Computer Science program, the framework extracts 400 competency records from 85 courses, aligning them with 30 job postings (483 clauses) at SBERT cosine 0.50. Results show 0.79 kappa for skill extraction, 100% schema conformance, and quantified competency gaps (e.g., 25.0% in transversal skills, 1.8% in AI/data science).

schema-constrained extractionsentence-bertesco taxonomycompetency gap quantificationllm ensemble

Algorithmic algorithm development with LLMs: A Case Study on LLM-Usage for Contraction Order Optimization in Tensor Networks

arXiv cs.AI · Fabian Hoppe, Melven Röhrig-Zöllner, Philipp Knechtges · 2026-06-01

The study demonstrates LLM-assisted algorithm development for tensor network contraction order optimization using OpenEvolve, emphasizing LLM selection and evaluation design. The method employs verifier-guided evolutionary coding agents to iteratively improve algorithms, with human validation remaining crucial. Results indicate promise for LLM-aided algorithmic improvement while underscoring persistent challenges in evaluation and interpretation.

tensor networkscontraction order optimizationllm-assisted programmingevolutionary coding agentsalgorithmic improvement

AutoMedBench: Towards Medical AutoResearch with Agentic AI Models

arXiv cs.AI · Junqi Liu, Salena Song, Yuhan Wang, Jiawei Mao · 2026-06-01

The paper introduces AutoMedBench, a workflow-aware benchmark for evaluating autonomous medical-AI research agents across five stages (Plan, Setup, Validate, Inference, Submit) and five research tracks (segmentation, image enhancement, VQA, report generation, lesion detection). The benchmark features long-horizon tasks (avg. 33 agent turns per run) with two difficulty tiers (Lite/Standard) and provides both final task performance and stage-level scores. Results from thousands of runs show Validate as the weakest stage (37.7% error rate) and Setup as the strongest, with verification/submission failures dominating errors (76.2% combined) and single-error runs scoring 48% lower than error-free ones.

autonomous agentsmedical-aiworkflow-aware benchmarkmultimodal inferenceerror analysis

Rank-Constrained Deep Matrix Completion for Group Recommendation

arXiv cs.AI · Mubaraka Sani Ibrahim, Lehel Csató, Isah Charles Saidu · 2026-06-01

The authors propose Group Rank-Constrained Deep Matrix Completion (Group RC-DMC), a group recommendation framework combining low-rank matrix completion with attention-based aggregation. The method enforces explicit low-rank regularization via nuclear-norm proximal steps, uses a Set-Transformer for group-level representation, and employs a low-rank factorization decoder. Evaluated on MovieLens and Goodbooks datasets, Group RC-DMC achieves superior group RMSE and competitive precision/recall/F1 scores compared to weighted-before-factorization and after-factorization baselines, demonstrating robust performance across group sizes.

group recommendationmatrix completionlow-rank regularizationset-transformernuclear-norm proximal

Parameter-Efficient Fine-Tuning of Large Pretrained Models for Instance Segmentation Tasks

arXiv cs.AI · Nermeen Abou Baker, David Rohrschneider, Uwe Handmann · 2026-06-01

This work evaluates parameter-efficient fine-tuning (PEFT) methods for transformer-based instance segmentation models, focusing on adapters and Low-Rank Adaptation (LoRA). The study introduces sequential adapter modules and applies LoRA to deformable attention—a novel approach—achieving competitive performance while fine-tuning only 1-6% of parameters versus 40-55% in full fine-tuning. Results show 2-3 adapters per transformer block optimize performance-efficiency trade-offs, with LoRA on deformable attention sometimes outperforming adapters. Findings highlight PEFT's dataset- and architecture-dependent efficacy, enabling scalable transfer learning for instance segmentation.

parameter-efficient fine-tuningadapterslow-rank adaptationinstance segmentationdeformable attention

VET: A Framework for Analyzing AI Discourse

arXiv cs.AI · Meredith Ringel Morris · 2026-06-01

The VET Framework is introduced as a method for categorizing AI discourse along three dimensions: valence, effectiveness, and trajectory. This framework enables the identification, comparison, and critique of prevalent narratives—AI Hype, AI Doom, AI Denial, and AI Normalcy—by analyzing how each stance exaggerates aspects of AI's current state or likely evolution. The VET Framework serves as an AI Literacy tool, facilitating the vetting of polarized AI discourse in traditional and social media.

vet frameworkai discourseai literacyvalencetrajectory

SMH-Bench: Benchmarking LLM Agents for Environment-Grounded Reasoning and Action in Smart Homes

arXiv cs.AI · Kuan Li, Shuo Zhang, Huacan Wang, Fangzhou Yu · 2026-06-01

The authors introduce SMH-Bench, a benchmark for evaluating LLM agents in smart-home environments, addressing limitations of existing static benchmarks. Built on the HomeEnv simulator, it comprises 1,100 tasks across 7 categories and 22 subcategories, stratified by home complexity (135 devices max). Experiments reveal frontier LLMs perform well on explicit control but struggle with task scheduling (38% accuracy drop in complex homes), ambiguity resolution, and personalized reasoning. The benchmark enables evaluation of context-aware reasoning in realistic multi-device scenarios.

smart-home benchmarkllm agentsenvironment-grounded reasoningtask schedulingpersonalized reasoning

Echo: A Joint-Embedding Predictive Architecture for Speaker Diarization and Speech Recognition in a Shared Latent Space

arXiv cs.AI · Louis Mouchon · 2026-06-01

Echo introduces a joint-embedding predictive architecture (JEPA) for speaker diarization, speech recognition, and dynamic source separation using a shared 25M-parameter ViT encoder. The system employs a 512-dimensional latent space pretrained with JEPA and specialized via staged training, avoiding per-task fine-tuning. Lightweight heads handle diarization (ArcFace + VBx) and separation (null-target K-set prediction). Evaluated on synthetic VoxCeleb2 mixtures, it achieves 15.00% DER, 97.80% separation accuracy (+9.52dB SI-SDR), and +53.50 speaker/content factorization gap. The work highlights multi-task coexistence limitations, particularly the VQ bottleneck for end-to-end ASR.

joint-embedding predictive architecturespeaker diarizationdynamic source separationvit encoderlatent space

Bayesian Spectral Emotion Transition Discovery from Multi-Annotator Disagreement

arXiv cs.AI · Keito Inoshita, Takato Ueno · 2026-06-01

We propose Bayesian Spectral Emotion Transition Discovery (BSETD), a two-stage framework for discovering emotion-transition structure from multi-rater soft labels. First, a hierarchical Dirichlet-Multinomial posterior constructs a K x K transition matrix with credible intervals and Benjamini-Hochberg FDR-controlled significance. Second, spectral decomposition of the symmetrized graph Laplacian separates low-frequency (inertia) and high-frequency (contagion) components. On EmotionLines, BSETD identifies distinct affective spaces, recovering Plutchik-adjacent transitions (e.g., disgust to anger, log2 lift +0.94) and Russell-valence-reversed transitions (e.g., joy to anger, -0.90). Cross-corpus validation shows Pearson correlations of 0.91-0.98 within English and 0.79-0.85 against Chinese M3ED, demonstrating annotator uncertainty preservation bridges emotion dynamics with psychological theory.

bayesian spectral emotion transition discoveryhierarchical dirichlet-multinomialbenjamini-hochberg fdrsymmetrized graph laplacianemotionlines

KliniskVestBERT: BERT Model Specialised to Norwegian Clinical Texts

arXiv cs.AI · Christian Autenried, Cosimo Persia · 2026-06-01

The paper introduces KliniskVestBERT, a suite of three Norwegian clinical BERT models (based on Nb-BERT-large, NorBERT3-large, and ModernBERT) through continued pretraining on a de-identified corpus of diverse clinical texts from Helse Vest. The dataset includes discharge summaries, surgical reports, and nursing notes in both bokmål and nynorsk, representing a broad clinical spectrum. Evaluations on three synthetic benchmarks and two real-world tasks show consistent performance gains over baseline models, demonstrating the value of clinical domain adaptation for Norwegian NLP applications.

bert-based modelsdomain adaptationclinical nlpnorwegian languagecontinued pretraining

The Image Reconstruction Game: Drawing Common Ground Through Iterative Multimodal Dialogue

arXiv cs.AI · Sherzod Hakimov, Mattia D'Agostini, Ivan Samodelkin, David Schlangen · 2026-06-01

The paper introduces the Image Reconstruction Game, an automated benchmark where a vision-language model provides iterative corrective instructions to an image generator, enabling direct observation of accumulated common ground through rendered images. Evaluating two Describer and two Generator models across seven image categories reveals that describer quality dominates reconstruction performance, while generator choice determines the efficacy of iterative refinement. Key findings include: mathematical/geometric images are most challenging, token budget affects convergence dynamics, and stronger describers employ richer correction vocabularies. Human validation shows automated judges achieve only slight-to-fair agreement with human preferences, requiring recalibration for reliable use.

vision-language modeliterative refinementcommon groundimage reconstructionautomated benchmarking

RA-LWLM: Retrieval-Augmented In-Context Localization with Wireless Foundation Models

arXiv cs.AI · Guangjin Pan, Hui Chen, Hei Victor Cheng, Henk Wymeersch · 2026-06-01

RA-LWLM introduces a retrieval-augmented in-context localization framework for 6G networks, enabling training-free cross-scene adaptation by externalizing scene-specific information into a per-scene fingerprint database. The framework comprises a frozen wireless foundation model encoder, a retrieval module for similarity search, and a transformer-based in-context learning module with a mixture-of-experts design. Extensive ray-tracing experiments demonstrate that RA-LWLM achieves consistent accuracy across seen and unseen scenes without retraining, outperforming end-to-end and foundation model-based baselines. This validates the retrieval-augmented in-context paradigm as a scalable solution for 6G localization.

retrieval-augmentedin-context learningwireless foundation modelmixture-of-expertscross-scene adaptation

Community-Aware Assessment of Social Textual Engagement and Resonance: A Human-Centric Perspective on User-Generated Content Evaluation

arXiv cs.AI · Tianjiao Li, Kai Zhao, Xiang Li, Yang Liu · 2026-06-01

We propose CASTER, a novel task for assessing community resonance in User-Generated Content (UGC) by evaluating multimodal engagement rather than visual fidelity alone. To address this, we introduce MEDEA, a Multimodal Engagement-Driven Evaluation Architecture featuring a Social Chain-of-Thought (Social-CoT) mechanism. Social-CoT simulates diverse viewer personas to model collective cognitive and emotional reactions before deriving quality judgments. MEDEA is trained via supervised fine-tuning and process-supervised reinforcement learning with Social Alignment Reward. Evaluated on CASTER-Bench, MEDEA outperforms state-of-the-art baselines while providing interpretable reasoning paths aligned with real community feedback.

user-generated contentmultimodal engagementsocial chain-of-thoughtprocess-supervised reinforcement learningcommunity resonance

Train, Test, Re-evaluate: Schedule-Sensitive Evaluation of Generative Data for Hand Detection

arXiv cs.AI · Atmika Bhardwaj, Silvia Vock, Nico Steckhan · 2026-06-01

The study demonstrates that generative inpainting of hand regions with accessories (gloves, tattoos, jewelry) can mitigate distribution shifts in hand detection for occupational safety applications. Using YOLOv8n, six training regimes (Experiments A-F) were evaluated on real and synthetic-augmented datasets, with three random seeds each. A two-stage approach (pretraining on real+synthetic, fine-tuning on real-only) improved mAP@0.5 on the standard test set and reduced the real-gloves out-of-distribution gap. A three-stage method achieved the highest mAP@0.5:0.95, preserving box-tightness. Training procedure critically determines synthetic-data utility.

generative inpaintingdistribution shiftyolov8nmean average precisionoccupational safety

Collaborative Space Object Detection with Multi-Satellite Viewpoints in LEO Constellations

arXiv cs.AI · Xingyu Qu, Wenxuan Zhang, Peng Hu · 2026-06-01

The paper proposes a multi-viewpoint fusion approach for space object detection (SOD) in low Earth orbit (LEO) constellations, leveraging deep learning to enhance detection accuracy under onboard constraints. A multi-view pipeline and input representations are designed for YOLO-based detectors, integrating RGB and grayscale data from multiple satellite viewpoints. Experiments demonstrate significant improvements in detection performance: YOLOv9-m achieves mAP50 increases from 0.638 to 0.732 and mAP50-95 from 0.227 to 0.276 in three-view RGB settings, with grayscale configurations yielding 36.3% and 46.5% improvements, respectively. These results validate multi-view fusion as an effective strategy for SOD in LEO environments.

space object detectionmulti-viewpoint fusionlow earth orbityolo-based detectorsmap50

Physically-Constrained Mamba-SDE for Remaining Useful Life Prediction under Irregular Observations

arXiv cs.AI · Deyu Zhuang, Peiliang Gong, Yang Shao, Liyuan Shu · 2026-06-01

PC-MambaSDE, a physically-constrained continuous-time framework, addresses Remaining Useful Life prediction under irregular sensor observations. It integrates a Mask-Aware Continuous Mamba Encoder for context-rich control signals and a Physics-Guided Latent SDE with parametrically rectified hybrid drift to enforce monotonic degradation. The framework formulates RUL prediction as a boundary value problem using a Terminal Degradation Penalty. Theoretical guarantees include variational objective equivalence to KL divergence minimization via Girsanov's theorem and global asymptotic stability via Lyapunov analysis. Evaluated using a Hybrid Irregularity Generation Scheme, PC-MambaSDE outperforms state-of-the-art methods on public benchmarks, especially under extreme observation scarcity.

remaining useful lifelatent sdegirsanov's theoremlyapunov analysishybrid drift

Absorbing Complexity: An Interaction-Native Knowledge Harness for Financial LLM Agents

arXiv cs.AI · Ailiya Borjigin, Igor Stadnyk, Ben Bilski, Maksym Chikita · 2026-06-01

The interaction-native knowledge harness (InKH) is proposed as an architecture for financial LLM agents to absorb complexity by converting user, market, portfolio, and tool events into structured operational knowledge. InKH employs passive knowledge injection, a bounded working context buffer, temporal graph memory, wiki audit surface, and background extraction with maturity, decay, and write-time invalidation. Evaluated on a synthetic benchmark with 46,080 baseline-conditioned evaluations, InKH achieves a mean task quality of 0.815 at 900 ms latency, reducing latency by 82.95%, token cost by 82.29%, and stale-knowledge usage by 96.58%, while improving quality by 0.108 and traceability by 0.461 compared to agent-driven wiki-walk memory.

temporal graph memorypassive knowledge injectionwiki audit surfacebackground extractionstale-knowledge usage

EVA-Net: Subject-Independent EEG Motor Decoding with Video-Derived Motor Priors

arXiv cs.AI · Ziyuan Li, Yueyu Sun, Yimeng Zhang · 2026-06-01

EVA-Net introduces a two-stage framework for subject-independent EEG motor decoding by leveraging action videos as semantic priors. The method first aligns EEG and video features in a shared space using cross-modal and supervised contrastive objectives to reduce subject-specific variation. It then transfers video-derived priors to an EEG-only classifier via video category prototypes and knowledge distillation, maintaining inference efficiency. Experiments on EEGMMI and another public dataset demonstrate an 8.66% improvement in leave-one-subject-out (LOSO) accuracy compared to baselines, highlighting video's superiority over text as a semantic anchor for dynamic motor processes.

eeg motor decodingsubject-independentsemantic priorscross-modal alignmentknowledge distillation

WorldCoder-Bench: Benchmarking Physically Grounded 3D World Synthesis

arXiv cs.AI · Shuo Lu, Yinuo Xu, Kecheng Yu, Siru Jiang · 2026-06-01

The paper introduces WorldCoder-Bench, a benchmark for evaluating LLMs on physically grounded 3D world synthesis via Three.js, comprising 2,026 tasks across Simulation, Rendering, and Application scenarios. It proposes StateProbe, an execution-based verification protocol that checks runtime states and transitions in sandboxed browsers, reporting metrics like verification coverage, Return on Automation, and Time Efficiency Multiplier. Evaluations on nine frontier models show the best system achieves only 27.8% coverage on WorldCoder-Core and 19.9% on WorldCoder-Robust, with failures primarily due to state-schema drift and broken interaction chains.

3d world synthesisthree.jsexecution-based verificationstate-schema driftinteraction chains

RadioMaster: Multi-Agent System for Autonomous Radio Signal Generation

arXiv cs.AI · Jiazhen Lei, Tianze Cao, Yuxin Sha, Sihan Wang · 2026-06-01

RadioMaster introduces a multi-agent system for autonomous radio signal generation, addressing LLMs' limitations in physical layer tasks. The framework combines RadioWiki (domain knowledge retrieval), RadioAgent (collaborative I/Q sample generation), and RadioEmulator (closed-loop verification), alongside the new RadioBench benchmark. Evaluations show superior performance over SOTA baselines in configuration viability and signal fidelity.

multi-agent systemradio signal generationi/q samplesphysical layer verificationwireless prototyping

Boosting Multimodal Federated Learning via Chained Modality Optimization

arXiv cs.AI · Zixin Zhang, Fan Qi, Shuai Li, Xiaoshan Yang · 2026-06-01

The paper proposes FedMChain, a novel Multimodal Federated Learning (MMFL) framework addressing modality competition through chained modality optimization. The method structures training as sequential modality-specific phases with error-compensated regularization, coupled with a sparse sign-guided aggregation strategy for robust model fusion. Evaluations on multimodal benchmarks show FedMChain improves predictive performance while reducing communication frequency compared to joint optimization baselines.

multimodal federated learningmodality competitionchained optimizationsign-guided aggregationerror-compensated regularization

Does Compression Preserve Uncertainty? A Unified Benchmark for Quantized and Sparse LLMs via Conformal Prediction

arXiv cs.AI · Yujia Tong, Yuxi Wang, Yunyang Wan, Tian Zhang · 2026-06-01

We introduce a unified benchmark for assessing whether model compression preserves uncertainty quantification in large language models (LLMs), addressing a critical gap in safety-critical applications. Using conformal prediction, we evaluate 12 LLMs across five NLP tasks under various quantization and pruning configurations. Results show that (1) compression decouples accuracy from uncertainty, (2) larger models better absorb compression-induced uncertainty, and (3) uncertainty inflation exhibits threshold-like behavior. These findings demonstrate that accuracy-only evaluations are insufficient for deployment readiness, necessitating uncertainty-aware benchmarking in model compression pipelines.

conformal predictionquantizationpruninguncertainty quantificationmodel compression

Unveiling the Limits of Large Language Models in Inferring Pragmatic Meaning from Non-Verbal Responses

arXiv cs.AI · Sugyeong Eo, Heuiseok Lim · 2026-06-01

This work systematically evaluates large language models (LLMs) on inferring pragmatic meaning from non-verbal dialogue responses, an underexplored aspect of pragmatic language understanding. The study examines three research questions regarding LLMs' recognition of indirect non-verbal intent, failure modes, and potential improvements. Results show LLMs struggle significantly, with accuracy dropping up to 60 percentage points compared to verbal responses, though in-context learning improves performance. Behavioral patterns in LLMs' non-verbal interpretation are identified through extensive analysis.

pragmatic inferencenon-verbal communicationlarge language modelsin-context learningdialogue understanding

Suppressing Forgery-Specific Shortcuts for Generalizable Deepfake Detection

arXiv cs.AI · Yihui Wang, Yonghui Yang, Jilong Liu, Fengbin Zhu · 2026-06-01

The paper proposes Shortcut Subspace Suppression (S^3), a framework for improving deepfake detection generalization by suppressing method-specific shortcuts. The method identifies forgery-specific artifacts via a linear probe for method classification, then uses SVD to extract a dominant shortcut subspace. Training involves soft subspace suppression, while inference employs a training-free neuron attenuation technique. Experiments show significant cross-method generalization improvements while maintaining in-domain performance on multiple benchmarks.

deepfake detectiongeneralizationshortcut learningsubspace suppressionsingular value decomposition

Evaluation of Baseline Methods for IDD-based SSD External Memory Search

arXiv cs.AI · Yuki Suzuki, Alex Fukunaga · 2026-06-01

The paper evaluates simple baseline methods for immediate duplicate detection (IDD) in A* search using external memory (SSDs/HDDs), addressing gaps in prior work focused on complex IDD methods and delayed duplicate detection. It systematically analyzes performance impacts of OS-level mechanisms like page caches, which were previously unstudied. Results demonstrate the efficacy of straightforward IDD approaches in memory-constrained search scenarios.

immediate duplicate detectionexternal memory searcha* algorithmpage cachessd performance

LayerRoute: Input-Conditioned Adaptive Layer Skipping via LoRA Fine-Tuning for Agentic Language Models

arXiv cs.AI · Prateek Kumar Sikdar · 2026-06-01

LayerRoute introduces input-adaptive layer skipping for agentic language models via LoRA fine-tuning, addressing compute heterogeneity between tool calls (low perplexity) and planning steps (high perplexity). The method augments Qwen2.5-0.5B-Instruct's 24 transformer blocks with per-layer routers (897 params) and rank-8 LoRA adapters (1.08M params), trained end-to-end with gate regularization. After 3,000 steps (6.4 A100-minutes), it achieves 12.91% skip differential (15.25% FLOPs reduction for tool calls vs 2.34% for planning) while improving perplexity by -1.3, using only 0.22% of backbone parameters.

adaptive computationlayer skippinglora fine-tuningagentic language modelsperplexity reduction

Physics-Guided Attention in a Lightweight TCN for Efficient WiFi CSI-Based Human Activity Recognition

arXiv cs.AI · Chinthaka Ranasingha, Tharindu Fernando, Sridha Sridharan, Clinton Fookes · 2026-06-01

We propose a lightweight temporal convolutional network (TCN) for WiFi CSI-based human activity recognition that incorporates physics-guided inductive biases to improve efficiency. The framework integrates a Doppler-energy-guided temporal attention mechanism to emphasize motion-salient time segments and a variance-driven channel attention module to weight informative subcarriers based on temporal motion statistics. This approach captures motion dynamics without increasing architectural depth. Experiments on multiple benchmarks demonstrate superior performance compared to deeper baselines, with significant reductions in parameter count and computational cost.

temporal convolutional networkchannel state informationinductive biasesattention mechanismmotion dynamics

Learning Implicit Bias in Generative Spaces for Accelerating Protein Dynamics Emulation

arXiv cs.AI · Kaihui Cheng, Zhiqiang Cai, Wenkai Xiang, Zhihang Hu · 2026-06-01

The paper introduces an implicit, history-dependent bias method to enhance generative emulators of protein dynamics, addressing their tendency to revisit known states during long-horizon extrapolation. The approach combines a history-aware score estimator, which steers sampling away from previous structures, with a score-based refinement step to maintain structural validity. Experiments on DynamicPDB-80 and 12 Fast-Folding proteins show a 35% diversity increase, up to 37× faster coverage of low-energy states, and 3× more low-energy states compared to unbiased emulators.

generative emulatorsprotein dynamicshistory-aware biasscore-based refinementlow-energy states

CAPF: Guiding Search-Agent Rollouts with Credit-Attenuated Privileged Feedback

arXiv cs.AI · Bin Chen, Xinye Liao, Yiming Liu, Xin Liao · 2026-06-01

The paper introduces Credit-Attenuated Privileged Feedback (CAPF), a training-time mechanism that enhances LLM search agents by leveraging verifier-side information to guide revisions during rollouts. CAPF enables policies to convert zero-reward attempts into positive-reward repair trajectories while attenuating credit for feedback calls to maintain deployment feasibility. Experiments show CAPF improves Qwen3-4B's exact-match accuracy from 44.7% to 48.5% across seven open-domain QA benchmarks compared to outcome-only RLVR.

reinforcement learningsearch-augmented reasoningprivileged feedbackcredit attenuationopen-domain qa

Dynamic Trust-Aware Sparse Communication Topology for LLM-Based Multi-Agent Consensus

arXiv cs.AI · Wanshuang Gou, Zihan Liu · 2026-06-01

DySCo (Dynamic Sparse Consensus) introduces a dynamic trust-aware sparse communication topology for LLM-based multi-agent systems, addressing the quadratic growth of message complexity in fully connected frameworks. It estimates edge value based on agent reliability, answer divergence, and task relevance, selecting high-value edges for message exchange under budget constraints. DySCo aggregates answers using dynamic trust weights and terminates early upon consensus stabilization, reducing overhead while preserving critical error-correction information. Evaluations on mathematical reasoning, logical reasoning, and factual question-answering tasks demonstrate its effectiveness in maintaining consensus stability and reducing communication complexity.

multi-agent systemsdynamic trustsparse communicationconsensus stabilitycommunication complexity

"I've Seen How This Goes": Characterizing Diversity via Progressive Conditional Surprise

arXiv cs.AI · Matthew Khoriaty, David Williams-King, Shi Feng · 2026-06-01

The paper introduces Decan ($D_{Ca_n}$), a novel metric for measuring diversity in creative outputs using in-context learning from a base language model. The method computes per-byte scores via single forward passes, requiring no embeddings, reference corpora, or human labels, and leverages information theory to detect similarities. Evaluated on Tevet and Berant's McDiv benchmark, $D_{Ca_n}$ achieves OCA 0.846, trailing SentBERT (0.897), and monotonically decreases across OLMo-2-7B's post-training stages (base→SFT→DPO→RLVR), capturing diversity loss relevant to creative writing.

diversity metricin-context learninginformation theorylanguage modelpost-training

Token Predictors Are Not Planners: Building Physically Grounded Causal Reasoners

arXiv cs.AI · Zheng Lu, Mingqi Gao, Qinlei Xie, Wanqi Zhong · 2026-06-01

The paper introduces Causal-Plan-Bench, a diagnostic benchmark for physically grounded causal reasoning in embodied vision-language planning, and Causal-Plan-1M, a million-scale corpus of annotated reasoning traces. The authors argue that current models prioritize linguistic token prediction over causal next-state reasoning, leading to shallow planning. Their proposed Causal Planner, built on Qwen3-VL-8B, achieves 45.28 on the benchmark (a 36.3% gain from scaling causal data) and demonstrates cross-benchmark generalization, revealing a Causal Scaling Law. Gemini 3 Pro scores only 38.18, highlighting the gap in physical reasoning.

causal reasoningembodied planningnext-state estimationvision-language modelsdiagnostic benchmark

ProbeScale: Probing Analysis to Optimize Neural Scaling Laws for Efficient Small Language Model Inference

arXiv cs.AI · Sourav Das · 2026-06-01

The paper introduces ProbScale, a framework combining neural scaling laws and probing analysis to identify parameter-efficient subnetworks in Small Language Models (SLMs). By quantifying layer relevance via task-specific probes, ProbScale formulates subnetwork selection as an optimization problem maximizing task-weighted performance under parameter constraints. Experiments on RoBERTa-Large and T5-Base show 5-10x parameter reduction while retaining 95-98% of original performance on targeted tasks, outperforming heuristic baselines.

small language modelsneural scaling lawsprobing analysisparameter-efficient subnetworkstask-weighted optimization

OctoT2I: A Self-Evolving Agentic Text-to-Image Router

arXiv cs.AI · Xu Jiang, Bin Chen, Gehui Li, Yule Duan · 2026-06-01

OctoT2I introduces a self-evolving agentic framework for text-to-image generation, addressing limitations of single-model scaling and existing agentic methods. The framework jointly optimizes generation quality and inference efficiency through a stateful, multi-round routing strategy, enabled by a knowledge base autonomously constructed via a novel Self-Evolving Mechanism. This mechanism employs a Propose--Solve--Evaluate--Learn loop to iteratively discover tool capability frontiers without human supervision. Experiments show OctoT2I achieves 0.96 on GenEval while delivering 90.3% inference speedup and 56.6% energy-efficiency gain over Flow-GRPO, balancing performance and efficiency.

text-to-imageself-evolving mechanismagentic frameworkinference efficiencyknowledge base

MOSS-Audio Technical Report

arXiv cs.AI · Chen Yang, Chufan Yu, Hanfu Chen, Jie Zhu · 2026-06-01

MOSS-Audio introduces a unified audio-language model for multimodal understanding, combining speech, environmental sound, and music processing via a dedicated audio encoder, modality adapter, and autoregressive decoder. Key innovations include DeepStack cross-layer feature injection for multi-depth acoustic representation and time markers for explicit temporal grounding. The system employs an event-preserving annotation pipeline for pretraining data and supports task-oriented fine-tuning. Pretrained on large-scale audio-language data with time-aware objectives, the 4B/8B variants achieve strong performance in audio captioning, ASR, and timestamped ASR, establishing a foundation for voice agents.

audio-language modeldeepstack cross-layertime markersautoregressive decodertimestamped asr

Multilinguality of Large Language Models From a Structural Perspective

arXiv cs.AI · Haruki Sakajo, Yusuke Sakai, Hidetaka Kamigaito, Taro Watanabe · 2026-06-01

The study investigates multilinguality in large language models (LLMs) through structural representational analysis, contrasting with prior token-level approaches. Using structural analysis of model representations, it examines how LLMs process diverse languages relative to English dominance in training data. Results show low-resource languages exhibit greater structural divergence from English compared to high/mid-resource languages, and that language-specific post-training modifies structural properties while maintaining cross-linguistic relationships.

large language modelsmultilingualitystructural analysislow-resource languagespost-training

STaR-KV: Spatio-Temporal Adaptive Re-weighting for KV Cache Compression in GUI Vision-Language Models

arXiv cs.AI · Yuhang Han, Wenzheng Yang, Yujie Chen, Xiangqi Jin · 2026-06-01

STaR-KV introduces a training-free KV cache compression framework for GUI vision-language models, addressing memory bottlenecks by dynamically calibrating token importance. The method employs (i) subspace-aware scoring via spatial mutual information, (ii) temporal stability discounts for redundant entries, and (iii) entropy-derived temperature for adaptive score reshaping. Evaluated on four GUI benchmarks, STaR-KV achieves state-of-the-art accuracy at matched budgets, reducing peak GPU memory by 40% at 20% KV-cache budget with negligible FLOPs overhead (-0.07%).

kv cache compressionvision-language modelsspatial mutual informationtemporal stabilitygui automation

Consistency evaluation of benchmarks used for causal discovery

arXiv cs.AI · Yuzhe Zhang, Chihui Chen, Lina Yao, Chen Wang · 2026-06-01

This work presents the first systematic evaluation of benchmark quality in causal discovery, focusing on inconsistencies between benchmark causal graphs and current domain knowledge. The authors develop an automated pipeline using LLMs to assess 11 real-world benchmarks against 38,081 domain research papers retrieved from scientific databases. Results reveal significant variation in benchmark consistency with domain research, highlighting critical implications for evaluating LLM-based causal discovery methods sensitive to evolving literature.

causal discoverybenchmark evaluationlarge language modelsgraphical causal modeldomain knowledge consistency

Stochastic convergence of parallel asynchronous adaptive first-order methods

arXiv cs.AI · Serge Gratton, Philippe L. Toint · 2026-06-01

The work introduces a novel class of asynchronous adaptive first-order optimization methods, extending popular algorithms to asynchronous settings with momentum and inexact normalization variants. Analyzing convergence in a stochastic non-convex setting, the authors prove an O(1/sqrt{t}) rate (up to logarithmic factors) under standard assumptions. Empirical results demonstrate the methods' practical relevance for heterogeneous large-scale ML systems.

asynchronous optimizationadaptive methodsfirst-order methodsstochastic convergencenon-convex optimization

Breaking the Information Silo: Semantic Personas for Cross-Domain Recommendation

arXiv cs.AI · Jonathan Mayo, Moshe Unger, Konstantin Bauman · 2026-06-01

The study introduces SPHERE, a cross-domain recommender system that transfers knowledge between strictly disjoint domains without shared users or items. SPHERE leverages large language models to create semantic personas and behaviorally similar source-domain communities, integrating these with collaborative signals via a dual-tower architecture and dynamic fusion gate. Evaluations on Amazon Books, Goodreads, and Steam show consistent improvements over NCF, SVD++, and LightGCN baselines, demonstrating that transfer effectiveness depends on target-domain structural density and predictive strength rather than semantic proximity alone.

cross-domain recommendationsemantic personasdual-tower architecturedynamic fusion gatebehavioral alignment

Structure-Guided Adaptive Propagation for Protein-Protein Interaction Site Prediction

arXiv cs.AI · Enqiang Zhu, Yizi Liu, Yilong Luo, Yao Chen · 2026-06-01

SGAP-PPIS introduces structure-guided adaptive propagation for protein-protein interaction site (PPIS) prediction, addressing limitations of fixed propagation schemes in graph-based models. The method employs an equivariant graph neural network to generate residue-wise propagation coefficients, enabling adaptive balance between local feature preservation and neighborhood diffusion based on multi-scale geometric states. On the Test_60 benchmark, SGAP-PPIS achieves competitive performance, with ablation studies confirming the importance of geometry-conditioned propagation, scale-aligned guidance, and multi-step state representation.

protein-protein interactionequivariant graph neural networkadaptive propagationgeometric microenvironmentmulti-scale representation

FLARE: Diffusion for Hybrid Language Model

arXiv cs.AI · Yuchen Zhu, Jing Shi, Chongjian Ge, Hao Tan · 2026-06-01

FLARE introduces a systematic framework for converting hybrid-attention large language models (LLMs) to diffusion language models (dLLMs), addressing challenges in capability preservation and inference efficiency. The method combines a token-equal autoregressive (AR) and diffusion objective, hardware-aware kernels, and unified inference, enabling a single checkpoint to support both AR-style verified decoding and diffusion-style parallel denoising. Results demonstrate FLARE's competitiveness with leading open-source dLLMs across model scales, achieving consistent throughput gains in single-GPU concurrent serving. The analysis highlights transfer data quality as a critical factor, surpassing loss formulation and attention-mask design in importance.

diffusion language modelshybrid-attentionautoregressiveparallel denoisingcapability preservation

Adaptive Auto-Harness: Sustained Self-Improvement for Agentic System Deployment on Open-Ended Task Streams

arXiv cs.AI · Zewen Liu, Zhan Shi, Yisi Sang, Bing He · 2026-06-01

The paper introduces Adaptive Auto-Harness, a framework for sustained self-improvement of LLM agents on open-ended task streams, addressing limitations of existing auto-harness systems evaluated on fixed benchmarks. The method decomposes performance gaps into evolution and adaptation losses, employing a stateful multi-agent evolver, harness tree with solve-time routing, and human-steering hooks. Evaluations on prediction-market, security-competition, and event-forecasting streams show superior performance over five baselines, with gains attributed to improved construction, routing, and targeted human steering.

auto-harnessllm agentstask streamsmulti-agent evolversolve-time routing

EvoBrain: Continual Learning of EEG Foundation Models Across Heterogeneous BCI Tasks

arXiv cs.AI · Yangxuan Zhou, Sha Zhao, Jiquan Wang, Shijian Li · 2026-06-01

EvoBrain introduces a continual learning framework for EEG foundation models to address cross-task scalability in brain-computer interfaces. The method combines Neuro-Spectral Task Normalization (NSN) for spectral alignment and Response-Affinity Distillation (RAD) with time-dependent replay to mitigate forgetting. Evaluations across six BCI tasks show superior performance over state-of-the-art methods, balancing plasticity and stability while enabling knowledge transfer between spectrally compatible tasks.

eegfoundation modelscontinual learningbrain-computer interfacesneuro-spectral task normalization

TriAlign: Towards Universal Truth Consistency in Personalized LLM Alignment

arXiv cs.AI · Thi-Nhung Nguyen, Linhao Luo, Rollin Omari, Junae Kim · 2026-06-01

TriAlign introduces a Truth-Invariant Alignment (TIA) framework for personalized LLMs, ensuring universal truth consistency across social groups while preserving personalization. The method employs offline multi-agent reinforcement learning (MARL), modeling each social group as an agent and jointly optimizing truth accuracy, cross-group consistency, and personalization via a fairness-aware objective and inconsistency penalty. Experiments on diverse benchmarks show TriAlign reduces universal truth disparities by 18-32% compared to baselines while maintaining personalization quality and improving objective task performance.

truth-invariant alignmentmulti-agent reinforcement learningpersonalized llmsuniversal truth consistencyfairness-aware objective

Construction of Historical Knowledge Graphs Based on BERT and Graph Neural Networks

arXiv cs.AI · Ping Li, Bartlomiej Brzozka · 2026-06-01

The paper introduces a BERT-GNN architecture for constructing historical knowledge graphs from unstructured texts, addressing challenges like linguistic ambiguity and implicit references. The method combines bidirectional transformer representations with graph neural networks to extract entities and relationships systematically. Experiments on municipal records and parliamentary documents demonstrate superior Precision, Recall, and F1-scores compared to rule-based and deep-learning baselines, validating the approach for handling nested structures and implicit references in historical data.

bertgraph neural networksknowledge graphshistorical textsentity extraction

SECUREVENT: Hybrid AI/ML Security Monitoring for Distributed Event-Based Systems

arXiv cs.AI · Eric Liang · 2026-06-01

SECUREVENT introduces a hybrid AI/ML security-monitoring architecture for distributed event-based systems, combining traditional protections (authenticated transport, topic-level authorization) with online anomaly detection, graph-aware behavioral features, and federated learning. The method integrates complex-event policy rules and adversarial-ML governance to address dynamic attack surfaces in publish/subscribe services. A prototype demonstrates improved recall over static rules while maintaining low false-positive rates in synthetic attack scenarios, emphasizing model-based monitoring for dynamic event flows and identities.

distributed event-based systemsonline anomaly detectionfederated learningcomplex-event processingadversarial-ml governance

THRD: A Training-Free Multi-Turn Defense Framework for Jailbreak Attacks on Large Language Models

arXiv cs.AI · Zhiqing Ma, Zhonghao Xu, Dong Yu, Chen Kang · 2026-06-01

THRD introduces a training-free multi-turn defense framework against jailbreak attacks on large language models (LLMs), addressing trajectory-dependent safety behavior through temporal risk accumulation modeling. The framework comprises four modules: Turn-level Risk Assessor (TRA), Historical Context Analyzer (HCA), Response Evaluator (RE), and a Decision Module integrating these via a time-evolving scoring mechanism. Evaluations against advanced multi-turn attacks demonstrate THRD reduces attack success rates (ASR) to 0.2--4.0% while maintaining model utility within 1.5% degradation on MMLU and GSM8K. Ablation studies confirm module contributions and cross-architecture generalization, with over 70% of attacks detected at Turn~2 or later, validating temporal aggregation necessity.

jailbreak attackstemporal risk accumulationtraining-free defensemulti-turn interactionattack success rate

TrafficRAG: A Multimodal RAG Framework for Traffic Accident Liability Determination

arXiv cs.AI · Xu Li, Zedong Fu, Xinyi Li, Xun Han · 2026-06-01

The paper introduces TrafficRAG, a multimodal retrieval-augmented generation framework for automated traffic accident liability determination. The method employs a vision-language model to generate structured textual descriptions of accidents, followed by hybrid BM25 and dense retrieval to fetch relevant legal clauses and historical cases. A large language model then integrates this evidence for reasoning, producing legally grounded reports. Experiments demonstrate improvements over baselines, achieving 77.32% Legal Norm Adaptation Accuracy, 81.71% Factual Faithfulness, and 5.48% Liability Ratio MAE.

retrieval-augmented generationvision-language modelhybrid retrievallegal norm adaptationliability determination

Argument Collapse: LLMs Flatten Long-Form Public Debate

arXiv cs.AI · Yekyung Kim, Yapei Chang, Chau Minh Pham, Mohit Iyyer · 2026-06-01

The paper introduces 'argument collapse', a phenomenon where LLM-generated essays converge to a limited set of arguments compared to human discourse. Analyzing 1,039 NYT and 448 Boston Review human responses versus 23,384 LLM-generated essays, the study finds only 3.4% of LLM main arguments are unique (vs 65.3% human), with similar patterns in sub-arguments (9.1% vs 41.0%). LLMs exhibit structural rigidity, favoring generalized claims and predictable arcs, even when prompted for diversity. Results hold across short-form (NYT) and long-form (Boston Review) contexts.

argument collapsellm-generated textpublic debatediscourse diversitytext analysis

Evidence-Gated LLM Priors for Multi-Objective Bayesian Optimization

arXiv cs.AI · Jiangyu Chen, Banyi · 2026-06-01

The paper introduces an evidence-gated mechanism for integrating LLM-generated expert priors in multi-objective Bayesian optimization, addressing calibration issues across diverse objectives. The method employs an objective-wise reputation-market system, updating expert weights dynamically based on observed feedback and market-level trust, alongside a decoupled counterfactual gate that selectively uses LLM priors. Experiments on synthetic stress tests and molecule optimization benchmarks (ESOL, FreeSolv, Lipophilicity) demonstrate improved robustness over fixed priors, though LLM confidence shows inconsistent utility. The three-arm counterfactual gate outperforms simpler variants, while margin portfolio experiments highlight the need for acquisition-aware selection.

bayesian optimizationmulti-objectivellm priorscounterfactual gatereputation-market

Characterization of Multi-Model Agentic AI Systems on General Tasks via Trace-Driven Simulation

arXiv cs.AI · Donghwan Kim, Prakhar Singh, Younghoon Min, Jongryool Kim · 2026-06-01

This paper introduces GAIATrace, the first token-level trace dataset capturing full reasoning tokens and task-level structures of two state-of-the-art agentic AI systems (MiroThinker and OWL) on the GAIA benchmark. It also presents Vidur-Agent, a trace-driven simulator enabling reproducible, low-cost system evaluation across diverse environments. Using these tools, the study characterizes how modern agentic systems handle general tasks and examines the impact of system design choices, yielding unique insights into their behavior.

token-level traceagentic aitrace-driven simulatorgaia benchmarksystem design choices

Shortcut to Nowhere: Demystifying Deep Spurious Regression

arXiv cs.AI · Guanrong Xu, Jessica Li, Hao Wang, Yuzhe Yang · 2026-06-01

The paper introduces Deep Spurious Regression (DSR), addressing continuous spurious correlations in regression tasks where traditional classification-focused methods fail. The authors propose leveraging spurious attribute similarity in both label and feature spaces to calibrate distributions across attributes, accounting for nearby targets and related groups. Experiments on computer vision, environmental sensing, and LLM regression datasets demonstrate the method's effectiveness, filling a gap in benchmarks for continuous prediction under spurious correlations.

deep spurious regressioncontinuous predictionspurious correlationsattribute-label confoundingdistribution calibration

Post-Deterministic Distributed Systems: A New Foundation for Trustworthy Autonomous Infrastructure

arXiv cs.AI · Jun He, Deying Yu · 2026-06-01

The paper introduces Post-Deterministic Distributed Systems (PDDS), a framework for coordinating heterogeneous environments where deterministic code, stochastic models, and autonomous agents coexist. It generalizes classical distributed computing models by relaxing the assumption of universal deterministic execution, addressing scenarios where semantically equivalent outcomes arise from divergent reasoning paths. The authors propose five architectural pillars (Protocol-Driven Development, Verifiable Agentic Infrastructure, etc.) and introduce Epistemic State Replication for knowledge-level consistency. A taxonomy of failure classes specific to non-deterministic participants is also defined.

post-deterministic distributed systemsepistemic state replicationverifiable agentic infrastructuresemantic quorum assuranceautonomous state control planes

Fair Finetuning Mitigates Distribution Inference Attacks

arXiv cs.AI · Rakshit Naidu · 2026-06-01

Fair Fine-tuning (FFt) mitigates distribution inference attacks (DIAs) by fine-tuning models on complementary distributions under Equalized Odds (EO) constraints, establishing a theoretical link between fairness and privacy. The method proves a tight bound $\text{Adv}(\mathcal{A},M_f) \le \Delta_{\text{EO}} \cdot W$, where $W$ quantifies distribution distinguishability by sensitive attributes. Evaluations across six datasets (ACS Income, COMPAS, German Credit, UTKFaces, Bias in Bios) demonstrate FFt reduces adversarial accuracy gaps below $\tau=0.1$, e.g., from $\sim15\%$ to under $4\%$ on ACS Income. This work formalizes the connection between EO disparity and adversarial advantage, enabling unified fairness-and-privacy defenses.

distribution inference attackequalized oddsfair fine-tuningadversarial advantagesensitive attributes

Two-Fidelity Best-Action Identification for Stochastic Minimax Tree

arXiv cs.AI · Peter Chen, Xi Chen · 2026-06-01

We introduce 2FFS, a two-fidelity tree-search algorithm for fixed-confidence best-action identification in stochastic minimax trees, addressing the cost-reliability tradeoff in AI planning. The method combines minimax-style fast expansion with Monte Carlo Tree Search (MCTS) stochastic sampling, adaptively selecting between cheap biased evaluations and expensive accurate rollouts for local certification. Theoretical analysis proves fixed-confidence correctness, finite stopping for exact identification, and polynomial-depth cost bounds. Empirical results demonstrate that 2FFS achieves significant reductions in sample complexity and computational operations compared to existing BAI-MCTS baselines.

stochastic minimax treesfixed-confidencemonte carlo tree searchtwo-fidelitybest-action identification

JenBridge: Adaptive Long-Form Video Soundtracking across Scene Transitions

arXiv cs.AI · Jiashuo Yu, Yao Yao, Boyu Chen, Alex Wang · 2026-06-01

JenBridge introduces a Transformer-based framework for generating coherent long-form video soundtracks across scene transitions, addressing limitations of existing short-clip systems. The method employs a two-stage approach: pretraining on text-audio corpora for musical priors, followed by video-domain adaptation with dual text-visual conditioning. A novel adaptive transition mechanism, including an LLM Agent for transition selection, ensures narrative continuity. Evaluated on the LVS Benchmark, JenBridge outperforms baselines in transition naturalness (12% improvement) and coherence metrics, demonstrating professional-quality automated soundtracking capabilities.

transformerflow-matchingtext-visual conditioningadaptive transitionlvs benchmark

Understanding Identity Continuity in Thermal Video through Scene-Level Consistency

arXiv cs.AI · Wei-Chieh Sun, Gyungmin Ko, Heejae Kwon, Hsiang-Wei Huang · 2026-06-01

The study demonstrates that lightweight post-processing can enhance identity continuity in thermal pedestrian multi-object tracking (MOT) without complex re-identification models. Using a YOLOv8 and SORT baseline, the authors introduce a modular identity-repair backend with online short-gap remapping and offline tracklet relinking based on temporal, spatial, motion, and border cues. Evaluations on the PBVS Thermal Pedestrian MOT benchmark show that conservative relinking improves IDF1 from 82.25 to 84.93 while maintaining MOTA, highlighting the efficacy of scene-level spatial-temporal consistency over local frame-to-frame association.

thermal motidentity continuitytracklet relinkingyolov8sort

RPCASSM: Robust PCA State Space Model For Infrared Small Target Detection

arXiv cs.AI · Pingping Liu, Aohua Li, Yubing Lu, Jin Kuang · 2026-06-01

The RPCASSM network introduces a robust principal component analysis (RPCA)-based state space model for infrared small target detection, addressing inefficiencies in mainstream visual state space models. It comprises a background state space module (BSSM) and a target state space module (TSSM), leveraging spatial heterogeneous signals and target sparsity/local highlight properties, respectively. The BSSM employs a spatial probe scanning mechanism (SPCM) for background modeling, while the TSSM utilizes a deformable prompt scanning mechanism (DPCM) for target edge structure modeling. Experiments on benchmark datasets validate the model's effectiveness in accurately detecting infrared small targets.

robust principal component analysisstate space modelinfrared small target detectionspatial probe scanning mechanismdeformable prompt scanning mechanism

HAIM: Human-AI Music Datasets for AI Music Production Tracking Benchmark

arXiv cs.AI · Seonghyeon Go, Yumin Kim · 2026-06-01

The paper introduces HAIM, a novel dataset for AI Music Tracking, addressing the limitations of binary AI-or-human classification in contemporary music production workflows. HAIM provides granular labels for stages of AI intervention, including hybrid production and agent-level tracking, reflecting real-world practices such as vocal synthesis, arrangement, and mastering. The authors evaluate state-of-the-art detectors, revealing systemic flaws in current approaches. By releasing HAIM, they propose a benchmark that shifts the field toward structured, multifaceted evaluation of AI integration in music production.

ai music trackinghybrid productionagent-level trackingvocal synthesismastering

Off-the-Shelf LLMs as Process Scorers: Training-Free Alternative to PRMs for Mathematical Reasoning

arXiv cs.AI · Atoosa Chegini, Soheil Feizi · 2026-06-01

We propose Chunk-Level Guided Generation, a training-free alternative to PRM-guided search for mathematical reasoning that uses off-the-shelf LLMs as process scorers. At each step, a small model samples k fixed-length candidate chunks, while a larger model scores candidates using likelihoods without generating text, steering generation before errors propagate. Two selection rules are introduced: Likelihood-Guided Selection (LGS) and Contrastive-Guided Selection (CGS). On GSM8K, MATH, Minerva Math, AMC23, and AIME24, CGS outperforms majority voting by up to 28 pp and matches or exceeds Qwen2.5-Math-PRM-72B performance without reward-model training, achieving 81.8% on MATH and 63.6% on Minerva Math with Qwen2.5-7B guided by Qwen2.5-72B.

chunk-level guided generationprocess scorercontrastive-guided selectionlikelihood-guided selectionmathematical reasoning

Time-Aware Diffusion based on Preference Disentanglement for Generative Recommendation

arXiv cs.AI · Bangguo Zhu, Peng Huo, Yuanbo Zhao, Zhicheng Du · 2026-06-01

The study introduces TDPM, a novel generative recommendation framework that incorporates time-aware diffusion on semantic index tokens to address the non-stationary distribution of user preferences. TDPM disentangles user preferences into period preference (long-term consistency) and point preference (short-term triggers), integrating these into the diffusion process. Experiments on three public datasets show TDPM outperforms state-of-the-art baselines by 29.21% in HR@20 and 25.45% in NDCG@20, validating the efficacy of time-aware token diffusion.

generative recommendationdiffusion modelstime-aware diffusionpreference disentanglementsemantic index tokens

DOT-MoE: Differentiable Optimal Transport for MoEfication

arXiv cs.AI · Udbhav Bamba, Arnav Chavan, Aryamaan Thakur, Steve Teig · 2026-06-01

DOT-MoE introduces a differentiable optimal transport framework for converting pre-trained dense LLMs into sparse Mixture of Experts (MoEs), addressing inference inefficiency. The method formulates neuron assignment as a balanced transport problem using differentiable Sinkhorn-Knopp iterations, enforcing expert capacity constraints, and jointly learns discrete neuron-to-expert assignments with token routing via Straight-Through Estimators. Experiments show DOT-MoE retains 90% of original dense model performance while reducing active parameters by 50%, outperforming heuristic clustering and random-split baselines.

mixture of expertsoptimal transportsinkhorn-knoppstraight-through estimatorneuron assignment

MINTS: Minimalist Thompson Sampling

arXiv cs.AI · Kaizheng Wang · 2026-06-01

We introduce MINimalist Thompson Sampling (MINTS), a Bayesian framework for sequential decision-making that places a prior solely on the location of the optimum while eliminating nuisance parameters via profile likelihood. This approach accommodates complex structural constraints naturally through a generalized posterior. MINTS demonstrates near-optimal non-asymptotic regret guarantees and sharp almost-sure asymptotic regret characterizations for multi-armed bandits with mean constraints. Specifically, it achieves the classical Lai--Robbins constant in unstructured settings and adapts to unimodal structure, attaining the sharp constant determined by the optimal arm's immediate neighbors.

thompson samplingbayesian frameworkprofile likelihoodmulti-armed banditsregret guarantees

MobEvolve: An Agentic Self-Evolving Heuristic System for Interpretable Human Mobility Generation

arXiv cs.AI · Junlin He, Yihong Tang, Tong Nie, Ao Qu · 2026-06-01

The paper introduces MobEvolve, an agentic self-evolving heuristic system for interpretable human mobility generation that addresses limitations of existing deep generative and LLM-based approaches. The framework initializes behavior-inspired heuristics, then employs an LLM agent to iteratively diagnose misalignments and evolve internal logic through validation-set feedback, accumulating evolution memory for continuous improvement. Evaluations on Singapore and Montreal benchmarks show MobEvolve outperforms state-of-the-art methods in trajectory fidelity (individual-level), distribution alignment (population-level), and behavioral plausibility while maintaining interpretability and inference efficiency.

human mobility generationagentic self-evolutionbehavioral plausibilitydistributional alignmentevolution memory

Easier to Mislead Than to Correct: Harmful and Beneficial Revision in LLM Conformity

arXiv cs.AI · Jiaming Qu, Lucheng fu, Yibo Hu · 2026-06-01

The study investigates harmful and beneficial revision behaviors in LLMs when exposed to peer responses, focusing on how consensus structure and authority labels influence conformity. Using four open-weight LLMs across seven QA datasets, the authors manipulate social cues and measure revision outcomes. Results show peer agreement more frequently misleads correct models than corrects wrong ones, authority labels bias decisions regardless of accuracy, and reasoning interventions fail to mitigate harmful revisions effectively. The findings advocate for peer answer verification in multi-agent LLM systems.

large language modelsmulti-agent systemsconformity biassocial cuespeer verification

AlphaToken: Decoupling Adaptation and Stability for Path-Aware Response Token Valuation in LLM Post-Training

arXiv cs.AI · Liu Qing, Ou Wu, Yi Du · 2026-06-01

AlphaToken introduces a principled framework for response token valuation in LLM post-training, decoupling adaptation (target-task learning) and stability (preserving pre-trained capabilities) through path-aware objectives. The method combines direct-path token gradients with causal-path signals in autoregressive generation, approximating stability via a Fisher-drift proxy when retention data is unavailable. It employs Ghost Dot-Product for efficient token-level valuation and masks low-value tokens during fine-tuning. Experiments demonstrate improved post-training performance and reduced catastrophic forgetting compared to heuristic approaches.

token valuationfisher-drift proxyghost dot-productpath-aware learningautoregressive generation

E4GEN: Event-level Explainable Extreme-Enhanced Time-series Generation

arXiv cs.AI · Lin Jiang, Dahai Yu, Ximiao Li, Guang Wang · 2026-06-01

The paper introduces E4GEN, an explainable diffusion framework for extreme event-aware time-series generation. The method comprises three components: E-Activator for dataset-adaptive extreme-control signal activation, E-Predictor for self-driven semantic prediction and data-conditioned training, and E-Control for layer-wise signal injection via a trainable Extreme Control Network. Evaluated on six datasets with 17 metrics, E4GEN outperforms state-of-the-art models in overall fidelity, extreme-event fidelity, and downstream utility.

diffusion frameworkextreme-event generationself-driven semantic predictiondata-conditioned trainingdenoising process

A Framework for Graph-Conditioned Hierarchical Shapley Attribution in Patent Valuation

arXiv cs.AI · Joy Bose · 2026-06-01

The paper introduces PatentXAI, a framework for patent valuation using hierarchical Shapley attribution in knowledge graphs. The method computes patent contributions by restricting coalitions to Markov Blankets grounded in the C-SVE conditional independence theorem, achieving tractability with median blanket sizes covering 32.9% of patents at n=100. Experiments demonstrate scalability (10ms/patent) and accuracy (0.062 deviation from Monte Carlo reference), improving to 0.039 deviation for homogeneous patent clusters. Profit allocation combines exact Shapley for macro-components and centrality-weighted Shapley for individual patents. The work identifies empirical estimation of the characteristic function v(S) as the key remaining challenge.

shapley valuemarkov blanketknowledge graphpatent valuationc-sve theorem

Demystifying Multimodal Biomolecular Co-design With Intrinsic Geodesic Coupling

arXiv cs.AI · Keyue Qiu, Xintong Wang, Zhilong Zhang, Hao Zhou · 2026-06-01

The paper introduces GeoCoupling, a framework optimizing temporal couplings between heterogeneous modalities in biomolecular co-design. It addresses limitations of existing approaches that enforce fixed synchronous coupling, which can lead to high-variance supervision and inconsistent intermediate states. GeoCoupling systematically learns intrinsic geodesic couplings during training and generation, improving modality consistency. Empirical evaluations in structure-based drug design and unconditional protein design demonstrate that GeoCoupling outperforms synchronous and randomly coupled baselines, yielding biomolecules with enhanced physical validity and diversity.

biomolecular co-designtemporal couplinggeodesic couplingmodality consistencystructure-based drug design

ReSkill: Reconciling Skill Creation with Policy Optimization in Agentic RL

arXiv cs.AI · Zelin He, Haotian Lin, Boran Han, Wei Zhu · 2026-06-01

ReSkill introduces a reinforcement learning framework that reconciles skill creation with policy optimization, addressing the decoupling issue in existing skill-augmented RL methods. It integrates three mechanisms: an assertion-driven skill creator for conditional skill revisions, within-group rollout sampling for skill version comparison, and Thompson Sampling with adaptive discounting for skill version selection. Evaluated across multiple domains, ReSkill outperforms memory and skill-based RL methods, particularly on unseen tasks, demonstrating effective skill-policy co-evolution through automatic skill creation, refinement, and pruning.

reinforcement learningskill creationpolicy optimizationthompson samplingrollout sampling

EvoPool: Evolutionary Programmatic Annotation for Label-Efficient Specialized Supervision

arXiv cs.AI · Tianyi Xu, Yaolun Zhang, Xuan Ouyang, Huazheng Wang · 2026-06-01

EvoPool introduces an evolutionary multi-agent framework for label-efficient specialized supervision, addressing the underperformance of large language models in high-stakes domains. The method employs three specialized agents that iteratively propose executable annotator code, validated via a fitness signal and deterministic gating for viability, diversity, and marginal contribution. EvoAgg aggregates pool votes into soft training labels using semantic and annotator-vote features. Results show EvoPool is 4500-31000x faster than LLM annotation and outperforms LLM baselines by +0.141 macro-F1 on average across 7 of 8 specialized tasks, peaking at +0.301 on ChemProt.

evolutionary multi-agentprogrammatic annotationlabel-efficientspecialized supervisionsoft training labels

TechGraphRAG: An Agentic Graph-Augmented RAG Framework for Technical Literature Reasoning

arXiv cs.AI · Kanwar Bharat Singh · 2026-06-01

The paper introduces TechGraphRAG, an agentic retrieval-augmented generation framework for technical literature reasoning, applied to a corpus of 2,100 papers in intelligent tires and vehicle dynamics. It features a 13-step pipeline with query intent classification, evidence sufficiency scoring, agentic retry loops, external database searches, Neo4j knowledge graph traversal, and citation verification. Key innovations include a 100-point evidence scoring system, route-dependent search architecture, and self-correcting generation loops. The framework demonstrates practical implementation for domain-specific RAG with enhanced reliability and context-aware reasoning.

retrieval-augmented generationknowledge graphevidence sufficiencyagentic loopscitation verification

Revisiting Ripple Effects in Knowledge Editing through Pressure-Aware Joint Neighborhood Optimization

arXiv cs.AI · Haoben Huang, Shuxin Liu, Ou Wu, Di Gao · 2026-06-01

The paper introduces Joint Neighborhood Optimization (JNO), a knowledge-editing framework that jointly addresses coupled ripple effects in large language models: desirable propagation to related facts and unintended perturbation of preserved knowledge. JNO employs Pressure-Aware Coordination (PAC) to optimize neighborhood target representations under coupled constraints and a semantic pre-execution gate to filter high-risk edits. Evaluated on RippleEdits, JNO improves propagation and preservation metrics by ≥7.0% while maintaining cross-backbone editing stability.

knowledge editingripple effectsjoint neighborhood optimizationpressure-aware coordinationsemantic pre-execution gate

FedMTFI: Feature Importance Based Optimized Multi Teacher Knowledge Distillation in Heterogeneous Federated Learning Environment

arXiv cs.AI · Nazmus Shakib Shadin, Aaron Cummings, Xinyue Zhang, Bobin Deng · 2026-06-01

The paper proposes FedMTFI, a federated learning architecture combining multi-teacher knowledge distillation (MTKD) with feature importance for heterogeneous environments. Clients are clustered by hardware and model types, training local models on non-IID data before server aggregation via FedAvg. Cluster prototypes serve as teachers for a global student model, with Shapley values (SHAP) weighting important features during distillation. Experiments demonstrate superior accuracy over traditional FL methods, particularly under non-IID conditions.

federated learningknowledge distillationnon-iid datashapley valuesmodel aggregation

Estimating Mutual Information between Time Series and Temporal Event Sequences Across Diverse Analysis Tasks

arXiv cs.AI · Haoji Hu, Huaqing Mao, Yijun Lin, Xiaowei Jia · 2026-06-01

The authors propose a nonparametric mutual information estimator for quantifying dependence between continuous time series and discrete temporal event sequences, addressing limitations of existing methods that suffer from quantization sensitivity and event redundancy. Their approach models continuous-discrete duality without data transformation and employs latent event clustering to reduce bias from co-occurrence. Evaluated on four tasks (time-delayed mutual information, temporal repetition discovery, covariate selection, and feature selection), the method demonstrates improved accuracy and robustness across synthetic and real-world datasets compared to baseline techniques.

mutual informationtemporal event sequencesnonparametric estimationcontinuous-discrete dualitylatent event clustering

TRON: Targeted Rule-Verifiable Online Environments for Visual Reasoning RL

arXiv cs.AI · Tianze Yang, Yucheng Shi, Ruitong Sun, Jingyuan Huang · 2026-06-01

The paper introduces TRON (Targeted, Rule-verifiable Online eNvironments), a scalable online environment substrate for visual reasoning reinforcement learning. TRON generates training rollouts on demand via a controllable generator-verifier program that samples latent visual states, renders images, poses questions, and verifies answers exactly, enabling unbounded instance generation at curriculum-controlled difficulty levels. The current suite comprises 520 environments across five ability buckets (spatial, mathematical, diagram, pattern/logic, and counting), supporting both generalist and specialist model training. Experiments demonstrate improved performance on ten multimodal reasoning benchmarks across Qwen3-VL-4B, Qwen2.5-VL-7B, and MiMo-VL-7B-SFT models.

visual reasoningreinforcement learningonline environmentgenerator-verifiermultimodal benchmarks

Identifying High-Confidence Social Biases in LLMs for Trustworthy Conversational Tutoring Agents

arXiv cs.AI · Aitor Arronte Alvarez, Naiyi Xie Fincham · 2026-06-01

The study introduces a dataset generation method to evaluate high-confidence social biases in LLM-based conversational tutoring agents, regenerating student-AI interactions with controlled bias from benchmark data. It assesses multiple LLMs' bias detection capabilities under naturalistic instructional conditions, combining computational and human evaluations. Results show bias detection is more challenging in tutoring contexts than benchmarks, with state-of-the-art LLMs exhibiting overconfidence in incorrect assessments, which influences reasoning and feedback. The findings highlight risks of biased behavior in educational LLM applications.

large language modelssocial biasesconversational tutoringdataset generationoverconfidence

Defenses & Enablers For Skill Injection Attacks on Terminal Based Agents

arXiv cs.AI · Yoshinari Fujinuma, Varun Gangal, Traian Rebedea, Makesh Narasimhan Sreedhar · 2026-06-01

The paper investigates defenses against skill injection attacks in LLM agents that utilize reusable procedural documents. It proposes two guardian-based defenses: a dynamic guardian mediating real-time skill file access and a static guardian rewriting files pre-deployment. Evaluations across three LLM agent families show these defenses reduce attack success rates by over 50% while maintaining task utility. Attack reframing tests demonstrate non-guardian setups reach 81.4% ASR, while dynamic guardians suppress it to 18.6%, proving real-time mediation's robustness.

skill injectionllm agentsdynamic guardianattack success rateprocedural documents

S-SPPO: Semantic-Calibrated Self-Play Preference Optimization

arXiv cs.AI · Xiwen Chen, Wenhui Zhu, Jingjing Wang, Peijie Qiu · 2026-06-01

We introduce S-SPPO, a semantic-calibrated self-play preference optimization framework addressing policy degeneration in Self-Play Preference Optimization (SPPO) for aligning Large Language Models (LLMs) with human preferences. S-SPPO employs dual-space semantic calibration: i) Supervision Calibration via semantic gating to anneal win rate targets toward maximum-entropy baselines as semantic overlap increases, and ii) Representation Calibration via latent repulsion to enforce geometric diversity and prevent manifold collapse. Theoretically, S-SPPO preserves the constant-sum game structure, ensuring convergence to Nash Equilibrium. Empirically, S-SPPO achieves 52.19% win rate and 47.46% length-controlled win rate on AlpacaEval 2.0 with Llama-3-8B, outperforming prior methods without additional human-annotated preferences.

self-play preference optimizationsemantic calibrationlatent repulsionpolicy degenerationnash equilibrium

GJDNet: Robust Graph Neural Networks via Joint Disentangled Learning Against Adversarial Attacks

arXiv cs.AI · Canyixing Cui, Tao Wu, Xingping Xian, Xiao-Ke Xu · 2026-06-01

GJDNet introduces a robust graph neural network framework for adversarial defense via joint disentangled learning, addressing structure-feature mismatches caused by perturbation-induced structural inversion. The method combines feature-driven soft structural disentanglement with skewness-aware neighbor filtering to suppress mismatches, and employs a Spherical Decision Boundary (SDB) to enhance intra-class compactness and inter-class separation. Experiments demonstrate consistent robustness across diverse graph assortativity regimes, supported by theoretical analysis of the disentangled mechanisms.

graph neural networksadversarial attacksdisentangled learningspherical decision boundarystructural disentanglement

RoleCDE:Benchmarking and Mitigating Role-Alignment Trade-offs in Role-Playing Agents

arXiv cs.AI · Huayi Lai, Shichao Song, Simin Niu, Hanyu Wang · 2026-06-01

The paper introduces RoleCDE, a novel benchmark for evaluating role-playing agents (RPAs) in value conflict scenarios, addressing gaps in existing role-alignment assessments. It features 8k role profiles and 24k dilemma instances across three difficulty levels and eight categories, testing role-scenario grounding and conflict resolution. Experiments reveal a 'Role Value Decoupling' phenomenon where LLMs default to alignment over role-specific values, invariant to difficulty but varying by role category. RoleCDE-based fine-tuning mitigates this decoupling while preserving role fidelity and reasoning performance.

role-playing agentsvalue conflictbenchmarkfine-tuningrole value decoupling

Self-Conditioned Positional HNSW for Overlap-Aware Retrieval in Chunked-Document RAG Systems: Method and Industrial Evidence-Quality Audit

arXiv cs.AI · Nataraj Agaram Sundar, Tejas Morabia · 2026-06-01

The paper introduces Self-Conditioned Positional HNSW (SCP-HNSW), a modified hierarchical navigable small world graph (HNSW) method for retrieval-augmented generation (RAG) systems that reduces redundant chunk retrieval by appending positional codes and using a two-pass query procedure. The method maintains standard HNSW construction while adding a minimum-index-gap selector for context construction. Industrial audits on 770 text reviews and 70 OCR cases show evidence quality varies (3/5 average for text, 45-95% pass rates for OCR), highlighting the need for overlap-aware retrieval and further controlled ablations.

hierarchical navigable small world graphsretrieval-augmented generationpositional encodingapproximate nearest-neighbor searchevidence quality audit

TN-SHAP-G: Graph-Structured Tensor Network Surrogates for Shapley Values and Interactions

arXiv cs.AI · Farzaneh Heidari, Guillaume Rabusseau · 2026-06-01

The paper introduces TN-SHAP-G, a framework for efficiently computing Shapley values and higher-order interaction indices in graph-structured inputs by learning a compact tensor network surrogate. The method aligns the surrogate's topology with the input graph, enabling deterministic recovery of Shapley indices via multilinear extension without additional model queries. Experiments on molecular benchmarks demonstrate accurate approximation of exact Shapley values on small graphs and scalable performance on larger graphs where sampling methods fail.

shapley valuestensor networksgraph-structured inputsmultilinear extensioninteraction indices

Joint Agent Memory and Exploration Learning via Novelty Signals

arXiv cs.AI · Shizuo Tian, Xiaohong Weng, Rui Kong, Yuxuan Chen · 2026-06-01

The paper introduces JAMEL, a framework for joint training of agent memory and exploration policies via novelty signals in open-ended environments. It addresses the computational inefficiency of raw history retention and lack of supervision in latent memory training by using deterministic novelty signals (e.g., code coverage in GUI domains) for annotation-free memory supervision. Empirical results show JAMEL generalizes to unseen environments, outperforms open-weight baselines in exploration depth, and rivals closed-source models while reducing token consumption.

explorationmemorynovelty signalsopen-ended environmentslatent memory

TERRA: Task-Embedded Reasoning and Representation Architecture for Cross-Domain Applications

arXiv cs.AI · Shayan Shokri · 2026-06-01

TERRA proposes a theoretical framework for cross-domain transfer learning in structured-state environments, formalizing when and how representations learned in one domain generalize to structurally analogous domains. The method models domains as controlled Markov processes on graded latent grids, using domain adapters and a shared core, with cross-domain correspondence measured via approximate MDP homomorphisms (lax bisimulation discrepancy, Gromov-Wasserstein distance). Theoretical results include a transfer bound separating source-model error from structural mismatch, growing geometrically with prediction horizon, and connecting latent error to decision regret via bisimulation metrics. The work presents a preregistered experimental program to test the Structured-State Transfer Hypothesis without empirical results.

cross-domain transfermarkov decision processbisimulation metricgromov-wasserstein distancestructured-state

Compliance-Scored Best-of-N Guardrail Orchestration for Multimodal Document Generation in Payments Dispute Defense

arXiv cs.AI · Nataraj Agaram Sundar, Tejas Morabia · 2026-06-01

The paper introduces a guardrail orchestration layer for multimodal document generation in high-stakes financial applications, combining multi-candidate generation with compliance scoring for early exit. The framework employs parallel generation heads, scores outputs against weighted guardrails (PII detection, content moderation, schema constraints), and selects the best candidate. Operational results show 91% compliance, 5 attempts within 20 seconds, and significant win-rate improvements (+11.0pp overall, +7.5pp for item-not-received cases) compared to controls. The system also includes Responsible-AI evidence-quality signals and detailed reproducibility documentation.

guardrail orchestrationcompliance scoringmultimodal generationpii detectionearly exit

ProbMoE: Differentiable Probabilistic Routing for Mixture-of-Experts

arXiv cs.AI · Heng Zhao, Zilei Shao, Guy Van den Broeck, Zhe Zeng · 2026-06-01

The paper introduces ProbMoE, a differentiable probabilistic routing framework for Mixture-of-Experts (MoE) models that addresses the non-differentiability of expert selection. The method formulates routing as probabilistic inference over cardinality-constrained expert subsets, using exact marginal probabilities for gradient estimation in Exact-$k$ routing and extending to Dynamic-$k$ routing for adaptive expert allocation. Experiments show ProbMoE Exact-$k$ achieves competitive performance with improved expert utilization and routing diversity, while Dynamic-$k$ matches performance with fewer activated experts.

mixture-of-expertsprobabilistic routinggradient estimationdynamic-k routingexpert utilization

Agent Operating Systems (AOS): Integrating Agentic Control Planes into, and Beyond, Traditional Operating Systems

arXiv cs.AI · Ankur Sharma, Deep Shah · 2026-06-01

The paper introduces Agent Operating Systems (AOS), a systems architecture integrating agentic control planes into traditional operating systems to address challenges posed by AI agents. AOS handles scheduling, memory management, tool registries, policy enforcement, and observability, while maintaining compatibility with existing OS abstractions. The authors analyze limitations of classical OS designs, propose integration models ranging from user-space runtimes to distributed control planes, and define evaluation criteria emphasizing security and auditability. The work establishes a foundation for scalable, controllable agentic computation without replacing conventional OSes.

agentic control planeoperating systemsschedulingmemory managementauditability

On the Limits of Token Reduction for Efficient Unified Vision Language Training

arXiv cs.AI · Siyi Chen, Weiming Zhuang, Jingtao Li, Lingjuan Lv · 2026-05-31

This work investigates the limits of token reduction for accelerating unified vision-language model (VLM) training, revealing an asymmetry in attention allocation: visual understanding shows late-layer redundancy while generation maintains persistent image-token dependence. The authors design task-specific accelerators that selectively reduce image-token computation, achieving isolated efficiency gains but observing synergy loss in unified training due to divergent parameter pathways. Results demonstrate that preserving shared cross-task structures is essential for efficient joint optimization, necessitating synergy-aware acceleration strategies.

vision-language modelstoken reductionattention allocationjoint optimizationautoregressive backbone

Move the Query, Not the Cache: Characterizing Cross-Instance Latent Attention Redistribution Across GPU Fabrics

arXiv cs.AI · Bole Ma, Jan Eitzinger, Harald Köstler, Gerhard Wellein · 2026-05-31

The paper introduces Multi-head Latent Attention (MLA), a method for efficient cross-instance attention in large language models by compressing token keys and values into narrow vectors (~1 KB) and routing queries instead of moving KV-cache blocks. The authors characterize MLA on a multi-node H100 cluster, developing a topology-aware cost model and a closed-form route/fetch/local predicate, which predicts batched round-trips within ~7% error. Results show that query routing reduces latency to tens of microseconds, outperforming cache movement (~3 ms) for sparse attention workloads. The cost model generalizes to architectures like DeepSeek-V3.2 and GLM-5.1.

multi-head latent attentionkv-cachesparse-attentionrdmatopology-aware cost model

TimeSage-MT: A Multi-Turn Benchmark for Evaluating Agentic Time Series Reasoning

arXiv cs.AI · Yaxuan Kong, Qingren Yao, Yuqi Nie, Yichen Li · 2026-05-31

TimeSage-MT introduces a multi-turn benchmark for evaluating agentic time series reasoning, addressing limitations in existing single-step benchmarks. The benchmark comprises 240 tasks and 2,680 dialogue turns across 8 domains, generated through a reproducible pipeline that converts real-world time series data into verifiable multi-turn conversations. It evaluates LLM agents and TimeSage, a structured agent with a time series skill library. Results reveal significant performance drops in decision-oriented tasks, attributed to failures in memory, uncertainty handling, and domain-based decision making. TimeSage-MT highlights critical gaps in agentic reasoning and establishes a foundation for future advancements.

time seriesmulti-turn benchmarkagentic reasoningdecision-oriented tasksstructured agent

ClawHub Security Signals: When VirusTotal, Static Analysis, and SkillSpector Disagree

arXiv cs.AI · Vincent Koc, Patrick Erichsen, Jacob Tomlinson, Agustin Rivera · 2026-05-31

ClawHub Security Signals introduces a sanitized dataset of 67,453 OpenClaw skill versions to study scanner disagreement in agent-skill security. The dataset pairs skill content with verdicts from VirusTotal, static heuristic analysis, and NVIDIA SkillSpector, revealing structured disagreement: only 0.69% of flagged skills are identified by all three scanners, with 81.9% flagged by just one. SkillSpector dominates semantic agentic-risk advisories (75.3% of suspicious rows) while VirusTotal detects 72.8% of malicious rows. The results advocate for layered governance over single-scanner decisions, with the dataset released as a silver-standard corpus for community research.

agent skillsscanner disagreementstatic heuristic analysissemantic agentic-risksilver-standard dataset

LLM Consortium for Software Design Refinement: A Controlled Experiment on Multi-Agent Collaboration Topologies

arXiv cs.AI · Nagarjuna Kanamarlapudi, Praveen K · 2026-05-31

The study evaluates 12 multi-agent LLM collaboration topologies for software architecture design through a $2\times2\times2$ factorial experiment (Authority $\times$ Roles $\times$ Dynamics), conducting 520 runs across 8 tasks. Designs were assessed by three automated evaluators (GPT-OSS 120B, Claude Opus 4.6, Claude Sonnet 4.6) using a 12-dimensional rubric. Key findings: (1) structural adversarial (v4b) ranks highest (4.637/5.0); (2) cross-model review consistently ranks second (4.606); (3) evaluator diversity reveals model-family biases; (4) parallel merge performs poorly (3.65-3.79) due to token starvation and fragmentation.

multi-agent collaborationsoftware architecture designautomated evaluatorstoken starvationfactorial experiment

MURMUR: An Efficient Inference System for Long-Form ASR

arXiv cs.AI · Wei-Tzu Lee, Keisuke Kamahori, Baris Kasikci · 2026-05-31

Murmur introduces an efficient inference system for long-form automatic speech recognition (ASR) that balances accuracy and latency through a dual-level approach. At the inter-chunk level, it optimizes chunk size as a tunable hyperparameter, finding intermediate sizes that enhance performance. At the intra-chunk level, it employs a sliding window KV cache eviction policy to exploit attention sparsity for both output and speech tokens. Evaluated on AMI-IHM, Murmur achieves single-pass accuracy while reducing latency by 4.2x, with token eviction yielding further gains at less than 1% relative tcpWER degradation. The system's code is publicly available.

automatic speech recognitionkv cacheattention sparsitylatencytcpwer

Crazyflow: An Accurate, GPU-Accelerated, Differentiable Drone Simulator in JAX

arXiv cs.AI · Martin Schuck, Marcel P. Rath, Yufei Hua, AbhisheK Goudar · 2026-05-31

Crazyflow introduces a GPU-accelerated, differentiable drone simulator in JAX, unifying fidelity, differentiability, and swarm simulation for aerial robotics. It achieves speeds over 10x faster than state-of-the-art simulators for single drones and scales to thousands of swarms with 4000 drones each. Real-world experiments demonstrate sub-centimeter trajectory tracking accuracy via analytical-gradient-based policy learning and obstacle avoidance at over half a billion steps per second. The simulator enables in-flight reinforcement learning, stabilizing a physical drone in 0.38 seconds. Crazyflow supports multiple abstraction levels, Crazyflie compatibility, and rapid reconfiguration across platforms, advancing synthetic data generation for online learning and optimization.

differentiable simulatorswarm simulationanalytical-gradient-basedin-flight reinforcement learningsystem identification pipeline

A Minimalist Brain-Computer Musical Interface for Real-Time Emotion-Driven Sonification: System Design and Preliminary Evaluation

arXiv cs.AI · Pablo A. Monroy-D'Croz, Rafael Ramirez-Melendez, Julian Cespedes-Guevara · 2026-05-31

The study introduces a minimalist brain-computer musical interface (BCMI) for real-time emotion-driven sonification, mapping prefrontal EEG-derived emotional valence (via frontal alpha asymmetry at AF7/AF8) to adaptive musical features (mode, tempo, rhythm, pitch) using stochastic generation. The system combines wireless EEG, Python signal processing, and Ableton Live via Lab Streaming Layer. Evaluation with 22 participants revealed no significant modulation of neurofeedback by intentional emotional self-induction (0.40% explained variance), with individual differences (musical/acting experience) dominating signal variance. Results question frontal alpha asymmetry's reliability for voluntary emotion regulation in closed-loop BCMI systems.

brain-computer musical interfacefrontal alpha asymmetryaffective sonificationstochastic generationneurofeedback

Hierarchical Online Prompt Mutation with Dual-Loop Feedback for Guardrailed Evidence Document Generation: A Production-Evaluation Case Study

arXiv cs.AI · Nataraj Agaram Sundar Tejas Morabia · 2026-05-31

The paper introduces Hierarchical Online Prompt Mutation (HOPM), a framework for adaptive, evidence-grounded document generation in high-stakes workflows. HOPM treats prompts as online policies, using a family/version router, deterministic guardrails, and dual-loop feedback (human review + automated judge) to update routing and mutation priorities. Evaluated on 600 marketplace dispute cases, full HOPM improved count win rate by +11.0 pp (34.7% to 45.7%), amount-weighted win rate by +19.1 pp (22.3% to 41.4%), and mean Likert quality from 3.18 to 4.40 while reducing issue-flag rate from 15.3% to 5.2%. The study includes detailed evaluation artifacts and reproducibility materials.

hierarchical prompt mutationdual-loop feedbackdeterministic guardrailsonline policiesevidence-grounded generation

Emergent Transfer of a Physics Foundation Model from Simulation to Laboratory Turbulence

arXiv cs.AI · Payel Mukhopadhyay, Stefan S. Nixon, Romain Watteaux, Michael McCabe · 2026-05-31

The study demonstrates that physics foundation models can effectively transfer from simulation to laboratory conditions, addressing the longstanding discrepancy in Rayleigh-Taylor instability (RTI) mixing growth rates between simulations ($α∼0.02$) and experiments ($α∼0.06-0.07$). Finetuning Walrus, a continuum dynamics foundation model, on ≤3 DNS realizations enables accurate RTI physics prediction over long rollouts. Zero-shot application to laboratory data yields growth rates within the experimental band, despite no exposure to experimental samples, implicating initial conditions in the sim-experiment gap. The model also generalizes to untrained stable stratification regimes, correctly modulating mixing-layer growth.

physics foundation modelrayleigh-taylor instabilitydirect numerical simulationzero-shot transfercontinuum dynamics

Computation-Aware Kalman Filtering with Model Selection for Neural Dynamics

arXiv cs.AI · JR Huml, Jonathan Wenger, John P. Cunningham · 2026-05-31

The paper introduces Computation-Aware State-Space Model (CASSM), a Bayesian framework for neural dynamics that addresses computational uncertainty and model selection in scale-imbalanced regimes (few trials, many neurons). CASSM combines a novel training loss and optimization scheme to enable tractable inference in large state-spaces while maintaining uncertainty calibration. Experiments on synthetic and real data demonstrate competitive performance with deep networks, particularly in data-scarce scenarios, with improved uncertainty quantification over prior Bayesian approaches. The work provides practical guidance for selecting dynamical latent variable models based on dataset characteristics.

bayesian methodsstate-space modelneural dynamicsuncertainty calibrationmodel selection

An Enigma of Artificial Reason: Investigating the Production-Evaluation Gap in Large Reasoning Models

arXiv cs.AI · Mingzhong Sun, Teresa Yeo, Armando Solar-Lezama, Tan Zhi-Xuan · 2026-05-31

The study identifies a production-evaluation gap in large reasoning models (LRMs), contrasting with human reasoning where evaluation skills are typically stronger. Using the Valid-Answer-Invalid-Reasoning (VAIR) dataset, which isolates reasoning evaluation from production, the authors find LRMs score as low as 48% in evaluation despite near-perfect production performance. Chain-of-thought analysis and linear probes reveal an answer confirmation bias, where LRMs prioritize answer validity over stepwise reasoning verification, with causal patching demonstrating the bias's dependence on answer representations.

large reasoning modelsproduction-evaluation gapvalid-answer-invalid-reasoninganswer confirmation biascausal patching

Transferring Information Across Interventions in Causal Bayesian Optimization

arXiv cs.AI · Mohammad Ali Javidian · 2026-05-31

We propose graph-coupled causal Bayesian optimization, a method that transfers information across interventions by leveraging shared causal parameters in a causal graph. The approach introduces a causal kernel that enables evidence from one intervention to improve estimates of related interventions, particularly effective in identifiable linear Gaussian causal models where the kernel exhibits low rank. Theoretical analysis demonstrates logarithmic growth in information-gain bounds and a regret bound separating optimization, causal estimation, and intervention selection errors. Empirical evaluations on Gaussian systems, stress tests, and benchmarks show improved performance, especially when direct interventions on target parents are unavailable and sparse interventional data must be reused.

causal bayesian optimizationcausal kernellinear gaussian modelsintervention transferregret bound

Self-Revising Discovery Systems for Science: A Categorical Framework for Agentic Artificial Intelligence

arXiv cs.AI · Fiona Y. Wang, Markus J. Buehler · 2026-05-31

The article presents a category-theoretic framework for agentic AI systems in scientific discovery, distinguishing between fixed-regime operations and regime transitions. Using copresheaves and left Kan extensions, it formalizes how artifacts are preserved and compared across representational regime changes. The framework is instantiated in two systems: Builder/Breaker, which revises protein-mechanics models under Minimum Description Length constraints, and CategoryScienceClaw, a proof-carrying knowledge-computation graph for materials science. Results demonstrate category theory's dual role as both a mathematical language and engineering specification for self-revising discovery systems.

category theorycopresheafleft kan extensionminimum description lengthproof-carrying knowledge

ProtoAda: Prototype-Guided Adaptive Adapter Expansion and Geometric Consolidation for Multimodal Continual Instruction Tuning

arXiv cs.LG · Yu-Cheng Shi, Zhen-Hao Xie, Jun-Tao Tang, Da-Wei Zhou · 2026-06-01

ProtoAda introduces a prototype-guided adaptive tuning framework for Multimodal Continual Instruction Tuning (MCIT), addressing inter-task interference and ineffective expert collaboration in sparse architectures like Mixture of LoRA Experts. The method employs format-aware task prototypes to align task assignment and routing with both task semantics and output structure, consolidating format-compatible updates in a geometry-aware manner. Extensive experiments on multiple benchmarks demonstrate ProtoAda's superior performance, particularly on tasks with answer structures vulnerable to sequential tuning.

multimodal continual instruction tuningprototype-guided tuningmixture of lora expertstask semanticsgeometry-aware updates

IntraShuffler: A Privacy Preserving Framework for Heterogeneous DP Federated Learning

arXiv cs.LG · Farhin Farhad Riya, Olivera Kotevska, Jinyuan Stella Sun · 2026-06-01

IntraShuffler introduces a privacy-preserving framework for Heterogeneous Differential Privacy Federated Learning (HDP-FL) that mitigates Privacy Inference Attacks while maintaining ε-aware aggregation. The method groups clients into privacy-compatible buckets and performs parameter-level shuffling within each bucket to disrupt persistent gradient structure induced by non-IID data. Experiments on four datasets demonstrate that IntraShuffler reduces gradient recoverability by over 60%, decreases surrogate inference accuracy from 0.78 to 0.33, and preserves model utility across multiple FL aggregation rules.

heterogeneous differential privacyfederated learningprivacy inference attacknon-iid dataε-aware aggregation

Auditing Asset-Specific Preferences in Financial Large Language Models: Evidence from Bitcoin Representations and Portfolio Allocation

arXiv cs.LG · Wenbin Wu · 2026-06-01

The study introduces a three-level audit protocol to detect and manipulate asset-specific biases in financial large language models (LLMs), focusing on Bitcoin. Behavioral audits of eight frontier LLMs reveal frame-dependent Bitcoin rankings, while sparse-autoencoder feature analysis in Gemma 3 identifies a Bitcoin-selective feature with causal influence. Amplifying this feature increases Bitcoin's portfolio share by 5.2 percentage points, while suppressing it reduces exposure by 4.6 pp, demonstrating bounded behavioral leverage. The framework connects internal representations to financial decisions, offering a foundation for know-your-agent (KYA) standards in autonomous financial agents.

large language modelsbehavioral auditsparse-autoencoderportfolio allocationknow-your-agent

Drifting Preference Optimization for One-Step Generative Models

arXiv cs.LG · Zhou Jiang, Yandong Wen, Zhen Liu · 2026-06-01

The paper introduces Drifting Preference Optimization (DrPO), an online preference-finetuning method for deterministic one-step generative models like SD-Turbo and SDXL-Turbo. DrPO samples candidate images, ranks them using a target reward, and synthesizes feature-space updates via a non-parametric dipole preference field and reference drift from the frozen base generator. This approach eliminates reward-model backpropagation, reducing HPSv3 training computation by 3.51× while improving alignment over reward-gradient-free baselines. Evaluations on benchmarks including HPSv3 and GenEval demonstrate effectiveness with black-box or non-differentiable rewards.

preference optimizationone-step generationfeature-space driftnon-parametric dipolereward alignment

A Biconvex Formulation for Stable Transport of Mixture Models with a Unique Solution

arXiv cs.LG · Yeganeh Marghi, Kelly Jin, Uygar Sümbül · 2026-06-01

The authors propose Optimal Mixture Transport (OMT), a scalable optimal transport framework that maps between mixture models rather than individual samples, formulated as a strictly biconvex optimization problem with a unique global minimizer. OMT leverages exponential-family distributions for subpopulations, decoupling computational complexity from sample size and scaling only with the number of mixture components. Theoretical analysis demonstrates stability under bounded distribution perturbations. Experiments validate OMT on synthetic benchmarks and real-world datasets, including image data and single-cell RNA sequencing.

optimal transportmixture modelsbiconvex optimizationexponential-family distributionssingle-cell rna sequencing

Towards Automated Discovery: A Review of Generative Models, Multimodal Learning and Closed-Loop Workflows in Inverse Materials Design

arXiv cs.LG · Anand Babu, Rogério Almeida Gouvêa, Gian-Marco Rignanese · 2026-06-01

The review systematizes advances in inverse materials design, focusing on generative models for crystalline solids and closed-loop discovery workflows. It analyzes how variational autoencoders, normalizing flows, autoregressive models, and diffusion models learn chemical-structural priors from materials databases while enforcing physical constraints through representation choices and sampling-time guidance. The work also examines multimodal learning approaches that integrate crystal structures with electronic properties, spectroscopy data, and scientific text, alongside inverse-design strategies combining conditional generation with latent optimization and active learning. Key challenges identified include surrogate exploitation, diversity collapse, and the stability-synthesizability gap, with proposed evaluation metrics addressing validity, novelty, and cost.

inverse materials designgenerative crystal modelingmultimodal learningclosed-loop workflowsphysical priors

Expressivity of congruence-based architectures for DNNs on positive-definite matrices

arXiv cs.LG · Antonin Oswald, Estelle Massart · 2026-06-01

The paper analyzes the expressivity limitations of congruence-based neural architectures for symmetric positive-definite (SPD) matrix classification, particularly SPDNet. It demonstrates that imposing semi-orthogonality constraints on weight matrices $W$ reduces spectral diversity in congruence-like layers (where input matrices are transformed as $WXW^T$), causing multi-layer architectures to collapse to single-layer equivalents under certain activation functions. This limitation stems from Poincaré's separation theorem. The work also evaluates Riemannian classifiers' compatibility with congruence-layer feature maps.

spd matricescongruence layersexpressivitypoincaré separationriemannian classifiers

Physics-Informed Residuals for Adaptive Mesh Refinement in Finite-Difference PDE Solvers

arXiv cs.LG · Henry Kasumba, Ronald Katende · 2026-06-01

The paper proposes using physics-informed neural networks (PINNs) as off-grid residual probes to guide adaptive mesh refinement (AMR) in finite-difference PDE solvers. The method samples PINN residuals to generate cellwise refinement indicators before final computation with a classical solver. Evaluated on 1D Burgers equation, PINN-threshold refinement achieves 0.021067 relative L² error with 60 DOF, outperforming uniform refinement (0.022617 error, 192 DOF) and reducing error by 67.5% at matched mesh size. Results extend to 2D/3D benchmarks, showing PINN-guided AMR can transfer physics-informed diagnostics while preserving classical solver accuracy.

adaptive mesh refinementphysics-informed neural networksfinite-difference methodspartial differential equationsresidual indicators

Speculative Sampling For Faster Molecular Dynamics

arXiv cs.LG · Arthur Kosmala, Stephan Günnemann, Meng Gao, Brandon Wood · 2026-06-01

The paper introduces Langevin Speculative Dynamics (LSD), a model-agnostic speculative sampler for accelerating molecular dynamics (MD) simulations without introducing relative error. Inspired by speculative methods in language and diffusion modeling, LSD employs a draft model to propose fast simulation steps, verifies them in parallel with a slower target model, and applies a transport map between distributions. The method extends speculative sampling to second-order Langevin dynamics, achieves 3-9x speedups across diverse systems, and theoretically preserves the target distribution. Empirical results confirm LSD samples correctly from the target model.

speculative samplinglangevin dynamicsmolecular dynamicstransport mapparallel verification

On the Scaling of PEFT: Towards Million Personal Models of Trillion Parameters

arXiv cs.LG · Mind Lab, :, Song Cao, Vic Cao · 2026-06-01

The paper repositions parameter-efficient fine-tuning (PEFT) as a substrate for persistent personal models rather than merely a cost-effective alternative to full fine-tuning. It proposes a framework with three scaling axes: Scale Up (leveraging stronger shared priors), Scale Down (minimizing adapter size while maintaining reliability), and Scale Out (managing many adapted instances). The authors introduce MinT, an infrastructure for adapter management, demonstrating PEFT's viability for maintaining instance-specific behaviors atop foundation models. Results suggest PEFT can support million-scale personalized models with trillion-parameter bases through compact, persistent adapters.

parameter-efficient fine-tuningfoundation modelsadapter scalingpersonalized modelsinstance-specific behavior

Spectral Audit of In-Context Operator Networks

arXiv cs.LG · Zhiwei Gao, Liu Yang, George Em Karniadakis · 2026-06-01

The paper introduces a Jacobian-based spectral audit framework for evaluating in-context operator learning, addressing limitations of prediction-error metrics. By analyzing the tangent operator derived from network Jacobians via Fourier projections, the method characterizes local spectral properties like frequency-dependent gains, phase structure, and cross-mode coupling. Results reveal operator-level phenomena (phase transport, nonlinear mode coupling) and detect failures (high-frequency degradation, incorrect phase recovery) obscured by prediction metrics, demonstrating distinctness between prediction accuracy and operator fidelity. The audit serves as a diagnostic for stability, sensitivity, and operator consistency in neural operators.

neural operatorsspectral auditin-context learningtangent operatorfourier analysis

Investigating and Alleviating Harm Amplification in LLM Interactions

arXiv cs.LG · Ruohao Guo, Wei Xu, Alan Ritter · 2026-06-01

The paper introduces HarmAmp, a benchmark for evaluating multi-turn harm amplification in LLMs across twelve risk categories, addressing gaps in existing work by focusing on compounded risks through extended interactions. It proposes TrajSafe, a proactive monitoring system that anticipates harmful trajectories and intervenes via intent probing and safer completion steering. Experiments show TrajSafe significantly reduces harmfulness in multi-turn interactions while maintaining low over-refusal rates and preserving model capabilities.

harm amplificationmulti-turn interactionsproactive monitoringllm safetyrisk mitigation

A Local Perturbation Theory for Cross-Domain Interference and Recovery in Multi-Domain RL

arXiv cs.LG · Lei Yang, Siyu Ding, Deyi Xiong · 2026-06-01

The paper develops a local perturbation theory to explain cross-domain interference in multi-domain reinforcement learning (RL) for large language models, challenging existing explanations based on catastrophic forgetting or global gradient conflict. The authors demonstrate that single-domain RL induces sparse, small-magnitude parameter edits with weak neuron overlap, while domains share active computation routes where update directions determine synergy or conflict. Theoretical analysis under a local perturbation model reveals second-order damage concentrated in a low-dimensional shared conflict subspace, with empirical validation showing that brief domain refreshes (e.g., Re-Math) recover performance (Math from 57.66 to 66.04) while preserving other domains, achieving a best average score of 66.39.

reinforcement learningcross-domain interferencelocal perturbation theorysparse updatesconflict subspace

How Optimality Structures Sparse Dictionaries: A Theory for Understanding SAE Representations

arXiv cs.LG · William Dorrell · 2026-06-01

The paper develops a theoretical framework for understanding what properties make dictionary features optimal in Sparse Autoencoders (SAEs), bypassing traditional data-generating models. By extending local optimality analyses to nonnegative joint-optimization problems, the authors derive constraints linking optimal SAE features to their distributions. These constraints explain empirical SAE behaviors like hierarchical splitting, residual structures, and dense antipodal features, revealing how L1 regularization and nonnegativity interact with data. The work also introduces a novel large-dictionary convex problem, exploring the wide atom-per-datapoint limit to inform future SAE design.

sparse autoencodersdictionary learningnonnegative optimizationl1 regularizationinterpretability

TabPrep: Closing the Feature Engineering Gap in Tabular Benchmarks

arXiv cs.LG · Andrej Tschalzev, Nick Erickson, Yuyang Wang, Huzefa Rangwala · 2026-06-01

TabPrep introduces a lightweight preprocessing pipeline for tabular data that addresses the feature engineering gap in modern benchmarks. The method employs specialized feature generators targeting three structural data patterns, revealing blind spots in common model classes. Evaluated on TabArena, TabPrep consistently improves performance across tree-based, neural, linear, and foundation models, outperforming automated feature engineering approaches in both accuracy and scalability.

tabular machine learningfeature engineeringpreprocessing pipelinemodel benchmarkingstructural patterns

Local Preferential Bayesian Optimization

arXiv cs.LG · Johanna Menn, Miriam Kober, Paul Brunzema, David Stenger · 2026-06-01

The authors propose local preferential Bayesian optimization (PBO) methods for high-dimensional preference-based optimization, addressing limitations of global PBO approaches. They introduce trust-region and derivative-informed local search techniques adapted to pairwise preference feedback, leveraging Laplace-approximated GP posterior derivatives. Evaluations on GP sample paths, benchmark functions, and policy-search tasks demonstrate superior performance in high-dimensional landscapes with steep optima, significantly reducing cumulative regret compared to global baselines.

preferential bayesian optimizationtrust-region methodslaplace approximationderivative-informed optimizationpolicy search

Doing well with less! On Sampling Techniques for Empirical Pairwise Loss Estimation/Minimization

arXiv cs.LG · Louise Davy, Stephan Clémençon, Charlotte Laclau · 2026-06-01

The paper proposes sampling-based methods for efficient pairwise loss estimation in large-scale machine learning tasks like similarity learning and ranking. By applying survey sampling techniques to select informative pairs directly (rather than individual observations), the approach achieves comparable performance to full pairwise evaluation while reducing computational costs. Theoretical analysis and experiments show that prioritizing high-information pairs using auxiliary data yields accuracy close to exhaustive methods, particularly for high-dimensional embeddings in vision and graph learning.

pairwise losssampling techniquessimilarity learningcomputational efficiencyhigh-dimensional embeddings

Parameter-efficient Dual-encoder Architecture with Differentiable Choquet Integral Fusion for Underwater Acoustic Classification

arXiv cs.LG · Amirmohammad Mohammadi, Joshua Peeples, Alexandra Van Dine · 2026-06-01

The paper proposes a parameter-efficient dual-encoder architecture for underwater acoustic classification, combining waveform and spectrogram representations via differentiable Choquet integral fusion. The method employs pre-trained backbones with parameter-efficient fine-tuning modules and introduces a fuzzy aggregation mechanism to dynamically balance temporal and spectral features while providing interpretability through learned fuzzy measures. Evaluations on DeepShip and ShipsEar datasets show improved classification accuracy over single-encoder baselines, with constrained parameter counts to prevent overfitting on limited acoustic data.

dual-encoderchoquet integralparameter-efficientunderwater acousticfuzzy aggregation

Entropy Minimization without Model Collapse: Mitigating Prediction Bias in Medical Imaging

arXiv cs.LG · Tim Nielen, Sameer Ambekar, Johannes Kiechle, Daniel M. Lang · 2026-06-01

The paper identifies prediction bias as a failure mode of entropy minimization (EM) in test-time adaptation, where distribution shifts cause feature clusters to merge while decision boundaries remain fixed, skewing predicted class distributions. The authors propose Distribution Shift Bias Reduction (DSBR), a bias-correcting objective that equalizes each class's contribution to the EM loss. Evaluated on four medical-imaging datasets and ImageNet-C, DSBR stabilizes adaptation, prevents model collapse, and matches or outperforms state-of-the-art methods while operating solely at test-time.

entropy minimizationtest-time adaptationprediction biasmodel collapsedistribution shift

Hallucination-Aware Diffusion Sampling for Inverse Problems via Robust Prior Updates

arXiv cs.LG · Pengfei Jin, Yiqi Tian, Kailong Fan, Bingjie Qi · 2026-06-01

The paper proposes Robust Prior Update (RPU), a module for diffusion-based inverse problem solvers that mitigates measurement-conditioned hallucination by stabilizing the prior update step while preserving measurement conditioning. RPU analyzes local stability of the diffusion prior update, re-anchors displacements to current iterates, and leaves measurement updates intact. Evaluated on FFHQ and ImageNet inverse problems (box inpainting, Gaussian/motion deblurring), RPU improves PSNR and LPIPS over DPS, with human studies showing 91.9% non-tie preference for FFHQ inpainting. Results demonstrate that robust prior updates enhance instance faithfulness, particularly for weakly constrained content.

diffusion modelsinverse problemshallucination mitigationprior updateinstance faithfulness

Riemannian Gradient Descent for Low-Rank Architectures

arXiv cs.LG · Nicholas Knight · 2026-06-01

The paper investigates Riemannian optimization techniques for low-rank matrix parameterizations in deep learning, examining ten design variants involving different geometries for rank-r matrices and partial isometries, including block-matrix extensions. The methods are applied to multihead attention parameters in small language models, with rigorous learning rate tuning. Results show no conclusive performance improvement over AdamW baselines despite the geometric framework. Implementations are publicly released.

riemannian optimizationlow-rank matricespartial isometriesmultihead attentionadamw baseline

Deep Learning for Remote Sensing to Improve Flood Inundation Mapping

arXiv cs.LG · Yogesh Bhattarai, Vijay Chaudhary, Wai Lim Kim, Sanjib Sharma · 2026-06-01

This study introduces a cloud-removal framework for flood imagery using Denoising Diffusion Probabilistic Models (DDPM) with a Masked Diffusion Transformer architecture. The method leverages self-attention mechanisms and masked token modeling to reconstruct cloud-obscured regions in Sentinel-2B multispectral imagery, preserving hydrological consistency. Evaluations demonstrate improved reconstruction fidelity and spectral signature preservation for water detection indices, offering a robust solution for continuous flood monitoring under cloud cover constraints.

denoising diffusion probabilistic modelsmasked diffusion transformerself-attention mechanismsmultispectral sentinel-2bhydrological consistency

Measurement Geometry and Design for Trustworthy Generative Inverse Problems

arXiv cs.LG · Pengfei Jin, Na Li, Quanzheng Li · 2026-06-01

The paper introduces a measurement-geometry framework to assess trustworthiness in generative inverse problems, where plausible reconstructions may arise from either measurements or prior-driven hallucinations. The authors propose a local measurement-manifold compatibility measure to quantify how well acquisition operators observe prior-relevant tangent directions, with theoretical guarantees linking this measure to reconstruction error stability. Practical fixed and sequential acquisition rules are derived, including a posterior-cloud design for adaptive test-time measurements. Experiments on row-sampling, tomographic, and MR acquisitions (e.g., fastMRI Cartesian sampling) demonstrate improved sampling strategies over baselines, with the proposed scores explaining failure modes and reducing hallucinations.

generative inverse problemsmeasurement geometrylocal compatibility measureposterior-cloud designfastmri

Regularized Large Neighborhood Search

arXiv cs.LG · Germain Vivier-Ardisson, Laurent Demonet, Axel Parmentier, Mathieu Blondel · 2026-06-01

The paper introduces Regularized Large Neighborhood Search (RLNS), a method that bridges the gap between heuristic combinatorial optimization and neural network integration by transforming Large Neighborhood Search (LNS) into an efficient MCMC sampler. RLNS achieves this by regularizing or perturbing local subproblems, enabling exact block Gibbs sampling under entropic regularization. The approach allows interpolation between pseudolikelihood and exact maximum likelihood estimation, facilitating end-to-end learning without requiring global solvers. Empirical validation is provided on $k$-subset selection, generalized assignment, and stochastic vehicle scheduling problems.

regularized large neighborhood searchmcmc samplerentropic regularizationblock gibbs samplingpseudolikelihood

Massive Spikes in LLMs are Bias Vectors: Mechanistic Uncovering and Spike-Free Quantization

arXiv cs.LG · Yung-Chin Chen, Chung Peng Lee, Ze-Wei Liou, Naveen Verma · 2026-06-01

The paper identifies massive activation spikes in Large Language Models (LLMs) as structural vector biases, not scalar biases, demonstrating their role in attention sink and value-state drain mechanisms. Through geometric analysis of projection weights ($W_K$, $W_Q$, $W_V$) and Rotary Positional Embedding (RoPE) interactions, it shows these biases are preserved in 'zones of rotational stability'. The proposed INSERTQUANT framework clamps spikes and uses pre-computed template vectors to enable spike-free, low-bit quantization, achieving parity with state-of-the-art per-tensor methods and generalizing to modalities like ViTs.

activation spikesvector biasesrotational stabilitylow-bit quantizationattention sink

Physics-Guided Recurrent State-Space Neural Networks for Multi-Step Prediction

arXiv cs.LG · Ruiyuan Li, Ajay Seth, Manon Kok · 2026-06-01

The paper introduces PG-RSSNN, a physics-guided recurrent state-space neural network combining physical knowledge with deep learning for improved multi-step prediction. The method employs recurrent structures to enable non-saturating activation functions, addressing vanishing gradients and numerical divergence in training. Evaluated on systems including linear state-space models with Gaussian noise, a robotic arm, and a cascaded water tank, PG-RSSNN demonstrates stable training and superior prediction accuracy compared to black-box neural networks and physics-only models, even with limited data or partial physical knowledge.

physics-guidedrecurrent state-spacemulti-step predictionvanishing gradientsnon-saturating activation

A combination of noise and bilateral filters achieve supralinear and scalable adversarial robustness in CNNs

arXiv cs.LG · Nicolas Stalder, Benjamin F. Grewe, Matteo Saponati, Pau Vilimelis Aceituno · 2026-06-01

We propose a computationally efficient preprocessor combining Gaussian noise and bilateral filtering to enhance adversarial robustness in CNNs. Theoretically, these techniques exhibit complementary mechanisms, yielding supralinear robustness when combined. Experiments on RobustBench demonstrate that our method, integrated with adversarial training, achieves second rank on AutoAttack and third overall, using only ~35% training FLOPs, ~50% fewer parameters, ~33% fewer epochs, and ~15% less data compared to state-of-the-art defenses. The approach scales efficiently, matching accuracy with 2-8x less compute across three orders of magnitude, offering negligible overhead and a theoretically grounded design.

adversarial robustnessgaussian noisebilateral filteringsupralinear improvementrobustbench

ArrythML: An Autoencoder-Based TinyML Approach for On-Device Arrhythmia Detection on Resource-Constrained Embedded Systems

arXiv cs.LG · Nagarajan S, Kurian Polachan · 2026-06-01

The paper introduces ArrythML, an autoencoder-based TinyML method for on-device arrhythmia detection on resource-constrained embedded systems. The approach employs INT8-quantized autoencoders with minimal layers and parameters, validated on a custom dataset derived from MIT-BIH Arrhythmia Database. Evaluations on an ESP32-S3 microcontroller show 84% recall, 79% F1-score, 180 KB model size, and 9 ms inference latency, demonstrating feasibility for low-power wearable systems.

tinymlautoencoderint8 quantizationecg segmentationesp32-s3

ShaplEIG: Bayesian Experimental Design for Shapley Value Estimation

arXiv cs.LG · David Rundel, Fabian Fumagalli, Maximilian Muschalik, Bernd Bischl · 2026-06-01

ShaplEIG introduces a Bayesian experimental design method for efficient Shapley value estimation in costly evaluation settings. The approach employs Gaussian process surrogates to model the value function and adaptively selects coalitions via expected information gain, leveraging the Shapley values' linearity for closed-form solutions. A polynomial-time computation scheme using elementary symmetric polynomials reduces complexity from exponential to polynomial in player count. Experiments demonstrate superior sample efficiency over baselines in low-budget regimes across diverse applications.

shapley valuesbayesian experimental designgaussian processsample efficiencycoalition sampling

BlockGen: Flexible Blockwise Sequence Modeling with Hybrid Samplers

arXiv cs.LG · Justin Deschenaux, Caglar Gulcehre · 2026-06-01

BlockGen introduces a blockwise sequence modeling framework that combines masked and uniform diffusion approaches with hybrid sampling. The method employs adaptive block sizes during training and proposes AR-informed predictor-corrector sampling (ARPC) to selectively regenerate low-probability tokens without external verification. Experiments show uniform diffusion outperforms masked diffusion under ancestral sampling (especially in few-step regimes), while ARPC narrows this gap and reverses it at high NFE. On GSM8K with block size 16, masked diffusion achieves marginally higher accuracy (0.5-1.2%) than uniform diffusion, with similar trends in Generative Perplexity on OpenWebText.

blockwise modelinguniform-state diffusionpredictor-corrector samplinggenerative perplexityancestral sampling

Why Are DMD Students Lazy? Understanding the Copying Behavior in Few-Step Distillation

arXiv cs.LG · Shucheng Li, Iolo Jones, Alexander Tong, Michael M. Bronstein · 2026-06-01

The paper investigates the copying behavior observed in few-step distillation of diffusion models, where students reproduce teacher noise-data pairings despite distribution-level supervision. Through Distribution Matching Distillation (DMD), the study demonstrates that copying emerges from limited geometric freedom in high-dimensional settings, not adversarial objectives or memorization. Results show this phenomenon is intrinsic to high-dimensional distillation.

distribution matching distillationdiffusion modelsfew-step distillationnoise-data pairingsgeometric freedom

A Doeblin-Anchored Contrastive Chart for Learning Markov Transition Kernels

arXiv cs.LG · Ao Xu · 2026-06-01

The paper introduces a Doeblin-anchored contrastive chart, a framework for learning valid Markov transition kernels from contrastive objectives. The method mixes the target transition with a restart law, producing an anchored kernel that serves as a Doeblin-minorized Markov kernel and an invertible coordinate for the original transition law. Theoretical results include identification of the anchored transition density, calibration of excess risk to density error, and nonparametric rates for independent transition pairs. For geometrically β-mixing trajectories, a thinning-and-coupling extension achieves the same reconstruction interface with an effective sample size. Perturbation bounds transfer one-step kernel error to finite-horizon marginal, path-law, and occupation-measure errors.

markov transition kerneldoeblin-anchoredcontrastive objectiverestart lawβ-mixing

Identifiable Markov Switching Models with Instantaneous Effects and Exponential Families

arXiv cs.LG · Roel Hulsman, Carles Balsells-Rodas, Sara Magliacane · 2026-06-01

The paper establishes identifiability of latent regimes and regime-dependent causal structures in Markov Switching Models (MSMs) with temporal dependencies, nonlinear lagged/instantaneous effects, and exponential-family noise. It introduces FlowMSM, a framework combining stationary causal discovery methods with regime detection to handle non-stationary time series. Experiments on synthetic benchmarks and financial data demonstrate accurate regime detection and causal structure recovery under frequent switches and complex dynamics.

markov switching modelsidentifiabilitynon-stationary time seriesinstantaneous effectsexponential family

Bayesian meta-learning for modeling Alzheimer's disease progression

arXiv cs.LG · Clara Hoffmann, Nadja Klein · 2026-06-01

A Bayesian meta-learner is proposed for predicting Alzheimer's disease progression, addressing limitations of classical regression and single-task neural networks. The model dynamically predicts discrete disease score distributions tailored to individual patients based on MRI volumes and historical trajectories, without requiring retraining for unseen individuals. It scales linearly with historical observations and reduces overconfidence in long-term predictions compared to deterministic approaches. Evaluated on the Alzheimer's Disease Neuroimaging Initiative (ADNI) database, the method achieves competitive performance with single-task models and deterministic meta-learners, while significantly improving long-term progression prediction accuracy.

bayesian meta-learningdisease progressionmri volumenonlinear relationshipsoverconfidence

Network Learning with Semi-relaxed Gromov-Wasserstein

arXiv cs.LG · Charles Dufour, Ulysse Naepels, Leonardo V. Santoro · 2026-06-01

The authors propose a semi-relaxed Gromov-Wasserstein framework for estimating the generative mechanisms of large-scale networks, addressing the NP-hard combinatorial challenge of latent connectivity structure identification. Their method employs probabilistic couplings to relax the assignment problem, solved via a block-coordinate conditional gradient algorithm, yielding deterministic solutions with an optimality gap vanishing at rate O(1/n). Theoretical guarantees include consistency and minimax-optimal convergence rates for stochastic block models and Holder-smooth graphons, with demonstrated scalability on synthetic and real-world datasets.

gromov-wassersteinstochastic block modelsgraphonsprobabilistic couplingsminimax-optimal

CORE-MTL: Rethinking Gradient Balancing via Causal Orthogonal Representations

arXiv cs.LG · Chengfeng Wu, Tao Zou, Yanru Wu, Jingge Wang · 2026-06-01

CORE-MTL introduces a causal representation-centric framework for multi-task learning (MTL) that decomposes shared representations into semantic (task-relevant) and residual (nuisance) streams, addressing negative transfer in optimization-centric methods. The method leverages physical priors for structured scenes and statistical constraints for attributes, theoretically yielding tighter out-of-distribution generalization bounds and reduced gradient interference without explicit gradient manipulation. Experiments demonstrate consistent improvements over existing MTL methods on visual benchmarks in both in-distribution and out-of-distribution settings.

multi-task learningcausal representationnegative transfergradient interferenceout-of-distribution generalization

Model Multiplicity and Predictive Arbitrariness in Recidivism Risk Assessment

arXiv cs.LG · Ashwin Singh, Carlos Castillo · 2026-06-01

The study investigates predictive multiplicity in recidivism risk assessment, demonstrating that multiple similarly accurate models can coexist without severe predictive arbitrariness. By constructing a dataset from legal rules and training interpretable models, the authors improve predictive performance and reduce error-rate disparities. Theoretical analysis provides a tight lower bound on expected predictive agreement, while empirical results show higher agreement than worst-case bounds. A policy assigning the lowest risk among models effectively mitigates arbitrariness.

predictive multiplicityrecidivism risk assessmenterror-rate disparitiesinterpretable modelspredictive arbitrariness

Coherent Off-Policy Improvement of Large Behavior Models with Learned Rewards

arXiv cs.LG · Christian Scherer, Joe Watson, Theo Gruner, Daniel Palenicek · 2026-06-01

The paper proposes coherent imitation learning, an inverse reinforcement learning (IRL) method for finetuning large behavior models with learned dense rewards, avoiding performance drops common in RL-based approaches. The method distills expert demonstrations into a pretrained policy via behavioral cloning, then learns a dense reward function to guide RL finetuning of a residual policy. Experiments on six sparse-reward manipulation tasks show the approach maintains or improves π-0.5 performance, achieving ≥90% success on five tasks, outperforming sparse-reward RL baselines. Theoretical guarantees ensure policy improvement by initializing the finetuning policy as optimal for the learned reward.

inverse reinforcement learningbehavioral cloningresidual policydense rewardsmanipulation tasks

The Ghost Couple: Correlated LLM Name Priors and Their Haunting of the Web and Academic Publishing

arXiv cs.LG · Michał Brzozowski, Neo Christopher Chung · 2026-06-01

The study reveals that large language models generate correlated fictional name ensembles (e.g., Elena Vasquez + Marcus Chen) with co-occurrence rates exceeding chance, exhibiting model-family-specific and version-specific priors. Through analysis of AI-generated documents, the authors demonstrate these priors leave detectable fingerprints in web content and academic publishing. They identify 1,655 ghost-authored records on Zenodo with fabricated journals and backdated publication timestamps, plus synthetic research groups on ResearchGate, providing temporal proxies for model deployment windows.

name priorscorrelated ensemblesdoi metadatamodel fingerprintssynthetic authorship

Low-Pass Flow Matching

arXiv cs.LG · Francesco M. Ruscio, T. Konstantin Rusch · 2026-06-01

Low-Pass Flow Matching is introduced as a variant of Flow Matching that addresses the misalignment between white noise sources and the frequency-decaying power spectra of natural data. The method employs an operator-modulated interpolant to induce a time-varying spectral bias, transitioning from the source spectrum to a frequency-decaying bias as the path approaches the data. Empirical validation on unconditional image generation tasks, including the Galaxy10 dataset, demonstrates improved or preserved sample quality with adaptive ODE solvers while substantially reducing sampling cost compared to standard baselines.

flow matchingspectral biasoperator-modulated interpolantadaptive ode solversgalaxy10 dataset

Closing the Alignment-Maturity Gap in Federated Prototype Learning

arXiv cs.LG · Mario Casado-Diez, Alejandro Dopico-Castro, Verónica Bolón-Canedo, Bertha Guijarro-Berdiñas · 2026-06-01

The paper introduces FedSAP, a federated learning framework that addresses the alignment-maturity gap in prototype-based methods by stabilizing representation learning. The method employs a deterministic alignment curriculum to delay global alignment until local representations stabilize and a geometry-driven proxy separation loss to enforce inter-class structure on the unit hypersphere. Evaluated across three benchmarks under varying heterogeneity, FedSAP achieves up to 4 percentage point improvements over baselines, with notable gains in high-heterogeneity scenarios. The framework also extends to semi-supervised settings with minimal modifications.

federated learningprototype learningnon-iid datarepresentation learningalignment curriculum

Disentanglement-Based Equivariant Learning for Compositional VQA

arXiv cs.LG · Zhou Du, Zhaoquan Yuan, Xiao Wu, Changsheng Xu · 2026-06-01

We propose Disentanglement-based EquivAriant Learning (DEAL), a novel framework for compositional visual question answering (VQA) that enhances compositional reasoning without relying on additional training clues. DEAL employs causality-inspired interventions to disentangle visual and textual concepts within a re-encoding framework, followed by compositional transformations and equivariant constraints to augment inference. Evaluations on CLEVR-CoGenT and GQA-SGL datasets demonstrate DEAL's superiority over state-of-the-art methods in visual and linguistic generalization settings for compositional VQA tasks.

compositional vqadisentanglementequivariancecausality-inspired interventionsre-encoding framework

EEG-FuseFormer: A Transformer-Driven Feature Fusion Framework for Seizure Onset Prediction

arXiv cs.LG · Vigneshwar Hariharan, Chithra Reghuvaran, Arlene John, Nhat Pham · 2026-06-01

EEG-FuseFormer introduces a transformer-driven feature fusion framework for seizure onset prediction, combining intermediate features from CNN-LSTM and ResNet-18 networks. CNN-LSTM captures spatial-temporal features from raw EEG signals, while ResNet-18 extracts features from STFT representations. A transformer encoder fuses these features, followed by dense layers for final prediction. Evaluated on the CHB-MIT dataset, the model achieves a mean recall of 98.85%, outperforming state-of-the-art methods. Cross-patient testing with target adaptation improves recall, precision, and F1-score metrics. Runtime complexity is assessed across hardware platforms, highlighting performance-complexity trade-offs.

transformerfeature fusioncnn-lstmresnet-18stft

Hybrid Neural Ordinary Differential Equations for Data-Efficient Polymerization Modeling with Incomplete Kinetics

arXiv cs.LG · Marah Almanasreh, Alexander Mitsos, Eike Cramer · 2026-06-01

The paper introduces a hybrid Neural ODE framework for data-efficient modeling of free-radical polymerization, combining mechanistic mass balances with learned neural network surrogates for partially characterized kinetics. Using MMA polymerization as a case study, it retains established physical reactions while learning only the effective radical concentration from sparse data (as few as ten measurements). Evaluated against purely data-driven baselines, the hybrid approach achieves superior prediction accuracy (RMSE 0.013 vs. 0.31-0.68) and physical consistency under noisy, unseen conditions.

neural odepolymerization kineticsdata-efficient modelinghybrid frameworksparse data

TimeBlocks: Foundational and Continual Time-Series Blockbase -- Extended Version

arXiv cs.LG · David Campos, Bin Yang, Tung Kieu, Lei Chen · 2026-06-01

TimeBlocks introduces a modular framework for constructing lightweight foundational time-series models capable of handling multiple tasks under real-time constraints. The method maintains a pool of interchangeable model blocks, employing a routing strategy to iteratively select and assemble the most suitable blocks for specific time-series data. StreamCore, a component of TimeBlocks, builds a representative subset of the data stream, enabling continual model calibration while preserving stream approximation guarantees. Experimental results across multiple datasets and tasks demonstrate that TimeBlocks outperforms existing baselines in accuracy and efficiency, addressing limitations of large foundational models in real-time, hardware-limited settings.

time-series processingfoundational modelsmodular blockscontinual calibrationstream approximation

Edge-aware Decoding for Neural Asymmetric Routing

arXiv cs.LG · Li Liang, Jinbiao Chen, Zizhen Zhang · 2026-06-01

The paper proposes an edge-aware decoder design principle for neural asymmetric routing models, addressing the representation-decision mismatch by explicitly incorporating transition-level edge information in final scoring. The method augments candidate logits with directed edge terms, return-to-start closure, and static lookahead while maintaining the original representation backbone. Evaluated on ATSP and ACVRP benchmarks, the decoder reduces the ATSP-1000 optimality gap from 4.13% to 2.73% versus RADAR baseline, with ablation studies highlighting edge sensitivity as the primary mechanism.

neural asymmetric routingedge-aware decodertransition-level scoringrepresentation-decision mismatchdirected edge sensitivity

ProbRes: Volatility Learning for Probabilistic Time-Series Forecasting

arXiv cs.LG · Tingting Wang, Yunyi Zhang, Benyou Wang · 2026-06-01

ProbRes introduces a post-hoc probabilistic calibration method for time-series forecasting that explicitly models volatility dynamics to handle heteroskedastic data. The method employs two architecture-agnostic modules to separately learn conditional mean and conditional volatility during training, then generates predictive distributions by resampling normalized residuals at inference. Theoretical analysis confirms its validity, and experiments on synthetic and real-world datasets demonstrate accurate predictive distributions and well-calibrated prediction intervals under non-Gaussian innovations with conditional heteroskedasticity.

probabilistic forecastingheteroskedasticityvolatility learningpredictive distributionsconditional mean

Error Bounds for a Diffusion Model-Based Drift Estimator

arXiv cs.LG · Ioar Casado-Telletxea, Omar Rivasplata · 2026-06-01

This work establishes theoretical guarantees for a diffusion model-based drift estimator in stochastic differential equations. The authors derive an explicit risk bound for the time-averaged mean-squared error, decomposing it into four components: Euler-Maruyama discretization, score/denoiser approximation, noise initialization, and sampling variance. Their analysis reveals trade-offs between hyperparameters and error sources, addressing a gap left by prior empirical studies of Tapia Costa et al. (2026). The bound leverages techniques from diffusion model theory to quantify estimator performance across different drift classes.

drift estimationdiffusion modelsrisk boundeuler-maruyamascore-matching

When Tabular Foundation Models Transfer Across Modalities: A Systematic Evaluation Across 95 Datasets, 7 Modalities, and Two Regimes

arXiv cs.LG · Julien Lafrance · 2026-06-01

The paper introduces a unified classification pipeline combining Equiangular Tight Frame (ETF) preprocessing with a tabular foundation model for in-context inference, applicable across seven modalities (vision, audio, speech, text, molecular, time-series, tabular) on 95 datasets. The pipeline is evaluated against strong lightweight tuned baselines on frozen features, with separate reporting for oracle selection, deployed selection, and specialized fine-tuning. Results show competitive performance with baselines, achieving 4-200× speedup over full backbone fine-tuning while maintaining comparable quality. The paper details practical deployment considerations, including ETF preprocessing, training termination without validation, in-context classifier setup, and probability calibration, which restores disrupted calibration from ETF preprocessing.

equiangular tight framein-context inferencetabular foundation modelprobability calibrationlightweight tuning

It does what it says on the tin: safe synthetic data from coarsened margins

arXiv cs.LG · Gillian M Raab · 2026-06-01

The paper introduces a method for generating synthetic data (SD) with enhanced transparency and disclosure safety. The approach ensures that relationships between variables in the original data are approximately maintained in the SD, and guarantees that the SD is derived from information deemed free of disclosure risk. This is achieved by defining and calculating margins where variable relationships are preserved, applying statistical disclosure control (SDC) techniques such as top-coding and bottom-coding, and coarsening counts to multiples of the disclosure limit. The Iterative Proportional Fitting (IPF) algorithm is then used to generate SD. The method is demonstrated using data from the 1901 Census of Scotland.

synthetic datastatistical disclosure controliterative proportional fittingcoarseningdisclosure risk

Planar Symmetric Pattern Generation

arXiv cs.LG · Ning Lin, Luxi Chen, Huaguan Chen, Jiacheng Cen · 2026-06-01

We propose a symmetrization framework for generating planar symmetric patterns from arbitrary 2D continuous representations while preserving continuity. The method mathematically formulates symmetric representations, demonstrates their approximation capability for symmetric functions, and details the construction methodology. Experiments validate the approach across four design tasks: pattern design, paper-cutting design, stylized topology design, and material design. Results confirm effective symmetry control and broader applicability of the proposed representation.

planar symmetrycontinuous representationsymmetrization frameworkpattern generationapproximation capability

Ablating Archetypes: The Stability of Archetypal SAEs is an Artifact of Initialization and Metric Design

arXiv cs.LG · Michał Brzozowski, Neo Christopher Chung · 2026-06-01

The study challenges the claimed stability of archetypal sparse autoencoders (SAEs) by demonstrating it stems from identical initialization rather than the archetypal constraint. Through systematic ablation of k-means decoder initialization and analysis of cosine geometry metrics, the authors distinguish between stability (inter-run agreement) and stabilization (convergence toward common solutions). Results show archetypal SAEs provide no stabilization advantage when initialization varies, with endpoint stability metrics further confounded by preprocessing choices. The work emphasizes the need for trajectory diagnostics in mechanistic interpretability of NLP features.

sparse autoencodersdictionary learningmechanistic interpretabilityfeature stabilityinitialization ablation

Query-Limited Community Recovery in Stochastic Block Models

arXiv cs.LG · Sabyasachi Basu, Manuj Mukherjee, Lutz Oettershagen, Suhas Thejaswi · 2026-06-01

The paper demonstrates that adaptive querying strategies strictly improve exact community recovery in the two-community stochastic block model (SBM) under limited and noisy access to network data. The authors analyze both oracle-only access and a combined model incorporating a subsampled graph. For oracle-only access, balanced uniform querying serves as a non-adaptive benchmark, reducing observations to an SBM with attenuated edge probabilities. However, a two-stage adaptive strategy achieves exact recovery with n+o(n) queries, outperforming the benchmark requiring mn queries (m>1). With a subsampled graph, adaptive querying targets uncertain vertices, enabling exact recovery where uniform querying fails, highlighting the adaptivity gap in sublinear-query regimes.

stochastic block modeladaptive queryingexact recoverysubsampled graphoracle access

Convex Distance Operator Transport: A Convex and Geometry-Preserving Formulation

arXiv cs.LG · Junhyoung Chung, Euijong Song, Won Hwa Kim, Gunwoong Park · 2026-06-01

The authors introduce Convex Distance Operator Transport (CDOT), a convex optimal transport framework that preserves feature correspondence and intrinsic geometric structure across heterogeneous domains. CDOT employs operator-based regularization, aligning aggregated distance structures via distance and conditional expectation operators, enhancing robustness to local geometric variations. The CDOT discrepancy is proven to be a valid pseudometric on attributed compact metric-measure spaces, and its relationship to Gromov-Wasserstein (GW) is characterized through a dispersion gap, explaining GW's non-convexity. A non-asymptotic risk bound is derived, showing risk consistency under a Frank-Wolfe algorithm. Experiments on synthetic point clouds, brain connectomes, and graph classification benchmarks demonstrate superior performance and stability.

optimal transportoperator-based regularizationgromov-wassersteinpseudometricfrank-wolfe algorithm

Realistic noise synthesis reduces bias and improves tissue microstructure estimation with supervised machine learning

arXiv cs.LG · Bradley G. Karat, Maëliss Jallais, Ali R. Khan, Santiago Aja-Fernández · 2026-06-01

The study introduces a realistic noise synthesis (RNS) framework to mitigate covariate shift in supervised machine learning for diffusion MRI microstructure estimation. RNS incorporates Rician expectation and effective post-processing noise variance into simulated training signals, using MPPCA-estimated noise standard deviation and spherical harmonic residuals. Evaluated on cylinder-zeppelin and SANDI models, RNS reduced SNR-dependent bias to the level of noise-aware nonlinear least-squares fitting, with further precision gains from effective standard deviation modelling. Performance was architecture-independent but sensitive to noise estimation accuracy.

diffusion mricovariate shiftrician noisemicrostructure estimationsupervised learning

Uncertainty-Aware Graph Neural Reconstruction of Urban Temperature Fields from Sparse Sensors under Deployment Constraints

arXiv cs.LG · Reda Snaiki, Abdelatif Merabtine · 2026-06-01

The study proposes an uncertainty-aware graph neural network (GNN) for reconstructing urban daily maximum temperature fields from sparse sensor data, incorporating deployment constraints. The model employs a graph-attention-based mean-residual architecture with Gaussian negative log-likelihood training to predict temperature and uncertainty fields, alongside a Proper Orthogonal Decomposition with QR factorization (POD-QR) strategy for constrained sensor placement. Evaluated on Montreal-area Daymet v4.1 data (1 km resolution, 2020-2024), the GNN outperforms inverse distance weighting and ordinary kriging in RMSE and MAE across 10-40 sensors, with diminishing placement effects beyond 30 sensors and improved uncertainty calibration at higher densities.

graph neural networktemperature field reconstructionsensor placementuncertainty calibrationproper orthogonal decomposition

World-Task Factorization for Robot Learning

arXiv cs.LG · Eduardo Sebastián, Adrian Pfisterer, Vito Mengers, Oliver Brock · 2026-06-01

The paper proposes world-task factorization as a fundamental principle for robot learning, formalizing it through Bayesian model evidence to separate world properties from task logic. The method instantiates this via AICON, a differentiable graph of recursive estimators that propagates cost gradients, paired with a learned policy modulating gradient paths. Evaluated across heterogeneous robots, environments, and tasks, the framework outperforms end-to-end baselines and analytical heuristics, demonstrating zero-shot generalization to out-of-distribution configurations and successful real-world transfer without retraining.

robot learningbayesian model evidencedifferentiable graphzero-shot generalizationgradient propagation

Unveiling the Entropy Dynamics of Chain-of-Thought Reasoning

arXiv cs.LG · Ting Xu, Xu He, Yupu Lu, Jiankai Sun · 2026-06-01

The paper introduces a novel analysis of Chain-of-Thought (CoT) reasoning entropy dynamics, revealing a consistent two-phase structure: an initial Uncertainty Region for exploration followed by a Confidence Region with high answer reliability and token redundancy. The authors propose two inference strategies—Early Exit and Test-Time Scaling—leveraging these properties, and formulate Confidence Region detection as a sequential change-point problem solved via the Cumulative Sum (CUSUM) algorithm. Experiments demonstrate CUSUM achieves 63.06% accuracy with 11.1% token reduction, outperforming DEER and Dynasor by 3.28% and 4.36% respectively in accuracy.

chain-of-thoughtentropy dynamicsearly exitchange-point detectioncusum algorithm

Evaluating Real-World Generalizability of Algorithm Selection Models

arXiv cs.LG · Gjorgjina Cenikj, Jakub Kudela, Eva Tuba, Tome Eftimov · 2026-06-01

This study evaluates the real-world generalizability of Algorithm Selection (AS) models across synthetic benchmarks (BBOB, CEC) and real-world optimization tasks (robotics trajectory, UAV path-planning). Through cross-benchmark analysis, it identifies transferability patterns, failure modes, and domain-specific challenges in AS systems. Results reveal limitations in current AS approaches when transitioning from controlled benchmarks to practical applications, providing insights for developing more robust and generalizable AS frameworks.

algorithm selectiongeneralizabilitybbobcecoptimization

Provable Data Scaling Law for Meta Learning via Complexity Minimization

arXiv cs.LG · Kazuto Fukuchi, Ryuichiro Hataya, Kota Matsui · 2026-06-01

The paper introduces complexity minimization, a meta-representation learning framework that theoretically explains improved downstream sample efficiency with increased pre-training data. The method learns representations by optimizing worst-case downstream model complexity across source domains. Theoretical analysis proves error rate improvements in few-shot adaptation with more meta-training data, while empirical results show complexity regularization enhances existing meta-learning methods' sample efficiency.

meta-representation learningcomplexity minimizationsample efficiencyfew-shot adaptationpre-training

Machine Learning for Coding Retail Product Names to Consumer-Price Categories: A Rule-plus-Bag-of-Words Pipeline with Reliability-Weighted Human-in-the-Loop Labeling

arXiv cs.LG · Vladimir Beskorovainyi · 2026-06-01

The paper presents a pipeline for mapping noisy retail product names to consumer-price categories, combining rule-based pre-classification with a binary confirmation model. Key innovations include a trie-based pre-classifier, reliability-weighted human-in-the-loop labeling, and empirical validation showing bag-of-words models achieve near-perfect performance (F1 ≈ 0.99) with minimal labeled data (67 examples). The study compares labeling protocols, finding Dawid-Skene outperforms reliability-weighted voting, and discusses implications for price measurement using transaction data.

triebag-of-wordshuman-in-the-loopdawid-skenecoicop

Graph Edit Distance Formulation for the Vehicle Routing Problem: Theory and Analysis

arXiv cs.LG · Adel Dabah · 2026-06-01

The paper reformulates the Vehicle Routing Problem (VRP) as a Graph Edit Distance (GED) maximization problem, demonstrating that minimizing route cost is equivalent to maximizing deleted edge weights under an edge-deletion cost model. This edge-level formulation enables structural analyses of solution quality attribution, optimality gap decomposition, and sparsity characterization. Theoretical contributions include a merge-decomposition theorem linking Clarke-Wright savings to GED increments and an approximation-transfer theorem for cost bounds. Empirical analysis of 90 CVRP benchmarks reveals optimal routing graphs use only 5.5% of edges, with 3.0% consistently missed by Clarke-Wright heuristics, while cost gaps decompose comparably between missed optimal and substituted non-optimal edges.

graph edit distancevehicle routing problemclarke-wright savingsoptimality gapedge-deletion cost

A Closer Look at In-Distribution vs. Out-of-Distribution Accuracy for Open-Set Test-time Adaptation

arXiv cs.LG · Zefeng Li, Evan Shelhamer · 2026-06-01

The study benchmarks open-set test-time adaptation (TTA) methods (SAR, OSTTA, UniEnt, SoTTA) on corrupted datasets CIFAR-10-C and ImageNet-C, evaluating in-distribution (InD) accuracy and out-of-distribution (OOD) detection. Using OOD data from SVHN-C, CIFAR-100-C, ImageNet-O-C, and Textures-C, the analysis reveals current methods struggle to balance InD recognition and OOD rejection, imperfectly filtering OOD data during adaptation. A proposed sigmoid/multi-label baseline explores this trade-off, showing limitations in existing approaches.

test-time adaptationout-of-distribution detectionin-distribution accuracyopen-set recognitioncorruption benchmarks

Flow-Transformed Implicit Processes for Function-Space Variational Inference

arXiv cs.LG · Luis A. Ortega, Andrés R. Masegosa, Thomas D. Nielsen · 2026-06-01

The paper introduces Flow-Transformed Implicit Processes (FTIP), a method for expressive function-space variational inference using implicit-process priors. FTIP replaces conventional Gaussian variational distributions over combination weights with normalizing flows, enabling richer posterior approximations that capture asymmetry, heavy tails, and multimodality. The model is trained via a Black-Box α objective, allowing control over mass-covering versus mode-seeking behavior. Experiments demonstrate FTIP's superior ability to represent complex posterior structure compared to Gaussian approximations.

implicit-process priorsfunction-space inferencenormalizing flowsvariational inferenceblack-box optimization

Learning Action-Conditional and Object-Centric Gaussian Splatting World Models for Rigid Objects

arXiv cs.LG · Jens U. Kreber, Lukas Mack, Joerg Stueckler · 2026-06-01

The paper introduces Multi Rigid Object Gaussian World Model (MRO-GWM), a novel action-conditional world model for 3D rigid object dynamics. It represents scenes via object-centric Gaussians, enabling arbitrary shapes and multi-object interactions, and employs a spatio-temporal transformer to predict future motion from Gaussian histories and actions. Objects are modeled in canonical frames, with rigid body transformations describing motion. Trained on multi-view reconstructions, MRO-GWM handles occlusions and partial observations. Evaluations on synthetic datasets demonstrate its effectiveness in predicting multi-object dynamics and enabling model-predictive control for non-prehensile robotic manipulation.

gaussian splattingobject-centricrigid body transformationspatio-temporal transformermodel-predictive control

HMPO: Hybrid Median-length Policy Optimization for Chain-of-Thought Compression

arXiv cs.LG · Minghui Zheng, Hongxu Chen, Huimin Ren, Hongsheng Xin · 2026-06-01

HMPO introduces a single-stage reinforcement learning framework for efficient chain-of-thought (CoT) compression, addressing limitations of manual length budgets and multi-stage training. The method combines an adaptive median-based budget derived from successful rollouts, cosine-decay token reward for smooth length penalization, and multiplicative reward prioritization to prevent reward hacking. Evaluated across 9B to 122B parameter models (including dense and MoE architectures), HMPO achieves 19%--46% token compression with minimal accuracy loss while significantly reducing training costs, demonstrating strong generalization from math to code, science, and instruction-following tasks.

chain-of-thoughtreinforcement learningtoken compressionmixture-of-expertsreward hacking

Resonant Context Anchoring: Decoupling Attention Routing and Signal Gain at Inference Time

arXiv cs.LG · Mingkuan Zhao, Yide Gao, Wentao Hu, Suquan Chen · 2026-06-01

The paper introduces Resonant Context Anchoring (RCA), a lightweight inference-time intervention method to mitigate contextual disregard in Large Language Models (LLMs). RCA decouples attention routing and signal gain by using raw pre-softmax attention scores to construct a dynamic gain field, selectively amplifying value vectors of context tokens without altering attention probabilities. Experiments on Llama-3 show RCA improves contextual faithfulness in factual consistency tasks, suppressing parametric hallucinations while maintaining fluency and general language understanding.

resonant context anchoringsignal attenuationself-attention moduleparametric hallucinationsnon-linear rectification

Private and Stable Test-Time Adaptation with Differential Privacy

arXiv cs.LG · Zefeng Li, Qiaoyue Tang, Mathias Lecuyer, Evan Shelhamer · 2026-06-01

This work introduces differentially private test-time adaptation (DP-TTA) methods to address privacy risks in model updates during inference. The authors adapt popular TTA techniques (Tent, EATA, SAR, DeYO, COME) by incorporating per-sample gradient clipping and Gaussian noise to ensure differential privacy. Evaluated on ImageNet-C, the DP-TTA methods maintain accuracy while providing privacy guarantees, with per-sample clipping improving adaptation stability and accuracy in low-privacy regimes. The approach incurs only modest computational overhead, demonstrating the feasibility of private TTA and highlighting per-sample clipping as an effective technique for both privacy and adaptation performance.

test-time adaptationdifferential privacyper-sample clippinggaussian noiseimagenet-c

MidSurfNet: Learnable Face Pairing and Interference Implicit Fields for Generalized Mid-surface Abstraction

arXiv cs.LG · Li Ye, Xinhang Zhou, Xingyu Yang, Ruofeng Tong · 2026-06-01

MidSurfNet introduces a learning-based framework for mid-surface abstraction in thin-walled CAD models, addressing limitations of rule-based methods. The approach combines a neural face pairing module for complex geometric configurations and an interference implicit field for flexible offset control. Evaluated on a dataset of 1,500 annotated CAD models, it achieves 87.32% face pairing accuracy, with 61.90% and 52.94% completion rates for multi-wall-thickness and self-matching scenarios respectively, outperforming existing methods.

mid-surface abstractionneural face pairinginterference implicit fieldcad modelsfinite element analysis

Segment-driven Structural Induction and Semantic Alignment for Heterogeneous Tabular Representation

arXiv cs.LG · Woojun Jung, Susik Yoon · 2026-06-01

NAVI introduces a segment-centric pretraining framework for heterogeneous tabular representation, addressing schema variability and semantic alignment through header-value pair modeling. The method combines Masked Segment Modeling and Entropy-driven Segment Alignment to aggregate structural and distributional evidence at column level. Experiments demonstrate improved reconstruction, semantic consistency, and downstream task performance on heterogeneous in-domain tables.

heterogeneous tablesmasked segment modelingsemantic alignmenttabular representationheader-value pairs

Beyond the Simplex: Balanced Prototype Geometry for Scorer-Agnostic Open-Set Recognition

arXiv cs.LG · Mayank Sharma, Rohit Kumar Mourya · 2026-06-01

The paper introduces a theoretical framework for simplex-based open-set recognition (OSR) that generalizes to all embedding dimensions, including d=2, where prior work required d≥C-1. It proves that balanced prototype geometries yield exact Euclidean ball sublevel sets for squared ratio scores, with a sharp dichotomy: prototypes achieve one-distance symmetry iff d≥C-1, with quantifiable degradation below this threshold. The analysis shows exponential decay in false-acceptance rates under isotropy and Lipschitz continuity of operational scores. Empirical validation on CIFAR and MedMNIST demonstrates the geometry's utility as a representation-learning prior, though performance depends heavily on the scoring rule.

open-set recognitionsimplex geometryprototype learningembedding dimensionlipschitz continuity

G2LoRA: Gradient Orthogonal Low-Rank Adaptation Framework for Graph Continual Learning on Text-Attributed Graphs

arXiv cs.LG · Yuhan Wang, Yibo Ding, Yutong Ye, Mufan Zhao · 2026-06-01

G2LoRA introduces a gradient orthogonal low-rank adaptation framework for graph continual learning on text-attributed graphs (TAGs), addressing catastrophic forgetting and task interference in LLM-as-Aligner models. The method unifies node-, link-, and graph-level tasks under a single graph-text alignment objective, employs category-aware gradient projection in structured subspaces to resolve conflicting updates, and modulates gradient magnitudes to coordinate updates between graph and text encoders. Experiments on benchmark datasets show G2LoRA outperforms baselines across architectures, achieving superior continual performance and transferability.

graph continual learningtext-attributed graphsgradient projectioncross-modal driftlow-rank adaptation

Task-Induced Representational Invariances Depend on Learning Objective in Deep RL

arXiv cs.LG · Manu Srinath Halvagal, Sebastian Lee, SueYeon Chung · 2026-06-01

The study establishes a principled framework for comparing learned representations in deep RL algorithms through MDP reduction theory. Analyzing DQN (value-based) and PPO (policy-gradient) in navigation tasks, it reveals algorithm-specific invariances: DQN representations exhibit MDP homomorphism symmetry, while PPO shows action symmetry invariance. These differences persist across domains, affect transfer learning performance, and manifest similarly in LLMs with prompt dependence. The findings bridge theoretical RL representation analysis with potential neuroscientific implications for neural coding.

mdp reductionrepresentation learningdeep reinforcement learninginvariancetransfer learning

Continual Learning as a Multiphase Moving-Boundary Problem

arXiv cs.LG · Snigdha Chandan Khilar · 2026-06-01

Stefan-CL introduces a physics-inspired approach to continual learning by modeling it as a multiphase moving-boundary problem. It conceptualizes consolidated knowledge as a 'solid' phase and unused capacity as a 'liquid' phase, with the boundary between them dynamically expanding during learning. The method employs a 'latent heat' parameter to regulate this expansion and mathematically freezes the learned interior to minimize forgetting. Stefan-CL achieves near-zero forgetting without storing raw data, matching the performance of memory-intensive baselines.

continual learningstability-plasticity dilemmalatent heatmultiphase boundaryforgetting minimization

A Theoretical Framework for Self-Play Theorem Proving Algorithms

arXiv cs.LG · Thomas Chen, Zhiyuan Li · 2026-06-01

The paper establishes a theoretical framework for self-play algorithms in theorem proving, formalizing theorems as nodes in a semantic graph. It proves that a prover-conjecturer system with reversible random walk-based conjecture generation achieves exponential growth in proved theorems under graph connectivity assumptions. To address empirical issues of overly complex theorem generation, the authors propose a diversity measure based on diffusion similarity and an improved algorithm maximizing it via contrastive learning embeddings.

self-playtheorem provingdiffusion similaritycontrastive learninggraph connectivity

ContinuousBench: Can Differentially Private Synthetic Text Improve Capabilities?

arXiv cs.LG · Peihan Liu, Lucas Rosenblatt, Weiwei Kong, Natalia Ponomareva · 2026-06-01

The paper introduces ContinuousBench, a novel benchmark for evaluating differentially private (DP) synthetic text generation by measuring capability gain from synthetic data. The benchmark features quarterly updates with new training corpora and QA sets designed to be unsolvable without corpus access yet learnable under DP constraints. Two tracks are implemented: Geminon (procedurally-generated fictional data) and News (crawled articles). Results show non-private synthesis effectively transfers knowledge, while state-of-the-art DP methods fail even at high privacy budgets (ε=100), highlighting current limitations in DP text synthesis.

differentially privatesynthetic textcapability gainbenchmarkqa sets

The Lie We Tell: Correcting the Euclidean Fallacy in Vision Language Action Policies via Score Matching on Tangent Space

arXiv cs.LG · Bing-Cheng Chuang, I-Hsuan Chu, Bor-Jiun Lin, YuanFu Yang · 2026-06-01

The paper introduces Lie Diffuser Actor (LDA), a diffusion framework correcting the Euclidean Fallacy in Vision-Language-Action policies by operating intrinsically on SE(3). LDA injects noise via left-invariant SDEs, predicts scores in the tangent space, and retracts samples through the exponential map, ensuring manifold adherence, coordinate-frame equivariance, and geodesic optimality. On CALVIN ABC→D, LDA increases average task length from 3.27 to 3.51 (+7.3%), with real-robot validation confirming superior performance over baselines.

se(3)diffusion policiestangent spacescore matchingmanifold drift

Mos-Gen: A Generative Molecular Framework for Mosquito Insecticide Design

arXiv cs.LG · Lina Wang, Yaning Cui · 2026-06-01

Mos-Gen introduces a motif-aware generative framework for de novo design of disulfide-containing allicin derivatives as mosquito insecticides, addressing resistance issues from conventional chemical insecticides. The framework integrates Uni-Mol, a pretrained molecular representation model, with a variational autoencoder (VAE) to generate novel molecular scaffolds. Experimental validation of fourteen synthesized compounds showed a 78% hit rate among predicted positives, with no mosquitocidal activity in predicted negatives, confirming Mos-Gen's high-precision screening capability.

mos-gendisulfide-containingvariational autoencodermolecular scaffoldsmosquitocidal activity

Observation, Not Prediction: Conversation-Level Disaggregated Scheduling for Agentic Serving

arXiv cs.LG · Jianru Ding, Ryien Hosseini, Pouya Mahdi Gholami, Mingyuan Xiang · 2026-06-01

ConServe introduces conversation-level disaggregated scheduling for LLM-based agents, eliminating the need for predictive models by leveraging observable first-turn input length and KV cache occupancy. By raising the scheduling unit from individual turns to entire conversations, ConServe identifies a stable two-phase structure: compute-bound turn-1 prefill and memory-bound tail. This approach routes prefill to high-throughput prefillers, transfers the KV cache once, and pins conversations to a single decoder. ConServe reduces p95 time-to-first-effective-token by 51.08%, improves energy efficiency by 7.51%, and achieves an additional 22.75% efficiency gain with heterogeneous GPU mapping.

llm-based agentskv cachedisaggregated schedulingprefillenergy efficiency

Adaptive Sharpness-Aware Minimization with a Polyak-type Step size: A Theory-Grounded Scheduler

arXiv cs.LG · Dimitris Oikonomou, Nicolas Loizou · 2026-06-01

The authors propose adaptive Polyak-type step size schedulers for Sharpness-Aware Minimization (SAM), eliminating the need for manual learning rate tuning. They derive novel algorithms for both deterministic and stochastic settings, proving linear convergence for strongly convex objectives and O(1/T) rates for convex cases. Empirical results show performance matching or exceeding tuned SAM baselines while reducing hyperparameter sensitivity.

sharpness-aware minimizationpolyak step sizestochastic optimizationconvergence analysisadaptive learning rate

Site4Drug: Predicting Drug-Binding Target Sites with an AI Agent

arXiv cs.LG · Taehan Kim, Sarrah Rose Mikhail Leung, Bharat Mekala, Jeongbin Park · 2026-06-01

Site4Drug introduces an AI agent for predicting drug-binding target sites on proteins, particularly addressing challenges in membrane proteins where accessibility and modifications constrain targetable regions. The system outputs ranked targetable regions with constraints, evidence summaries, and risk flags, while also recommending binding modalities (e.g., antibody-like vs small-molecule) based on consistent evidence across topology, hydropathy, PTMs, and domain context. This approach avoids biologically occluded sites by unifying modality-aware discovery.

drug-binding sitesmembrane proteinsmodality-awarepost-translational modificationstarget discovery

Tree-Guided Identify-Then-Exploit: A Unified Framework of Best Arm Identification and Regret Minimization for Dueling Bandits

arXiv cs.LG · Pu Wang, Yao-Xiang Ding · 2026-06-01

The paper introduces Tree-Guided Identify-Then-Exploit (TG-ITE), a unified framework for stochastic dueling bandits under the Condorcet-winner assumption, addressing best-arm identification (BAI), weak regret, and strong regret. The method employs a tree-guided identification phase to find a high-confidence incumbent in O(N) comparisons, followed by objective-specific exploitation strategies. Key results include O(N) sample complexity for BAI without stronger assumptions, O(N) weak regret (first winner-stays-style algorithm), O(N log T) strong regret matching specialized approaches, and joint optimization of BAI and weak regret with O(N) guarantees for both, eliminating prior O(log N) gaps.

dueling banditscondorcet-winnerbest-arm identificationregret minimizationtree-guided identification

An Algebraic View of the Expressivity of Recurrent Language Models

arXiv cs.LG · Franz Nowak, Ryan Cotterell, Reda Boumasmoud · 2026-06-01

This paper provides a unified algebraic framework for analyzing the expressivity of recurrent neural language models, resolving conflicting claims in the literature regarding their computational power. By formalizing different arithmetic models, the authors reduce expressivity to algebraic properties, such as whether a network's syntactic monoid divides a specific wreath product. As a case study, diagonal state-space models are examined, revealing that their expressivity depends on arithmetic constraints: they fail to implement even-modulus counters under floating-point recurrences but succeed under unsigned-integer quantization.

expressivityrecurrent neural networksalgebraic frameworksyntactic monoidarithmetic models

Sensitivity as a Double-Edged Sword: A Trade-off Between Discriminability and Adversarial Robustness

arXiv cs.LG · Kai Wang · 2026-06-01

The paper identifies a trade-off between discriminability and adversarial robustness in neural network classifiers, showing that fully connected (FC) classifiers are discriminative but sensitive to perturbations, while $\ell_2$-distance classifiers are robust but less discriminative. To address this, the authors propose a Hybrid Prototype Mixing (HPM) framework, combining stable dataset-level prototypes and dynamic batch-level prototypes generated via a Straight-Through Estimator (STE). They introduce the Mixed Surrogate Attack (MSA) for rigorous evaluation, demonstrating that their lightweight module enhances robustness in adversarially trained models with minimal fine-tuning.

adversarial robustnesshybrid prototype mixingstraight-through estimatormixed surrogate attackfully connected classifiers

FlatVPR: Plug-and-play Geo-linear Residual Adapter for Geometric Rectification of Foundation Model Feature Manifolds

arXiv cs.LG · Rai Hisada, Kanji Tanaka · 2026-06-01

FlatVPR introduces a geo-linear residual adapter for geometric rectification of foundation model feature manifolds in visual place recognition (VPR), addressing the trade-off between map lightweightness and localization accuracy. The method enforces a linear interpolation property between anchor descriptors via a learnable residual transformation, minimizing manifold curvature with a Pullback Flatness Loss. Experiments on NCLT show significant performance gains under sparse anchor conditions (100m intervals) and extreme seasonal changes.

visual place recognitionfeature manifoldresidual adapterpullback flatness lossgeometric rectification

Decentralized Instruction Tuning: Conflict-Aware Splitting and Weight Merging

arXiv cs.LG · Minsik Choi, Geewook Kim · 2026-06-01

The paper introduces MERIT, a decentralized instruction-tuning pipeline that addresses gradient interference and synchronization bottlenecks in heterogeneous mixtures. The method employs conflict-aware dataset splitting via PCA-aligned partitioning, independent fine-tuning of partitions, and token-weighted merging. Theoretical analysis shows weight merging performs curvature-weighted variance reduction and spectral filtering. Evaluated on Qwen2.5-VL-3B with 136 Vision-FLAN tasks, MERIT improves benchmark performance from 54.3 to 57.0, scaling to 7B models on 1.6M-example mixtures with minimal overhead.

instruction tuninggradient interferencepca-aligned splittingweight mergingspectral filtering

Density-Aware Translation of Spurious Correlations in Zero-Shot VLMs

arXiv cs.LG · Afsaneh Hasanebrahimi, Hanxun Huang, Christopher Leckie, Sarah Erfani · 2026-06-01

The paper introduces Density-Aware Translation (DAT), a method to mitigate spurious correlations in zero-shot Vision-Language Models (VLMs) like CLIP by refining image-text similarity scores using local geometric density. DAT addresses the anisotropic embedding geometry in CLIP, where common patterns cluster near the mean and rare patterns are marginalized, by employing a relative density measure to rescale similarities. Experiments on benchmark datasets show consistent improvements in worst-group and average accuracy, demonstrating DAT's effectiveness as a calibration mechanism for reliable zero-shot classification.

vision-language modelsspurious correlationszero-shot classificationembedding geometrydensity-aware translation

KDH-CAD: Knowledge-data hybrid CAD learning under data scarcity

arXiv cs.LG · Ziqin Gao, Zhijie Yang, Qiang Zou · 2026-06-01

KDH-CAD proposes a knowledge-data hybrid framework for computer-aided design (CAD) learning under data scarcity, integrating pretrained foundation models, structured domain knowledge, and minimal labeled CAD data. The method uses domain knowledge to complete CAD-relevant concepts in foundation models and calibrates them with task-specific geometric variability via labeled data, avoiding full fine-tuning. Experiments on mechanical part classification achieve 92.6% accuracy with 250 samples and 95.8% with 1,000 samples, matching/exceeding state-of-the-art performance with 10× less data.

knowledge-data hybridcomputer-aided designfoundation modelsdata scarcitygeometric variability

CANARY: Zero-Label Detection of Fine-Tuning Contamination in Language Models

arXiv cs.LG · Swapnil Parekh · 2026-06-01

CANARY introduces a zero-label checkpoint auditor for detecting fine-tuning contamination in language models via hidden-state geometry analysis. The method projects hidden-state differences through a Sparse Autoencoder (SAE) to isolate semantic drift from style noise, requiring only two forward passes over unlabeled prompts. Results show perfect detection (AUROC = 1.000) at 1% contamination across four architectures, with 7.5x lower detection threshold than output-level methods, zero false positives, and robustness to adaptive attacks. The SAE basis also enables harm amplification (5x), red-teaming lift (4.2x), and inference-time remediation (70%→10% harm reduction).

fine-tuning contaminationhidden-state geometrysparse autoencoderzero-label detectionsupply-chain security

IstGPT: LLM-based Anomaly Detection for Spatial-Temporal Graph in Industrial Systems

arXiv cs.LG · Yuchen Zhang, Ning Xi, Pengbin Feng, Shigang Liu · 2026-06-01

IstGPT introduces a novel anomaly detection framework for industrial systems by integrating large language models (LLMs) with graph learning. The method leverages multi-modal industrial knowledge to construct sensor-actuator dependency graphs through multi-stage prompt engineering, refines these graphs using LLM-Optimation based on node accuracy, edge consistency, and logical coherence, and employs improved graph neural networks with an encoder-decoder architecture for anomaly detection via reconstruction errors. Evaluated against 12 baselines across 9 datasets, including public, simulated, and real-world robotic arm datasets, IstGPT achieves superior F1-scores and eTaF1 metrics. The study also explores its feasibility in real-world industrial deployments.

anomaly detectionlarge language modelsgraph neural networkssensor-actuator dependencyencoder-decoder architecture

Don't Let a Few Network Failures Slow the Entire AllReduce

arXiv cs.LG · Peiqing Chen, Jiedong Jiang, Nengneng Yu, Yuefeng Wang · 2026-06-01

The paper presents OptCC, a fault-tolerant AllReduce algorithm that minimizes performance degradation during network failures in GPU clusters. By establishing an information-theoretic lower bound on AllReduce completion time under asymmetric bandwidth, the authors prove that only O(1/p) overhead is unavoidable when a straggler retains ≥50% bandwidth. OptCC employs a four-stage pipelined design to approach this bound. Experiments on SimAI show OptCC achieves within 2-6% of NCCL's fault-free performance under 50% bandwidth loss, outperforming state-of-the-art methods by up to 57% reduction in overhead.

allreducenetwork failuresfault tolerancegpu clustersbandwidth asymmetry

RDA: Reward Design Agent for Reinforcement Learning

arXiv cs.LG · Hojoon Lee, Ajay Subramanian, Ben Abbatematteo, Vijay Veerabadran · 2026-06-01

The paper introduces Reward Design Agent (RDA), a VLM-based framework for automating reward function design in reinforcement learning. RDA addresses alignment issues in prior methods (e.g., Eureka) by decomposing tasks, visually evaluating trajectories, and iteratively revising reward code using semantic understanding. Evaluated on 12 tabletop (ManiSkill) and 4 whole-body (HumanoidBench) manipulation tasks, RDA produces policies with significantly better instruction alignment while maintaining comparable success rates to baselines.

reinforcement learningreward designvlm-basedinstruction alignmenttrajectory evaluation

ATLAS: Agentic Test-time Learning-to-Allocate Scaling

arXiv cs.LG · Peijia Qin, Qi Cao, Pengtao Xie · 2026-06-01

ATLAS introduces an agentic test-time scaling framework where an LLM orchestrator manages the control loop end-to-end, deciding when to gather evidence, stop, and synthesize answers. The framework features an extensible action space, including solver choice and prompting strategy. Evaluated on four benchmarks with a Claude Sonnet 4.6 backbone, ATLAS achieves 56.00% on HLE-Verified, 82.29% on LiveCodeBench, 85.75% on GPQA-Diamond, and 23.71% on BabyVision, using fewer API calls than fixed-workflow baselines. A multi-model extension (ATLAS-MM) further improves performance, with ablations confirming the importance of stateful evidence management.

test-time scalingllm orchestratoraction spaceclaude sonnetstateful evidence

Quantifying the Energy Floor: Direct Measurement and Replay Buffer Bias in SAC-Based HVAC Control on sbsim

arXiv cs.LG · Bo Li, Chen Zhang · 2026-06-01

The study quantifies the energy floor—the minimum achievable cost under action space constraints—for Soft Actor-Critic (SAC) HVAC control on the sbsim calibrated building simulator, measuring it at USD 35.51/day. Through minimum-action experiments and systematic ablation, the authors identify replay buffer initialization as the dominant source of sub-optimality, with training from an empty buffer reducing costs to USD 35.57/day, eliminating 96% of the gap. Expanding the supply water temperature range by 10 K yields negligible savings (USD 0.03/day). The study also uncovers a discount factor coupling (gamma_eff = 0.891) that shrinks the effective planning horizon from 8.3 h to 46 min, highlighting a benchmark-wide issue.

energy floorsoft actor-criticreplay bufferdiscount factorplanning horizon

Gate the Filter, Not the Message: Node-Channel Mixtures for Pre-Propagation GNNs

arXiv cs.LG · Zichao Yue, Zhiru Zhang · 2026-06-01

FilterMoE introduces joint node- and channel-adaptive filtering for pre-propagation graph neural networks (PPGNNs), addressing a gap in existing designs that either share filters across nodes or channels. The method employs a mixture-of-experts architecture with a bank of learnable Chebyshev filters routed by a 3D gating tensor. Evaluated on eleven homophilic and heterophilic benchmarks, FilterMoE outperforms strong PPGNN baselines on nine datasets, achieving a 1.53-point average test score improvement and ranking first on all three large-scale benchmarks. This demonstrates the robustness of joint node-channel filter routing over dataset-specific hop-aggregator selection.

pre-propagationgraph neural networkschebyshev filtersmixture-of-expertsnode-channel filtering

Self-Regulating Annealing in Heavy-Tailed Diffusion Models

arXiv cs.LG · Keito Wakatsuki, Hideaki Shimazaki · 2026-06-01

The paper proposes a stochastic differential equation (SDE)-based sampler for heavy-tailed diffusion models (HTDMs) that introduces a state-dependent diffusion coefficient. This adaptive coefficient induces self-regulating annealing by modulating the effective noise scale during sampling. Theoretical analysis demonstrates the mechanism's role in maintaining heavy-tailed fidelity, while experiments confirm its necessity for accurate sample generation from heavy-tailed distributions compared to standard Gaussian formulations.

heavy-tailed diffusion modelsstochastic differential equationself-regulating annealingstate-dependent diffusionstudent's t-distribution

IMWM: Intuition Models Complement World Models for Latent Planning

arXiv cs.LG · Baoqi Gao, Ruize Han, Miao Wang, Song Wang · 2026-06-01

The paper introduces IMWM (Intuition Model + World Model), a hybrid planning approach combining a learned world model with an intuition model trained from demonstrations to address search bottlenecks in latent planning. The method integrates three components: Retrieval Initialization for action proposal, Hybrid Cost combining intuition scores with world-model rollout costs, and a Reliability Gate for dynamic trust adjustment. Evaluated on four pixel-based goal-reaching tasks (Two-Room, Reacher, Push-T, OGBench-Cube), IMWM improves mean success rates, notably on Two-Room (+11.5pp to 99.2%) and OGBench-Cube (+28.5pp to 94.7%).

latent planningworld modelintuition modelsample-based plannerhybrid cost

Self-Improving Small Object Grounding in LVLMs

arXiv cs.LG · Tianze Yang, Yucheng Shi, Ruitong Sun, Ninghao Liu · 2026-06-01

The work demonstrates that internal attention patterns in Large Vision Language Models (LVLMs) can reliably identify small-object boxes without fine-tuning. A lightweight IoU regressor trained on attention maps achieves strong IoU prediction (Pearson r > 0.67), powering the Attention-based Candidate Selection (ACS) framework. Two variants are proposed: ACS-Learned (regressor-based) and ACS-Free (training-free, using attention entropy). Experiments on COCO and Objects365 show up to 19% improvement in small object localization, with ACS-Free outperforming all training-free methods, revealing critical transformer layers and heads for interpretable grounding.

large vision language modelsattention patternsiou regressorobject groundingattention entropy

Learning Chaotic Dynamics through Second-Order Geometric Supervision

arXiv cs.LG · Shinhoo Kang, Hai V. Nguyen, Tan Bui-Thanh · 2026-06-01

The paper proposes second-order geometric supervision to improve learning of chaotic dynamical systems by preserving attractor geometry and invariant statistics. Existing methods match trajectories (zero-order) and Jacobians (first-order) but fail to constrain second-order curvature, leading to spurious attractors. The authors introduce model-constrained randomized Jacobian matching, which implicitly enforces Hessian consistency at O(d²) cost via Jacobian evaluations at perturbed inputs, avoiding explicit O(d³) Hessian computation. Experiments on Lorenz~63 and Lorenz~96 systems show second-order methods eliminate catastrophic Lyapunov-exponent outliers, maintain correct attractors, and preserve invariant measures under out-of-distribution forcing, performing comparably to explicit Hessian matching at lower cost.

chaotic dynamicsjacobian matchinghessian consistencyinvariant measurelyapunov spectrum

Uncertainty-Calibrated Diffusion for Reliable 3D Molecular Graph Generation

arXiv cs.LG · Fang Wan, Jingxiang Qu, Yi Liu · 2026-06-01

The paper introduces UCD (Uncertainty-Calibrated Diffusion), a method for improving 3D molecular graph generation by addressing epistemic uncertainty in diffusion models. The authors analyze how epistemic uncertainty from the denoiser interacts with aleatoric uncertainty during reverse diffusion, leading to variance inflation and distribution mismatch. UCD calibrates the reverse process to account for this effect, achieving state-of-the-art performance on standard 3D molecular benchmarks.

bayesian inferencediffusion modelsepistemic uncertaintymolecular generationuncertainty calibration

TLG: Temporal-Logic Grounding for Video Question Answering via Source-Annotation Reconstruction and Category-Targeted Reasoning

arXiv cs.LG · Ali Alavi · 2026-06-01

TLG (Temporal-Logic Grounding) improves video question answering on the TimeLogic Challenge by combining three components: (i) deterministic execution of temporal-logic programs reconstructed from source-dataset annotations, (ii) fallback to a vision-language model (VLM) for unannotated content, and (iii) category-targeted routing to a frontier reasoning model. The method addresses VLMs' inability to localize temporal actions, achieving 71.37% accuracy (+24.5 over VLM baselines) on a benchmark with 16 temporal operators. Ablations show annotation quality—not model scale—is critical for temporal grounding.

temporal-logic groundingvideo question answeringvision-language modelsaction timeline reconstructiontemporal operators

RobustModelMaker: Coupling Bootstrap Stability Selection with Leakage-Safe Nested Cross-Validation for Scientific Machine Learning

arXiv cs.LG · Amanda S Barnard · 2026-06-01

RobustModelMaker introduces a Python framework coupling bootstrap stability selection with strict nested cross-validation to address feature instability and optimistic bias in scientific ML pipelines. The method performs all preprocessing and selection within each fold, supporting nine algorithms across classification and regression tasks. Evaluations on three datasets show competitive predictive scores versus ANOVA F-test, RFECV, and Boruta selectors, while uniquely optimizing the joint score-stability frontier, as demonstrated in ovarian cancer biomarker discovery and superconductivity critical-temperature regression applications.

bootstrap stability selectionnested cross-validationfeature selectionscientific machine learningoptimistic bias

MomentKV: Closing the Directional Gap in KV Cache Eviction for Long-Context Inference

arXiv cs.LG · Yu Li, Binxu Li, Tian Lan · 2026-06-01

MomentKV introduces a novel KV cache eviction strategy for long-context inference in Transformer-based language models, addressing the directional mismatch between retained and evicted tokens. The method maintains compact moment statistics—count, key mean, value mean, and value-key covariance—over the evicted token set, ensuring geometric regularity. During inference, these statistics enable a closed-form first-order approximation of evicted attention output, enhancing accuracy. Evaluated on LongBench and RULER with LLaMA-3.1-8B-Instruct and Qwen3-4B-Instruct, MomentKV outperforms all baselines across cache budgets, particularly under aggressive compression.

kv cacheeviction strategymoment statisticslong-context inferenceattention output

Everywhere Learning: Artificial Intelligence with Pointwise Constraints

arXiv cs.LG · Ignacio Boero, Ignacio Hounie, Luiz Chamon, Alejandro Ribeiro · 2026-06-01

The paper introduces everywhere learning, a paradigm where AI systems satisfy loss constraints with probability one over the data distribution, contrasting with average loss minimization. The authors develop an approximate duality theory to analyze generalization, showing dual variables reweight data toward points where constraints are harder to satisfy. Generalization depends on the mass concentration mismatch between data distribution and constraint difficulty points, controllable via L1 sparsity on constraint relaxations. Experimental validation includes agentic classification for language model tasks.

everywhere learningduality theorygeneralization analysisconstraint relaxationagentic classification

Flexible Online Representation Learning Based on Similarity Matching

arXiv cs.LG · Shagesh Sridharan, Yanis Bahroun, Anirvan M. Sengupta · 2026-06-01

The authors propose a biologically plausible online learning algorithm for sparse high-dimensional representations, addressing limitations of conventional methods in scalability and computational tractability. The method operates without row sum constraints, enabling shift-invariant representations useful for clustering, manifold tiling, and sparse coding. It avoids optimization in completely positive or doubly nonnegative matrix spaces, which scale poorly with sample size. The algorithm demonstrates versatility in handling diverse data structures, including graphs relevant to community detection problems. Results suggest applicability to unsupervised exploration tasks requiring sparse representations.

sparse representationmanifold tilingshift-invariantonline learningcommunity detection

CRePE: Convolution-aware Relative Importance in Post-training Pruning with Efficient Search

arXiv cs.LG · Cheonjun Park · 2026-06-01

CRePE introduces convolution-aware relative importance scoring for post-training pruning (PTP) of LLMs, incorporating 2D local neighborhood context and adaptive coefficients to surpass existing methods like RIA. The method employs PHO (Proxy-based Hyperparameter Optimization) to reduce coefficient search time from 11 hours to 20 minutes by eliminating repeated perplexity evaluations. Evaluations show consistent improvements across models and sparsity levels, with optimal hyperparameters transferring well between models and combining orthogonally with techniques like Channel Permutation and non-uniform sparsity allocation.

post-training pruningrelative importance scoringhyperparameter optimizationlarge language modelssparsity allocation

Scalable Counterfactual Risk Estimation for Rare Events in Longitudinal Data

arXiv cs.LG · Xiaohui Yin, Avijit Mitra, Ying Zhou, Kun Chen · 2026-06-01

The authors propose a subsampling and reweighting strategy to address computational and stability challenges in estimating causal effects of time-varying treatments on rare survival outcomes in longitudinal data. The method enhances existing estimators like the iterative conditional expectation (ICE) by mitigating class imbalance and reducing bootstrap variance estimation costs. Evaluations through simulations and a large-scale EHR study on social/behavioral determinants of health demonstrate improved computational efficiency and estimation stability for rare outcomes.

longitudinal causal inferenceg-formulaiterative conditional expectationclass imbalancesurvival analysis

MPMWorlds: Material-Point-Method Simulations for Inferring and Extrapolating Physical Dynamics

arXiv cs.LG · Žiga Kovačič, Kevin Ellis · 2026-06-01

The authors introduce MPMWorlds, a 2D Material Point Method (MPM) simulation dataset for studying physical dynamics inference and extrapolation from videos. They evaluate code generation and video diffusion approaches, varying the inclusion of physical side information. Results show code generation models synthesize functional MPM simulations but struggle with parameter inference, while producing physically stable extrapolations. Video diffusion models better identify geometric properties but generate physically implausible long-term predictions.

material point methodphysical dynamicscode generationvideo diffusionparameter inference

PaCX-MAE: Physiology-Augmented Chest X-Ray Masked Autoencoder

arXiv cs.LG · Yancheng Liu, Kenichi Maeda, Manan Pancholy · 2026-06-01

PaCX-MAE introduces a cross-modal distillation framework that enhances chest X-ray (CXR) encoders with physiological priors from ECG and laboratory data while maintaining unimodal inference. The method combines masked autoencoding with dual contrastive-predictive objectives to align CXR representations with physiological embeddings. Evaluations across nine benchmarks show significant improvements over domain-specific MAE, notably in physiology-dependent tasks (+2.7 AUROC on MedMod, +6.5 F1 on VinDr), with strong label efficiency in the 1% regime and preserved anatomical fidelity.

masked autoencodercross-modal distillationphysiological priorscontrastive-predictivelabel efficiency

Multi-Agent Computer Use

arXiv cs.LG · Jing Yu Koh, Ruslan Salakhutdinov, Daniel Fried · 2026-06-01

The paper proposes multi-agent computer use (MACU) systems to address limitations of single-agent setups in complex, long-horizon tasks. MACU employs a manager model that decomposes tasks into a directed acyclic graph (DAG), dispatches parallel subagents to execute nodes, and dynamically revises the DAG based on new information. Evaluations on OSWorld, Online-Mind2Web, WebTailBench, and Odysseys show MACU improves task completion by 3.4-25.5%, reduces wall-clock time by ~1.5× on Odysseys, and exhibits better test-time scaling compared to single-agent baselines.

multi-agent systemsdirected acyclic graphtask decompositionparallel executionlong-horizon tasks

Rethinking the Role of Positional Encoding: Sliding-Window Transformers without PE Remain Turing Complete

arXiv cs.LG · Qian Li, Xinyu Mao, Shang-Hua Teng · 2026-06-01

The paper challenges the necessity of positional encoding (PE) in sliding-window transformers by proving their Turing completeness without PE. It introduces the HIST model, an autoregressive framework relying on token-count histograms within a finite window, and demonstrates its Turing completeness via simulation of Post machines. A PE-free sliding-window transformer is then shown to simulate HIST, revealing that window sliding inherently breaks permutation symmetry and provides sufficient positional information for universal computation.

positional encodingsliding-window transformersturing completenesspermutation symmetrypost machines

Near-Optimal Pure Machine Unlearning for Smooth Strongly Convex Losses

arXiv cs.LG · Matthew Regehr, Gautam Kamath, Andrew Lowy · 2026-06-01

The work establishes near-optimal bounds for pure machine unlearning in smooth strongly convex stochastic optimization, resolving the fundamental statistical cost of approximate ε-unlearning. The authors prove tight upper and lower bounds on excess population risk, showing optimality up to a condition-number factor. For mean estimation over the unit ball, bounds match exactly, revealing an unlearning penalty that interpolates between retraining costs and exponentially smaller terms as ε/d grows. When ε ≫ d, their algorithm achieves exponential accuracy gains over retraining and differentially private baselines; for ε ≤ d, retraining remains optimal.

machine unlearningstrongly convex optimizationexcess population riskdifferential privacycondition-number

Semi-Supervised Hyperbolic Hierarchical Clustering with Set-Level Structural Priors

arXiv cs.LG · Junjing Zheng, Xinyu Zhang, Xiangfeng Qiu, Chengliang Song · 2026-06-01

The paper proposes a semi-supervised hyperbolic hierarchical clustering method incorporating set-level structural priors to improve subtree coherence beyond conventional leaf-level supervision. The approach learns constraint-consistent embeddings to partition samples into sets, estimates inter-set similarities as structural priors, and optimizes hierarchy formation via a hyperbolic objective. Evaluations on eleven benchmarks demonstrate consistent improvements in label consistency and tree quality compared to baseline methods.

hyperbolic clusteringset-level priorshierarchical organizationconstraint-consistent embeddingssemi-supervised learning

Fast Generalization after Interpolation via Critically Damped Momentum Optimization

arXiv cs.LG · Luca Muscarnera, Silas Ruhrberg Estévez, Yuanzhang Xiao, Mihaela Van der Schaar · 2026-06-01

The paper introduces GROKtimizer, a biphasic optimization strategy combining rapid interpolation with critically damped momentum (CDM)-based post-interpolation norm minimization, to select low-norm interpolating solutions in high-dimensional regimes. The method leverages a local quadratic model of the post-interpolation basin, achieving a quadratic speedup over gradient descent and provable optimality among first-order optimizers. Evaluations on synthetic benchmarks and real-world datasets demonstrate improved generalization, aligning with the flat-minima hypothesis and highlighting post-interpolation dynamics' role in model quality.

grokkinginterpolation thresholdcritically damped momentumnorm minimizationflat-minima hypothesis

Semantic Retrieval for Product Search in E-Commerce

arXiv cs.LG · Nikhil Kothari, Saksham Samdani, Ritam Mallick, Praveen Gupta · 2026-05-31

The paper introduces a Siamese LLM dual-encoder for semantic retrieval in e-commerce, addressing noisy queries and fine-grained product distinctions. The method employs a two-stage pipeline: contrastive learning with a false-negative margin mask to avoid penalizing near-duplicates, followed by Relative Odds Alignment for Retrieval (ROAR), a preference optimization objective extending Bradley-Terry to graded relevance groups. Evaluations show improved retrieval of exact matches and correct ordering of substitutes and complementary products, with statistically significant gains across query-frequency strata and business verticals in live A/B tests.

siamese llmcontrastive learningfalse-negative margin maskrelative odds alignmentgraded relevance

CART: Context-Anchored Recurrent Transformer -- A Parameter-Efficient Architecture with Learned Stability

arXiv cs.LG · Chad A. Capps · 2026-05-31

CART (Context-Anchored Recurrent Transformer) introduces a parameter-efficient language model architecture that reuses a single shared core block R times across depth, reducing computational overhead by computing key-value tensors once via a multi-layer prelude and employing multi-head latent attention. Stability is ensured by a learned Linear Time-Invariant (LTI) gate, maintaining a spectral radius ρ ∈ [0.79, 0.83]. Evaluated on single consumer GPUs, CART shows that prelude depth P dominates loop count R, with R=6 performing best at d≥512 after full training. However, CART underperforms parameter-matched dense baselines by 1-2% at stored-parameter parity and ~10% at effective-parameter parity, with diagnostic ablations attributing ~5% to weight sharing and ~5% to the heterogeneous framing.

parameter-efficientmulti-head latent attentionlinear time-invariant gatespectral radiusdiagnostic ablations

Perception First: A Frontier Native-Video Model with Self-Consistency for Implicit Video Question Answering

arXiv cs.LG · Ali Alavi · 2026-05-31

The paper presents a video question answering model for the VRR Challenge @ CVPR 2026, focusing on the ImplicitQA/VRR-QA benchmark where answers require inference from spatial, temporal, and social cues across discontinuous frames. It evaluates training-free strategies across multiple Video-LMMs (Qwen2.5-VL, Qwen3-VL, InternVL3, Gemma-3, Video-R1, VideoChat-R1.5) and inference techniques (chain-of-thought, self-consistency, etc.). Key findings indicate the task is perception-bound, with reasoning augmentations offering minimal benefit. Error analysis reveals challenges in low-level perception (depth, viewpoint, counting), while causal and social reasoning are nearly solved. Injecting depth cues degrades performance by 5.8 points, underscoring the need for improved perceptual capabilities.

implicitqavideo-lmmsself-consistencyperception-boundmonocular depth

OmniOPD: Logit-Free On-Policy Distillation via Speculative Verification

arXiv cs.LG · Yuhang Zhou, Lizhu Zhang, Yifan Wu, Mingyi Wang · 2026-05-31

OmniOPD introduces a logit-free on-policy distillation framework that replaces token-level logit matching with chunk-level Monte Carlo rollouts and semantic similarity metrics. The method employs a peak-entropy scheduler for targeted supervision, a Dirichlet-Multinomial prior for variance control, and a KL anchor to prevent policy collapse. Evaluations show +28.64% improvement over standard OPD on math tasks and +9.54% gain when using black-box teachers like Claude-4.5-Haiku, demonstrating superior signal extraction compared to logit-based approaches.

on-policy distillationmonte carlo rolloutssemantic similaritypeak-entropy schedulerkl anchor

Genotype-Conditioned Molecular Generation via Evidence-Grounded Multi-Objective Latent Perturbation in Diffusion Models

arXiv cs.LG · Brenda Nogueira, Gisela A. Gonzalez-Montiel, Nitesh V. Chawla, Nuno Moniz · 2026-05-31

The paper introduces a latent-space optimization method for genotype-conditioned molecular generation, enhancing a pretrained diffusion model via gradient ascent on a composite reward function. The approach jointly optimizes for drug sensitivity (AUC), drug-likeness (QED), and synthetic accessibility (SAS), with biological realism enforced through cancer cell line data and pharmacologic signals. Evaluation across 15 cancer cell lines demonstrates improvements over baselines in sensitivity, drug-likeness, synthesizability, and chemical validity, with mechanistic plausibility assessed via a multi-agent LLM pipeline.

latent-space optimizationgenotype-conditioned generationdiffusion modelsdrug sensitivitysynthetic accessibility

Truthful AI Advisors: A Pre-Specified Benchmark for Large Language Model Honesty Under Preference Misalignment

arXiv cs.LG · Hamidreza Hasani Balyani, Seyed Pouyan Mousavi Davoudi, Alireza Amiri-Margavi, Amin Gholami Davodi · 2026-05-31

The paper introduces a pre-specified benchmark for evaluating LLM honesty under preference misalignment, adapting Crawford-Sobel's cheap-talk model to measure strategic information revelation. Using 5 bias levels, 3 prompt frames, and 200 states per cell (12,000 total calls), the study tests four instruction-tuned models (GPT-4o, Claude Sonnet 4.5, Gemini 2.5 Flash-Lite, Llama-3.3-70B). Results show all models over-reveal information by 1.8-4.2x compared to the most-informative equilibrium, maintaining high normalized mutual information (0.78-0.94 vs. oracle-prescribed 0.18-0.53) and exhibiting linear exaggeration rather than optimal coarse partitions.

cheap-talk modelpreference misalignmentnormalized mutual informationinstruction-tuned modelslinear exaggeration

Spatially Distributed Task-Oriented Compression for Multi-Emitter Localization and Characterization with Spectral Overlap

arXiv cs.LG · H. Nazim Bicer, J. Nick Laneman · 2026-05-31

The authors propose a task-oriented distributed compression framework for joint multi-emitter localization and characterization in dense wireless environments. The method employs spatially distributed receivers that observe complex IQ samples, convert them to time-frequency representations, and encode them into compact latent vectors. A central fusion decoder combines these latents to estimate emitter properties, including location, center-frequency offsets, bandwidth, and waveform families, using a permutation-invariant training objective. Experiments on synthetic multi-emitter scenes show that compact receiver-side representations (e.g., drx=16) suffice for emitter counting and waveform-family estimation, while larger latents (e.g., drx=64) are needed for accurate localization and spectral-parameter regression.

task-oriented compressionmulti-emitter localizationpermutation-invariant trainingtime-frequency representationspectral overlap

CEAR: Certified Ensemble Adversarial Robustness in DNNs

arXiv cs.LG · Daniel Sadig, Mohammadreza Maleki, Hamed Karimi, Reza Samavi · 2026-05-31

The paper introduces CEAR, a certified ensemble adversarial robustness method combining empirical and certified defenses for DNNs. CEAR trains ensemble members with varying Gaussian noise and temperatures to obfuscate gradients and logits, enhancing resistance to gradient-based attacks. It employs noisy logits and two voting mechanisms for robustness, extending randomized smoothing for ensemble verification. Experiments on MNIST, CIFAR10, and TinyImageNet show improved certified accuracy (average), larger robustness radii, and reduced attack transferability versus baselines.

certified robustnessensemble learningadversarial defenserandomized smoothinggradient obfuscation

Leaf Spectral Reflectance Prediction Using Multi-Head Attention Neural Networks

arXiv cs.LG · Parastoo Farajpoor, Alireza Pourreza, Mohammadreza Narimani, Ashraf El-Kereamy · 2026-05-31

A multi-head attention neural network was developed for species-specific leaf spectral reflectance prediction, outperforming the generalized PROSPECT-PRO radiative transfer model. The model was trained on a grapevine-specific dataset with 16 leaf traits across varieties, growth stages, and years, using stratified 5-fold cross-validation. It achieved R^2=0.84 and NRMSE=1.52%, with particularly reduced MAE in NIR and SWIR regions, demonstrating improved accuracy for crop-specific applications in remote sensing and precision agriculture.

spectral reflectancemulti-head attentionradiative transfer modelnear-infraredprecision agriculture

On the Uncertainty Quantification Ability of Tabular Foundation Models

arXiv cs.LG · Tyler R. Johnson, Kian Ben-Jacob, Nima Negarandeh, Oriol Vendrell-Gallart · 2026-05-31

This work evaluates the uncertainty quantification (UQ) capabilities of tabular foundation models (FMs) in regression tasks, comparing Tabular Prior-Data Fitted Networks (TabPFN v2.5) against Gaussian processes (GPs). Through systematic empirical analysis across varying dataset complexities, sizes, and dimensionalities, the study reveals a trade-off between explicit and learned priors: TabPFN excels in complex, high-dimensional scenarios with sufficient data, while GPs demonstrate superior predictive accuracy and UQ in data-scarce settings, particularly when the kernel aligns well with the underlying function. Results are reproducible via the provided GitHub repository.

uncertainty quantificationtabular foundation modelsgaussian processesprior-data fitted networksregression tasks

Learning-based Directed Graph Abstraction of Combinatorial Spaces for Order-Preserving Search in Mixed-Combinatorial Nonlinear Optimization

arXiv cs.LG · Gishnu Madhu, Feng Liu, Souma Chowdhury · 2026-05-31

The paper introduces a learning-based directed graph abstraction method for mixed-combinatorial nonlinear programming (MCNLP) problems, addressing limitations of traditional integer/binary encodings. An Edge Field Graph Network (EFGN) maps an undirected fully connected graph of combinations to a directed graph indicating improvement directions, enabling order-preserving search. Integrated with particle swarm optimization and genetic algorithms, the method outperforms indexified combination baselines on three benchmark problems, achieving better mean optima and robustness across runs.

mixed-combinatorial nonlinear programmingedge field graph networkgraph neural networksparticle swarm optimizationgenetic algorithm

Target localization, identification and sensing using latent symmetries

arXiv cs.LG · David Dukov, Malte Röntgen, Bryn Davies · 2026-05-31

The work demonstrates a novel sensing method exploiting latent symmetries in three-dimensional scatterer arrays, using the capacitance matrix as a model for hybridization. By analyzing symmetry breaking caused by an intruder scatterer, the system localizes the intruder and identifies its radius through dictionary-based approaches, Bayesian inference, or multi-layer perceptrons, with the latter two outperforming under noise. This represents the first application of latent symmetries in 3D open systems for sensing and the first observation of such symmetries in non-sparse-graph systems.

latent symmetriescapacitance matrixscatterer arraysbayesian inferencemulti-layer perceptron

Differentially Private Datastore Generation for Retrieval-Augmented Inference

arXiv cs.LG · Abdelrahman Abouelenein, Marwan Torki · 2026-05-31

The paper introduces a differentially private (DP) framework for generating secure datastores to enable privacy-preserving retrieval-augmented inference. The method employs locality-sensitive hashing (LSH) to partition high-dimensional data into buckets, then adds calibrated DP noise to bucket votes to produce class probability distributions. Evaluated on seven datasets (2-14 classes), the approach maintains strong utility (2.6% average accuracy drop at ε=5) while providing formal privacy guarantees. Membership inference attack accuracy is reduced to 53.60%, demonstrating robust protection against adversarial analysis.

differential privacyretrieval-augmented inferencelocality-sensitive hashingmembership inferencedatastore generation

GPTQ-intrinsic LoRA: A Near-optimal Algorithm for Low-precision Quantization with Low-rank Adaptation

arXiv cs.LG · Shihao Zhang, Rayan Saab · 2026-05-31

The paper introduces GPTQ-intrinsic LoRA, a near-optimal algorithm for low-precision quantization with low-rank adaptation, addressing the layer-wise reconstruction objective $\|XW-X(Q+LR)\|_F^2$. The method incorporates low-rank correction directly into GPTQ-style quantization by augmenting the calibration Hessian, with theoretical bounds showing improved dependence on rank-$r$ residuals. Experiments on Qwen3 and DeiT models demonstrate superior performance over GPTQ and its variants, with additional gains from Bid-Up refinement.

quantizationlow-rank adaptationgptqlayer-wise reconstructionpost-training

Neural Network Compression by Approximate Differential Equivalence

arXiv cs.LG · Ravi Dhiman, Andrea Passarella, Mirco Tribastone, Lorenzo Valerio · 2026-05-31

The paper introduces a neural network compression method based on approximate differential equivalence, contrasting with conventional weight-pruning approaches. By encoding networks as polynomial ODE systems and applying Approximate Forward Differential Equivalence, the method aggregates neurons with similar dynamical behavior using a tolerance parameter ε to control compression. Evaluations on synthetic dynamical systems and regression benchmarks demonstrate superior parameter reduction versus magnitude-based pruning and Wanda at comparable accuracy levels, establishing differential equivalence as a principled compression alternative.

neural network compressiondifferential equivalencepolynomial odeparameter aggregationmagnitude-based pruning

📰 Industry Media (2)

Rehumanizing global health care with agentic AI

MIT Tech Review — AI · MIT Technology Review Insights · 2026-06-02

Agentic AI is being deployed in healthcare to address workforce shortages and administrative inefficiencies, automating complex workflows while maintaining human oversight. Hospitals like HSS utilize multi-agent systems for claims processing (handling 1,100/month), triage (reducing appeal time from 45 to 5 minutes), and scheduling, trained on institutional protocols with auditable decisions. Results show 100% appeal success rates and operational consolidation, enabled by unified data strategies and enterprise-wide integration. Deloitte and KPMG report 68-84% adoption rates, with providers anticipating 90% task automation for clinician focus on high-complexity cases.

agentic aimultiagent systemselectronic health recordstriage automationworkflow optimization

How small businesses can leverage AI

MIT Tech Review — AI · Peter Hall · 2026-06-02

The article examines practical applications of LLMs for small business automation, focusing on administrative task delegation. Through a case study of a private tutor using Notion AI (a $20/month add-on with calendar/email integrations), it demonstrates AI-assisted recordkeeping, meeting summarization, and goal decomposition. Results include 60-80% time savings in inventory listing for craft businesses using specialized tools like Rain, though users report interface clunkiness and privacy concerns. Recommendations emphasize evaluating task suitability, using local models for sensitive data, and integrating AI with existing workflows.

llmsnotion aiinventory automationlocal inferencetask decomposition


Generated automatically at 2026-06-02 22:19 UTC. Summaries and keywords are produced by an LLM and may contain inaccuracies — always consult the original article.