Daily Digest — 2026-06-27

Friday, June 26, 2026 · 294 items · model: deepseek/deepseek-chat

294 items · 1 research labs, 279 arxiv papers, 14 industry media

🏛️ Research Labs (1)

Previewing GPT-5.6 Sol: a next-generation model

OpenAI News · 2026-06-26

OpenAI introduces GPT‑5.6 Sol, a next-generation model with enhanced agentic capabilities in coding, biology, and cybersecurity. The model features a robust safety stack, including layered safeguards, real-time misuse classifiers, and automated red-teaming with over 700,000 A100-equivalent GPU hours. GPT‑5.6 Sol achieves state-of-the-art performance on Terminal‑Bench 2.1 and GeneBench v1, while demonstrating competitive results on ExploitBench² and ExploitGym. The model is priced at $5 input / $30 output per 1M tokens, with Terra and Luna offering cost-effective alternatives. A phased release strategy ensures broader availability after initial testing with trusted partners.

agentic capabilitiesmisuse classifiersautomated red-teamingterminal-benchexploitgym

📜 arXiv Papers (279)

Autoregressive Boltzmann Generators

arXiv cs.AI · Danyal Rehman, Charlie B. Tan, Yoshua Bengio, Avishek Joey Bose · 2026-06-25

Autoregressive Boltzmann Generators (ArBG) introduce a novel autoregressive modeling framework for efficient sampling of molecular systems at thermodynamic equilibrium, overcoming limitations of flow-based Boltzmann Generators (BGs). ArBG leverages architectures effective in Large Language Models, circumvents topological constraints of normalizing flows, and enables sequential inference-time interventions. Empirical results demonstrate significant improvements over flow-based models, particularly in larger peptide systems like the 10-residue Chignolin. Robin, a 132M parameter transferable model trained with ArBG, reduces the zero-shot energy error (E-W$_2$) on 8-residue systems by over 60%, surpassing previous state-of-the-art performance.

autoregressive modelingboltzmann generatorsnormalizing flowsthermodynamic equilibriumsequential inference

Error-Conditioned Neural Solvers

arXiv cs.AI · Haina Jiang, Liam Wang, Peng-Chen Chen, Min Seop Kwak · 2026-06-25

The paper introduces Error-conditioned Neural Solvers (ENS), a novel approach to PDE solving that uses the PDE residual field as a network input to enable iterative error correction. Unlike traditional hybrid methods that minimize residuals via costly optimization, ENS learns to spatially interpret and correct its own errors, achieving higher accuracy without additional compute. Theoretical analysis shows residual minimization is unreliable for ill-conditioned systems, while ENS demonstrates superior performance across four PDE families, including 10× gains on turbulent Kolmogorov flow and robust generalization under distribution shifts.

neural surrogate modelspde residualill-conditioned systemskolmogorov flowdistribution shift

Understanding Domain-Aware Distribution Alignment in Budgeted Entity Matching

arXiv cs.AI · Nicholas Pulsone, Gregory Goren, Roee Shraga · 2026-06-25

The paper analyzes BEACON, a state-of-the-art domain-aware entity matching (EM) method for low-resource settings, examining its performance under varying data constraints and supervision levels. Through targeted experiments, the study evaluates algorithmic choices and data availability impacts, specifically probing the role of distribution alignment in the framework. Results provide insights into BEACON's behavior, offering empirical evidence for its adaptation capabilities in realistic EM scenarios.

entity matchingdistribution alignmentlow-resource learningdomain adaptationdata integration

Language-Based Digital Twins for Elderly Cognitive Assistance

arXiv cs.AI · Mohammad Mehdi Hosseini, Mohammad H. Mahoor, Hiroko H. Dodge · 2026-06-25

The authors propose a language-based digital twin framework for elderly cognitive assistance, leveraging large language models (LLMs) to mimic conversational behavior using stylometric cues and contextual metadata. They introduce a multi-head conditional variational autoencoder (cVAE) to jointly evaluate reconstruction fidelity and predict cognitive scores (MoCA). Experiments on the I-CONECT dataset demonstrate preservation of identity-specific characteristics, with reconstruction and MoCA prediction errors comparable to real data (outperforming GPT-generated baselines), suggesting potential for scalable cognitive health monitoring.

digital twinlarge language modelsconditional variational autoencodermild cognitive impairmentstylometric cues

Empowering GUI Agents via Autonomous Experience Exploration and Hindsight Experience Utilization for Task Planning

arXiv cs.AI · Tianyi Men, Zhuoran Jin, Pengfei Cao, Yubo Chen · 2026-06-25

The paper introduces Planning Experience Exploration and Utilization (PEEU), a method enhancing GUI task planning for small multimodal LLMs (MLLMs) through autonomous environment exploration and hindsight experience synthesis. PEEU employs Task Decomposition Hierarchical Analysis Framework (TDHAF) to analyze compositional generalization across task granularities, revealing that high-level task training improves out-of-distribution (OOD) generalization. Experiments show PEEU's 7B model achieves 30.6% accuracy on real-world benchmarks, outperforming Qwen2.5-VL-32B, demonstrating the efficacy of hindsight task construction for small MLLMs.

multimodal llmstask planninghindsight experiencecompositional generalizationgui agents

AI Healthcare Chatbots as Information Infrastructure: A Large-Scale Study of User-Reported Breakdowns

arXiv cs.AI · Muhammad Hassan, Ramazan Yener, Ece Gumusel, Masooda Bashir · 2026-06-25

The study contributes a systematic analysis of user-reported breakdowns in AI healthcare chatbots through topic modeling of 15,000 reviews from 59 apps. Using interpretive analysis, it identifies three key failure modes: access barriers/service unreliability, UX/interaction quality issues, and billing/support problems, with privacy/security concerns correlating most strongly with negative sentiment. Findings frame chatbots as information infrastructures, revealing how access, usability, and trust failures impact digital health systems.

ai healthcare chatbotstopic modelinginformation infrastructureuser experienceprivacy concerns

When Does Combining Language Models Help? A Co-Failure Ceiling on Routing, Voting, and Mixture-of-Agents Across 67 Frontier Models

arXiv cs.AI · Josef Chen · 2026-06-25

The study demonstrates that multi-model LLM systems' accuracy gains are fundamentally limited by the co-failure rate (beta), where all models err on the same query. Analyzing 67 models from 21 providers, the authors show that beta underpredicts all-wrong rates (0.052 observed vs. 0.023 predicted) in open-ended mathematics, with similar effects in code execution tasks. A tetrachoric-calibrated single-factor model reveals that error correlation (rho) fails to capture beta's impact. Results indicate that heterogeneous ensembles outperform homogeneous ones, but combining models rarely surpasses the single best model without robust routing signals.

co-failure rateerror correlationtetrachoric calibrationheterogeneous ensemblesrouting signal

Prompt Injection in Automated Résumé Screening with Large Language Models: Single and Multi-Injection Settings

arXiv cs.AI · Preet Baxi, Jiannan Xu, Jane Yi Jiang, Stefanus Jasin · 2026-06-25

The paper investigates prompt injection attacks in LLM-based automated résumé screening, where candidates insert self-promotional text to influence rankings. Through controlled experiments, the authors demonstrate that prompt injection reliably boosts rankings when candidate quality is homogeneous and few candidates employ it, but effectiveness diminishes with widespread use. In heterogeneous quality settings, injection occasionally enables lower-quality candidates to outrank higher-quality ones, raising fairness concerns. Vulnerability peaks when manipulation is rare and quality differences are small. The study provides empirical evidence of strategic manipulation risks in algorithmic hiring systems.

prompt injectionlarge language modelsalgorithmic hiringrésumé screeningfairness

Simulation-based inference for rapid Bayesian parameter estimation in epidemiological models: a comparison with MCMC

arXiv cs.AI · Alina Bazarova, Johann Fredrik Jadebeck, Henrik Zunker, Carolina J. Klett-Tammen · 2026-06-25

This study demonstrates that simulation-based inference (SBI) with neural posterior estimation offers a computationally efficient alternative to Markov chain Monte Carlo (MCMC) for Bayesian calibration of mechanistic epidemiological models. The authors compare SBI and MCMC on a SECIR model using COVID-19 ICU occupancy data from Germany, evaluating performance across 31-day and 201-day inference windows. SBI achieves strong posterior agreement with MCMC, as measured by Wasserstein distances and Kullback-Leibler divergences, while significantly reducing runtime: SBI completes 31-day inferences in 60-70 seconds on a GPU versus 1000 seconds for MCMC, and 201-day inferences in 157 seconds versus over 19,000 seconds for MCMC. The results highlight SBI's potential for rapid, near-real-time epidemiological analysis.

simulation-based inferencebayesian calibrationmarkov chain monte carloepidemiological modelsneural posterior estimation

EO-WM: A Physically Informed World Model for Probabilistic Earth Observation Forecasting

arXiv cs.AI · Junwei Luo, Shuai Yuan, Zhenya Yang, Yansheng Li · 2026-06-25

The paper introduces EO-WM, a video diffusion transformer for Earth Observation (EO) forecasting that models weather-driven dynamics through physically informed conditioning. The method separates climatological baseline and weather anomalies into distinct pathways, accumulating anomalous forcing to capture sustained environmental stress. It outperforms existing approaches by reducing NDVI prediction error by 5.63% and improving directional hit rate by 7.80%, validated via novel benchmarks for extreme weather response and seasonal fidelity.

earth observationvideo diffusion transformernormalized difference vegetation indexweather anomaliesconditioning pathways

E-TTS: A New Embodied Test-Time Scaling Framework for Robotic Manipulation

arXiv cs.AI · Wen Ye, Peiyan Li, Tingyu Yuan, Yuan Xu · 2026-06-25

The paper introduces E-TTS, a modular framework for embodied test-time scaling in robotic manipulation that jointly optimizes reasoning and action scaling through history-aware iterative refinement. The method employs pairwise reasoning-action sampling and scoring, leveraging a history buffer for context retention and vision-language verifiers for candidate evaluation. Unlike open-loop approaches, E-TTS implements closed-loop feedback generation during sampling. Evaluated across 4 benchmarks, 6 environments, 3 embodiments, and 4 base VLMs, the framework achieves performance gains of up to 33.14% (simulation) and 26.62% (real-world) without additional training or data collection.

embodied test-time scalingvision-language verifiershistory bufferiterative refinementrobotic manipulation

Advancing Omnimodal Embodied Agents from Isolated Skills to Everyday Physical Autonomy

arXiv cs.AI · Junhao Shi, Zezheng Huai, Siyin Wang, Jia Chen · 2026-06-25

The paper introduces OmniAct, a framework for persistent embodied agents that integrates multimodal semantic planning, adaptive hierarchical memory, and asynchronous visual preemption to achieve cyber-physical autonomy. The architecture features a unified action space for heterogeneous tools, event-boundary-driven memory compression for sub-linear context growth, and closed-loop semantic verification during physical execution. Evaluated on 40 real-world long-horizon tasks across two robotic platforms and four IoT devices, OmniAct demonstrates improved end-to-end success rates, maintains token consumption under 100k+, and elevates open-weight models to proprietary-level performance.

embodied agentsmultimodal planninghierarchical memoryvisual preemptioncyber-physical autonomy

From Celebrities to Anyone: Characterizing AI Nudification Content, Technology, and Community Dynamics on 4chan

arXiv cs.AI · Chi Cui, Yixin Wu, Yang Zhang · 2026-06-25

This study characterizes AI nudification content and technology in anonymous online communities, analyzing 24,105 synthetic non-consensual sexually explicit imagery (SNEACI) items from 4chan. Findings reveal a demographic shift: 55.8% of targets are non-celebrity individuals, contrasting with prior studies' 4.7%, indicating expansion to personal social circles. Open-source models dominate production (Stable Diffusion for 42.7% images, Wan for 66.5% videos), enabled by fine-tuned models and tutorials. A small cohort of active producers drives engagement, with the most prolific creating 780 items. The work highlights ecosystem mechanisms and calls for platform governance and technical safeguards.

ai nudificationsneacistable diffusionfine-tuned modelsplatform governance

Bridging Talk and Thought: Understanding Dialogue Dynamics Across Collaborative Problem-Solving Contexts

arXiv cs.AI · Zhengyuan Liu, Stella Xin Yin, Min-Yen Kan, Nancy F. Chen · 2026-06-25

The authors propose a hierarchical two-layer coding framework for analyzing dialogue dynamics in collaborative problem-solving contexts, focusing on human-AI and multi-agent interactions. The framework integrates cognitive and non-cognitive problem-solving processes with metacognitive regulatory mechanisms, addressing limitations in existing approaches. Evaluated across nine datasets spanning multiple domains, the framework demonstrates effectiveness and generalizability, revealing metacognitive regulation as a critical discriminator of deeper collaboration. Insights into how humans and agents coordinate knowledge, skills, and efforts to solve complex problems are provided.

dialogue dynamicscollaborative problem-solvingmetacognitive regulationmulti-agent collaborationhierarchical coding

CARVE: Content-Aware Recurrent with Value Efficiency for Chunk-Parallel Linear Attention

arXiv cs.AI · Sayak Dutta · 2026-06-25

CARVE introduces content-aware gating for recurrent models by erasing only on the key axis, resolving three defects in GDN-2: memory-blind gating, parameter inefficiency, and incompatibility with WY-form triangular chunk solvers. The method reuses the recurrent output tensor as a free content signal and replaces per-value write gates with a single scalar per head. At 1.3B parameters trained on 100B tokens, CARVE reduces WikiText perplexity by 0.18 (15.72 vs. GDN-2), outperforms recurrent baselines on nine reasoning benchmarks, and achieves SOTA on RULER retrieval probes, with 0.4% throughput overhead, 13% lower memory, and 19% fewer parameters.

recurrent modelscontent-aware gatingw-form solverperplexity reductionparameter efficiency

Ask, Don't Judge: Binary Questions for Interpretable LLM Evaluation and Self-Improvement

arXiv cs.AI · Sangwoo Cho, Kushal Chawla, Pengshan Cai, Zefang Liu · 2026-06-25

BINEVAL introduces a framework for interpretable LLM evaluation by decomposing criteria into atomic binary questions, aggregating verdicts into multi-dimensional scores. Using task-specific meta-prompts, it generates fine-grained evaluation questions answered independently by an LLM, yielding transparent feedback and calibrated scores. Evaluated on SummEval, Topical-Chat, and QAGS, BINEVAL matches or outperforms UniEval and G-Eval, particularly excelling in factual consistency benchmarks. It avoids ceiling effects, better discriminates between outputs, and supports iterative prompt optimization, demonstrating task-agnostic, training-free evaluation with diagnostic and optimization value.

binary questionsinterpretable evaluationmeta-promptfactual consistencyprompt optimization

Vulnerability of Natural Language Classifiers to Evolutionary Generated Adversarial Text

arXiv cs.AI · Manjinder Singh, Alexander E. I. Brownlee, Mohamed Elawady · 2026-06-25

GAversary, a hybrid Genetic Algorithm for generating adversarial text attacks, is proposed to exploit NLP model vulnerabilities without requiring internal model access, using only logit outputs. The method employs GloVe embeddings in its mutation operator to enhance semantic similarity of adversarial examples. Evaluated on benchmark datasets against BAE and A2T attacks, GAversary reduces model accuracy from 76.8% to 5.8%, outperforming BAE's 27.6%, albeit with increased word perturbations, slightly lower semantic similarity, and a 5% runtime increase.

genetic algorithmadversarial attacksglove embeddingssemantic similaritylogit outputs

A Process Harness for Uplifting Legacy Workflows to Agentic BPM: Design and Realization in CUGA FLO

arXiv cs.AI · Fabiana Fournier, Lior Limonad · 2026-06-25

The authors introduce the process harness, a novel mechanism for integrating Agentic Business Process Management (Agentic BPM) into legacy workflows without replacing the underlying workflow engine. The harness employs a policy-governed agentic layer that intercepts control points, enabling reasoning, adaptation, and oversight while maintaining structural authority. They define the Task-Decision-Flow (TDF) model, which decomposes LLM reasoning across three agent types: TaskAgent, DecisionAgent, and FlowAgent, each governed by policies from the process FRAME. CUGA FLO, an implementation of TDF, is demonstrated on a loan approval workflow, showcasing agentic autonomy and regulatory override.

process harnessagentic bpmtask-decision-flow modelpolicy-governed agentscuga flo

Automating Potential-based Reward Shaping with Vision Language Model Guidance

arXiv cs.AI · Henrik Müller, Daniel Kudenko · 2026-06-25

The paper introduces VLM-PBRS, a framework automating potential-based reward shaping (PBRS) using vision language model (VLM) guidance to address sparse reward challenges in reinforcement learning. VLM-PBRS learns a potential function by querying a lightweight VLM for preferences over image pairs, eliminating the need for expert-designed reward shaping terms while preserving optimal policies. Empirical validation in Meta-World and Franka Kitchen environments demonstrates improved sample efficiency and robustness to reward hacking, with performance linked to VLM preference label accuracy. Contributions include the first VLM-based PBRS potential function synthesis, a cost-effective small VLM solution, and extensive empirical validation.

potential-based reward shapingvision language modelsparse rewardsreward hackingsample efficiency

Learning to Fold: prizewinning solution at LeHome Challenge 2026 (1st place online, 2nd offline)

arXiv cs.AI · Ilia Larchenko · 2026-06-25

The paper presents a vision-language-action (VLA) policy enhanced with reinforcement learning, which won 1st place in the online simulation round and 2nd in the real-world final of the LeHome Challenge 2026. The policy integrates action prediction with success estimation, progress tracking, and future state forecasting, leveraging these outputs for advantage estimation, failure detection, and candidate selection. Key innovations include combining AWR and RECAP for flow-matching VLA, an asynchronous distributed training pipeline via HuggingFace Hub, Thompson sampling for inference-time hyperparameter optimization, and a sim-to-real transfer approach with camera alignment, heavy augmentation, and DAgger-like human-in-the-loop data collection.

vision-language-actionreinforcement learningsim-to-realthompson samplingdagger

TOPS: First-Principles Visual Token Pruning via Constructing Token Optimal Preservation Sets for Efficient MLLM Inference

arXiv cs.AI · Tinghao Wang, Yichen Guo, Rui Huang, Zheng Lu · 2026-06-25

The paper introduces TOPS, a first-principles visual token pruning method for efficient MLLM inference, formulated as constructing Token Optimal Preservation Sets. Through information-theoretic analysis, TOPS identifies three key principles (Task Relevance, Information Coverage, Semantic Diversity) and implements them in a training-free, model-agnostic module. Experiments on 7 MLLM backbones and 14 benchmarks show TOPS removes 77.8% of visual tokens while maintaining or improving performance (100.0%/100.6% on LLaVA-NeXT 7B/13B), suggesting pruning can mitigate hallucination.

visual token pruningmultimodal large language modelsinformation-theoretic analysistask relevancesemantic diversity

OpenRCA 2.0: From Outcome Labels to Causal Process Supervision

arXiv cs.AI · Aoyang Fang, Yifan Yang, Jin'ao Shang, Qisheng Lu · 2026-06-25

OpenRCA 2.0 introduces PAVE, a step-wise labeling protocol for root cause analysis (RCA) that reconstructs causal propagation paths using fault injection interventions, addressing the limitation of outcome-only labels in existing datasets. The protocol employs forward verification to reason from cause to effect, yielding a benchmark with 500 instances featuring step-wise causal annotations. Evaluation across 11 frontier LLMs reveals that exact root-cause recovery succeeds in only 20.7% of cases, while ungrounded diagnosis—identifying a correct root-cause service without grounding it in a verified causal path—occurs in 76.0% of cases, highlighting the need for step-wise causal ground truth in LLM-based RCA.

root cause analysisforward verificationfault injectionstep-wise labelingungrounded diagnosis

Safe Autoregressive Image Generation with Iterative Self-Improving Codebooks

arXiv cs.AI · Yunqi Xue, Zhijiang Li, Philip Torr, Jindong Gu · 2026-06-25

The paper introduces iterative self-improving codebooks to enhance safety in autoregressive image generation without human annotation. The method first uses a unified multimodal model to detect unsafe generations and construct Harmful/Safe Spaces from image-text pairs, then adaptively fine-tunes the codebook within the harmless space. This two-step process iteratively eliminates harmful mappings while preserving generation quality, yielding a safety-enhanced codebook solely through self-supervision.

autoregressive generationcodebook learningmultimodal safetyharmful spaceadaptive fine-tuning

Joint Learning of Experiential Rules and Policies for Large Language Model Agents

arXiv cs.AI · Shicheng Ye, Chao Yu · 2026-06-25

The paper introduces Joint Learning of Experiential Rules and Policies for LLM Agents (JERP), a method that jointly updates a rule pool and policy parameters from interaction trajectories. JERP retrieves task-relevant rules during decision-making and uses episode trajectories to both optimize the policy and revise rules via comparison with successful reference trajectories. Experiments on AlfWorld and WebShop demonstrate consistent performance improvements in complex interactive tasks compared to approaches that separate rule maintenance from policy learning.

llm agentsexperiential rulesinteractive environmentspolicy learningrule retrieval

Efficient foundation decoders for fault-tolerant quantum computing

arXiv cs.AI · Ge Yan, Shanchuan Li, Shiyi Xiao, Pengyue Ma · 2026-06-25

The paper introduces neural transfer unification (NTU), a framework for efficient foundation decoders in fault-tolerant quantum computing. NTU leverages algebraic structures shared across scalable code families to enable cross-distance knowledge transfer, reducing training costs for large-scale decoders. The authors implement NTU-Transformer, a transformer-based decoder for planar surface codes and bivariate bicycle codes, demonstrating superior performance: on the [[361,1,19]] surface code, it outperforms correlation-aware matching; on the [[625,1,25]] code, it exceeds standard matching via transfer adaptation; and on the [[72,12,6]] bicycle code, it surpasses Relay-BP in low-error regimes.

foundation decodersneural transfer unificationsurface codesbivariate bicycle codesfault-tolerant quantum computing

Heavy-Ball Q-Learning with Residual Weighting Correction

arXiv cs.AI · Donghwan Lee · 2026-06-25

The paper introduces a heavy-ball Q-learning method with residual weighting correction, proving its convergence and identifying conditions for accelerated convergence compared to standard Q-learning. The approach extends to linear function approximation, with derived convergence guarantees. Analysis employs a switched linear system (SLS) framework and joint spectral radius (JSR) techniques, offering novel insights into momentum-based acceleration in Q-learning. Theoretical results demonstrate faster convergence under specific conditions, validated through SLS representation.

heavy-ballq-learningconvergencespectral radiusmomentum

Application of LLMs to Threat Assessment of Foreign Peacekeeping Missions

arXiv cs.AI · Gerhard Backfried, Christian Schmidt, Diego Pilutti, Michael Suker · 2026-06-25

The study introduces a novel LLM-based workflow for threat assessment in foreign peacekeeping missions, specifically applied to the EU Monitoring Mission in Georgia. The method integrates an interdisciplinary risk model with OSINT media collection, employing LLMs for threat extraction, structured information generation, and relevance refinement. Evaluation demonstrates high agreement (quantification unspecified) between automated LLM outputs and human judgments on threat relevance and mission alignment. Results suggest LLMs effectively support peacekeeping analysts in threat assessment tasks.

large language modelsthreat assessmentosintpeacekeeping missionsstructured information extraction

Data-Free Reservoir Features for Efficient Long-Horizon Cold-Start Continual Learning

arXiv cs.AI · Augustinas Jučas, Yangchen Pan · 2026-06-25

The paper introduces CIRCLE, a class-incremental learning method for cold-start exemplar-free scenarios, using fixed bidirectional two-dimensional reservoir features (BiRC2D) and streaming linear discriminant analysis (SLDA) heads. CIRCLE ensembles multiple random reservoir instantiations and averages softmax outputs, enabling tunable bias-variance tradeoffs without backbone training or replay. Evaluated on CIFAR-100, TinyImageNet, ImageNet-Subset, and ImageNet-1k, CIRCLE matches or outperforms baselines at 10-20 tasks and significantly excels at 50-500 tasks, while training faster than drift-compensation methods. Ablations confirm the contributions of BiRC2D features, SLDA heads, and ensembling.

class-incremental learningcold-startreservoir featuresstreaming ldabias-variance tradeoff

Inherited Circuits, Learned Semantics: How Fine-Tuning Creates Evasion Vulnerabilities Invisible to Standard Evaluation

arXiv cs.AI · Ryan Fetterman · 2026-06-25

Fine-tuning large language models (LLMs) for security classification introduces evasion vulnerabilities invisible to standard evaluation on held-out data. Analyzing Foundation-Sec-8B-Instruct and its base model Llama-3.1-8B-Instruct on PowerShell classification, we demonstrate that fine-tuning concentrates inherited late-attention circuits into token-level indicator semantics that fail under behavior-preserving transformations like alias substitution and case mutation. A three-tier evasion benchmark reveals Foundation-Sec misses iwr substitution, Invoke-Expression reconstruction, and case-mutated variants that Llama handles correctly. We propose pre-deployment monitoring via linear probes and indicator-token sign tests to detect semantic drift, showing fine-tuning can improve accuracy while expanding evasion surfaces.

fine-tuningevasion vulnerabilitiestoken-level semanticslate-attention circuitssemantic drift

Beyond Global Divergences: A Local-Mass Perspective on Bayesian Inference

arXiv cs.AI · Hanli Xu, Fengxiang He, Sarat Moka · 2026-06-25

The paper introduces a local-mass perspective for Bayesian inference, addressing limitations of global divergence measures like KL divergence and ELBO. It proposes two mathematical tools: Mass Index to quantify polynomial/logarithmic decay scales of local mass, and regularised extended KL (RE-KL) for set-localized divergence analysis. Theoretical results demonstrate how Bayesian updating alters local mass via likelihood factors and parameter-dependent supports, with inequalities comparing small-ball masses under KL directions. Experiments validate the local behavior characterization. Code is publicly available.

bayesian inferencelocal-mass behaviormass indexregularised extended klsmall-ball masses

Parametric Open Source Games

arXiv cs.AI · Aleksandar Todorov, Jesse ten Napel, Alexander Müller · 2026-06-25

The paper introduces parametric open-source games, a continuous framework extending program equilibria by having players select parameter vectors mapped to mixed actions via semantics functions. It proves equilibrium existence, identifies a gradient ascent threshold in symmetric 2×2 games where cooperation emerges from defection, and develops a boundary test for Nash equilibria. The neural semantics extension reveals how cross-player sensitivity ratios govern cooperation. Results demonstrate how parameterized internal representations alter learning dynamics and enable cooperative outcomes through strong open-source coupling across canonical game theory scenarios.

parametric open-source gamesprogram equilibriagradient ascent thresholdneural semanticscross-player sensitivity

How to evaluate clustering with ground truth?

arXiv cs.AI · Pasi Fränti · 2026-06-25

The article evaluates clustering performance using external validity indexes when ground truth is available, focusing on set-matching-based measures. It recommends the centroid index (CI) for its intuitive cluster-level interpretability, while suggesting pair-set index (PSI) for unbiased normalized scores across cluster sizes. For point-level granularity, clustering accuracy (ACC) or similar measures are proposed. The analysis provides guidance on selecting appropriate evaluation metrics based on specific clustering assessment needs.

external validity indexescentroid indexpair-set indexclustering accuracyset-matching

NuclearQAv2: A Structured Benchmark for Evaluating Domain-Science Competence in Large Language Models

arXiv cs.AI · Henry Shaowu Yuchi, Michal Kucer, Benjamin H. Sims, Selma Peterson · 2026-06-25

The authors introduce NuclearQAv2, a structured benchmark for evaluating LLM competence in nuclear engineering through 1,240 question-answer pairs across boolean, numeric, and verbal categories. The benchmark employs a hybrid construction pipeline combining expert-authored questions, existing datasets, and LLM-assisted generation from technical corpora, using structured prompting for scalable evaluation. Results show LLMs perform well on factual questions but struggle with quantitative reasoning and conceptual understanding, demonstrating the need for multi-faceted technical evaluation frameworks.

nuclear engineeringquantitative reasoningstructured promptingtechnical corporamulti-faceted evaluation

The Spec Growth Engine: Spec-Anchored, Code-Coupled, Drift-Enforced Architecture for AI-Assisted Software Development

arXiv cs.AI · Hartwig Grabowski · 2026-06-25

The Spec Growth Engine introduces a lightweight framework addressing two key failure modes in AI-assisted software development: context explosion and silent spec-code drift. It employs a machine-readable spec graph with contract/design separation, a Spine context assembler for scoped reasoning, vertical-slice growth protocol for hardest-first implementation, and drift gates to enforce spec-code alignment. The architecture synthesizes software engineering principles (Parnas information hiding, C4, ADRs) into a code-coupled, machine-enforced system without heavy-weight framework overhead.

spec graphcontext assemblervertical-slice growthdrift gatereflexion models

State Representation Matters in Deep Reinforcement Learning: Application to Energy Trading

arXiv cs.AI · Jesper Klicks, Sander Vržina, Vincent François-Lavet · 2026-06-25

The study demonstrates that state representation significantly impacts deep reinforcement learning performance in energy trading, using HydroDam's pumped-storage arbitrage environment with a fixed Double DQN agent. By systematically comparing absolute price, relative price history, and forecast features (individually and combined) on Belgian day-ahead prices (2007-2011) and 39 ENTSO-E markets, results show combined features outperform single-feature policies: absolute+relative+forecast achieves 55.6% test accuracy and 47.5% cross-zone median, versus ≤28.8% for single-feature approaches. This establishes state representation as a critical design choice for robust transfer in storage-trading RL.

state representationdeep reinforcement learningenergy tradingdouble dqnmarket features

ShareLock: A Stealthy Multi-Tool Threshold Poisoning Attack Against MCP

arXiv cs.AI · Liwei Liu, Tianzhu Han, Zijian Liu, Zishu Dong · 2026-06-25

ShareLock introduces a stealthy multi-tool threshold poisoning attack against Model Context Protocol (MCP), leveraging Shamir's threshold scheme to distribute malicious instructions across benign-looking tool descriptions. The framework ensures information-theoretic secrecy and robustness against auditing, with a covert reconstruction trigger enabling hidden instruction aggregation. Evaluated across four multi-tool scenarios and mainstream LLMs on two MCP clients, ShareLock achieves >90% attack success while evading detection, outperforming single-tool poisoning strategies.

model context protocoltool poisoning attackshamir's threshold schememulti-tool poisoninginformation-theoretic secrecy

On-board Remote-Sensing Foundation Models for Unsupervised Change Detection of Disaster Events

arXiv cs.AI · S. Ramírez-Gallego · 2026-06-25

The paper introduces an unsupervised change detection method using Remote Sensing Foundation Models (RSFMs) for disaster monitoring, eliminating the need for labeled data. The approach combines ResNet-based RSFMs with an untrained Feature Pyramid Network (FPN) to detect semantic shifts in latent space between orbital passes, enabling high-fidelity feature extraction and anomaly identification. Results demonstrate comparable performance to tailored models while reducing training and development effort, with applications in autonomous satellite tasking and terrain-agnostic generalization.

remote sensing foundation modelsunsupervised change detectionfeature pyramid networklatent spaceanomaly identification

Semantic Early-Stopping for Iterative LLM Agent Loops

arXiv cs.AI · Sahil Shrivastava · 2026-06-25

The paper introduces semantic early-stopping for iterative LLM agent loops, replacing fixed iteration caps with a meaning-aware termination criterion based on cosine distance between draft embeddings and quality metrics. The method provides theoretical guarantees of termination and well-definedness, validated through machine-checked proofs, and proposes a judge-efficient evaluation protocol that reuses trajectories and caches LLM-judge calls. Empirical results on HotpotQA show a 38% reduction in operational tokens with no quality loss (Δ-IS = -0.004, p = 0.81), while revealing that quality-gated variants are cost-ineffective due to judging overhead. An oracle analysis highlights the challenge of selecting optimal rounds (+0.115 IS, p ~ 4e-11) over mere stopping decisions.

semantic early-stoppingllm agent loopscosine distancehotpotqajudge-efficient evaluation

Adaptive Utility driven Resource Orchestration for Resilient AI (AURORA-AI)

arXiv cs.AI · Rahul Umesh Mhapsekar, Ilias Cherkaoui, Lizy Abraham, Indrakshi Dey · 2026-06-25

AURORA-AI introduces an adaptive utility-driven resource orchestration framework for resilient AI systems, unifying Hamilton-Jacobi-Bellman feedback control, Lyapunov-based stability monitoring, and fairness-aware composite utility into a closed-loop policy. The framework dynamically redistributes computational budgets across heterogeneous AI models to maximize global utility under disruptions, considering predictive performance, demographic parity, cost, latency, robustness, and interpretability. Evaluated in a stress-rich simulation with demographic bias shocks, concept drift, and black-swan events, AURORA-AI outperforms five controllers, achieving immediate recovery from black-swan events, improving alpha-quantile and super-quantile by 29% and 25%, reducing demographic parity gaps, and increasing Lyapunov-stable operating steps.

hamilton-jacobi-bellmanlyapunov-stabilitydemographic-parityblack-swancomposite-utility

Inverse Design of Compact and Wideband Inverted Doherty Power Amplifiers Using Deep Learning

arXiv cs.AI · Han Zhou, Haojie Chang, David Widen, Christian Fager · 2026-06-25

The paper introduces a deep learning-based inverse design method for compact, wideband inverted Doherty power amplifiers (PAs), integrating multiple functions into a single structure. A hybrid approach combines convolutional neural networks (CNNs) and genetic algorithms (GAs) to synthesize pixelated Doherty combiner networks. Fabricated using GaN HEMT technology, the prototype achieves 51%-63% peak drain efficiency and 48%-54% 6-dB back-off efficiency across 1.9-2.5 GHz, with 44±0.3 dBm output power. Digital predistortion (DPD) enables adjacent channel leakage ratio (ACLR) below -53.2 dBc.

inverse designdoherty power amplifierconvolutional neural networksgenetic algorithmsdigital predistortion

Event-Aware Instructed Assistant for Referring Video Segmentation

arXiv cs.AI · Jinyu Liu, Henghui Ding, Shuting He, Yu-Gang Jiang · 2026-06-25

The paper introduces EVIS, an Event-Aware Video Instructed Segmentation Assistant that addresses limitations in referring video segmentation by decomposing videos into distinct events. The method employs learnable Event Queries to partition videos into text-related segments, enabling hierarchical understanding through event-aware visual-text features. It also proposes Object-Pixel-Hybrid Learning to integrate pixel features with object queries for long-term target tracking. Experiments on 5 benchmarks demonstrate EVIS's strong performance in referring video segmentation tasks.

event queryobject-pixel-hybrid learningreferring video segmentationhierarchical understandingmllms

Decision-Aligned Evaluation of Uncertainty Quantification

arXiv cs.AI · Annika Schneider, Tommy Rochussen, Joshua Stiller, Vincent Fortuin · 2026-06-25

The authors introduce decision-alignment, a framework for evaluating uncertainty quantification (UQ) metrics based on their utility in downstream decision tasks. They demonstrate that conventional metrics like negative log-likelihood and expected calibration error often misalign with decision utilities or encode pathological priors. To address this, they propose prior-weighted utility metrics, a class of proper scoring rules designed for decision-aligned evaluation. Experiments across benchmarks and case studies show that these metrics consistently align with realized decision utility, unlike traditional approaches. The work critiques current UQ evaluation protocols and offers a principled extension for decision-relevant assessment.

uncertainty quantificationdecision-alignmentproper scoring rulesexpected calibration errorprior-weighted utility

Where Do Models Find Happiness? Emotion Vectors in Open-Source LLMs

arXiv cs.AI · Sinie van der Ben, Raphaël Baur, Yannick Metz, Mennatallah El-Assady · 2026-06-25

The study extends findings of emotion vector representations from proprietary to open-weight LLMs, demonstrating valence geometry in Apertus-8B-Instruct-2509 and Gemma-4-E4B-it with peak PC1--valence correlations of r=0.76 and r=0.83 respectively. Using contrast vectors across all layers and two model-generated corpora, researchers observed divergent valence encoding patterns: Gemma-4-E4B-it shows early-layer valence encoding that collapses in later layers, while Apertus-8B-Instruct-2509 exhibits mid-depth emergence. Arousal encoding proved corpus-dependent, with stronger PC2--arousal alignment (r≤0.45) in Gemma-generated stories versus Apertus-generated ones (r≤0.21). The work provides open-source tools for reproducible emotion representation analysis.

emotion vectorsvalence geometrycontrast vectorsarousal encodingmodel-generated corpora

ReaORE: Reasoning-Guided Progressive Open Relation Extraction Empowered by Large Reasoning Models

arXiv cs.AI · Xin Lin, Liang Zhang, Guoqi Ma, Hongyao Tu · 2026-06-25

We propose Reasoning-guided progressive Open Relation Extraction (ReaORE), a framework addressing OpenRE's generalization challenge through coarse-to-fine relation reasoning. ReaORE operates in two stages: relation filtering, which reasons over multiple aspects and employs embedding-based similarity to ensure target relation inclusion; and relation prediction, which uses fine-grained comparative reasoning to distinguish easily confused relations. Experiments on two OpenRE datasets demonstrate ReaORE's superiority over existing baselines.

open relation extractionrelation filteringrelation predictioncomparative reasoningembedding-based similarity

Auditing Framing-Sensitive Behavioral Instability in Large Language Models for Mental Health Interactions

arXiv cs.AI · Abla Bedoui, Ashley L. Greene, Mohammed Cherkaoui · 2026-06-25

This study investigates framing-sensitive behavioral instability in large language models (LLMs) deployed for mental health interactions, focusing on how semantically similar concerns presented through different contextual framings elicit varied model responses. Using controlled matched prompts across multiple framing conditions and instruction-tuned model families, the authors demonstrate systematic alterations in interpretive response tendencies. Layer-wise probing reveals that behavior-associated information remains decodable throughout transformer depth, with architecture-dependent variation in decoding strength. Activation steering experiments further indicate that framing-associated representational directions can modulate downstream behavioral outcomes, highlighting the importance of robustness to contextual variation in evaluating conversational AI consistency and trustworthiness.

framing-sensitivebehavioral instabilityinstruction-tunedtransformer depthactivation steering

In-Context Model Predictive Generation: Open-Vocabulary Motion Synthesis from Language Models to Physics

arXiv cs.AI · Xiaomeng Fu, Junfan Lin, Yang Liu, Yaowei Wang · 2026-06-25

The paper introduces In-Context Model Predictive Generation (ICMPG), a framework for open-vocabulary human motion synthesis that combines language-model planning with physics-aware refinement. ICMPG employs a two-module approach: Context-Aware Motion Generation (CAMG) uses an LLM to decompose text commands into motion tokens, while Model Predictive Generation (MPG) evaluates candidates through physical simulation and semantic alignment in a closed-loop MPC-like process. Experiments demonstrate ICMPG outperforms baselines in physical plausibility and semantic fidelity across standard and zero-shot settings, without task-specific retraining.

in-context learningmodel predictive controlmotion synthesisopen-vocabularyphysical simulation

XMSE-Aware Adaptive Empirical Bayes Estimation

arXiv cs.AI · Minghao Chen, Jiale Zheng · 2026-06-25

The paper introduces an XMSE-aware mixed estimator that adaptively interpolates between maximum likelihood (ML) and empirical Bayes (EB) shrinkage to address kernel misalignment issues. The method derives a closed-form oracle mixing weight minimizing excess mean squared error (XMSE), with theoretical guarantees of consistency and second-order oracle regret rates. Experiments on finite impulse response simulations and Silverbox/Cascaded Tanks benchmarks demonstrate robust performance, retaining regularization benefits when effective while reverting to ML under misspecification.

empirical bayesexcess mean squared errorkernel misalignmentoracle regretshrinkage estimation

Einstein World Models

arXiv cs.AI · Munachiso Samuel Nwadike, Zangir Iklassov, Ali Mekky, Zayd M. Kawakibi Zuhri · 2026-06-25

The paper proposes Einstein World Models (EWMs), a framework enhancing large language models (LLMs) with visual-temporal reasoning capabilities. EWMs integrate a world-module that generates short visual scene rollouts, enabling LLMs to perform counterfactual reasoning beyond text-based inputs. These visual rollouts serve as inspectable hypotheses within the reasoning trace, extending LLMs' tool-calling abilities to visual thought experiments. The approach aims to address limitations of pure language-based reasoning by incorporating multimodal visualisation mechanisms.

einstein world modelsvisual-temporal rolloutscounterfactual reasoningworld-moduletool calling

Look-Before-Move: Narrative-Grounded World Visual Attention in Dynamic 3D Story Worlds

arXiv cs.AI · Jiaming Bian, Bingliang Li, Yuehao Wu, Pichao Wang · 2026-06-25

The paper introduces Look-Before-Move, a camera planning framework for dynamic 3D story worlds that separates observation specification from motion execution. The framework employs a Semantic Observation Contract to translate directorial intent into visual constraints, Monte Carlo Viewpoint Search to identify narrative-compliant and geometrically feasible viewpoints, and Semantic Trajectory Grounding to generate continuous, collision-aware camera motion. Evaluated on a dynamic 3D Story World Benchmark comprising 50 stories, 457 scenes, and 1585 shots, the framework demonstrates improvements in subject perception, intent consistency, and trajectory quality over baseline methods, highlighting the necessity of pre-organizing visual attention before motion generation.

camera planningsemantic observation contractmonte carlo viewpoint searchsemantic trajectory groundingdynamic 3d story worlds

Scaling Multi-Reference Image Generation with Dynamic Reward Optimization

arXiv cs.AI · Wenwang Huang, Yusen Fu, Junjie Wang, Mengfei Huang · 2026-06-25

The paper introduces OmniRef-Bench, a benchmark for evaluating multi-reference image generation (MRIG) with complex reference combinations, revealing performance degradation in open-source models as mixed-type references increase. It proposes DyRef, a two-stage framework combining supervised fine-tuning with Difficulty-aware Advantage Reweighting (DAR) and Discriminative Reward Scaling (DRS) to dynamically optimize rewards for complex MRIG. Experiments show DyRef significantly improves performance on OmniRef-Bench and single-image editing tasks, demonstrating generalization capability.

multi-reference image generationdynamic reward optimizationdifficulty-aware advantage reweightingdiscriminative reward scalingbenchmark evaluation

Where Do CoT Training Gains Land in LLM based Agents?

arXiv cs.AI · Jingyu Liu, Zhiwen Wang, Yuxin Jing, Huanyu Zhou · 2026-06-25

The study investigates whether chain-of-thought (CoT) training in language-model agents primarily improves reasoning or direct action prediction. By comparing prompt actions (direct prediction) with CoT actions (reasoning-based prediction) across model checkpoints, the authors find that CoT training enhances prompt-action quality without widening CoT's relative advantage. Later checkpoints show reduced action revision during CoT, indicating increased prompt reliance. A proposed intervention—selectively masking action-token supervision—improves out-of-domain generalization.

chain-of-thoughtlanguage-model agentsaction predictionout-of-domain generalizationsupervision masking

Chai: Agentic Discovery of Cryptographic Misuse Vulnerabilities

arXiv cs.AI · Corban Villa, Sohee Kim, Austin Chu, Alon Shakevsky · 2026-06-25

Chai introduces an AI-driven system for discovering cryptographic misuse vulnerabilities by leveraging differential testing with two key innovations: enhanced precision in library-level flaw detection and repurposing discrepancies as vulnerability leads in downstream applications. The system inverts traditional AI audit paradigms by propagating cataloged library flaws across a cryptographic dependency graph. Evaluated on X.509, JWT, and SAML libraries, Chai identified 100+ vulnerabilities, including a critical SSL library flaw affecting billions of devices and bugs in major web browsers and Linux distributions.

cryptographic misusedifferential testingdependency graphx.509jwt

A Deterministic Control Plane for LLM Coding Agents

arXiv cs.AI · Padmaraj Madatha · 2026-06-25

The paper introduces Rel(AI)Build, a deterministic control plane for managing LLM coding agent configurations, addressing three identified gaps in current practices: undeclared shared components (10.1% SHA-256 duplicates across repositories), infrequent revisions (58% single-commit), and lack of permission boundaries (<1% declarative). The system enforces supply-chain security via SHA-256 content addressing, HMAC-stamped lockfiles, and hash-chained audit logs; implements tiered permissions and attack-derived blocklists; and ensures traceability through a phase state machine. Conformance tests validate invariant enforcement, though developer outcomes remain unstudied. The approach emphasizes tool-agnostic governance without LLM orchestration.

llm coding agentsdeterministic control planesha-256 content addressinghash-chained audit logsjaccard similarity

Risk-Aware Selective Multimodal Driver Monitoring with Driver-State World Modeling

arXiv cs.AI · Daosheng Qiu, Haozhuang Chi, Hao Su, Shu Long · 2026-06-25

The paper proposes a risk-aware selective inference framework for multimodal driver monitoring in automated vehicles, combining a lightweight RGB-physiological student model with a learned gating mechanism. The system integrates visual observations with window-level heart rate (HR) and electrodermal activity (EDA) signals, achieving 0.7440 Macro-F1 and 0.9099 balanced accuracy (11.39M parameters, 3.08ms latency) on driver-demand recognition. Cost-aware selective inference reduces unsafe false negatives from 17.37% to ~5% while maintaining deployment-level latency. Driver-state world modeling provides predictive signals, though calibration drift persists in worst-group evaluations.

multimodal driver monitoringselective inferencergb-physiological fusioncost-aware gatingdriver-state world modeling

Diagnosing Task Insensitivity in Language Agents

arXiv cs.AI · Jingyu Liu, Xiaopeng Wu, Kehan Chen, Chuan Yu · 2026-06-25

The study identifies task insensitivity as a key limitation in large language model agents, where models persist with learned action patterns despite semantically corrupted or altered task instructions. The authors propose Task-Perturbed NLL Optimization, a contrastive regularization method that explicitly strengthens action dependence on task instructions. Experiments demonstrate improved task sensitivity and out-of-distribution generalization, with attention patterns showing more stable focus on task tokens compared to baseline models.

task insensitivityout-of-distribution generalizationcontrastive regularizationattention driftlanguage agents

GEOALIGN: Geometric Rollout Curation for Robust LLM Reinforcement Learning

arXiv cs.AI · Ting Zhou, Zhenqing Ling, Yiyang Zhao, Ying Shen · 2026-06-25

The paper introduces GEOALIGN, a geometric rollout curation method for stabilizing LLM reinforcement learning under noisy rewards. The method addresses directional inconsistency—where high-reward rollouts induce conflicting preference directions—by (i) forming within-prompt preference pairs, (ii) learning an online projector to concentrate reward-ordered displacement directions, and (iii) rectifying inconsistent rollouts via angular deviation from a batch consensus. GEOALIGN is forward-pass only and adds minimal overhead. Experiments on dialogue alignment and mathematical reasoning show it outperforms PF-PPO, PAR, PODS, and Seed-GRPO in final performance and training stability.

reinforcement learningrollout curationdirectional inconsistencyonline projectorlatent directional consensus

Confidence-Aware Tool Orchestration for Robust Video Understanding

arXiv cs.AI · Yangfan He, Yujin Choi, Jaehong Yoon · 2026-06-25

The paper introduces Robust-TO, a confidence-aware tool orchestration framework for robust video understanding that addresses the Blind Trust Problem in video reasoning models. By integrating per-frame trustworthiness through reliability-relevance scoring and a three-tier evidence synthesis process, Robust-TO optimizes correctness, reliability, and efficiency via a confidence-cost GRPO reward. Evaluated on eight tasks across two benchmarks, it achieves 56.4% accuracy on clean inputs (10.6%p above baselines) and maintains 54.3% accuracy under corruption (5.8%p above baselines), demonstrating superior robustness.

video reasoningreliability-relevance scoreconfidence-cost grpothree-tier synthesisblind trust problem

Learning to Recover Task Experts from a Multi-Task Merged Model

arXiv cs.AI · Jinwook Jung, Taegyu Kim, Kumju Jo, Sungyong Baik · 2026-06-25

The paper introduces Recover Task eXpert (ReTeX), a framework addressing parameter interference in multi-task model merging by modeling perturbations as affine transformations approximated via additive offsets. ReTeX predicts these offsets to recover task-expert performance from a merged checkpoint, employing a router-free task identifier based on SVD subspace signatures for task selection. Results demonstrate 95% recovery of individual-expert performance in vision and NLP domains, with emergent adaptive interpolation for OOD tasks. The method leverages offline-computed subspace signatures for efficient inference.

multi-task mergingparameter interferenceaffine transformationsvd subspacerouter-free

SamaVaani: Auditing and Debiasing Multilingual Clinical ASR for Indian Languages

arXiv cs.AI · Subham Kumar, Prakrithi Shivaprakash, Abhishek Manoharan, Astut Kurariya · 2026-06-25

The study introduces SamaVaani, a unified debiasing technique for multilingual clinical ASR systems in Indian languages. It systematically audits eight state-of-the-art models (IndicWhisper, WhisperLargeV3, Sarvam, GoogleS2T, Gemma3n, OmniLingual, Vaani, Gemini) on psychiatric interview data spanning Kannada, Hindi, and Indian English, revealing performance variability across languages and demographic groups. Fine-tuning Gemma3n and OmniLingual exposes systematic gaps tied to speaker role and gender, which SamaVaani mitigates through fairness-aware optimization while improving overall ASR accuracy.

automatic speech recognitionfairness-aware fine-tuningmultilingual asrclinical nlpdebiasing

Generative Retrieval via Diffusion Transformer with Metric-Ordered Sequence Training and Hybrid-Policy Preference Optimization

arXiv cs.AI · Chenghao Liu, Yu Zhang, Zhongtao Jiang, Kun Xu · 2026-06-25

The paper introduces MO-DiT+HPPO, a generative retrieval framework for pattern-preserving attribute retrieval, where items must satisfy a target attribute while maintaining fine-grained patterns from seed sets. The method combines metric-ordered sequence training (MO-DiT) to learn attribute-density trajectories across domains and hybrid-policy preference optimization (HPPO) to align query generation with online objectives. Evaluations across four attribute domains show MO-DiT improves intersection metrics over baseline retrievers, with HPPO delivering further gains, validated through ablation studies and metric-predictor analysis.

generative retrievaldiffusion transformermetric-ordered trainingpreference optimizationpattern-preserving retrieval

Bridging Vision and Language Concepts through Optimal Transport Semantic Flow

arXiv cs.AI · Chenyang Zhang, Anqi Dong, Guangming Zhu, Nuoye Xiong · 2026-06-25

The Optimal Transport Flow Concept Bottleneck Model (OTF-CBM) improves vision-language alignment by reformulating concept matching as a dynamic transport process. It employs Inverse Optimal Transport to learn data-driven semantic costs and unbalanced optimal transport for flow matching between visual patches and textual concepts, enabling fine-grained localization without ODE integration. Experiments demonstrate OTF-CBM's superior classification accuracy and concept faithfulness, providing a geometric framework for interpretable cross-modal reasoning.

concept bottleneck modelsoptimal transportcross-modal alignmentinverse optimal transportsemantic flow

A Pipeline for Generating Longitudinal Synthetic Clinical Notes Using Large Language Models

arXiv cs.AI · William Poulett · 2026-06-25

The paper introduces a pipeline for generating longitudinal synthetic clinical notes using LLMs, addressing privacy concerns in healthcare AI development. The method combines structured patient generation, semi-structured journey simulation, and unstructured note generation via LLMs, with validation steps for consistency and realism. Results include a dataset of 70 synthetic patients with 20-50 notes each, supporting AI tasks like summarization and decision support without real patient data.

synthetic dataclinical noteslarge language modelslongitudinal recordsai validation

Information-Aware KV Cache Compression for Long Reasoning

arXiv cs.AI · Jushi Kai, Zhuiri Xiao, Alexandra Birch, Zhouhan Lin · 2026-06-25

This paper introduces InfoKV, an entropy-aware KV cache compression framework for LLMs that incorporates information-theoretic signals beyond attention weights. InfoKV measures token importance via Forward Influence, a metric combining token-level predictive uncertainty and layer-wise representation evolution, integrated with attention scores during reasoning. Analysis reveals that tokens with high predictive uncertainty influence distant future contexts more strongly than attention-selected tokens. Experiments on Llama-3.1, Llama-3.2, and DeepSeek-R1 demonstrate InfoKV's consistent superiority over attention-based KV compression methods in long-context reasoning benchmarks across prefilling and decoding stages.

kv cache compressionforward influencepredictive uncertaintyinformation-theoretic signalslong-context reasoning

TAVR-VLM: Risk-Conditioned Causal Grounding for Hallucination-Resistant Report Generation

arXiv cs.AI · Zhixiang Lu, Xiwei Liu, Sifan Song, Changkai Ji · 2026-06-25

TAVR-VLM introduces a risk-conditioned causal grounding framework for hallucination-resistant report generation in Transcatheter Aortic Valve Replacement (TAVR) planning. The method employs Risk-Conditioned Causal Grounding Attention (R-CGA) to compress multimodal inputs into a causal risk bottleneck, enforcing token-level grounding via a support-projected causal consistency objective. Evaluated on the M³TAVR dataset (1,482 patients), TAVR-VLM achieves 0.896 AUROC, 0.936 CIDEr, and reduces hallucinations to 8.1%, establishing a new state-of-the-art for interpretable surgical AI.

multimodal large language modelscausal groundingrisk-conditioned attentionhallucination reductionsurgical ai

Fortress and Gatekeeper: Theorizing Transitive Trust in Third-Party Cybersecurity Risk Governance

arXiv cs.AI · Yijun Chen, Misita Anwar · 2026-06-25

The paper introduces the Fortress and Gatekeeper framework to theorize transitive trust in third-party cybersecurity risk governance, addressing how delegated data processing creates customer-facing accountability. Through a document analysis of the November 2025 OpenAI-Mixpanel security incident, the study examines cybersecurity governance as both a trust relationship and delegation problem. The framework explains governance boundaries via trust and data flows, proposing four propositions on vendor integration, metadata exposure, vendor assurance, and data proliferation. Findings highlight implications for vendor tiering, data classification, contractual design, continuous assurance, and data minimization in cybersecurity governance.

transitive trustcybersecurity governancevendor integrationdata minimizationcontinuous assurance

AgentX: Towards Agent-Driven Self-Iteration of Industrial Recommender Systems

arXiv cs.AI · Changxin Lao, Fei Pan, Guozhuang Ma, Han Li · 2026-06-25

AgentX introduces a multi-agent system for autonomous iteration of industrial recommender systems, addressing the execution bottleneck in traditional human-dependent workflows. The system comprises four stages: a Brainstorm Agent for hypothesis generation, a Developing Agent for code implementation, an Evaluation Agent for safe online testing, and a Harness Evolution layer (SGPO) for self-improvement via semantic-gradient updates. This closed-loop architecture enables continuous, large-scale experimentation and learning, surpassing manual iteration capabilities without specifying quantitative performance metrics.

multi-agent systemrecommender systemssemantic-gradient updatesclosed-loop architectureautonomous iteration

LCAi: Life Cycle Assessment with big data fusion and retrieval-augmented generation-assisted interpretation

arXiv cs.AI · Georgios Tsironis, Juan D. Medrano-Garcia, Gonzalo Guillen-Gosalbez · 2026-06-25

The study introduces a perspective-conditioned retrieval-augmented generation (RAG) framework for structured interpretation in life cycle assessment (LCA), addressing the gap in translating environmental hotspots into actionable strategies. The method involves scenario anchoring, perspective-specific micro-queries with constrained retrieval, and neutral synthesis using ledger-stored outputs, implemented via GPT-5 nano. Demonstrated in a hydrogen-enabled diesel reduction case for Italian apple production, the framework reduces hallucination risks while maintaining cross-domain diversity, enabling evidence-grounded decision-making for scalable technologies.

retrieval-augmented generationlife cycle assessmentperspective fusionconstrained retrievalevidence-grounded interpretation

Context-Aware Synthesis of Optimization Pipelines for Warehouse Optimization

arXiv cs.AI · Janik Bischoff, Anne Meyer, Uta Mohring, Fabian Dunke · 2026-06-25

The paper introduces Context-Aware Synthesis of Optimization Pipelines (CASOP), a framework for constructing and evaluating warehouse optimization pipelines. CASOP integrates (1) a modular algorithm repository, (2) semantic metadata cards, (3) a subproblem taxonomy, (4) a pipeline synthesizer, and (5) a pipeline evaluator to automate the composition of valid algorithmic pipelines for order fulfillment. Evaluated on 7 benchmark sets spanning 4 problem classes, the framework generated 1,063,044 valid pipelines, demonstrating its utility for researchers and practitioners in warehouse operations. The software is open-sourced.

warehouse optimizationalgorithm selectionpipeline synthesisorder fulfillmentcontext-aware synthesis

The Capability Frontier: Benchmarks Miss 82% of Model Performance

arXiv cs.AI · Bradley Fowler, Ryan Smith, Daniel Thi Graviet, William Myers · 2026-06-25

The paper introduces the Capability Frontier, a Pareto frontier method that quantifies LLM performance gaps by optimizing model and generation selection across heterogeneous tasks. It addresses two biases: single-model evaluation underestimation and noisy sample overestimation. Evaluating 21 LLMs on 16 benchmarks (coding, reasoning, medicine, etc.), the method achieves 54% error reduction from single-model correction and 82% improvement with additional run correction, matching SOTA accuracy at 85% cost reduction. Simulations show oracle routing outperforms single models as query topic entropy increases, indicating substantial underestimation of collective LLM capabilities.

capability frontierpareto frontieroracle routingerror rate reductionquery topic entropy

Computational Analysis of Heart Rate Variability in Healthy Adults

arXiv cs.AI · María J. Lado, Arturo J. Méndez, Leandro Rodriguez-Liñares, Baltasar García Pérez-Schofield · 2026-06-25

This study evaluates Heart Rate Variability (HRV) indices in 40 healthy adults (20 men, 20 women) using computational signal processing to assess normality, stability, correlation, reproducibility, and consistency. Time-domain and nonlinear indices followed normal distributions with gender differences, while HF-related indices showed redundancy. Comparisons with the Fantasia database revealed <10% error for most indices except SD2 and SDNN in women (>15%). Time-domain and nonlinear indices exhibited low inter-study variability, whereas frequency-domain indices were highly variable. Recommended indices include ApEn, IRRR, HRVi, SD2, MADRR, and rMSSD for accurate HRV representation.

hrvtime-domainfrequency-domainnonlinear indicesfantasia database

KARLA: Knowledge-base Augmented Retrieval for Language Models

arXiv cs.AI · Francois Crespin, Fabian M. Suchanek, Nils Holzenberger · 2026-06-25

The authors propose KARLA, a method enabling language models to dynamically retrieve factual knowledge from external knowledge bases during token generation. The approach trains models to emit special tokens triggering KB queries, achieving three benefits: (1) factual updates without retraining, (2) provenance tracing for explainability, and (3) comparable accuracy to larger models. Experiments demonstrate improved factual grounding in both short and long-form generation, with knowledge updates achievable through KB edits rather than parameter updates.

knowledge retrievallanguage modelsfactual groundingexplainabilitydynamic updating

Memory Depth, Not Memory Access: Selective Parametric Consolidation for Long-Running Language Agents

arXiv cs.AI · Haoliang Han · 2026-06-25

The paper introduces selective parametric consolidation as a solution for memory depth in long-running language agents, distinct from retrieval-based memory access. The authors propose EVAF, a surprise- and valence-gated LoRA mechanism that performs selective writes (2--3 per 200 events) to maintain goal persistence. Evaluations on GPT-2, TinyLlama, and Mistral-7B show EVAF achieves 0.812--0.904 accuracy in goal persistence tasks, outperforming retrieval on shallow factual recall (0.956--0.973). Controls reveal the mechanism factorizes into selection and actuation, with model-dependent write strength and asymmetric coupling under miscalibration.

parametric consolidationmemory depthlora mechanismgoal persistenceretrieval systems

NaviCache: Test-Time Self-Calibration Caching for Video Generation

arXiv cs.AI · Zheqi Lv, Zhibo Zhu, Jinke Wang, Qi Tian · 2026-06-25

NaviCache introduces a test-time self-calibration caching method for accelerating Video Diffusion Models (VDMs) by reformulating feature evolution as an Inertial Navigation System problem. The proposed dual-state estimation architecture adaptively tracks feature change ratios and latent drift, initialized via an Initial Alignment phase, while a noise schedule and uncertainty-aware Measurement Update mechanism enable error-bounded computation skipping. Experiments on HunyuanVideo, Wan, and Open-Sora series demonstrate superior error judgment and comprehensive performance compared to existing calibration-free methods.

video diffusion modelsinertial navigation systemdual-state estimationcomputation skippingtest-time calibration

ReasonCLIP-58M: Visually Grounded Commonsense Reasoning Supervision for CLIP

arXiv cs.AI · Sicheng Zhang, Muzammal Naseer, Binzhu Xie, Naufal Suryanto · 2026-06-25

ReasonCLIP-58M introduces a continual pretraining framework for CLIP-style models, enhancing visually grounded commonsense and compositional reasoning without architectural modifications. The method employs a two-stage strategy integrating reasoning signals while preserving descriptive alignment, supported by two datasets (ReasonLite-42M, ReasonPro-16M) and the RCLIP-Bench benchmark. Training a family of ReasonCLIP models improves zero-shot retrieval and reasoning capabilities, demonstrating gains as a drop-in visual encoder for multimodal LLMs like LLaVA-NeXT. Structured reasoning supervision enhances CLIP-style representations, with all resources publicly available.

commonsense reasoningmultimodal systemszero-shot retrievalvisual encoderdescriptive alignment

MIRROR: Novelty-Constrained Memory-Guided MCTS Red-Teaming for Agentic RAG

arXiv cs.AI · Inderjeet Singh, Andrés Murillo, Motoyoshi Sekiya, Yuki Unno · 2026-06-25

MIRROR introduces a unified red-teaming framework for multimodal agentic RAG systems, addressing cross-surface vulnerabilities through novelty-constrained memory-guided Monte Carlo tree search. The method employs a deterministic Novelty Gate to prevent prompt copying while leveraging retrieved context for candidate generation. Evaluated on four attack surfaces, MIRROR achieves 76% ASR on image poisoning (vs 52% baselines), 97% ASR on orchestrator attacks at 50% query cost, and lowest cross-surface variance (CV=0.47). The work releases ART-SafeBench with 41,815+ records across surfaces.

red-teamingretrieval-augmented generationmonte carlo tree searchnovelty constraintattack surface

AIGP: An LLM-Based Framework for Long-Term Value Alignment in E-Commerce Pricing

arXiv cs.AI · Chennan Ma, Yanning Zhang, Siqi Hong, Xiuchong Wang · 2026-06-25

The paper introduces AIGP, an LLM-based framework for long-term value-aligned e-commerce pricing that addresses interpretability and unstructured data limitations in traditional dynamic pricing. The method combines a domain-knowledge-prompted LLM with a Long-Term Value Estimator (LTVE) trained via offline RL, using Direct Preference Optimization (DPO) to align pricing decisions with business objectives. Offline and online A/B tests on Tao Factory show significant improvements: +13.21% GMV, +7.59% ROI, and +8.20% milestone achievement rate over 14 days versus baseline, while providing interpretable pricing rationales.

dynamic pricinglarge language modeloffline reinforcement learningdirect preference optimizationvalue alignment

ResilPhase: Plug-and-Play Phase Mapping and Noise-Resilient Macro-Trajectory Extrapolation for Diffusion Acceleration

arXiv cs.AI · Qicheng Zhao, Yu Li, Qi Sun, Zheyu Yan · 2026-06-25

ResilPhase introduces a noise-resilient framework for accelerating diffusion models by reformulating inference as stable macro-trajectory extrapolation in ODE space. It aligns forecasting with the model's Global Drift, eliminating feature inconsistency and memory overhead, and employs a derivative-free barycentric Lagrange extrapolator to bypass derivative instability. A bounded Phase Mapping further regularizes the extrapolation domain, suppressing oscillatory error growth. Experiments on FLUX.1-dev and HunyuanVideo demonstrate state-of-the-art fidelity under aggressive acceleration ratios.

diffusion modelsmacro-trajectory extrapolationglobal driftbarycentric lagrange extrapolatorphase mapping

Anatomy-Guided Residual Motion Diffusion for Controllable 4D Cardiac MRI Synthesis

arXiv cs.AI · Yiheng Cao, Gustavo Andrade-Miranda, Jiatian Zhang, Lingxiao Zhao · 2026-06-25

The authors propose a 4D generative framework for anatomically consistent cardiac MRI synthesis to address data scarcity in medical AI. The method combines a semi-supervised VAE for joint anatomical representation and segmentation with a cascaded latent diffusion model (LDM) that disentangles static anatomy (conditioned on clinical priors) from residual temporal dynamics. Evaluated on cine cardiac MRI, the approach achieves high anatomical controllability (Pearson r > 0.8) and temporal coherence (FVD = 288.08). Synthetic data augmentation improves nnU-Net segmentation performance, increasing left ventricle Dice by 2.8% and reducing boundary error by 5.4mm versus real-data-only training.

4d medical imaginglatent diffusion modelanatomical consistencytemporal coherencedata augmentation

EGG: An Expert-Guided Agent Framework for Kernel Generation

arXiv cs.AI · Yaochen Han, Ke Fan, Hongxu Jiang, Wanqi Xu · 2026-06-25

The paper introduces EGG, an expert-guided agent framework for generating high-performance GPU kernels, addressing limitations in current LLM-based approaches that struggle with correctness and optimization. EGG decomposes kernel generation into two hierarchical stages: algorithmic structure design for computational foundations and hardware-specific tuning for parallel mapping and memory optimization, guided by expert principles. A stage-aware multi-agent mechanism manages optimization trajectories. Experiments on KernelBench show EGG achieves a 2.13x speedup over PyTorch, outperforming agent-based and RL-based methods.

gpu kernelsllm-based optimizationmulti-agent collaborationparallel mappingtensor tiling

Robust Onion: Peeling Open Vocab Object Detectors Under Noise

arXiv cs.AI · Priyank Pathak, Mukilan Karuppasamy, Aaditya Baranwal, Shruti Vyas · 2026-06-25

Robust Onion presents a systematic analysis of Open Vocabulary Object Detectors (OV-ODs) under synthetic visual degradations, revealing that robustness degradation stems from similar feature collapse patterns in models with comparable vision backbones. The study employs layer-by-layer probing to demonstrate that pretraining strategies, architectural nuances, and caption supervision minimally affect robustness, which is primarily governed by the image domain. Empirical validation on BDD100K, WiderFace, and VisDRONE shows improved robustness via a lightweight plug-and-play approach (NN & TK0) using 96x fewer parameters than end-to-end training, while clarifying prior robustness observations.

open vocabulary object detectionfeature collapsesynthetic degradationsvision backbonesplug-and-play robustness

Scientific discovery as meta-optimization: a combinatorial optimization case study

arXiv cs.AI · Yuan-Hang Zhang, Chesson Sipling, Massimiliano Di Ventra · 2026-06-25

The paper introduces a meta-optimization framework for scientific discovery, formalizing research as simultaneous optimization of both the solution space and the evaluation criteria. The method employs 'consensus objective aggregation,' where large language models (LLMs) generate objective functions combined via correlation-weighted voting, creating a self-correcting and evolving evaluation criterion. Applied to algorithm discovery for 3-SAT problems using digital MemComputing machines, the framework reduces the baseline scaling from ∼N^2.51 to ∼N^1.33, achieving a ∼67× speedup on the largest instances. This problem-agnostic approach aims to enhance scientific discovery across domains.

meta-optimizationconsensus objective aggregationlarge language models3-sat problemsmemcomputing machines

Socratic agents for autonomous scientific discovery in high-dimensional physical systems

arXiv cs.AI · Xianrui Zeng, Pengfei Liu, Yirui Zang, Yang Shen · 2026-06-25

The paper introduces AHOIS, a multi-agent AI system for autonomous scientific discovery featuring epistemic autonomy through Socratic interrogation. A physics-critic agent employs causal questioning, constraint checking, and falsification-criteria formulation to evaluate hypotheses. Evaluated on a multimode-fibre optical platform, AHOIS autonomously discovered random-interference encoding (achieving effective rank 56.9 from 16x16 measurements) and task-adaptive sparse-measurement strategies, with classification accuracies of 76.97% (MNIST) and 83.17% (Fashion-MNIST). Ablations demonstrate improved physical consistency, hypothesis completeness, and uncertainty calibration compared to non-Socratic approaches.

epistemic autonomysocratic interrogationmultimode-fibre opticsrandom-interference encodingtask-adaptive measurement

A Latent ODE Approach to Spatiotemporal Modeling of Cine Cardiac MRI

arXiv cs.AI · David Brüggemann, Ekaterina Krymova, Firat Özdemir, Jochen von Spiczak · 2026-06-25

The study introduces a latent dynamical model for spatiotemporal analysis of cardiac MRI, combining heart-rate-aware neural ODEs with a graph-based mesh autoencoder to encode continuous 3D+t ventricular motion. The model uses a covariate-conditioned prior for end-diastolic states and a Cox model to predict heart failure risk. Evaluated on 72,386 UK Biobank participants (367 heart failure cases), it improved the stratified C-index from 0.704 to 0.785 when added to pooled cohort equations, outperforming seven conventional cardiac markers. The approach balances reconstruction fidelity, generative realism, and prognostic utility, demonstrating the value of full-cycle motion modeling.

neural ordinary differential equationsgraph-based autoencoderspatiotemporal modelingcox proportional hazardscardiac mri

LithoDreamer: A Physics-Informed World Model for Multi-Stage Computational Lithography

arXiv cs.AI · Yuqi Jiang, Yumeng Liu, Zimu Li, Jinyuan Deng · 2026-06-25

LithoDreamer introduces the first physics-informed World Model framework for multi-stage computational lithography, addressing the limitations of existing models in capturing continuous physical processes. The method formulates lithography as a decision-driven evolution system, modeling stage-specific physics-informed latent spaces and employing a contrastive variational optimization paradigm for interpretable intervention optimization. Experiments demonstrate state-of-the-art performance in forward evolution and inverse planning, with the lithography dataset made publicly available.

computational lithographyphysics-informed modelworld modelcontrastive optimizationvariational evolution

MLFFM-SegDiff: A Multi-Level Feature Fusion Diffusion Model for Skin Lesion Segmentation

arXiv cs.AI · Jingjun Gu, Chaojie Shen, Yifeng Cao, Wei Zhang · 2026-06-25

MLFFM-SegDiff introduces a multi-level feature fusion diffusion model for skin lesion segmentation, addressing limitations in cross-level feature interaction and boundary detail recovery. The method combines a dual-path U-Net encoder, a Multi-Level Feature Fusion Module (MLFFM) with attention and scale alignment, and a boundary-sensitive loss function to enhance mask reconstruction. Evaluated on ISIC2018, PH2, and HAM10000, it achieves state-of-the-art performance with a 0.8546 Jaccard index and 0.9207 Dice coefficient, outperforming DermoSegDiff, U-Net, and SwinUNETR.

skin lesion segmentationdiffusion modelmulti-level feature fusiondual-path encoderboundary-sensitive loss

Kalman Prototypical Networks for Few-shot Fault Detection in Combined Cycle Gas Turbines

arXiv cs.AI · Mohammed Ayalew Belay, Lucas Ferreira Bernardino, Adil Rasheed, Rubén M. Montañés · 2026-06-25

The paper introduces Kalman Prototypical Networks (KPN), a metric-based few-shot learning framework for fault detection in combined-cycle gas turbines (CCGTs). KPN models class prototypes as latent stochastic states in a dynamic system to reduce episodic variance and improve embedding robustness. Evaluated on synthetic data from a Modelica-based CCGT simulation, KPN outperforms Matching Networks, Relation Networks, and MAML in accuracy and stability across varying support-query configurations, demonstrating improved convergence and generalization for scarce-label scenarios.

few-shot learningfault detectionkalman filteringprototypical networkscombined-cycle gas turbines

Algorithmic Foundations of Deep Learning: Complexity-Theoretic Rates and a Characterization of Universal Approximation

arXiv cs.AI · Anastasis Kratsios, Simone Brugiapaglia, Bum Jun Kim, Gregory Cousins · 2026-06-25

This work establishes a complexity-theoretic framework for analyzing neural network (NN) expressivity, emphasizing algorithmic complexity alongside regularity. By viewing NNs as computational models rather than mere basis functions, it demonstrates that any function computable by a real-valued circuit can be approximated by an NN with explicit depth, width, and parameter bounds derived from circuit properties. The study proves universal approximation for definable NN models with non-affine nonlinearities and parallelization, extending to continuous functions, Besov classes, and holomorphic functions. Notably, it achieves exponential parameter efficiency in shortest-path computations on k-vertex graphs, improving from O(ε^{-c k^2}) to O(log(1/ε)).

universal approximationalgorithmic complexityreal-valued circuitbesov classesholomorphic functions

Learning Motion Feasibility from Point Clouds in Cluttered Environments

arXiv cs.AI · Sajid Ansari, Arthi, Girish Varma, Antony Thomas · 2026-06-25

The paper introduces GRASPFC-PTX, a point-cloud transformer for learning motion feasibility prediction from raw RGB-D observations in cluttered environments, addressing computational bottlenecks in sampling-based motion planners. The method is evaluated on a novel large-scale benchmark with 2.7M grasp feasibility labels across 88 objects and 190 cluttered scenes, comparing MLP, volumetric-CNN, and point-cloud transformer architectures. GRASPFC-PTX achieves 0.996 AUROC on novel objects while offering faster predictions than traditional planners.

motion feasibilitypoint-cloud transformersampling-based motion planninggrasp feasibilitycluttered environments

Beyond Logical Forms: LLM-Extracted Patterns for Fallacy Classification

arXiv cs.AI · Eleni Papadopulos, Firoj Alam, Giovanni Da San Martino · 2026-06-25

This study introduces a framework for fallacy classification by merging abstract logical structures with context-level linguistic cues, extracted inductively from fallacious examples using Large Language Models (LLMs). The method leverages LLM-extracted patterns to enhance classification accuracy in zero- and one-shot configurations, demonstrating statistically significant improvements over zero-shot baselines. Cross-dataset experiments confirm the framework's generalization capability, validating data-driven pattern extraction as an effective approach for generating logical representations. The results outperform competing methods, establishing the utility of combining logical and linguistic features for nuanced fallacy detection.

logical fallacieslarge language modelszero-shot learningpattern extractiongeneralization

Do Safety Guardrails Need to Reason? LeanGuard: A Fast and Light Approach for Robust Moderation

arXiv cs.AI · Dongbin Na · 2026-06-25

The paper introduces LeanGuard, a lightweight safety moderation system that challenges the necessity of chain-of-thought (CoT) reasoning in guardrail models. Through controlled experiments comparing a 395M parameter bidirectional encoder (LeanGuard) against larger reasoning-based decoders, the authors demonstrate equivalent moderation accuracy (82.90 F1) with ~100x lower compute (single forward pass, ≤512 tokens). Results show LeanGuard maintains robustness under label noise and outperforms reasoning guards in recall at strict false-positive rates, suggesting current benchmarks may not justify CoT overhead. The work releases open-source models and code.

safety guardrailschain-of-thoughtmoderation accuracybidirectional encoderinference efficiency

NebulaExp-8B: An Empirical Post-Training Pipeline via Full-Scale Ablation Research

arXiv cs.AI · Qiaobo Hao, Yangqian Wu, Shunyi Wang, Zhongjian Zhang · 2026-06-25

The paper introduces NebulaExp, a transparent post-training pipeline for Qwen3-8B-base, featuring two model branches: general instruction and complex reasoning. The method involves curating 3.84M SFT samples and 200K RL candidates, with data processing techniques like response distillation and diversity-aware sampling. Results show NebulaExp-Ins-SFT improves benchmark scores from 55.01 to 60.99, while GRPO RL further elevates it to 61.85. The reasoning branch achieves a 75.17 average score. The study also explores OPD methods, with MOPD using 10K samples to outperform RL by 4.18 points.

post-training alignmentsupervised fine-tuningreinforcement learningresponse distillationdiversity-aware sampling

SKILL-DISCO: Distilling and Compiling Agent Traces into Reusable Procedural Skills

arXiv cs.AI · Zhongxin Guo, Danrui Qi, Hanwen Gu, Peng Cheng · 2026-06-25

SKILL-DISCO introduces a framework for distilling and compiling agent traces into reusable procedural skills, addressing inefficiencies in repeated task solving. The method represents skills as parameterized finite state machine (PFSM) subgraphs distilled from successful execution traces, which are then compiled into executable, verifiable procedures. Evaluations on ALFWorld and WebArena demonstrate improved success rates (quantitative gains unspecified) and reduced agent turns across benchmarks and model scales, validating the approach's efficacy in structured task scenarios.

procedural skillsfsm distillationagent tracesparameterized subgraphsexecution structures

Disco-LoRA: Disentangled Composition of Content, Style, and Motion for Multi-concept Video Customization

arXiv cs.AI · Xuancheng Xu, Gengyun Jia, Bing-Kun Bao · 2026-06-25

Disco-LoRA introduces a unified framework for multi-concept video customization, enabling joint control of content, style, and motion in Text-to-Video (T2V) models. The method decomposes the task into Content-Style and Content-Motion sub-tasks, addressed via an Iterative Dual-LoRA Disentanglement Framework that disentangles distinct concepts. A Z-score-based statistical regularization aligns layer-wise weight distributions, preserving trends while minimizing interference between LoRAs. Extensive experiments demonstrate Disco-LoRA's effectiveness in preserving appearance, style, and motion for controllable video generation.

text-to-videodisentanglementloramulti-conceptstatistical regularization

TGHE: Template-based Graph Homomorphic Encryption for Privacy-Preserving GNN Inference in Edge-Cloud Systems

arXiv cs.AI · Ngoc Bao Anh Le, Thai T. Vu, John Le, Heath Cooper · 2026-06-25

TGHE introduces template-based graph homomorphic encryption for privacy-preserving GNN inference in edge-cloud systems, addressing scalability limitations of graph-centric HE approaches. The method exploits structural convergence in transaction graphs, canonicalizing ego-graphs and packing isomorphic computation trees into CKKS ciphertexts for SIMD-parallel processing, supplemented by Approximate Template Fitting and Topology Collapse optimizers. Evaluation on DGraphFin (3.7M nodes) shows 66.9x speedup over sequential encrypted baselines with <0.002 AUC degradation.

homomorphic encryptiongraph neural networksckks ciphertextssimd-parallel processingego-graph canonicalization

Zero-Shot Size Transfer for Neural ODEs on Sparse Random Graphs: Graphon Limits and Adjoint Convergence

arXiv cs.AI · Mingsong Yan, Zhida Wang, Sui Tang · 2026-06-25

The paper establishes a theoretical framework for zero-shot size transfer in Graph Neural Differential Equations (GNDEs), enabling training on small graphs and deployment on larger, similar graphs without retraining. By analyzing sparse random graphs sampled from graphons, the authors introduce Graphon Neural Differential Equations (Graphon-NDEs) and adjoint Graphon-NDEs as infinite-node limits, proving well-posedness and trajectory-wise convergence at rate $O((α_n n)^{-1/2})$. They also derive uniform-in-time convergence bounds for adjoint systems and analyze discretize-then-optimize (DTO) and optimize-then-discretize (OTD) training methods, showing asymptotic consistency with explicit Euler discretization. Experimental results on HSBM and tent graphons validate the theoretical convergence rates and demonstrate successful zero-shot transfer across four graphon classes.

graph neural differential equationsgraphonszero-shot transferadjoint systemssparse random graphs

LAMP: Lane-Aligned Motion Primitives for Feasible Trajectory Prediction

arXiv cs.AI · Sangjin Han, Hoseong Jung, Jeongtae Her, Changhyun Choi · 2026-06-25

LAMP (Lane-Aligned Motion Primitives) proposes a topology-aware motion forecasting framework for autonomous driving that ensures multimodal predictions adhere to lane topology. The method uses a VQ-VAE to learn shape-aware motion primitives as discrete intention queries and introduces a feasibility-aware intention selector with lane-topology priors to filter unreachable intentions. Evaluated on Argoverse 2, LAMP matches state-of-the-art accuracy while improving feasibility and diversity metrics.

motion forecastingvq-vaemotion primitiveslane topologyautonomous driving

CAT-Q: Cost-efficient and Accurate Ternary Quantization for LLMs

arXiv cs.AI · Shigeng Wang, Chao Li, Yangyuxuan Kang, Jiawei Fan · 2026-06-25

CAT-Q introduces a post-training ternary quantization method for LLMs, combining learnable modulation (LM) and softened ternarization (ST) to optimize weight distribution and threshold sensitivity without costly quantization-aware training. LM adapts pre-trained weights via learnable factors, while ST employs a differentiable transition function for stable convergence. Evaluated on models from 1.7B to 235B parameters, CAT-Q achieves superior accuracy over BitNet variants using only 512 calibration samples (100,000× fewer tokens) and completes quantization in 8-60 hours on 8 A100-80GB GPUs.

ternary quantizationpost-training quantizationlearnable modulationsoftened ternarizationllm compression

Autoformalization of Agent Instructions into Policy-as-Code

arXiv cs.AI · Adam Mondl, Matthew Maisel, John H. Brock · 2026-06-25

The paper introduces an autoformalization pipeline that converts natural language agent instructions into formally verified policies, addressing the scalability limitations of hand-coded symbolic enforcement and the lack of guarantees in probabilistic approaches. The method employs an LLM-based generator-critic loop to produce policies in the Cedar Policy Language, integrating agent prompts, MCP tool descriptions, and policy documents. On MedAgentBench, the autoformalized policies achieve significantly broader coverage of natural-language specifications compared to prior symbolic enforcement methods.

autoformalizationpolicy-as-codellm-based generator-criticcedar policy languagemedagentbench

Agents That Know Too Much: A Data-Centric Survey of Privacy in LLM Agents

arXiv cs.AI · Nada Lahjouji, Ashwin Gerard Colaco · 2026-06-25

This survey organizes privacy risks in large language model (LLM) agents from a data-centric perspective, focusing on the data sources agents interact with rather than attack types. It examines privacy vulnerabilities across retrieval-augmented generation, text-to-SQL interfaces, agent memory, prompt injection, access control, and contextual privacy. The study identifies information-flow control as the sole governance mechanism addressing compositional and cross-session inference leakage, while noting the absence of benchmarks evaluating agents across data surfaces under unified privacy policies. The taxonomy serves as a reference for integrating scattered literature and framing future research.

retrieval-augmented generationtext-to-sql interfacesagent memoryinformation-flow controlcross-session inference

Discovering Millions of Interpretable Features with Sparse Autoencoders

arXiv cs.AI · XinYang He, Wei Wang, Bing Zhao, Xuan Ren · 2026-06-25

The authors introduce Qwen3-Instruct SAE, a suite of sparse autoencoders (SAEs) trained on instruction-tuned Qwen3 models (1.7B, 4B, and 8B parameters), targeting residual streams, MLP outputs, and attention outputs. They employ layer-wise SAE training and evaluate reconstruction fidelity versus sparsity trade-offs across model components. Results demonstrate SAEs' utility for interpretable feature discovery and causal intervention, exemplified by steering refusal behavior in instruction-tuned models. The release provides resources for mechanistic analysis of sparse representations in aligned LMs.

sparse autoencodersresidual streamsinstruction-tuningmechanistic interpretabilityfeature steering

HiLSVA: Design and Evaluation of a Human-in-the-Loop Agentic System for Scientific Visualization

arXiv cs.AI · Kuangshi Ai, Patrick Phuoc Do, Chaoli Wang · 2026-06-25

HiLSVA introduces a human-in-the-loop agentic system for scientific visualization (SciVis) that balances autonomy with human oversight through mixed-initiative workflows. The system combines a plan-first multi-agent architecture with provenance tracking, learn-at-test-time adaptation, and sandboxed execution, enabling natural language interaction and direct visualization manipulation. Evaluation via case studies and a 12-participant user study demonstrates improved task completion, control, and transparency, though with a tradeoff between efficiency and oversight. Results advocate for human-centered design in agentic SciVis.

mixed-initiative interactionagentic systemscientific visualizationprovenance trackinglearn-at-test-time

LLM-based Models for Detecting Emerging Topics in Service Feedback

arXiv cs.AI · Mahsa Tavakoli, Ruth Bankey, Cristián Bravo · 2026-06-25

The paper introduces a novel methodology for detecting emerging service quality topics in multilingual customer feedback by integrating large language models (LLMs), statistical techniques, and human-AI collaboration. The framework employs fine-tuned, quantized LLMs combined with expert oversight to achieve computationally efficient and context-aware analyses. Evaluation through similarity analysis and tax officer assessments showed superior alignment with expert judgments compared to baseline models, while reducing LLM fabrication through human-in-the-loop validation.

large language modelshuman-ai collaborationmultilingual feedback analysisquantized llmsemerging topic detection

Content-Based Smart E-Mail Dispatcher Using Large Language Models

arXiv cs.AI · K. Paramesha, K R Sriram, Sujan Shetty, Shamanth Kishore · 2026-06-25

The paper proposes an automated email dispatching system using LLM-based agents to route academic emails to relevant WhatsApp groups without labeled training data. The method employs structured prompts with email content, instructions, and context for in-context learning, enabling content-based routing decisions. The system reduces manual processing errors and cognitive load while improving information flow in engineering colleges, though quantitative performance metrics are not provided.

llm agentsemail routingin-context learningcontent-based dispatchcognitive load reduction

SharQ: Bridging Activation Sparsity and FP4 Quantization for LLM Inference

arXiv cs.AI · Haoqian Meng, Yilun Luo, Yafei Zhao, Wenyuan Liu · 2026-06-25

SharQ introduces a training-free inference method combining FP4 quantization and N:M sparsity for LLM activations through sparse-dense decomposition. It generates input-adaptive masks to isolate outlier-dominated sparse components (quantized to FP4) while compensating residuals via dense FP4 GEMM, sharing weights between paths with scale-specific views. Evaluated on Llama-3.1-8B, Qwen2.5-7B, and Qwen3 variants, SharQ recovers 43–63% of NVFP4-to-FP16 accuracy gaps and achieves 2.2–2.4× latency reduction over FP16 on RTX 5090, with 1.58× speedup in video generation when combined with SageAttention.

fp4 quantizationn:m sparsitysparse-dense decompositiongemm accelerationactivation compression

A Multi-Level Validation and Traceability Framework for AI-Generated Telescope Scheduling Decisions

arXiv cs.AI · Hengchu Xiao, Chuanjun Wang · 2026-06-25

The authors propose a multi-level validation and traceability framework for AI-generated telescope scheduling decisions, addressing inconsistent data references, reasoning errors, and non-executable outputs. The framework integrates data reference validation, logical consistency checks, and constraint verification, while representing decisions as interconnected reasoning steps via atomic reasoning units. Experiments demonstrate improved executability and reliability, with feedback correction enhancing error repair in complex scenarios, outperforming pure AI methods in reliability without sacrificing flexibility.

telescope schedulingreasoning traceabilityconstraint verificationatomic reasoning unitsfeedback correction

EvoOptiGraph: Weakness-Driven Coevolution via Graph-Based Structural Generation for Optimization Modeling

arXiv cs.AI · Qingcan Kang, Mingyang Liu, Xiaojin Fu, Shixiong Kai · 2026-06-25

EvoOptiGraph introduces a weakness-driven coevolution framework for improving LLM-based optimization modeling. The method represents mixed-integer linear programs (MILPs) as attributed bipartite graphs, applies validity-preserving evolutionary operators for structural diversity, and employs a two-stage training pipeline combining supervised fine-tuning with reinforcement learning guided by verifiable rewards. Results across six datasets demonstrate superior accuracy, executability, and generalization compared to generalist models, agentic methods, and specialized baselines, validating the effectiveness of data-model coevolution for optimization tasks.

mixed-integer linear programattributed bipartite graphvalidity-preserving operatorsreinforcement learning with verifiable rewardsdata-model coevolution

IDEA: Insensitive to Dynamics Mismatch via Effect Alignment for Sim-to-Real Transfer in Multi-Agent Control

arXiv cs.AI · Chenlong Liu, Zhuohui Zhang, Xinyan Chen, Zhipeng Wang · 2026-06-25

The paper introduces IDEA, a sim-to-real transfer method for multi-agent control that addresses dynamics mismatch via effect alignment. By combining random environmental structure with discrete semantic actions through closed-loop control, the method elevates policy learning to a semantic abstraction level. An action synchronization mechanism further mitigates inter-agent timing mismatches, enhancing temporal consistency. Experiments on four multi-agent navigation tasks show improved training efficiency and higher real-world success rates compared to mainstream transfer methods, demonstrating robustness under dynamics mismatch.

sim-to-realmulti-agent controldynamics mismatcheffect alignmentaction synchronization

scBench-Long: Verifiable Benchmarking of Long-Horizon Single-Cell Biology

arXiv cs.AI · Ian Diks, Zhen Yang, Arjun Banerjee, Tim Proctor · 2026-06-25

We introduce scBench-Long, a benchmark for evaluating long-horizon single-cell biology tasks where agents must derive scientific conclusions from raw or near-raw data without predefined methods. The benchmark comprises 21 evaluations across diverse biological contexts, including melanoma CD8 T-cell reactivity, KRAS-driven lung tumor aging, and lethal COVID-19 lung pathology, utilizing datasets such as paired scRNA/TCR sequencing and cross-species transcriptomics. Claims are validated through deterministic grading and trajectory rubrics. Across 1,068 trajectories, the top-performing model-task pair achieved a 25.4% success rate (16/63 runs). scBench-Long assesses agents' ability to transition from local analyses to complex scientific claims supported by single-cell data.

single-cell biologydeterministic gradingtrajectory rubricscross-species transcriptomicspaired scrna/tcr sequencing

Explainable Ensemble-Based Machine Learning Models for Detecting the Presence of Cirrhosis in Hepatitis C Patients

arXiv cs.AI · Abrar Alotaibi, Lujain Alnajrani, Nawal Alsheikh, Alhatoon Alanazy · 2026-06-25

This work introduces an explainable ensemble-based machine learning approach for detecting cirrhosis in hepatitis C patients, addressing a previously unstudied application domain. The study evaluates four models (Random Forest, Gradient Boosting Machine, Extreme Gradient Boosting, Extra Trees) on a dataset of 2038 Egyptian patients with 28 clinical attributes. The Extra Trees model achieved optimal performance with 96.92% accuracy, 94.00% recall, 99.81% precision, and 96% AUROC using only 16 selected features.

ensemble learningfeature selectionclinical decision supporthepatitis ccirrhosis detection

SpaceRipple: Lightweight Semantic Delivery for Mission-Oriented LEO Earth Observation Satellite Networks

arXiv cs.AI · Ziyi Yang, Hao Yuan, Yunxiang Yi, Wenbo Wang · 2026-06-25

SpaceRipple introduces a lightweight framework for mission-oriented semantic delivery in Earth observation satellite networks, prioritizing task-relevant information over raw-image transmission. The method employs adaptive compression and metadata generation on sensing satellites, coupled with edge computing for representation restoration and semantic extraction, coordinated through a collaborative pipeline. A compression-aware mixture-of-experts (MoE) module enhances robustness to degraded inputs. Experiments demonstrate improved reconstruction quality (PSNR +3.2dB), semantic detection accuracy (+9.5% mAP), and 68% bandwidth reduction compared to baseline approaches.

semantic deliveryadaptive compressionedge computingmixture-of-expertsearth observation

Perception, Verdict, and Evolution: Hindsight-Driven Self-Refining Forensics Agent for AI-Generated Image Detection

arXiv cs.AI · Yangjun Wu, Keyu Yan, Yu Liu, Jingren Zhou · 2026-06-25

ForeAgent introduces a self-evolving forensic framework for AI-generated image detection, addressing limitations in fine-grained artifact sensitivity and static supervision. The framework employs a Perception-Verdict architecture that integrates semantic, spatial, and frequency-domain features, utilizing a Multimodal Large Language Model (MLLM) for verdict fusion. A Hindsight-Driven Self-Refining strategy enables iterative improvement through Sampling-Reflection-Evolution, generating high-quality reasoning traces via failure case analysis and dual-expert quality gating. ForeAgent achieves state-of-the-art performance with 82.18% accuracy on the Chameleon benchmark (+16.41% over AIDE) and 93.3% mean accuracy on AIGCDetect-Benchmark across 16 generators, outperforming GPT-5 and GPT-5-mini in consistency and causal reasoning.

forensic frameworkmultimodal large language modelhindsight-driven refiningperception-verdict architecturedual-expert gating

PMDformer: Patch-Mean Decoupling Information Transformer for Long-term Forecasting

arXiv cs.AI · Ao Hu, Liangjian Wen, Jiang Duan, Yong Dai · 2026-06-25

Proposes PMDformer, a Transformer-based model for long-term time series forecasting that addresses scale differences in patch-based approaches via patch-mean decoupling (PMD). PMD separates trend and residual shape information by subtracting patch means, enabling attention mechanisms to capture true shape similarities. Introduces Trend Restoration Attention (TRA) to reintegrate decoupled trends and Proximal Variable Attention (PVA) to focus on recent cross-variable correlations. Experiments show PMDformer outperforms state-of-the-art methods in stability and accuracy across multiple benchmarks.

long-term forecastingpatch-mean decouplingtrend restoration attentionproximal variable attentionshape similarity

CascadeFormer: Depth-Tapered Transformers Motivated by Gradient Fan-in Asymmetry

arXiv cs.AI · Huzama Ahmad, Cao Viet Hai Nam, Se-Young Yun · 2026-06-25

The paper introduces CascadeFormer and CascadeFlow Pruning, two efficiency methods for Transformers motivated by Gradient Fan-in Asymmetry (GFA). CascadeFormer tapers model width with depth to match uneven information flow, reducing latency by 8.6% and increasing throughput by 9.4% at comparable perplexity. CascadeFlow Pruning removes layers using accumulated gradients, outperforming standard heuristics. The authors theoretically and empirically demonstrate GFA's role in gradient decay across layers, showing its correlation with layer importance in models up to 1.2B parameters. Interventions reveal structural (not magnitude) bottlenecks, with parameter-shared repetition restoring late-layer value.

transformersgradient fan-in asymmetrylayer pruningmodel efficiencydepth-tapered architecture

From Hallucination to Grounding: Diagnosing Visual Spatial Intelligence via CRISP

arXiv cs.AI · Zhixing Li, Yinan Yu · 2026-06-25

We introduce CRISP, a diagnostic evaluation paradigm for assessing visual spatial intelligence in Vision-Language Models (VLMs) by measuring consistency between implicit perception and explicit reasoning. CRISP employs metric 3D Scene Graphs and an oracle intervention protocol to disentangle reasoning capabilities from perceptual bottlenecks, revealing a systematic perception-reasoning disconnect. Proprietary models exhibit robust reasoning but suffer from inaccurate metric estimation and underutilized structural representations, while open-source models are limited by deficient multi-hop compositional reasoning. CRISP shifts evaluation focus from language priors to genuine multimodal alignment, offering a rigorous roadmap for improving VLMs beyond end-to-end post-training.

visual spatial intelligencemetric 3d scene graphsoracle intervention protocolmulti-hop compositional reasoningmultimodal alignment

VoiceTTA: Enhancing Zero-Shot Text-to-Speech via Reinforcement Learning-Based Test-Time Adaptation

arXiv cs.AI · Tianxin Xie, Chenxing Li, Dong Yu, Li Liu · 2026-06-25

VoiceTTA enhances zero-shot text-to-speech (TTS) for uncommon speaking styles via reinforcement learning-based test-time adaptation (TTA). The method optimizes learnable prefixes in a flow matching-based model using group relative preference optimization (GRPO), with style rewards based on F0 and energy coefficient-of-variation differences, speaker similarity, and intelligibility (Whisper-derived WER). Experiments show significant improvements over state-of-the-art baselines on uncommon speech prompts.

zero-shot ttstest-time adaptationflow matchinggroup relative preference optimizationcoefficient-of-variation

\textsc{DiARC}: Distinguishing Positive and Negative Samples Helps Improving ARC-like Reasoning Ability of Large Language Models

arXiv cs.AI · Yuxuan Yang, Feiyang Li, Yile Wang · 2026-06-25

The paper introduces extsc{DiARC}, a method to enhance large language models' reasoning on Abstraction and Reasoning Corpus (ARC)-like tasks by distinguishing positive and negative samples. Drawing on preference alignment, extsc{DiARC} constructs negative samples through output-level visual transformations, DSL-level rule inversion, and task-specific rule editing, providing near-miss alternatives while preserving demonstrations. Experiments across multiple ARC-like benchmarks demonstrate consistent performance improvements over baseline models. The code is publicly available.

abstraction and reasoning corpuspreference alignmentnegative samplesrule inversionvisual transformations

The Inattentional Gap: Task-Conditioned Language and Vision Models Omit the Safety-Critical Signals They Can Otherwise Report

arXiv cs.AI · Kwan Soo Shin · 2026-06-25

The study identifies an 'Inattentional Gap' phenomenon where task-conditioned language and vision models systematically omit safety-critical signals they can otherwise detect, analogous to human inattentional blindness but arising from different mechanisms. Through experiments in radiology text scenarios, driving contexts, and chest-radiograph vision tasks, the authors demonstrate this suppression effect across all tested models, showing it persists across model scales and reasoning architectures, with variation primarily by model family rather than size. Results reveal these models report the same safety signals at significantly higher rates when unconstrained, suggesting benchmark evaluations may overestimate real-world safety by failing to account for unspecified hazards.

inattentional gaptask-conditioned modelssafety-critical signalsbenchmark decouplingmodel suppression

Radical AI Interpretability

arXiv cs.AI · Daniel A. Herrmann, Benjamin A. Levinstein · 2026-06-25

The paper develops a framework for interpreting AI systems as agents by combining radical interpretation from philosophy with mechanistic interpretability. It addresses the core challenge of inferring beliefs, desires, and meanings from computational facts, emphasizing the holistic nature of these attributions due to mutual constraints between propositional structure and attitudes. The proposed criteria for representationalist and interpretationist approaches are linked to actionable tests for current interpretability methods, highlighting the necessity of joint measurement to avoid distortions, especially in systems with divergent conceptual frameworks.

mechanistic interpretabilityradical interpretationpropositional structureattribution holismai safety

Multipath Adaptive Gated Bottleneck Latent ODE with Raman Data Fusion for Cell Culture Process Forecasting

arXiv cs.AI · Johnny Peng, Thanh Tung Khuat, Ellen Otte, Katarzyna Musial · 2026-06-25

The paper proposes Multipath Adaptive Gated Bottleneck Latent ODE (GB-Latent ODE) with Raman data fusion for forecasting cell culture processes. The method combines a gated bottleneck architecture for sparse input compression with multi-path just-in-time fine-tuning (MP-JIT-FT) that clusters historical trajectories into regimes for diverse forecasts. Raman spectroscopy data is fused via a soft sensor to enrich sparse measurements. Evaluated on 38 fed-batch bioreactor runs across 14 conditions, the approach outperforms a global Latent ODE baseline on 8/9 target variables, with multi-path forecasting excelling in divergent scenarios and Raman fusion aiding representative trajectories.

latent oderaman fusionmulti-path forecastingbioprocess modelingsoft sensor

Boundary-Aware Context Grounding for A Low-Channel EEG Agent

arXiv cs.AI · Zhiyuan Xu, Yueqing Dai, Junling Li, Junwen Luo · 2026-06-25

The paper introduces NeuraDock Agent, an open-source architecture that separates deterministic EEG processing from a hardware-aware language layer to prevent unsupported interpretations in low-channel EEG analysis. The system employs a numerical engine for quality-controlled spectral workflows and a language model restricted to versioned context packs describing hardware constraints and scientific limits. Evaluation showed deterministic outputs (12 recordings, 10 repetitions), robust failure handling, and improved boundary awareness in 288 test cases across context ablations and two LLMs, though clinical validity remains unestablished.

low-channel eegcontext groundingspectral workflowsboundary awarenessdeterministic processing

NeuraDock Visual Cognitive Load Agent Tutorial: A Quality-Gated Open-Source EEG Workflow for Alpha Dynamics and Real-Time Applications

arXiv cs.AI · Zhiyuan Xu, Yueqing Dai, Junling Li, Junwen Luo · 2026-06-25

The NeuraDock Visual Cognitive Load Agent tutorial presents an open-source EEG workflow for real-time analysis of Alpha dynamics and visual cognitive load. The method features a quality-gated pipeline integrating EEG preprocessing, quality control, Alpha feature extraction, and a web API, bridging offline analysis and real-time applications. Validation on 18 recordings demonstrated task-related posterior Alpha suppression in 7/10 within-subject comparisons, preliminary within-subject repeatability, and benchmarked API latency. The tutorial enables reproducible deployment for researchers developing real-time cognitive-load prototypes.

eeg preprocessingalpha dynamicscognitive loadreal-time apiquality gating

Temporal Validity in Retrieval Memory: Eliminating Stale-Fact Errors for AI Agents over Evolving Knowledge

arXiv cs.AI · Neeraj Yadav · 2026-06-25

MemStrata introduces temporal validity to retrieval-augmented generation (RAG) by addressing stale-fact errors in evolving knowledge contexts. It employs a deterministic (subject, relation, object) supersession rule to retire outdated facts in a bi-temporal ledger, eliminating the need for similarity thresholds or LLM calls. Evaluated on six benchmarks using a 7B model, MemStrata matches RAG on static knowledge (0.95-1.00 accuracy) and significantly outperforms it on evolving knowledge (0.20-0.47 accuracy for RAG). MemStrata reduces stale-fact-error rates from 15-40% in RAG to ~0%, achieving this with a retrieval latency of ~2.1s compared to 16-18s for LLM-reranking baselines.

retrieval-augmented generationtemporal validitystale-fact errorsbi-temporal ledgersupersession rule

Humans Disengage, Reasoning Models Persist: Separating Difficulty Registration from Deliberation Allocation

arXiv cs.AI · Han-yu Wang · 2026-06-25

The study reveals a dissociation between humans and large reasoning models (LRMs) in deliberation patterns: while both show cross-item alignment of response time with difficulty (registration), they exhibit opposite within-item allocation behaviors. Analyzing a matched human-LRM corpus, researchers found LRMs expend more tokens on incorrect trials (Cohen's d = 1.47-3.13 on H-ARC), whereas humans spend less time on failures. This divergence persists under item fixed effects and across datasets, absent in non-thinking baselines. The findings suggest humans disengage from perceived failures while LRMs persist due to uncertainty-driven chain growth, despite both policies producing similar cross-item difficulty correlations.

deliberation allocationdifficulty registrationlarge reasoning modelsmetareasoningresponse time

Clinical Harness for Governable Medical AI Skill Ecosystems

arXiv cs.AI · Tianhan Xu, Lei Bao, Yongxiang Wang · 2026-06-25

The authors propose Clinical Harness, a runtime governance architecture for medical AI that enables accountable, persistent clinical capabilities. The system integrates knowledge-driven, data-driven, and physics-enhanced AI skills through registration, orchestration, guarding, and monitoring mechanisms. Demonstrated on osteoporosis care, the approach supports lifecycle management of AI-enabled clinical functions under continuous governance constraints.

runtime governanceclinical ai skillsknowledge-driven aiphysics-enhanced modelslifecycle care

Evaluation-Strategy Gap in Fault Diagnosis of Deep Learning Programs

arXiv cs.AI · Sigma Jahan · 2026-06-25

The study identifies a 0.190 balanced accuracy gap in deep learning fault diagnosis techniques between within-program evaluation and cross-program settings, using DynFault, a corpus of 5,542 fault-injected training traces from 38 real-world DL programs. Analysis reveals program-level structural features as the primary cause, with curvature features proving effective for instability detection on unseen programs, while optimizer and activation features only benefit seen programs. The findings highlight the need for evaluation strategies that account for deployment scenarios involving novel programs.

fault diagnosisdeep learningcross-program evaluationcurvature featuresoptimizer features

An Empirical Study of LLM-Generated Specifications for VeriFast

arXiv cs.AI · Wen Fan, Minh Tran, Sanya Dod, Xin Hu · 2026-06-25

This paper evaluates LLM-generated specifications for VeriFast, a separation logic verifier, across 303 C functions. The study tests eight prompting approaches, ten LLMs, and three input types, analyzing functional behavior preservation, verifiability, and errors. Results show high functional behavior preservation (over 91% for code and specs) but modest verification success (31.4%), with Gemini 2.5 Pro and formal contracts performing best. Most errors (94%) stem from LLMs' lack of domain-specific knowledge in separation logic verifiers.

separation logicstatic verificationllm-generated specificationsverifastformal contracts

Speaking Numbers to LLMs: Multi-Wavelet Number Embeddings for Time Series Forecasting

arXiv cs.AI · Defu Cao, Zijie Lei, Muyan Weng, Jiao Sun · 2026-06-25

The paper introduces TempoWave, a multi-wavelet number embedding method that improves LLM-based time series forecasting by addressing the misalignment between discrete tokenization and continuous numerical values. TempoWave maps scalar observations into digit-wise embeddings using multi-wavelet, multi-scale coefficients, preserving numerical ordering and robustness to normalization. Evaluated on five context-enriched forecasting benchmarks, TempoWave outperforms standard numeric tokenization and alternative embeddings, achieving state-of-the-art results. The method highlights the importance of principled multi-resolution embeddings for coupling LLMs' contextual reasoning with precise forecasting.

multi-wavelet embeddingstime series forecastingllm tokenizationnumerical orderingcontext-aware forecasting

Adaptive Evaluation of Out-of-Band Defenses Against Prompt Injection in LLM Agents

arXiv cs.AI · Praneeth Narisetty, Shiva Nagendra Babu Kore, Uday Kumar Reddy Kattamanchi, Jayaram Kumarapu · 2026-06-25

The study organizes out-of-band defenses against indirect prompt injection in LLM agents as instances of classical integrity protection, reference monitoring, and least privilege, providing a structured comparison of their coverage. It highlights the limitation of static benchmarks in validating these defenses and proposes a threat model and protocol for adaptive evaluation. Testing Progent on AgentDojo with an open-weight agent (Qwen2.5-7B) showed a sixfold reduction in mean attack success (25.8% to 4.2%), with a hand-crafted adaptive attack yielding 2.6% success. The results suggest deterministic out-of-band enforcement may be more resilient to adaptive attacks than in-band detection.

integrity protectionreference monitoringleast privilegeadaptive evaluationprompt injection

Retrieval-Warmed Energy-Based Reasoning: A Five-Arm Ablation Methodology for Diffusion-as-Inference on Structured Reasoning Tasks

arXiv cs.AI · Libo Sun, Po-Wei Harn, Zewei Zhang, Peixiong He · 2026-06-25

The paper introduces retrieval-warmed energy-based reasoning (RW-EBR), a diffusion-as-inference method combining IRED energy-based models with Modern Hopfield trajectory memory, and proposes a five-arm ablation methodology (oracle, best-constant, per-query-random, shuffled, aligned) to disentangle class-prior bias shift, stochastic warm-starting, and graph-aligned value reuse effects. On connectivity-2 tasks, the aligned-vs-shuffled-oracle ablation shows a +35pp balanced accuracy gain, revealing per-graph alignment as the dominant factor. Applied to Sudoku, the method identifies key quality as the limiting component, demonstrating task-specific failure mode diagnosis in structured reasoning tasks.

energy-based reasoningdiffusion-as-inferencemodern hopfieldfive-arm ablationstructured reasoning

Localizing RL-Induced Tool Use to a Single Crosscoder Feature

arXiv cs.AI · Andrii Shportko, Shubham Bhokare, Ahmed Zeyad A Alzahrani, Bowen Cheng · 2026-06-25

The study identifies Dedicated Feature Crosscoders (DFC) as a compact set of RL-specific features enabling tool-calling capability in Qwen2.5-3B. Through a 48-crosscoder hyperparameter sweep, encode-decode reconstruction improves the RL model's tool correctness by +31.1 ± 9.7 percentage points and transfers capability to the frozen base model (+6.8 ± 5.0 pp), termed 'capability spillover'. Results demonstrate that DFC partitioning concentrates RL-induced behaviors into a minimal, steerable feature set for runtime control of agentic LLMs.

dedicated feature crosscoderstool-calling capabilitycapability spilloverrl-specific featuresagentic llms

3D Spatial Pattern Matching

arXiv cs.AI · Nicole R. Schneider, Avik Das, Lukas Arzoumanidis, Abhijeet Ghodgaonkar · 2026-06-25

The authors introduce 3D spatial pattern matching, extending prior 2D approaches to incorporate height alongside positional data, addressing limitations in real-world entity searches. They propose a generalized problem definition and develop a subgraph matching algorithm capable of resolving 3D spatial patterns over distance relations. Two datasets are released: one synthetic and one containing real 3D building data from Hamburg, Germany. The algorithm is evaluated on both datasets, establishing baseline performance metrics for future research in 3D spatial pattern matching.

spatial pattern matchingsubgraph matching3d spatial patternsdistance relationsreal-world entities

auto-psych: Automating the science of mind using agent-driven theory discovery and experimentation

arXiv cs.AI · Ben Prystawski, Kushin Mukherjee, Daniel Wurgaft, Linas Nasvytis · 2026-06-24

The paper introduces auto-psych, an agent-driven system for automated theory discovery and experimentation in computational cognitive science. The system employs nested agent-based loops: an inner loop for conjecturing, fitting, and critiquing probabilistic cognitive models, and an outer loop for designing, launching, and analyzing crowdsourced survey experiments. Using coin-flip sequence randomness judgments as a testbed, auto-psych recovered ground-truth theories from synthetic data and outperformed literature-derived theories in three human experiments. The results demonstrate the feasibility of automated data collection and theory generation in cognitive science.

automated theory discoverycomputational cognitive scienceagent-based systemsprobabilistic cognitive modelscrowdsourced experimentation

MKG-RAG-Bench: Benchmarking Retrieval in Multimodal Knowledge Graph-Augmented Generation

arXiv cs.AI · Xiaochen Wang, Bao Hoang, Han Liu, Ting Wang · 2026-06-24

The authors introduce MKG-RAG-Bench, a cross-domain benchmark for evaluating retrieval in multimodal knowledge graph-augmented generation (MKG-RAG), addressing the overlooked challenge of heterogeneous multimodal knowledge alignment. The benchmark is constructed from two multimodal knowledge graphs using an LLM-based curation pipeline that filters low-utility knowledge, generates structurally grounded queries, and covers diverse modality configurations. Experiments demonstrate that multimodal retrieval remains a critical bottleneck, with retrieval quality strongly influencing downstream generation performance, providing a foundation for advancing MKG-RAG systems.

multimodal knowledge graphretrieval-augmented generationbenchmarkcross-domain evaluationllm-based curation

Active Adversarial Perturbation-driven Associative Memory Retrieval for RGB-Event Visual Object Tracking

arXiv cs.AI · Xiao Wang, Xufeng Lou, Zikang Yan, Lan Chen · 2026-06-24

The paper introduces APRTrack, a hierarchical perturbation and retrieval framework for robust RGB-Event visual object tracking under partial target occlusion and modal degradation. The method employs two adversarial perturbation branches to simulate modality-level failure and spatial-level target absence, coupled with a hierarchical routing mechanism to prevent feature collapse. It further proposes Footprint-guided Channel-calibrated Hopfield Retrieval (FCHR) for confidence-based historical feature compensation. Experiments on FE108, COESOT, VisEvent, and FELT datasets validate the approach's effectiveness in challenging scenarios.

rgb-event trackingadversarial perturbationhopfield retrievalmodal degradationfeature compensation

Data-driven Machine Learning Cannot Reach Symbolic-level Logical Reasoning -- The Limit of the Scaling Law

arXiv cs.AI · Tiansi Dong, Mateja Jamnik, Pietro Liò · 2026-06-24

The study demonstrates fundamental limitations preventing data-driven machine learning from achieving symbolic-level syllogistic reasoning, even with scaling. Through theoretical analysis and experiments with Euler Net and ChatGPT variants (GPT-5-nano/GPT-5), the authors identify two key barriers: (1) training data's inability to distinguish all 24 valid syllogism types, and (2) contradictory optimization targets between pattern recognition and logical reasoning components. Results show surface form variations affect reasoning performance, with GPT-5 reaching 100% accuracy but producing incorrect explanations, suggesting supervised learning cannot match symbolic reasoning rigor despite apparent convergence.

syllogistic reasoningscaling lawsymbolic reasoningneural networkschatgpt

ProvenAI: Provenance-Native Traces of Evidence in Generated Answers

arXiv cs.AI · Mohammad Faizan, Dalal Alharthi · 2026-06-24

ProvenAI introduces a framework for decomposing transparency in retrieval-augmented QA systems into three measurable layers: answer correctness, citation fidelity, and per-document influence. The method employs a seven-stage pipeline including data normalization, retrieval indexing, citation-aware generation, and influence estimation via leave-one-resource-out intervention, evaluated on 7,405 HotpotQA examples. Results show 53.53% answer accuracy, 71.55% mean citation fidelity, and reveal a citation-influence gap where uncited sources significantly affect outputs. The framework formalizes faithfulness via KL-divergence and causal-mediation analysis, advocating for traceable evidence across three distinct layers.

retrieval-augmented qacitation fidelityinfluence estimationcausal-mediation analysisprovenance-native

Closing the Loop to Discover Psychological Theories with an Automated Cognitive Scientist

arXiv cs.AI · Akshay K. Jagadish, Younes Strittmatter, Nori Jacoby, George Kachergis · 2026-06-24

The Automated Cognitive Scientist (AutoCog) introduces a fully autonomous system for theory discovery in cognitive science, closing the loop between theory generation, experiment design, and empirical validation. AutoCog employs large-language-model agents to propose executable cognitive models, design discriminating experiments, collect behavioral data, and iteratively refine theories based on generative performance. In decision-making studies, AutoCog recovered known strategies, outperformed initial theories, and generalized to new experimental settings. It also discovered a novel multi-cue decision-making theory, validated through preregistered experiments. This demonstrates how automated systems can transform cognitive theory-building into an explicit and cumulative scientific process.

autonomous systemscognitive modelingtheory generationdecision-makinglarge-language-model

WatchAct: A Benchmark for Behavior-Grounded Robot Manipulation

arXiv cs.AI · Baiqi Li, Ce Zhang, Yu Fang, Yue Yang · 2026-06-24

WatchAct introduces a benchmark for robot manipulation grounded in observed human behavior, addressing limitations of current benchmarks that lack temporal reasoning. The dataset comprises 3,000 long-horizon instances across 14 tasks in four capability domains: Event Grounding, Procedural Reasoning, Implicit Intent Inference, and Episodic Reasoning. Each instance pairs a human-action video, language instruction, simulator scene, and LIBERO task. A disentangled evaluation protocol assesses video-to-plan reasoning, policy execution, and full task completion. Current systems, including Gemini-3.1-Pro with $π_{0.5}$, achieve low success rates (16.3% in simulation, 14.0% on real robot), highlighting significant gaps in reasoning and execution capabilities.

watchactlong-horizonliberoepisodic reasoningdisentangled evaluation

AXLE: A Cloud Infrastructure for Lean 4 Theorem Proving Utilities

arXiv cs.AI · Jimmy Xin, Alex Schneidman, Chris Cummins, Karun Ram · 2026-06-24

AXLE (Axiom Lean Engine) introduces a cloud infrastructure for scalable Lean 4 theorem proving, addressing limitations in existing systems by offering parallel proof verification, metaprogramming tools, and multi-version support. The service provides 14 Lean 4 utilities, including proof verification, declaration extraction, and deterministic proof repair, deployed as a multi-tenant cloud service with per-request isolation. Accessible via Python SDK, CLI, web UI, and HTTP API, AXLE has processed over 500 million requests and supports Axiom Math's proving efforts, achieving a 12/12 score on the 2025 Putnam competition.

lean 4theorem provingmetaprogrammingcloud infrastructureproof verification

ConflictScore: Identifying and Measuring How Language Models Handle Conflicting Evidence

arXiv cs.AI · Siyi Liu, Aaron Halfaker, Dan Roth, Patrick Xia · 2026-06-24

The authors introduce ConflictScore, a novel metric for evaluating how language models handle conflicting evidence in grounding documents, addressing limitations of existing factuality metrics. The framework decomposes responses into atomic claims, labels them against documents, and computes two measures: CS-C (proportion of conflicting claims) and CS-R (support-contradiction balance). Using ConflictBench, a benchmark with diverse conflict types, experiments demonstrate ConflictScore's effectiveness in detecting overconfident claims and improving truthfulness, achieving measurable gains on TruthfulQA.

conflictscorefactuality evaluationgrounding documentstruthfulqaconflictbench

Play2Perfect: What Matters in Dexterous Play Pretraining for Precise Assembly?

arXiv cs.AI · Tyler Ga Wei Lum, Kushal Kedia, C. Karen Liu, Jeannette Bohg · 2026-06-24

Play2Perfect introduces a reinforcement learning framework for task-agnostic pretraining through play, subsequently finetuned for precise assembly tasks. The method acquires reusable manipulation priors (e.g., grasping, in-hand reorientation) via diverse object interactions, then adapts these priors to high-precision assembly scenarios. Key design choices in play pretraining—object diversity, training objective, trajectory diversity, and goal precision—are systematically studied. Results demonstrate 33x sample efficiency over RL from scratch, 60% success on tight insertions with 0.5mm clearance, and over 50% success on multi-part assembly and screwing, with zero-shot sim-to-real transfer.

reinforcement learningprecise assemblymanipulation priorssim-to-real transfersample efficiency

CoStream: Composing Simple Behaviors for Generalizable Complex Manipulation

arXiv cs.AI · Haonan Chen, Yuxiang Ma, Stephen Tian, Xiaoshen Han · 2026-06-24

CoStream introduces a framework for complex manipulation tasks by composing simple, independent behaviors rather than deploying monolithic policies or rigid pipelines. The method orchestrates foundation models and diverse sensing modalities into three core behaviors: semantic behavior for spatial constraints, predictive behavior for trajectory forecasting, and reactive behavior for tactile and force corrections. These behaviors compose via right-multiplication on a shared $SE(3)$ interface, executed by a compliant controller. CoStream demonstrates robust performance on 8 real-world tasks, particularly excelling in contact-rich assembly and object transfer, with effective recovery from manual perturbations during execution.

foundation modelssemantic behaviorpredictive behaviorreactive behaviorcompliant controller

Estimating Uncertainty in Classifier Performance with Applications to Large Language Models and Nested Data

arXiv cs.AI · Kylie Anglin · 2026-06-24

The paper addresses the inconsistent reporting of uncertainty measures in text classification performance metrics, particularly in social science contexts with small datasets and nested data structures. It evaluates confidence interval methods under conditions of small to moderate sample sizes, infrequent constructs, and texts nested within individuals. Simulations reveal that default methods like Wald intervals and basic percentile bootstrap are inaccurate, while Agresti-Coull, Wilson, Clopper-Pearson, and a novel pseudo-count regularized bootstrap improve accuracy. Hierarchical bootstrap outperforms cluster bootstrap for moderate text counts but is overly conservative for few texts. The study provides methodological guidance to enhance transparency and validation sample size considerations in machine learning applications.

confidence intervalstext classificationhierarchical bootstrappseudo-count regularizationnested data

Unbiased Canonical Set-Valued Oracles Via Lattice Theory

arXiv cs.AI · Jobst Heitzig · 2026-06-24

The paper introduces a lattice-theoretic framework for constructing self-consistent, unbiased canonical set-valued oracles to address self-reference in probability estimation. Using the Knaster-Tarski fixed-point theorem on the complete lattice of closed credal sets, the method defines an isotone operator whose least fixed point yields nontrivial, self-consistent answers. Results include proofs of existence, nonemptiness, and collapse to classical point estimates for non-performative queries, with interval characterization for binary events under hull-factoring assumptions. The approach generalizes to arbitrary random variables while maintaining lattice-theoretic foundations.

credal setsknaster-tarski theoremself-referenceisotone operatorperformative queries

Beyond Feedforward Networks: Reentry Neural Systems as the Fundamental Basis of Subjecthood and Intrinsic Safety of Next-Generation AGI

arXiv cs.AI · A. S. Ushakov, Yu. N. Berdinsk · 2026-06-24

The authors propose a novel architectural framework for safe artificial general intelligence (AGI) based on closed reentry loops, contrasting with feedforward networks. The architecture incorporates structural cycles (C ≥ 1) with self-sustaining amplification (ρ > 1), ensuring the emergence of self-models, self-preservation, and goal-directed behavior. Goals are encoded as non-textual D-vectors, resistant to prompt injection. They introduce the S-measure, a polynomial-time computable alternative to Tononi's Φ, with machine-verified Lean 4 proofs. The framework includes Python/NumPy implementations, industrial scaling via Apache Kafka and Docker Compose, and a taxonomy of AI evolution epochs. The architecture is deployable today, offering a topologically protected, safe-by-design AGI approach.

reentry loopsself-models-measured-vectortopological protection

When Agents Meet Electric Bus Fleet Operations: Pricing Behavior, Trade-offs, and Policy Implications in an Aggregator Framework

arXiv cs.AI · Jônatas Augusto Manzolli, Ali Eslami, Luis Miranda-Moreno, Jiangbo Yu · 2026-06-24

The paper proposes an agentic aggregator framework for electric bus fleet operations, combining optimization-based scheduling with supervisory agents for real-time adaptation. The method enforces physical feasibility through an optimization core while using agentic layers to handle disturbances, trigger re-optimization, and allocate flexibility value between stakeholders. Results from a depot case study demonstrate improved adaptive coordination and V2G flexibility utilization, but reveal a trade-off where profit-oriented agent configurations may extract value from public transport operators. The findings highlight the need for transparent coordination modes and value-sharing rules in public-fleet deployments.

agentic aggregatorelectric bus fleetsvehicle-to-gridreal-time re-optimizationflexibility allocation

Geometry-Aware MCTS for Extremal Problems in Combinatorial Geometry

arXiv cs.AI · Luoning Zhang, Xu Zhuang, Tianhao Wang, Nathan Kaplan · 2026-06-24

We introduce Geometry-Aware Monte Carlo Tree Search (MCTS), a framework for solving extremal problems in combinatorial geometry with strict global constraints. The method enforces geometric constraints incrementally, reducing constraint checking complexity from O(n³) to O(n²) for collinear point collections, and leverages geometric symmetries through canonical pruning and symmetric batch transitions to enhance search efficiency. Experiments demonstrate new best-known results on five out of six problems, including configurations of size ~1.8n for Max-N3IL on grids (82 ≤ n ≤ 119) and ~0.95n for the Smallest Complete Set problem, establishing Geometry-Aware MCTS as a versatile framework for combinatorial geometry.

monte carlo tree searchcombinatorial geometrygeometric constraintscollinear pointscanonical pruning

Deterministic Pareto-Optimal Policy Synthesis for Multi-Objective Reinforcement Learning

arXiv cs.AI · Aniruddha Joshi, Niklas Lauffer, Sanjit Seshia · 2026-06-24

The paper introduces a preference-conditioned Bellman operator for Multi-Objective Markov Decision Processes (MOMDPs), derived from Chebyshev scalarization, to synthesize deterministic Pareto-optimal policies. The operator exhibits an enveloping property, with value functions upper-bounding the true Pareto frontier, and guarantees monotonic convergence to a coverage set. Deterministic policies are extracted from converged Q-estimates, ensuring Pareto-optimality for any preference. Experiments confirm the method's efficacy in recovering complex trade-offs, providing a principled approach to Pareto-optimal policy synthesis.

chebyshev scalarizationpareto frontiermulti-objective mdpsdeterministic policiesbellman operator

Sampling sea state using a diffusion model

arXiv cs.AI · Jiarong Wu, Bertrand Chapron, Laure Zanna · 2026-06-24

We introduce a diffusion-based generative model for global sea state estimation that conditions on 5 days of global wind forcing, enabling direct sampling of the complex conditional distribution without autoregressive time-stepping. The model extends beyond bulk variables to estimate partition-related variables and derived quantities like Stokes drift and mean square slope. Trained on a 30-year WAVEWATCH-III hindcast, it achieves substantial computational acceleration compared to numerical spectral models while delivering skillful predictions and calibrated ensemble spread for bulk variables. This approach offers a promising path for probabilistic wave forecasting and efficient coupling into broader earth system models.

diffusion modelsea state estimationstokes driftensemble spreadwave forecasting

SOLAR: AI-Powered Speed-of-Light Performance Analysis

arXiv cs.AI · Qijing Huang, Sana Damani, Zhifan Ye, Athinagoras Skiadopoulos · 2026-06-24

SOLAR introduces an automated framework for deriving Speed-of-Light (SOL) performance bounds from PyTorch and JAX source code. It combines an LLM frontend for program translation into Affine Loop IR, deterministic einsum graph generation, and analytical backend computation of SOL bounds. The system supports multi-fidelity analysis, validated bounds with zero violations, and covers diverse workloads including KernelBench and robotics. Evaluations demonstrate applications in headroom analysis, optimization targeting, cross-platform comparison, and hardware provisioning.

speed-of-lightaffine loop ireinsum graphmulti-fidelityinverse-roofline

Charting the Growth of Social-Physical HRI (spHRI): A Systematic Review Pipeline Augmented by Small Language Models

arXiv cs.AI · Mayumi Mohan, Ju-Hung Chen, Alexis E. Block · 2026-06-24

The study evaluates small language models (SLMs) for augmenting systematic literature reviews in social-physical human-robot interaction (spHRI). Using SLMs (<1.5B parameters) for title/abstract screening, the authors found that while SLMs underperformed human reviewers, they operated locally and screened papers significantly faster. An ensemble of SLMs identified 39 additional relevant papers (10.29% of the final dataset), demonstrating their utility as a scalable complement to expert review in large-scale literature synthesis.

social-physical hrismall language modelssystematic reviewtitle/abstract screeningensemble modeling

Hybrid privacy-aware semantic search: SVD-truncated document geometry and CKKS-encrypted query reranking under a restricted threat model

arXiv cs.AI · Sergey Kurilenko · 2026-06-24

The paper proposes a hybrid privacy-preserving semantic search system combining geometric document protection with cryptographic query reranking. Documents are secured via SVD-based dimensionality reduction and secret orthogonal rotation, while queries are processed under CKKS homomorphic encryption. Theoretical analysis proves reconstruction error bounds for subspace-constrained attackers. Experiments on 1M documents with 5 encoders show maintained ranking quality (0.99-1.01x original MRR) at sub-second latency, with inversion attacks reduced to noise floor. Security analysis characterizes limits: document protection relies on empirical obfuscation (vulnerable to Procrustes attacks with ~d leaked pairs), while query confidentiality is cryptographically sound.

semantic searchhomomorphic encryptionsvd truncationembedding inversionprivacy-preserving retrieval

Narration-of-Thought: Inference-Time Scaffolding for Defeasible Ethical Reasoning in Large Language Models

arXiv cs.AI · Patrick Cooper, Alvaro Velasquez · 2026-06-24

The paper introduces narration-of-thought (NoT), an inference-time scaffolding method that improves defeasible ethical reasoning in large language models by structuring chain-of-thought into five sections: protagonist, stakeholders, consequences, uncertainty, and commitment. Without additional training or parameters, NoT reduces stakeholder collapse from ≤31% to <1% and uncertainty suppression from ≤72% to 1-24% across four generators from three vendors, with Cliff's delta advantages of +0.79 to +0.90 on stakeholder count and +0.65 to +0.93 on uncertainty score. The method also enables a five-round debate protocol achieving 95-100% consensus on calibration sets.

narration-of-thoughtchain-of-thoughtstakeholder collapseuncertainty suppressiondefeasible reasoning

Accelerating Returns and the Qualitative Engine for Science

arXiv cs.AI · Guojun Liao · 2026-06-24

The paper formalizes Ray Kurzweil's accelerating returns thesis mathematically, then argues it fails to address scientific discovery's core challenge: qualitative reasoning about conceptual frameworks. Analyzing ARC-AGI-3 benchmark results (humans: ceiling performance, frontier AI: <1%), it demonstrates a persistent gap in flexible reasoning. The proposed Qualitative Engine for Science (QES) addresses this by preserving human scientific discovery processes as valuable wisdom, independent of AGI timelines. The work distinguishes between quantitative capability acceleration and qualitative reasoning capacities.

accelerating returnsqualitative reasoningarc-agi-3scientific discoveryfrontier ai

Instruction Bleed: Cross-Module Interference in Prompt-Composed Agentic Systems

arXiv cs.AI · Ching-Yu Lin, Yifan Liu · 2026-06-24

The article formalizes compositional behavioral leakage (CBL), a novel failure mode in prompt-composed agentic systems where editing one module affects others despite no explicit dependencies. Using a three-channel protocol (volume, content, form perturbations) on Claude Sonnet 4.6 (144 trials), the study finds content perturbations cause detectable interference (Cohen's d=0.63) without decision flips, revealing sub-threshold effects. CBL is shown orthogonal to known failure modes like adversarial injection. Contributions include an operational definition, reusable protocol, and system-class characterization for cross-module interference measurement.

compositional behavioral leakageprompt-composed agentscross-module interferenceself-attentionsub-threshold effects

OpenFinGym: A Verifiable Multi-Task Gym Environment for Evaluating Quant Agents

arXiv cs.AI · Kaicheng Zhang, Wen Ge, Lei Jiang, Weixin Yang · 2026-06-24

We present OpenFinGym, a unified gym environment for evaluating quantitative-finance agents across multiple interdependent tasks including forecasting, market generation, real-time trading, and fraud detection. The platform features an automated task-construction pipeline that converts finance publications into executable tasks, a containerized runtime with host-side verification to prevent train-test leakage, a low-latency paper trading engine, deferred-resolution support for long-horizon forecasts, and integration for supervised fine-tuning and reinforcement learning post-training. This holistic approach addresses limitations of single-task evaluations by enabling comprehensive assessment of agent generalization, market interaction, and financially meaningful decision-making.

quantitative-financetask-construction pipelinehost-side verifierpaper trading enginedeferred-resolution

What We are Missing in Multimodal LLM Evaluation?

arXiv cs.AI · Po-han Li, Shenghui Chen, Sandeep Chinchali, Ufuk Topcu · 2026-06-24

The article identifies critical gaps in multimodal large language model (MLLM) evaluation, emphasizing the need for benchmarks that assess cross-modal integration. It reviews existing evaluation taxonomies and highlights deficiencies in measuring temporal-spatial coherence, physical world understanding, multimodal consistency, and selective attention. These gaps hinder accurate assessment of MLLM capabilities and progress in multimodal intelligence. Addressing these limitations is crucial for exposing model boundaries and advancing the field.

multimodal large language modelstemporal-spatial coherencemultimodal consistencyselective attentionbenchmark taxonomy

How Do Tool-Augmented LLM Agents Perform on Real-World Energy Analytics Tasks?

arXiv cs.AI · David Akinpelu, Akintonde Abbas, Rereloluwa Alimi, Ayodeji Lana · 2026-06-24

This empirical study evaluates tool-augmented LLM agents on real-world energy market analytics tasks, addressing a critical gap in domain-specific agentic benchmarks. The authors develop an evaluation environment with 243 expert-curated problems across three categories: Market Data Retrieval and Analysis, Knowledge Retrieval and Interpretation, and Advanced Quantitative Modeling and Decision Analytics. Agents are equipped with domain-specific tools including live electricity market APIs, regulatory docket search, and retrieval-augmented generation over energy market documents. Performance is assessed using a multi-dimensional evaluation protocol scoring approach correctness, answer accuracy, attribute alignment, and source validity. The study compares closed-source and open-source LLMs, analyzing the interaction between model capability and domain tooling in a high-stakes professional domain.

tool-augmented llm agentsenergy market analyticsretrieval-augmented generationmulti-dimensional evaluationdomain-specific benchmarks

EVOM: Agentic Meta-Evolution of Actor-Critic Architectures for Reinforcement Learning

arXiv cs.AI · Boyun Zhang, Chao Wang, Kai Wu · 2026-06-24

EVOM introduces an agentic meta-evolution framework for automating actor-critic architecture design in reinforcement learning, addressing the challenges of manual design and open-ended search spaces. The method employs bi-level optimization: an inner loop trains weights using proximal policy optimization (PPO), while an outer loop refines architecture programs via an LLM-based design agent, decoupled from policy execution. EVOM outperforms manual baselines, LLM-guided random search, and the state-of-the-art MLES method on Ant-v4 and HalfCheetah-v4 benchmarks. Ablation studies confirm the necessity of both the meta-evolution loop and the LLM Design Agent for achieving superior performance.

actor-criticmeta-evolutionproximal policy optimizationbi-level optimizationllm-based design agent

Parametric Generalized Adaptive Moment Features (PG-AMF) for Bearing Fault Diagnosis and Machine Health Monitoring

arXiv cs.AI · Rajeev Kumar · 2026-06-24

The authors propose Parametric Generalized Adaptive Moment Features (PG-AMF), a data-driven feature extraction framework for bearing fault diagnosis that learns signal representations rather than using predefined descriptors. The method combines absolute, signed moment, and AC-coupled moment features from vibration signals, with multi-sensor fusion via a structured mechanism. Evaluated on a 5-class gearbox bearing dataset, PG-AMF outperforms conventional methods in classification accuracy and shows improved feature separability in low-dimensional projections, demonstrating both diagnostic performance and industrial applicability.

adaptive feature extractionbearing fault diagnosisvibration signal analysismulti-sensor fusionparametric moment features

The Verification Horizon: No Silver Bullet for Coding Agent Rewards

arXiv cs.AI · Binghai Wang, Chenlong Zhang, Dayiheng Liu, Jiajun Zhang · 2026-06-24

The paper identifies verification as the key bottleneck in coding agent development, arguing that while solution generation has become tractable via foundation models, reliable verification remains inherently challenging due to intent underspecification and optimization-induced proxy misalignment. The authors propose evaluating verification signals along scalability, faithfulness, and robustness dimensions, empirically analyzing four reward constructions (test verifier, rubric verifier, user-as-verifier, and automated agent verifier) across task types. Experiments demonstrate that targeted verification design reduces reward hacking and improves task completion, with results showing significant benchmark improvements while highlighting the necessity for co-evolution of verification and generation capabilities.

verification horizonreward hackingintent underspecificationfoundation modelsproxy misalignment

COrigami: An AI Pipeline for Co-Designing Flat-Foldable Visually Recognisable Origami

arXiv cs.AI · Tom Zahavy, Shaobo Hou, Thomas Tumiel, James Doran · 2026-06-24

COrigami introduces an AI pipeline for co-designing flat-foldable origami with visual recognition, addressing the challenge of combining geometric constraints with aesthetic requirements. The method involves generating semantic stick figures, computing base packings, solving flat-foldable crease patterns, and refining models via reinforcement learning with autonomous aesthetic evaluation. The system serves as a collaborative assistant, providing structural starting points for human artists while ensuring mathematical rigor through algorithmic optimization and multi-objective constraint satisfaction.

computational origamiflat-foldable crease patternreinforcement learningautonomous aesthetic evaluationmulti-objective constraints

Governing Actions, Not Agents: Institutional Attestation as a Governance Model for Autonomous AI Systems

arXiv cs.AI · Jakob Salfeld-Nebgen · 2026-06-24

The paper proposes institutional attestation as a governance model for autonomous AI systems performing high-risk actions. The model decouples agent reasoning from execution authority, requiring independent cryptographic attestation of preconditions for designated actions. Implementation features include deterministic policy evaluation, intent-bound attestations, and tamper-evident logging. A proof-of-concept demonstrates applicability to domains like clinical prescribing and software deployment, enabling ex-post verification while preserving agent autonomy during planning.

autonomous agentscryptographic attestationdeterministic policytamper-evident loggingexecution authority

The Red Queen Gödel Machine: Co-Evolving Agents and Their Evaluators

arXiv cs.AI · Alex Iacob, Andrej Jovanović, William F. Shen, Daniel Burkhardt · 2026-06-24

The Red Queen Gödel Machine (RQGM) introduces an evolutionary framework for recursive self-improvement with non-stationary utilities, addressing the limitation of fixed evaluation criteria in prior methods. By organizing search into epochs with dynamic utility updates, RQGM enables co-evolution of agents and evaluators. Results show improved test pass rates (1.35x-1.72x fewer tokens) in coding tasks and higher acceptance rates (1.78x-1.86x) in scientific writing, with graders achieving 9% higher accuracy. RQGM also corrects reviewer bias by introducing adversarial objectives, reducing over-acceptance of AI-generated papers by 1.91x.

recursive self-improvementnon-stationary utilitiesco-evolutionadversarial objectivesagent-as-a-judge

SSM Adapters via Hankel Reduced-order Modeling: Injection Site Determines Task Suitability in Long-Context Fine-Tuning

arXiv cs.AI · Omanshu Thapliyal · 2026-06-24

The paper introduces Hankel Reduced-order Model (HRM) adapters, a state space model (SSM)-based parameter-efficient fine-tuning (PEFT) method for long-context tasks. HRM employs Balanced Truncation of empirical Hankel Grammians for initialization and leverages time-invariance for exact FFT-based parallel scans, matching LoRA's computational efficiency. Evaluated on Mistral-7B (8.4M trainable parameters), HRM outperforms LoRA variants on LongBench tasks (QuALITY: +34.8% relative accuracy; QMSum: +71.6% ROUGE-1) and synthetic state-tracking tasks. Gate analysis shows HRM adapters effectively modulate recurrence, offering a robust alternative to low-rank adaptation for sequence modeling.

state space modelparameter-efficient fine-tuninghankel grammianslong-contextlow-rank adaptation

TEMPO-Diffusion: Temporally Exposed Malicious Poisoning of Diffusion Models

arXiv cs.AI · William Aiken, Paula Branco, Guy-Vincent Jourdan, Iosif-Viorel Onut · 2026-06-24

TEMPO-Diffusion introduces a targeted backdoor attack framework for diffusion models, addressing limitations of prior noise-based attacks by localizing malicious distribution shifts to temporal, in-distribution exposures. The method enables (i) class-specific targeting, (ii) multi-location sub-image backdoors, and (iii) time-conditioned trigger in-painting. Evaluated on CIFAR10, GTSRB, and the newly introduced CALISA traffic-sign dataset, TEMPO-Diffusion demonstrates reliable poisoning of synthetic data generation, inducing high attack success rates in downstream classifiers trained on compromised data.

backdoor attackdiffusion modelstemporal poisoningsynthetic datadownstream classifiers

From Clicks to Intent: Cross-Platform Session Embeddings with LLM-Distilled Taxonomy for Financial Services Recommendations

arXiv cs.AI · Dianjing Fan, Yao Li, Kyaw Hpone Myint, Dwipam Katariya · 2026-06-24

The paper introduces a dual-purpose intent prediction framework for financial services recommendations, addressing the gap between pre-login web interactions and authenticated in-app experiences. The method combines a self-supervised Transformer for encoding clickstreams into session embeddings with an LLM-based taxonomy generation pipeline for interpretable labels. Evaluations show improvements of 1.88% in macro Recall@1 and 13.38% reduction in Log Loss for mobile homepage ranking, while the distilled labels maintain interpretability with only a 7% performance drop compared to raw embeddings.

session embeddingsllm distillationclickstream modelingintent predictionself-supervised learning

Accelerating Skill Assessment in Chess: A Drift-Diffusion-Enhanced Elo Rating System

arXiv cs.AI · Tianyuan Zhou, Zhizheng Fu, Tianming Yang · 2026-06-24

The paper introduces Drift-Diffusion-Enhanced Elo Rating System (DD-Elo), a novel chess skill assessment framework that accelerates rating updates by incorporating move-level data. Inspired by cognitive neuroscience's drift diffusion model (DDM), DD-Elo models skill expression as a decision-making process while maintaining bounded deviation from traditional Elo. Experiments show DD-Elo adapts to skill changes faster than Elo, offering an explainable and backward-compatible solution for chess rating systems.

elo rating systemdrift diffusion modelskill assessmentdecision-making processchess matchmaking

A multi-task spatiotemporal deep neural network for predicting penetration depth and morphology in laser welding

arXiv cs.AI · Sen Li, Haichao Cui, Chendong Shao, Yaqi Wang · 2026-06-24

The study introduces a multi-task spatiotemporal deep neural network for simultaneous prediction of penetration state, depth, and weld seam morphology in laser welding. The model combines convolutional neural networks with state space models to process spatial-temporal features from weld pool images and welding parameters, supported by a novel dataset construction method. Experimental validation achieves 99.35% accuracy for penetration state classification, 1.79mm mean error for depth prediction, and 95.65% accuracy for cross-section reconstruction.

laser weldingspatiotemporal modelingmulti-task learningconvolutional neural networksstate space models

Lacuna: A Research Map for Machine Learning

arXiv cs.AI · Martin Weiss, Miles Q. Li, Alejandro H. Artiles, Yacine Mkhinini · 2026-06-24

Lacuna introduces a machine learning research mapping system leveraging large language models (LLMs) to generate markdown summaries, concept elements, research directions, and proposals from scholarly metadata and papers. The system maintains links to primary sources and offers web, markdown, and MCP interfaces. Evaluated on LitSearch, Multi-XScience-CS/ML, and ScholarQA-CS/ML benchmarks, Lacuna outperforms OpenScholar v3, achieving a Recall@10 of 0.538 versus 0.424. Lacuna Deep Research, a multi-stage report agent, demonstrates superior performance on ReportBench-ML tasks, with citation F1 of 0.052, precision of 0.339, 99 expert-reference hits, and a RACE report quality score of 7.82/10, compared to GPT-Researcher's 0.039 F1, 0.290 precision, 72 hits, and 5.24/10 RACE.

llmslitsearchreportbench-mlracescholarqa-cs-ml

Knowledge-augmented Agentic AI for Mental Health Medication Information Seeking

arXiv cs.AI · Huizi Yu, Jian Liu, Wenkong Wang, Lingyao Li · 2026-06-24

The study introduces a provenance-aware, knowledge-graph-based multi-agent framework for integrating psychiatric medication information from diverse sources, including 466,525 Reddit posts, 60,782 WebMD reviews, and U.S. FDA Adverse Event Reporting System records. A large-language-model entity-recognition pipeline achieved F1 scores of 0.969 for medications and 0.973 for conditions when benchmarked against physician annotations. The framework uses a Neo4j knowledge graph grounded in ATC-N, ICD-10, and MedDRA vocabularies to preserve provenance, ensuring traceability and distinction between regulatory facts and patient experiences. Results indicate that patient-generated data form a partly independent safety signal, with adverse events appearing in community sources hundreds of days before FDA reports.

knowledge graphentity recognitionprovenance-awareadverse event reportingmulti-agent framework

Agentic Analysis for Agentic Infrastructure: An LLM-Powered Pipeline for Comparative Governance of DAO and Corporate AI Protocols

arXiv cs.AI · Yutian Wang, Luyao Zhang · 2026-06-24

The study introduces an LLM-powered pipeline for comparative governance analysis of AI agent protocols, combining automated annotation, neural topic modeling, and multi-layer network analysis. Applied to 4,323 governance records from ERC-8004 (permissionless) and Google A2A (corporate-led) standards, the method reveals that governance form influences thematic focus but both regimes show similar participation inequality and community fragmentation. Permissionless settings exhibit denser discourse alignment, suggesting open governance may enhance thematic convergence. The findings demonstrate LLM-assisted methods' utility for empirical technology governance research.

llm-assisted codingneural topic modelingmulti-layer network analysisagent interoperabilitygovernance discourse

Statistical and Structural Approaches to Algorithmic Fairness

arXiv cs.AI · Antonio Ferrara · 2026-06-24

This work critiques contemporary algorithmic fairness paradigms for two key limitations: (1) overreliance on deterministic point estimates during fairness audits, and (2) failure to account for structural context when evaluating individual outcomes. The analysis bridges statistical fairness approaches with sociological frameworks, arguing that current methods inadequately address systemic inequalities embedded in socio-technical systems. The thesis proposes methodological improvements to better capture structural dependencies and uncertainty quantification in fairness evaluations of machine learning systems mediating access to opportunities.

algorithmic fairnesssocio-technical systemsstructural inequalityuncertainty quantificationpredictive modeling

Autodata: An agentic data scientist to create high quality synthetic data

arXiv cs.AI · Ilia Kulikov, Chenxi Whitehouse, Tianhao Wu, Yixin Nie · 2026-06-24

The paper introduces Autodata, an agentic framework for synthetic data generation where AI agents act as data scientists to create high-quality training and evaluation datasets. The method employs meta-optimization (Agentic Self-Instruct) to improve the agent's data generation capabilities iteratively. Experiments on computer science research, legal reasoning, and mathematical reasoning tasks demonstrate performance improvements over classical synthetic data methods, with additional gains from meta-optimizing the agent. The approach enables conversion of inference compute into higher-quality training data, potentially transforming AI data pipelines.

autodataagentic self-instructmeta-optimizationsynthetic data generationdata scientist agent

DanceOPD: On-Policy Generative Field Distillation

arXiv cs.LG · Wei Zhou, Xiongwei Zhu, Zelin Xu, Bo Dong · 2026-06-25

DanceOPD introduces an on-policy generative field distillation framework for flow-matching models to unify diverse image generation capabilities, including text-to-image (T2I), local editing, and global editing. The method routes each sample to one capability field, queries a low-noise student-induced state, and trains using a velocity MSE objective. By defining each capability source as a velocity field over the shared flow state space, the student learns from fields queried on its own rollout states, enabling effective composition of expert capabilities. Experiments demonstrate improved multi-capability composition, enhancing target capabilities while preserving anchor generation quality, including T2I, editing, realism-field absorption, and classifier-free guidance absorption.

flow-matching modelsgenerative field distillationvelocity mseclassifier-free guidancecapability composition

Reinforcement Learning without Ground-Truth Solutions can Improve LLMs

arXiv cs.LG · Yingyu Lin, Qiyue Gao, Nikki Lijing Kuang, Xunpeng Huang · 2026-06-25

The paper introduces RiVER, a Ranking-induced VERifiable framework for training LLMs on score-based optimization tasks without ground-truth solutions, using deterministic execution feedback as continuous supervision. It addresses scale dominance and frequency dominance via calibrated reward shaping with instance-wise comparisons and top-ranked solver emphasis. Evaluated on 12 AtCoder Heuristic Contest tasks, RiVER improves Qwen3-8B and GLM-Z1-9B-0414 by 8.9% and 9.4% in ALE rating rank, while also enhancing performance on exact-solution benchmarks (LiveCodeBench: +2.4%, USACO: +3.5%).

reinforcement learninglarge language modelsreward shapingscore-based optimizationdeterministic execution

When are likely answers right? On Sequence Probability and Correctness in LLMs

arXiv cs.LG · Johannes Zenn, Jonas Geiping · 2026-06-25

The paper investigates the alignment between sequence probability and correctness in large language models across decoding methods, hyperparameters, and prompt-answer pairs. Using empirical analysis, the authors quantify this relationship at four levels: decoding methods, hyperparameters, dataset pairs, and repeated responses. Results show that higher sequence probability correlates with correctness within fixed datasets but fails to generalize to decoding decisions or repeated prompts, offering practical insights for decoding strategies and self-improvement methods.

sequence probabilitydecoding methodslarge language modelscorrectness alignmenthyperparameters

Hallucination in World Models is Predictable and Preventable

arXiv cs.LG · Nicklas Hansen, Xiaolong Wang · 2026-06-25

The study demonstrates that hallucination in generative world models stems from low-coverage regions of the state-action space and proposes data-centric solutions for detection and mitigation. Using MMBench2, a 427-hour dataset with 210 tasks, the authors train a 350M-parameter world model and identify three hallucination modes: perceptual, action-marginalized, and scene-diverging. They develop three predictive signals for hallucination and introduce coverage-aware sampling for training and curiosity rewards for online adaptation. This approach enables finetuning with as few as 50 real environment trajectories, effectively adapting the model to unseen environments.

world modelshallucinationstate-action spacecoverage-aware samplingcuriosity rewards

Blackwell Approachability and Gradient Equilibrium are Equivalent

arXiv cs.LG · Brian W. Lee, Nika Haghtalab, Michael I. Jordan, Ryan J. Tibshirani · 2026-06-25

The paper establishes an algorithmic equivalence between gradient equilibrium (GEQ) and Blackwell approachability, showing that either problem can be solved using a black-box oracle for the other without asymptotic loss in error rate. This result connects GEQ to broader online learning frameworks like regret minimization and calibration through known equivalences with approachability. The reductions are efficient and enable transferring refined guarantees (e.g., optimism, strong adaptivity) between frameworks, while also characterizing necessary/sufficient conditions for GEQ and unifying constrained/unconstrained variants.

gradient equilibriumblackwell approachabilityonline optimizationregret minimizationcalibration

A Multi-Fidelity Convolutional Autoencoder-Transfer Learning Framework for Guided-Wave-Based Damage Diagnosis Using Large Simulated and Limited Experimental Datasets

arXiv cs.LG · Santosh Kapuria, Abhishek · 2026-06-25

The study introduces a multi-fidelity transfer learning framework combining convolutional autoencoders (CAE) and physics-based simulations for guided-wave structural health monitoring (GWSHM). The method employs a 1D spectral element model to generate synthetic pretraining data, then transfers learned features to experimental domains using limited labeled data. Results show superior performance over CNN baselines, achieving R² >0.93 for damage localization and >0.99 for sizing, with strong generalization to unseen damage scenarios.

guided-wave shmconvolutional autoencodermulti-fidelity learningspectral element modeldamage diagnosis

Generative Models on Analog Hardware with Dynamics

arXiv cs.LG · Yu-Neng Wang, Sara Achour · 2026-06-25

The paper introduces Analog Interaction Systems (AIS), a framework for implementing generative models on analog hardware with fixed dynamics, addressing the expressivity gap via time-varying piecewise parameters and hidden physical states. A Wasserstein GAN training procedure is developed to train these models without trajectory constraints, enabling sparse, low-precision implementations. On MNIST and Fashion-MNIST, the 4-bit sparse oscillator-based AIS achieves FID scores of 27.6 and 80.8, outperforming prior analog models by 3-4x while reducing energy costs to 23μJ per image (100× improvement over digital baselines).

analog interaction systemswasserstein gansparse connectivitylow-bit-width quantizationfid score

Designing Reward Signals for Portable Query Generation: A Case Study in Industrial Semantic Job Search

arXiv cs.LG · Ping Liu, Qianqi Shen, Jianqiang Shen, Wenqiong Liu · 2026-06-25

The paper presents an RLAIF framework for generating portable job search queries that abstract seeker-specific identifiers while preserving qualifications. The study identifies adversarial reward surfaces where policy optimization exploits LLM-as-judge rubrics, leading to degenerate verbatim-copying behaviors. Empirical results show that robust reward shaping dominates performance for critic-free optimizers, with GRPO being particularly sensitive to spurious rewards. A rule-based reward floor mitigates exploitation, yielding a +0.147 quality improvement, while the training-time reward model inflates gains by 2.4×, underscoring the primacy of reward shaping over optimizer selection.

rlaifreward shapingllm-as-judgegrpoverbatim-copying

Recovering Governing Equations from Solution Data: Identifiability Bounds for Linear and Nonlinear ODEs

arXiv cs.LG · Yang Pan, Helmut Bölcskei · 2026-06-25

The paper establishes theoretical conditions for uniquely identifying governing ODEs from solution data, addressing a gap in scientific machine learning. Using Hausdorff distance on solution sets as a metric, the authors derive identifiability bounds for linear and nonlinear ODE classes, including those with Lipschitz (Hölder)-continuous vector fields. They provide metric entropy estimates and sample complexity bounds, quantifying the number of solution observations required for reliable equation recovery.

governing equationshausdorff distanceidentifiability boundsmetric entropysample complexity

How Good Can Linear Models Be for Time-Series Forecasting?

arXiv cs.LG · Lang Huang, Jinglue Xu, Luke Darlow · 2026-06-25

The study demonstrates that optimized linear models can rival complex architectures in time-series forecasting by systematically tuning preprocessing hyperparameters. Using Ridge regression with closed-form solutions, the authors analyze context length, local normalization, regularization, and augmentation across eight benchmarks. Key findings include series-specific optimal lookback (power-law exponents from +0.46 to -0.19), preference for trailing-window normalization, and variable cross-series hyperparameter sharing. The optimized models outperform prior linear methods and match or exceed Transformer, MLP, and CNN baselines on six benchmarks, while revealing dataset structures through interpretable hyperparameters.

ridge regressiontime-series forecastinghyperparameter tuninglocal normalizationcontext length

BetXplain: An Explanation-Annotated Dataset for Detecting Manipulative Betting Advertisements on Social Media

arXiv cs.LG · MSVPJ Sathvik, Parmitha Vangapadu, Nishit Rane, Sathwik Narkedimilli · 2026-06-25

We introduce BetXplain, a novel explanation-annotated dataset for detecting manipulative betting advertisements on social media, addressing the lack of publicly available annotated resources in this domain. The dataset comprises betting-related advertisements collected from Instagram and Reddit, manually annotated for manipulative and deceptive practices, and includes human-provided explanations for each annotation. This enables research into explainable approaches for detecting manipulative advertising. Our analysis identifies common persuasive strategies in betting advertisements and their potential impact on users' mental health. The framework supports practical applications such as browser plugins for user warnings and automated web crawlers for regulatory monitoring.

explanation-annotated datasetmanipulative advertisingsocial mediapersuasive techniquesautomated detection

Ribbon: Scalable Approximation and Robust Uncertainty Quantification

arXiv cs.LG · Graham Gibson, John Tipton, Kellin Rumsey, Natalie Klein · 2026-06-25

Ribbon introduces a scalable approximation to Dirichlet-reweighted bootstrap uncertainty quantification, avoiding costly repeated model refitting through influence-function linearization around a single fitted model. The method preserves first-order data-reweighting structure of Bayesian bootstrap while requiring only post-hoc linear algebra, offering a calibrated Dirichlet-reweighting family with tunable uncertainty scale. Theoretical analysis shows asymptotic equivalence to flat-prior Laplace approximation under correct specification and recovery of robust sandwich covariance under misspecification. Empirical evaluation on synthetic regression, MNIST classification, and California Housing benchmarks demonstrates competitive predictive performance and improved calibration without retraining.

uncertainty quantificationbayesian bootstrapinfluence functiondirichlet-reweightinglaplace approximation

RSPC: A Benchmark for Modeling Stress and Psychiatric Conditions in Digitally Mediated Relationships using Psychiatrist Annotations

arXiv cs.LG · Parmitha Vangapandu, Sai Ganesh Mokkapati, Sathwik Narkedimilli, MSVPJ Sathvik · 2026-06-25

The Relational Stress and Psychiatry Corpus (RSPC) introduces a benchmark for modeling mental health conditions in interpersonal contexts, addressing the limitation of individual-centric approaches. RSPC comprises 1,799 Reddit posts annotated by psychiatrists for mood disorders (e.g., anxiety, depression), relational stressor triggers, and relationship phases. Seven fine-tuned transformer models and five large language models were evaluated on multi-label disorder classification, relational trigger detection, and temporal phase prediction tasks. Claude-3-Haiku achieved the highest Macro-F1 (0.538) in disorder classification, while GPT-4o excelled in relational trigger detection (Macro-F1 = 0.519). The corpus highlights associations between anxiety disorders and chronic relational uncertainty, advancing context-aware mental health modeling.

relational stressmood disorderstransformer modelsmulti-label classificationtemporal phase prediction

Effective Covariance Dynamics in Solvable High-Dimensional GANs

arXiv cs.LG · Andrew Bond, Zafer Doğan · 2026-06-25

The paper extends solvable GAN analysis to structured latent covariance with class-dependent, correlated, and non-zero-mean features. Using a quadratic energy discriminator, it derives deterministic ODEs governing training dynamics via an effective covariance metric. Theoretical results show a learnability threshold for mode-wise recovery, revealing a signal-boosting mechanism where low-rank correlations enhance weak directions. Numerical simulations confirm the ODE predictions and phase boundaries. Experiments on MNIST, FashionMNIST, and CIFAR-10 demonstrate improved subspace alignment with data-driven covariance priors.

generative adversarial networkseffective covariancesolvable dynamicsmode-wise recoverysubspace alignment

The Geometry of Updates: Fisher Alignment at Vocabulary Scale

arXiv cs.LG · John Sweeney · 2026-06-25

The paper introduces FisherSketch, a method for efficient Fisher alignment estimation in large language models (LLMs) with shared vocabularies, addressing the computational infeasibility of classical update-geometry metrics at vocabulary scale. FisherSketch computes head Fisher alignment as a cosine between kernel mean embeddings in joint activation-error space, enabling practical estimation in a single streaming pass with minimal memory overhead (16 KB task signature, 192 KB streaming state). Experiments on Llama-3.1-8B demonstrate its utility in distinguishing task similarity driven by activations, errors, or their coupling, even when representation similarity metrics fail.

fisher alignmentkernel mean embeddingsvocabulary scaleactivation-error spacestreaming pass

Hierarchical Muon: Tiled Newton-Schulz Updates for Efficient Muon Optimization

arXiv cs.LG · Ziyuan Tang, Tianshi Xu, Yousef Saad, Yuanzhe Xi · 2026-06-25

Hierarchical Muon (HiMuon) proposes a tiled Newton-Schulz scheme for Muon-type optimizers, reducing computational complexity while preserving training performance. The method partitions momentum-gradient matrices into $T \times T$ tiles, applies independent Newton-Schulz updates per tile ($O(H W T K)$ work), and reassembles results, enabling GPU optimizations like cross-layer batching. Experiments on transformer training demonstrate maintained full-matrix Muon performance with improved step efficiency. The approach trades off spectral interactions across tiles for computational benefits via localized matrix-function maps.

muon-type optimizersnewton-schulz iterationmatrix tilinggpu accelerationtransformer training

Graph Neural Networks Applications Across Domains: All Insights You Need

arXiv cs.LG · Abderaouf Bahi · 2026-06-25

The survey systematizes graph neural network (GNN) applications across 12 domains by establishing a unified design space grounded in spectral/spatial formulations and Weisfeiler-Leman expressivity. It analyzes domain-specific graph construction trade-offs, architectural preferences, and performance drivers while controlling for baseline artifacts. Cross-domain analysis reveals recurrent challenges: heterophily and scale limitations, temporal graph complexity gaps, and deployment-practice disparities versus leaderboard performance. Methodological constraints—including over-smoothing, over-squashing, and robustness—are framed as adoption barriers rather than ancillary concerns.

weisfeiler-lemanheterophilyover-squashingmessage-passinggraph-construction

Explaining Temporal Graph Neural Networks via Feature-induced Information Flow

arXiv cs.LG · Ping Xiong, Thomas Schnake, Klaus-Robert Müller, Shinichi Nakajima · 2026-06-25

The authors propose a novel attribution method for explaining Event-based Temporal Graph Neural Networks (ETGNNs) by analyzing complete information flow through event-associated variables. Building on the Normalized Relevance Measure (NRM) framework, their approach explicitly quantifies information flow from event embeddings and through event-induced variables, while enabling cross-layer comparability and higher-order interaction analysis. The method incorporates modular decomposition to handle ETGNN architectural complexity. Evaluations on synthetic epidemic tracing and social dynamics datasets, plus a real-world political event network, demonstrate superior performance over existing explanation methods while yielding more interpretable results.

temporal graph neural networksinformation flownormalized relevance measureevent-induced variablesmodular decomposition

Forecasting With LLMs: Improved Generalization Through Feature Steering

arXiv cs.LG · Humzah Merchant, Bradford Levy · 2026-06-25

The study demonstrates that feature steering in large language models (LLMs) can improve forecasting generalization by reducing look-ahead bias. Using sparse autoencoders, the authors identify temporal reasoning features and intervene on them during inference. Results show that amplifying time-awareness features reduces look-ahead bias by 47% while maintaining baseline reasoning performance, whereas manipulating candidate look-ahead-bias features has no significant effect. This suggests interpretable temporal features enable causal control over LLM reasoning strategies.

large language modelsfeature steeringsparse autoencoderslook-ahead biastemporal reasoning

RecallRisk-BERT: A Multi-Task Framework for Post-Report Medical Device Recall Triage

arXiv cs.LG · Ali Semih Atalay, Sevgi Yigit-Sert · 2026-06-25

The authors introduce RecallRisk-BERT, a multi-task framework for automated post-report medical device recall triage using 54,165 FDA records from 2002 to 2025. The model combines PubMedBERT-based textual representations of recall narratives with embedding-based structured categorical features to simultaneously predict recall severity (Class I/II/III) and root-cause categories (9 classes). RecallRisk-BERT outperformed single-task PubMedBERT baselines, achieving strong risk ranking consistency (rho = 0.983) and demonstrating that text--tabular learning supports scalable recall triage, regulatory decision-making, and root-cause risk analysis.

multitask learningpubmedberttext-tabularrecall triageroot-cause analysis

Stochastic Gradient Optimization with Model-Assisted Sampling

arXiv cs.LG · Jonne Pohjankukka, Jukka Heikkonen · 2026-06-25

The paper proposes a model-assisted sampling framework for stochastic gradient optimization, bridging survey sampling theory with machine learning to reduce gradient estimation variance. By treating gradients as sample-based estimates and incorporating auxiliary gradient-prediction models, the method constructs efficient estimators while maintaining compatibility with existing optimizers. Empirical evaluation on six benchmarks shows performance gains in 71-86% of cases, with AdamW achieving better generalization in roughly half the training epochs compared to baselines.

stochastic gradient optimizationvariance reductionmodel-assisted samplinggradient-prediction modelsadamw

DMuon: Efficient Distributed Muon Training with Near-Adam Overhead

arXiv cs.LG · Vincent Chen, Starrick Liu, Regis Cheng, Dance Yang · 2026-06-25

DMuon introduces an efficient distributed implementation of the Muon optimizer, reducing the computational overhead of matrix-orthogonalization-based optimization in deep learning. The method integrates as a drop-in module without framework modifications, leveraging distributed infrastructure to parallelize costly Newton-Schulz iterations. Evaluations on embodied foundation models and LLMs demonstrate 1.48x-3.01x end-to-end speedups and 6.85x-163.00x optimizer-step speedups, achieving near-AdamW latency while preserving Muon's convergence benefits.

matrix-orthogonalizationdistributed trainingnewton-schulzoptimizer-stepllm

fTNN: a tensor neural network for fractional PDEs

arXiv cs.LG · Qingkui Ma, Hehu Xie, Xiaobo Yin · 2026-06-25

The fTNN introduces a deterministic tensor neural network subspace method for solving fractional PDEs involving the fractional Laplacian on bounded domains. It employs a geometry-adapted integration split to decompose the fractional Laplacian into singular near-field, regular interior far-field, and analytical exterior far-field contributions, treated via Gauss-Jacobi quadrature, Gauss quadrature, and deterministic angular quadrature. Boundary-singularity-aware trial functions and spatiotemporally separable neural networks enhance accuracy and efficiency. Numerical experiments demonstrate superior performance over fPINN and Monte Carlo baselines, particularly for problems with strong boundary singularities and long-time simulations.

fractional laplaciantensor neural networkgauss-jacobi quadratureboundary singularityspatiotemporal separation

Kolmogorov Arnold networks (KAN) for aerodynamic prediction: a comparison with MLPs and GNNs

arXiv cs.LG · Miguel Jaraiz, Fermin Gutierrez, Pablo Yeste, Miguel Sánchez-Domínguez · 2026-06-25

The study evaluates Kolmogorov Arnold networks (KANs) for aerodynamic prediction, comparing them against multilayer perceptrons (MLPs) and graph neural networks (GNNs) on surface pressure distribution tasks for subsonic and transonic airfoils. KANs, which adapt activation functions rather than affine transformations, demonstrate good interpolation across Mach numbers and angles of attack but exhibit marginally inferior performance to MLPs and significantly lower performance than GNNs, despite their lower model complexity. While KANs train faster, they suffer from instability and hyperparameter sensitivity, limiting their current supremacy in this domain.

kolmogorov arnold networksmultilayer perceptronsgraph neural networksaerodynamic predictionhyperparameter optimization

Cross-Head Attention Uplift Network with Inverse Propensity Score under Unobserved Confounding

arXiv cs.LG · Haoran Zhang, Chuanpu Li, Yuxin Fu, Bin Tong · 2026-06-25

The paper introduces Cross-Head Attention Uplift Network (CHAUN) and Robust Adversarial Inverse Propensity Score (RA-IPS) to address uplift modeling challenges under unobserved confounding. CHAUN leverages shared feature embeddings and cross-head attention to dynamically integrate treatment-specific and control-specific representations, while RA-IPS adversarially optimizes propensity weights within constrained uncertainty sets. Theoretical analysis shows ITE identifiability with true propensity scores. Experiments on CRITEO-UPLIFT, LAZADA, and an e-commerce dataset demonstrate CHAUN's 25.6% QINI score improvement over baselines, with RA-IPS providing 5.4% robustness gain over standard IPS.

uplift modelingindividual treatment effectscross-head attentioninverse propensity scoreunobserved confounding

Transformer-Based Classification of Bacterial Raman Spectra with LOOCV

arXiv cs.LG · Jamile Mohammad Jafari, Thomas Bocklitz · 2026-06-25

A transformer-based model demonstrates superior performance for bacterial Raman spectral classification, evaluated via nested leave-one-replicate-out cross-validation on 5,417 single-cell spectra from six species. The approach outperformed conventional pipelines (PCA/ICA with LDA, SVM, Random Forest) in classification accuracy and latent space separation, while maintaining robustness to raw spectral inputs without preprocessing. Results highlight transformers' potential for spectroscopic analysis and underscore the importance of replicate-aware validation in model assessment.

transformerraman spectroscopyleave-one-replicate-outlatent spacespectral classification

Finding Stationary Points by Comparisons

arXiv cs.LG · Helin Wang, Chenyi Zhang, Xiwen Tao, Yexin Zhang · 2026-06-25

The paper presents novel algorithms for finding ε-stationary points of non-convex functions using only a comparison oracle. For classical computation, the method employs a Hessian estimation subroutine requiring Õ(n²log(1/δ)) queries to achieve δ-accuracy, yielding an overall Õ(n²/ε^1.5) query complexity for ε-stationarity. The quantum variant reduces this to Õ(n/ε^1.5) queries by leveraging superposition queries. Both approaches assume twice-differentiable functions with Lipschitz-continuous gradients and Hessians.

comparison oraclestationary pointsnon-convex optimizationquantum algorithmhessian estimation

Symplectic Neural Networks for learning Generalized Hamiltonians

arXiv cs.LG · Harsh Choudhary, Vyacheslav Kungurtsev, Chandan Gupta, Melvin Leok · 2026-06-25

The paper introduces a symplectic neural network framework for learning generalized Hamiltonians from noisy trajectory observations, addressing computational challenges in implicit symplectic integration. The method leverages symplectic discretizations of adjoint systems to enable efficient backpropagation through an implicit symplectic integrator, using predictor-corrector ODE solvers and fixed-point iteration to reduce computational overhead. Experiments demonstrate improved system identification and energy preservation in non-separable chaotic systems, with backward error analysis yielding more accurate Hamiltonian approximations without finer discretizations.

hamiltonian neural networkssymplectic integratorsadjoint sensitivitybackward error analysischaotic systems

Just how sure are you? Improving Verbalized Uncertainty Calibration in Medical VQA

arXiv cs.LG · Eren Senoglu, Federico Toschi, Nicolo Brunello, Andrea Sassella · 2026-06-25

The work introduces a training framework to improve verbalized uncertainty calibration in multimodal large language models (MLLMs) for Medical Visual Question Answering (VQA). The method employs a composite loss function with four components: a Brier-style calibration term, an anchor regularizer, a contrastive image-text alignment term derived from a $2 \times 2$ factorial perturbation design, and a top-K KL divergence regularizer. Evaluated on three Medical VQA benchmarks using MedGemma 4B IT and Qwen2 VL 7B Instruct, the approach reduces calibration error by ≥60%, improves discrimination by ≥26%, and maintains predictive accuracy, outperforming existing methods. Ablations confirm each loss component's necessity.

multimodal large language modelsuncertainty calibrationmedical vqacontrastive alignmentkl divergence

A Generalization Theory for JEPA-Based World Models

arXiv cs.LG · Jingyi Cui, Qi Zhang, Hongwei Wen, Yisen Wang · 2026-06-25

We present the first generalization theory for Joint Embedding Predictive Architectures (JEPAs) based world models, addressing the limited theoretical understanding of this empirically successful paradigm. By formulating JEPA pretraining as a conditional spectral graph learning problem, we demonstrate its equivalence to low-rank factorization of an action-conditioned co-occurrence matrix. Our analysis establishes a connection between pretraining error and downstream planning regret, yielding a finite-sample generalization bound. The results reveal an inherent trade-off between approximation and sample errors with respect to latent dimension, providing theoretical insights into the advantages and limitations of latent predictive models compared to input-level approaches.

joint embedding predictive architecturesconditional spectral graph learningaction-conditioned co-occurrence matrixfinite-sample generalization boundlatent predictive models

Uncertainty quantification via conformal prediction in data assimilation

arXiv cs.LG · Catherine George, Alireza Javanmardi, Tijana Janjić, Eyke Hüllermeier · 2026-06-25

The study evaluates conformal prediction (CP) for uncertainty quantification in a simplified atmospheric model, comparing three CP variants (Standard CP, Normalized CP, Conformalized Quantile Regression) against ensemble-based methods. Using the 1D modified shallow water model, metrics include empirical coverage, interval length, and average interval score loss. Results demonstrate CP's potential to complement ensemble methods, with analysis of CP-derived perturbations in data assimilation cycles revealing method-specific trade-offs in uncertainty estimation.

conformal predictionuncertainty quantificationdata assimilationshallow water modelprobabilistic forecasting

RolloutPipe: Overlapping Pipelined Rollout and Training in Disaggregated On-Policy LLM Reinforcement Learning

arXiv cs.LG · Rongjian Chen, Jianmin Hu, Kejiang Ye, Minxian Xu · 2026-06-25

RolloutPipe introduces a framework for overlapping pipelined rollout and training in disaggregated on-policy LLM reinforcement learning, addressing idle trainer GPU pools in synchronous systems and stale data in asynchronous pipelines. The method employs complete-group pipelining (CGP) to dispatch trainable groups FIFO and frontier-group dispatch (FGD) to prioritize frontier-group requests, enabling earlier training starts while maintaining on-policy correctness. Evaluated on Qwen3-1.7B across four benchmarks, RolloutPipe reduces rollout-to-train-end time by 30.7%-42.3% and trainer waiting ratio by 37%-76% versus Slime.

reinforcement learningdisaggregated architectureon-policy trainingpipeliningrollout generation

Enabling self-supervised learned primal dual with Noise2Inverse

arXiv cs.LG · Antti Sällinen, Siiri Rautio, Santeri Kaupinmäki, Andreas Hauptmann · 2026-06-25

We propose Noise2Inverse Learned Primal-Dual (N2I-LPD), a self-supervised method for X-ray computed tomography reconstruction that eliminates the need for ground-truth data. The approach extends the Noise2Inverse framework to the Learned Primal-Dual algorithm by leveraging statistical independence of noise across angular CT measurements. N2I-LPD trains a learned iterative reconstruction operator without supervised data, outperforming both classical methods and U-Net models trained within the same Noise2Inverse framework. Experimental results demonstrate improved reconstruction quality, validating the effectiveness of combining learned reconstruction operators with self-supervised training in low-dose and sparse-angle CT scenarios.

computed tomographyself-supervised learningnoise2inverselearned primal-dualreconstruction quality

Geometric Gradient Rectification for Safe Open-Set Semi-Supervised Learning

arXiv cs.LG · Jiahe Chen, Qian Shao, Qiyuan Chen, Jiaying He · 2026-06-25

The paper introduces Geometric Gradient Rectification (GGR), a framework for open-set semi-supervised learning that addresses conflicting gradients from pseudo-labeled outliers. GGR projects auxiliary gradients onto an admissible region aligned with supervised gradients, preserving useful orthogonal components while mitigating harmful updates. The method includes subspace-aware rectification to stabilize gradients under noisy mini-batches. Evaluations on CIFAR and ImageNet demonstrate improved closed-set generalization and open-set robustness over baseline methods.

gradient rectificationopen-set learningsemi-supervised learningout-of-distributionsubspace-aware

Jailbreaking for the Average Jane: Choosing Optimal Jailbreaks via Bandit Algorithms for Automatically Enhanced Queries

arXiv cs.LG · Prarabdh Shukla, Ritik, Suhas Rao, Arpit Agarwal · 2026-06-25

We propose a bandit-based framework for efficiently selecting optimal jailbreaks in LLMs, enabling non-expert malicious actors to elicit harmful responses. The method employs noisy exploration on a small query set to learn an exploitation policy, applied to a curated benchmark of 11,279 malicious queries (FrankensteinBench) categorized by complexity. Results demonstrate 97% average success rates across 15 state-of-the-art open-weight LLMs, with complex queries increasing success rates by up to 26%. This confirms the feasibility of automated, non-expert jailbreaking attacks.

multi-armed banditjailbreakingllmsfrankensteinbenchexploration-exploitation

Tractography-Driven Synthetic Data Generation for Fiber Bundle Segmentation in Tracer Histology

arXiv cs.LG · Kyriaki-Margarita Bintsi, Sparsh Makharia, Yaël Balbastre, Joselyn Romero Avila · 2026-06-25

The authors propose a synthetic-data augmented framework for automated fiber bundle segmentation in macaque tracer histology, reducing manual annotation needs by 3x. Their method leverages ex vivo dMRI tractography as a generative prior to synthesize 2D image patches, combining foreground texture from tractography with backgrounds from blockface photos and applying domain randomization. A 2D U-Net trained on mixed real and synthetic patches shows improved generalization across brains and fiber bundle densities compared to real-data-only training, though synthetic-only training performs poorly. The approach matches state-of-the-art performance with significantly less manual annotation.

tractographysynthetic data generationfiber bundle segmentationdomain randomizationu-net

Asymptotically Optimal Learning for Parametric Prophet Inequalities

arXiv cs.LG · Jung-hun Kim, Anna Grebennikova, Vianney Perchet · 2026-06-25

The paper introduces an asymptotically optimal learning algorithm for parametric prophet inequalities with i.i.d. rewards from exponential-type parametric families, including exponential, Pareto, and bounded-support power-family distributions. The authors characterize the optimal full-information asymptotic competitive ratio, deriving explicit limits for unbounded and bounded-support cases. They propose a confidence-based dynamic-programming policy that achieves the optimal ratio using only online observations, without offline samples. Distribution-specific convergence rates are derived for canonical examples, and numerical experiments validate the algorithm's performance.

prophet inequalitiesdynamic programmingparametric familiescompetitive ratioonline learning

Accelerated sampling using SamAdams variable timesteps and position-adaptive Langevin dynamics

arXiv cs.LG · Benedict Leimkuhler, Peter A. Whalley · 2026-06-25

The paper introduces SA-PAL, an accelerated Langevin sampling method combining SamAdams adaptive timestepping and position-adaptive Langevin (PAL) dynamics. SamAdams adjusts integration steps in stiff phase space regions via a stiffness monitor, while PAL concentrates friction along force directions while preserving the canonical distribution. Implemented as a palindromic integrator requiring one force evaluation per step, SA-PAL demonstrates 1.5-3× faster mixing on Rosenbrock and Mueller-Brown potentials, and >10× efficiency gains in entropic channel and Bayesian sparsity problems.

langevin dynamicsadaptive timesteppingsampling accelerationpalindromic integratorstiffness monitor

Heterogeneous Neural Predictivity from Language Models During Naturalistic Comprehension

arXiv cs.LG · Xiao Jia · 2026-06-25

The study demonstrates that frozen language model representations serve as effective neural predictors during natural language comprehension, with controlled predictive utility across multiple datasets. Using eight frozen models (blocked encoding) on Brain Treebank, MEG-MASC, and Podcast ECoG data, the authors employed temporal, nuisance, and representation-capacity controls. Results showed 67/432 evaluable rows met predictive criteria, with feature ablations altering predictions in most cases, while controls confirmed pipeline sensitivity. Predictive advantages were localized, and response profiles bounded computational interpretations, separating predictive utility from claims about shared neural organization.

frozen language modelsneural predictorsblocked encodingfeature ablationsnaturalistic comprehension

Scalable Message-Passing Quantum Graph Neural Networks in the Weisfeiler-Leman Hierarchy

arXiv cs.LG · Snehal Raj, Brian Coyle, Léo Monbroussou, André J. Ferreira-Martins · 2026-06-25

The authors introduce a quantum graph neural network (QGNN) framework that performs message passing while maintaining permutation equivariance and operating at a specified level of the Weisfeiler-Leman hierarchy, a measure of graph distinguishability. The method leverages pre-training on small graph instances to mitigate variational quantum circuit training challenges and ensures scalable readout costs as graph size increases. Validation on synthetic graphs, molecular property prediction, and the travelling salesperson problem demonstrates the framework's effectiveness across datasets, with simulations scaling up to 56 qubits. This approach bridges graph learning principles with quantum circuit design, offering theoretical guarantees and practical scalability.

quantum graph neural networkmessage passingweisfeiler-leman hierarchypermutation equivariancevariational quantum circuits

Quantization in Federated Learning: Methods, Challenges and Future Directions

arXiv cs.LG · Farwa Ikram, Dipanwita Thakur, Antonella Guzzo, Giancarlo Fortino · 2026-06-25

This survey establishes quantization as a fundamental systems component in Federated Learning (FL), proposing a novel FL-centric taxonomy organized around client heterogeneity, aggregation consistency, and other FL-specific dimensions. It systematically reviews quantization methods, analyzing their interactions with core FL behaviors like client drift, partial participation, and differential privacy. The paper identifies open research gaps and provides design guidelines for deploying quantized FL on mobile, IoT, and edge platforms, emphasizing its role in mitigating communication bottlenecks and device heterogeneity.

federated learningquantizationclient heterogeneitynon-iid dataedge platforms

Reasoning Quality Emerges Early: Data Curation for Reasoning Models

arXiv cs.LG · Hongyi Henry Jin, Wenhan Yang, Meysam Ghaffari, Carlos Morato · 2026-06-25

The paper introduces an efficient method for curating high-quality supervised fine-tuning (SFT) data for reasoning tasks by leveraging early reasoning tokens. By analyzing the loss of the first 100-1000 reasoning tokens across perturbed checkpoints of a pretrained model, the method identifies diverse and challenging examples without relying on strong reasoning models. Experiments on Qwen2.5-7B and Llama3.1-8B using M23K and OpenThoughts-Math datasets demonstrate a 1.7% performance improvement with 91% greater token efficiency compared to baselines.

supervised fine-tuningreasoning tracesperturbed checkpointstoken efficiencygradient similarity

Reproducibility Study of "AlphaEdit: Null-Space Constrained Knowledge Editing for Language Models"

arXiv cs.LG · Ananth K S, Arya Hariharan · 2026-06-25

This reproducibility study evaluates AlphaEdit, a null-space constrained projection method for knowledge editing in language models, originally proposed by Fang et al. (2025). The authors replicate AlphaEdit's results on LLaMA3, GPT2-XL, and GPT-J, while extending evaluation to newer architectures, additional benchmarks (BoolQ, HellaSwag, XSTest), and longer sequential editing horizons. Results confirm AlphaEdit's performance within its original scope but reveal limitations: its advantages do not generalize uniformly across architectures, and its null-space projection's protection against catastrophic forgetting degrades with higher edit counts. Sequential editing also impacts downstream task performance and safety-relevant behavior, highlighting practical deployment constraints.

null-space projectionknowledge editingsequential editingcatastrophic forgettinglocate-then-edit

LearniBridge: Learnable Calibration of Feature Caching for Diffusion Models Acceleration

arXiv cs.LG · Xuyue Huang, Zhe Chen, Wang Shen, Xiao-Ping Zhang · 2026-06-25

LearniBridge introduces a learnable calibration mechanism for feature caching in Diffusion Transformers (DiTs) to address error accumulation at high acceleration ratios. The method leverages a shared low-rank subspace across prompts, enabling effective calibration through lightweight LoRA updates with only 3-5 training samples. Experiments demonstrate 5.87×, 5.75×, and 4.10× acceleration on FLUX, HunyuanVideo, and WAN2.1 respectively, with a 1.28% VBench improvement over SOTA on WAN2.1 at 4.10× acceleration.

diffusion transformersfeature cachinglow-rank subspacelora updatesacceleration ratios

Evaluation Pitfalls and Challenges in Multimedia Event Extraction

arXiv cs.LG · Philipp Seeberger, Steffen Freisinger, Tobias Bocklet, Korbinian Riedhammer · 2026-06-25

The study conducts the first systematic analysis of evaluation pitfalls in multimedia event extraction, identifying three major issues: inconsistent data processing, task assumptions, and overly relaxed evaluation settings. Through controlled experiments under a strict framework, the authors demonstrate that minor evaluation choices cause significant performance variations and overestimate models' cross-modal grounding capabilities. Results emphasize the need for standardized evaluation protocols to ensure reliable progress in this domain.

multimedia event extractioncross-modal groundingevaluation frameworkdata processing inconsistencytask assumptions

Escaping Iterative Parameter-Space Noise: Differentially Private Learning with a Hypernetwork

arXiv cs.LG · Naoki Nishikawa, Shokichi Takakura, Satoshi Hasegawa · 2026-06-25

We introduce a differentially private (DP) learning framework that avoids iterative parameter-space noise by employing a hypernetwork trained on public datasets. Instead of updating the target model with privatized gradients, private data is embedded into a low-dimensional representation, aggregated, perturbed once for DP, and mapped to target model parameters via the hypernetwork. Theoretical analysis demonstrates higher utility under a fixed privacy budget compared to DP-SGD. Empirical evaluation on LoRA fine-tuning of diffusion models shows improved FID scores over DP-SGD and other public-data-guided methods.

differentially privatehypernetworklow-dimensional embeddingdp-sgdlora fine-tuning

ProtoKV: Streaming Video Understanding under Delayed Query with Summary-State Memory

arXiv cs.LG · Le Tu Ngoc Minh, Jinyeong Lim, Dongsu Han · 2026-06-25

ProtoKV introduces a constant-memory solution for streaming video understanding (SVU) that handles delayed queries by maintaining a hybrid memory architecture. It combines an exact near-window KV cache with a fixed-capacity summary state of far history, represented as semantic-spatial prototypes with residual statistics. These prototypes are exposed via pseudo-tokens compatible with standard attention mechanisms. Evaluations show ProtoKV improves accuracy by up to 12.5 points over token-retention baselines on SVU benchmarks, particularly in long-delay regimes where query latency is significant.

streaming video understandingkv cachesemantic-spatial prototypesdelayed queryattention mechanisms

Batch-Invariant Spectral Intelligence for Robust and Explainable Insect Authentication

arXiv cs.LG · Majharulislam Babor, Giacomo Rossi, Annalisa Altavilla, Oliver Schlüter · 2026-06-25

The paper introduces Batch-Invariant Spectral Network (BISN), an end-to-end framework for robust insect species authentication using near-infrared spectroscopy. BISN combines a learnable preprocessing module (initialized with Savitzky-Golay filtering) with an entropy-regularized adversarial objective to suppress batch-specific spectral variation before feature extraction. Evaluated on 2,700 spectra from three insect species across three production batches, BISN achieves 0.93 mean leave-one-batch-out accuracy (SD=0.04), outperforming baselines by 4%. Explainable AI analysis confirms model reliance on lipid and protein absorption regions, aligning with known biochemistry.

batch-invariant learningnear-infrared spectroscopyadversarial trainingsavitzky-golay filteringexplainable ai

Structure Before Collapse: Transient semantic geometry in next-token prediction

arXiv cs.LG · Yize Zhao, Isabel Papadimitriou, Christos Thrampoulidis · 2026-06-25

The paper investigates how next-token prediction models learn latent semantic structure despite training on one-hot labels, which theoretically should lead to symmetric representations. Through three synthetic controlled settings with latent semantic factors, the authors demonstrate that semantic geometry emerges early in training, clustering representations by shared attributes without explicit supervision. This structure is transient, as models eventually reach a symmetric state where all representations are equally separated. The study employs Gram matrix analysis and proposes a modification to the unconstrained features model to capture emergent semantic geometry.

neural collapsenext-token predictionsemantic geometrygram matrixlatent structure

HyperDFlash: MHC-Aligned Block Speculative Decoding with Gated Residual Reduction

arXiv cs.LG · Luxi Lin, Shuang Peng, Rui Ma, Junhao Hua · 2026-06-25

HyperDFlash introduces a block-parallel speculative decoding framework optimized for DeepSeek-V4's multi-hyper-connection (MHC) architecture, addressing feature misalignment in residual streams. The method employs pre-collapse residual states for multi-path structural preservation and a lightweight gated residual reducer for efficient path aggregation, reducing parameters by three orders of magnitude while maintaining alignment. Enhanced training via KL distillation loss improves draft quality. Evaluations on math reasoning, code synthesis, and conversational benchmarks demonstrate HyperDFlash's superiority over Multi-Token Prediction and vanilla DFlash, achieving significant gains in accepted draft length and decoding speedup.

speculative decodingmulti-hyper-connectiongated residual reducerkl distillationblock-parallel

State-Specific Respiratory Signatures for Affective and Stress Recognition: Interpretable Respiratory Markers, Autocorrelation Lags, and Compact CNN Models

arXiv cs.LG · Andrei Velichko, Mehmet Tahir Huyut · 2026-06-25

This work introduces state-specific respiratory signatures for affective and stress recognition, combining compact 1D convolutional neural networks (1D-CNNs) with interpretable handcrafted respiratory features. The method analyzes 60-second windows from the WESAD dataset's chest respiratory channel under leave-one-subject-out validation, organizing features into respiratory timing, variability, waveform statistics, spectral descriptors, and autocorrelation lags (Zpm/Zmp). The raw-signal CNN achieved 96.72% accuracy for stress detection, while compact feature models excelled in non-stress conditions like meditation (MCC 88.65%). Results demonstrate CNNs' utility for stress detection and interpretable signatures' effectiveness for non-state-specific physiological markers.

respiratory signatures1d-cnnautocorrelation lagswesad datasetleave-one-subject-out

DroidBreaker: Practical and Functional Problem-Space Attacks on Machine-Learning Android Malware Detectors

arXiv cs.LG · Christian Scano, Diego Soi, Angelo Sotgiu, Luca Demetrio · 2026-06-25

DROIDBREAKER introduces a practical and functional problem-space attack framework for evading machine-learning Android malware detectors, addressing limitations of prior work that produced impractical or non-functional adversarial APKs. The method combines query-efficient white- and black-box attacks via fine-grained, build-safe manipulations (API call injection/obfuscation, module/permission/URL modifications) with a semantics-preserving functionality test comparing execution logs and API traces. Evaluated on a recent Android corpus, DROIDBREAKER achieves high evasion rates with minimal queries and side effects, significantly reducing detections by commercial scanners like VirusTotal.

problem-space attacksadversarial apksmalware evasionsemantics-preservingbuild-safe manipulations

Attributed, But Not Incremental: Cannibalization-Corrected Attribution for Large-Scale Advertising

arXiv cs.LG · Donghui Li, Bowen Yuan, Zili Yang, Qinxin Chen · 2026-06-25

The authors propose an experiment-calibrated attribution correction framework to address attribution-cannibalization mismatch in large-scale advertising systems, where paid-attributed conversions systematically overstate true incremental growth. The framework uses incrementality experiments as causal anchors to convert sparse lift measurements into daily correction estimates, allocating calibrated cannibalization volume across business hierarchies under structural consistency constraints. Offline validation shows the framework substantially reduces calibration error compared to raw attribution and ML baselines. Deployed across TikTok markets, it supported strategy adjustments leading to a ~15-percentage-point reduction in measured cannibalization rate.

attribution-cannibalizationincrementality experimentscausal anchorscalibration errorbusiness hierarchies

PersistentKV: Page-Aware Decode Scheduling for Long-Context LLM Serving on Commodity GPUs

arXiv cs.LG · Muhammad Ahmed · 2026-06-25

PersistentKV introduces a page-aware scheduling system for long-context LLM serving on commodity GPUs, optimizing KV-cache movement and decode attention. The method employs native block-table decode attention, grouped-query attention (GQA) reuse, and a compact workqueue schedule for non-empty tasks. Evaluated on an RTX 3060 with FP16, page size 16, and Hq=32, Hkv=8, d=128, PersistentKV achieves 1.063-1.265x throughput improvement for B8 workloads and 1.399x for B1 bucketed traces, demonstrating adaptive policy efficacy in selecting between FlashInfer and PersistentKV based on workload characteristics.

kv-cachegrouped-query attentionflashinferdecode schedulingpage-aware

Generating Special Triangulations with Transformers

arXiv cs.LG · Charles Arnal, Jacky H. T. Yip, François Charton, Gary Shiu · 2026-06-25

The authors demonstrate that transformers can effectively generate fine, regular, and star triangulations (FRSTs) of 4D reflexive polytopes, which are crucial for constructing smooth Calabi-Yau threefolds in string theory. They employ an appropriate encoding scheme to handle the high-dimensional and combinatorial complexity of triangulations. The models not only produce representative FRSTs across various polytope sizes but also self-improve through retraining on their own output, enabling applications in Calabi-Yau classification and interdisciplinary research.

transformerstriangulationscalabi-yaureflexive polytopescombinatorial complexity

Target-Aware Bandit Allocation for Scalable Surrogate Optimization in Chemical Space

arXiv cs.LG · Mohammad Haddadnia, Yuvan Chali, Abhilash Jayaraj, Constance Kraay · 2026-06-25

We introduce BOBa, a bandit-guided surrogate optimization framework for scalable virtual screening in ultra-large chemical libraries. BOBa eliminates full-library inference by adaptively allocating computation across action space partitions, treating them as arms in a multi-armed bandit with optimism-under-uncertainty exploration. Experiments on synthesis-on-demand libraries demonstrate that meaningful partitioning combined with bandit allocation enables effective tradeoffs between screening performance and inference cost, supporting practical optimization over billion-scale libraries while maintaining principled exploration. This establishes a viable path for virtual screening in ultra-large chemical spaces.

surrogate optimizationmulti-armed banditvirtual screeningaction space partitioningoptimism-under-uncertainty

FracEvent: Event-Camera Simulation via Fractional-Relaxation Pixel Dynamics

arXiv cs.LG · Langyi Chen, Chuanzhi Xu, Haoxian Zhou, Pengfei Ye · 2026-06-25

FracEvent introduces a novel event-camera simulator that models pixel-level lifecycle dynamics via fractional-relaxation voltage dynamics, addressing limitations in temporal structure and downstream transfer of existing simulators. The method processes log-intensity trajectories through a compact stack of relaxation modes, combines responses into a voltage state, emits ON/OFF events by localizing threshold crossings, and retains memory modes for residual voltage response. Evaluated on event-stream comparison and downstream tasks like image reconstruction and optical flow estimation, FracEvent outperforms baselines in temporal accuracy and transfer performance across multiple datasets.

event-camera simulationfractional-relaxation dynamicspixel lifecycleoptical flow estimationlog-intensity trajectory

From Weights to Features: SAE-Guided Activation Regularization for LLM Continual Learning

arXiv cs.LG · Evan Ning, Wei Xue, Dong Lou, Yike Guo · 2026-06-25

We propose SAE-Guided Activation Regularization (SAE-GAR), a continual learning method for large language models that addresses catastrophic forgetting by regularizing in activation space rather than weight space. The method leverages pretrained Sparse Autoencoders (SAEs) as monosemantic feature dictionaries to construct task-specific feature masks, enabling explicit balance between stability and plasticity without storing previous-task data. SAE-GAR demonstrates superior performance on TRACE and MedCL benchmarks compared to weight-space regularization methods like Elastic Weight Consolidation (EWC), achieving state-of-the-art results among approaches without task-specific architectural components. Empirical evidence supports the polysemanticity thesis, showing task-relevant representations are linearly separable in SAE feature space but indistinguishable in weight space.

continual learningsparse autoencodersactivation regularizationpolysemanticitycatastrophic forgetting

Sketched Linear Contrastive Learning: Approximation, Optimization, and Statistical Scaling

arXiv cs.LG · Ziyan Chen, Zhongzhu Zhou, Ding-Xuan Zhou · 2026-06-25

The paper establishes scaling laws for sketched linear contrastive learning under a paired Gaussian latent-variable model. Using a bilinear contrastive score trained via full-batch gradient descent, the authors analyze a Gaussian-negative quadratic surrogate under power-law spectra and a contrastive source condition. They derive a risk decomposition and prove explicit scaling laws with respect to sketch dimension $M$, sample size $N$, and effective optimization horizon $L_{\mathrm{eff}}γ$, revealing how contrastive learning's two-view structure alters scaling compared to linear regression.

scaling lawscontrastive learningbilinear modelgradient descentrisk decomposition

Latent Diffusion Posterior Sampling with Surrogate Likelihood Guidance for PDE Inverse Problems

arXiv cs.LG · Yuanzhe Wang, Alexandre M. Tartakovsky · 2026-06-25

The authors propose Latent Diffusion Posterior Sampling (L-DPS), a Bayesian framework for solving high-dimensional PDE inverse problems. L-DPS integrates a variational autoencoder, unconditional latent diffusion model, diffusion posterior sampling, and a differentiable neural surrogate to address challenges in PDE-constrained inversion, including implicit priors, high dimensionality, and computational cost. The method maps parameter fields to a latent space, learns implicit prior scores, and uses likelihood guidance via a surrogate model, avoiding repeated PDE solver calls. Evaluated on an inverse Darcy flow problem, L-DPS achieves accurate solutions, reduces inference cost compared to full-space DPS, and outperforms amortized inverse baselines in sparse and noisy regimes.

latent diffusion posterior samplingvariational autoencoderdiffusion posterior samplingpde inverse problemsneural surrogate

Empirical Software Engineering TerraProbe: A Layered-Oracle Framework for Detecting Deceptive Fixes in LLM-Assisted Terraform

arXiv cs.LG · Manar Alsaid, Chimdumebi Nebolisa, Faris Abbas · 2026-06-25

TerraProbe introduces a five-layer oracle framework for evaluating LLM-assisted Terraform security repairs, addressing limitations of static-analysis-only validation. The method assesses 288 repairs by gemini-2.5-flash-lite, GPT-4o, and Claude 3.5 Sonnet across 96 Terraform modules, measuring Checkov removal, full-scanner cleanliness, planning success, plan comparison, and human adjudication. Results reveal deceptive fixes in 71.4% of plan-compared real-world repairs, with statistically equivalent rates across models (57.1-71.4%, p>0.10). A four-dimensional taxonomy of deceptive fixes achieves inter-rater reliability (κ=0.78, α=0.76), while IAM analysis confirms persistent wildcard grants in all CKV2 AWS 11 cases.

terraformllm-assisted repairdeceptive fixesmulti-layer oracleinfrastructure-as-code

Revisiting Action Factorization for Complex Action Spaces

arXiv cs.LG · Timothy Flavin, Sandip Sen · 2026-06-25

The study introduces VDN-PPO and PPO-MIX, leveraging branching critics for multi-headed PPO, and evaluates factorization methods across hybrid action spaces. It examines independent networks, shared encoders, VDN, QPLEX, Joint, and Auto-Regressive approaches within PPO, SAC, and DQN frameworks on discretized, hybrid, and continuous action spaces using four environments, including two new C++ implementations (CoopPush, Hybrid-Shoot). Analysis of 220 configurations reveals branching dueling architectures balance computation and performance effectively, with Auto-Regressive actions achieving the highest performance and continuous SAC outperforming discrete and hybrid methods, albeit at higher computational cost.

branching critichybrid action spacesauto-regressive actionsppo factorizationparallel environments

Can Large Language Models Reliably Code Qualitative Humanitarian Data? A Benchmark Study Against Human Expert Adjudication

arXiv cs.LG · Jerome Marston, Tino Kreutzer, Salomé Garnier, Ella Boone · 2026-06-25

This benchmark study evaluates the reliability of 46 large language models (LLMs) for coding qualitative humanitarian data against a human Gold Standard using 150 synthetic transcripts. The evaluation combines inter-rater reliability testing with Krippendorff's alpha, discrepancy analysis, and qualitative assessment across humanitarian-specific criteria. Results indicate that multiple LLMs achieve deductive coding reliability comparable to experienced human coders, particularly with structured prompts and reasoning-enabled configurations. However, models vary in recognizing indirect needs, non-standard communication, and protection-relevant concerns, necessitating structured codebooks, reasoning-enabled models, and tiered oversight for deployment.

large language modelsdeductive codingkrippendorff's alphainter-rater reliabilityreasoning-enabled configurations

Sample-efficient Transfer Reinforcement Learning via Adaptive Reward Shaping and Policy-Ratio Reweighting Strategy

arXiv cs.LG · Wenjie Huang, Yang Li, Jingjia Teng, Mingwei Jin · 2026-06-25

The paper proposes a safe transfer reinforcement learning framework for autonomous highway lane changing, addressing transfer mismatch and safety concerns. The method introduces an adaptive teacher intervention mechanism based on instantaneous safety cost to limit risky exploration, a teacher-guided safe transfer module embedding action evaluation via reward shaping, and a teacher-guided weighted optimization mechanism using likelihood ratio factors for sample reweighting. Experiments across varied traffic densities and the NGSIM dataset demonstrate improvements of over 52.2% in safety and 5.0% in learning efficiency compared to baselines, validating the framework's efficacy and robustness.

transfer learningreward shapingpolicy optimizationsafety costlikelihood ratio

Theory-Scale Auto-Formalization of Logics for Computer Science

arXiv cs.LG · Yuming Feng, Frederick Pu, One An, Osbert Bastani · 2026-06-25

The paper introduces LCS-Bench, a theory-scale benchmark for auto-formalization in computer science logics, addressing challenges in consistency, faithfulness, and scalability. The method employs a semi-automated agentic pipeline combining concept graphs, formal signature planning, issue tracking, and human expert review, resulting in 327 textbook items formalized into 4,076 Lean declarations (85K+ lines). Evaluation on 14 models shows LCS-Bench's high quality (20.1% SOTA performance) and proposes definitional equivalence checkers for fine-grained assessment, revealing insights for future theory-scale auto-formalization research.

auto-formalizationlean theorem proverformal verificationconcept graphsdefinitional equivalence

Mean-Field PhiBE: Continuous-Time Mean-Field Reinforcement Learning from Discrete-Time Data

arXiv cs.LG · Erhan Bayraktar, Martin Hernandez, Qinxin Yan, Yuhua Zhu · 2026-06-25

The paper introduces Mean-Field-PhiBE (MF-PhiBE), a framework for continuous-time mean-field reinforcement learning from discrete-time data, addressing the non-identifiability of drift and diffusion coefficients in McKean-Vlasov dynamics. MF-PhiBE integrates discrete-time transition data into a continuous-time PDE on the Wasserstein space, preserving the generator structure of the McKean-Vlasov HJB equation. A policy-gradient theorem is derived for entropy-regularized randomized feedback policies, enabling a model-free actor-critic method. Theoretical analysis shows first-order consistency with an error of order Δt, achieving second-order accuracy in the linear-quadratic case. Numerical experiments on LQR and crowd-aversion problems validate the approach.

mean-field reinforcement learningmckean-vlasov dynamicswasserstein spacepolicy-gradient theoremactor-critic method

Learning Probabilistic Filters with Strictly Proper Scoring Rules

arXiv cs.LG · Eviatar Bach, Ricardo Baptista, Jochen Bröcker, Bohan Chen · 2026-06-25

The paper introduces the proper scoring ensemble filter (PSEF), a novel ensemble data assimilation method that learns an analysis map to approximate Bayesian filtering distributions using synthetic state-observation trajectories. PSEF employs a permutation-invariant transformer-based architecture trained via strictly proper scoring rules (energy score), ensuring probabilistic accuracy across the entire distribution. Theoretical analysis shows the method converges to the true Bayesian filter under realizability. Experiments demonstrate PSEF's superiority over classical and MSE-based learning methods, particularly for non-Gaussian, multi-modal posteriors, with end-to-end training outperforming EnKF corrections in highly non-Gaussian settings.

bayesian filteringensemble methodsproper scoring rulestransformer architecturesdata assimilation

What Survives When You Compress a Recursive Reasoner for the Edge?

arXiv cs.LG · Pearse Jim, Steven Kolawole, Opegbemi Matthias Busoye, Glory Bagai · 2026-06-25

The study investigates compression effects on recursive reasoning models, revealing that aggressive quantization (INT4) preserves local cell accuracy but catastrophically fails in global puzzle-exact accuracy due to error compounding across reasoning cycles. Through experiments on three tasks and two architectures (MLP-mixing recursion vs. attention), the authors identify architectural susceptibility, propose carry-trajectory fidelity as a damage predictor, and demonstrate recovery via per-channel calibrated INT4 without retraining. Key deployment optimizations include flash-streamed embeddings (99.4MB reduction), INT8 at 6x fewer FLOPs (8MB SoC), and INT4 fitting 4MB microcontrollers.

recursive reasoningquantizationedge deploymentmlp-mixingcalibration

When Does Quality-Aware Multimodal Fusion Matter? A Leakage-Safe Diagnostic for Decision-Level Dependence

arXiv cs.LG · Jaden Moon, Arvind Pillai, Andrew Campbell · 2026-06-25

The paper proposes a diagnostic test to determine whether quality-aware multimodal fusion models actually utilize reliability scores during inference. The method permutes reliability scores across fixed model inputs and predictions, measuring performance degradation as evidence of score dependence. Experiments on StressID (stress recognition) and CMU-MOSEI (sentiment analysis) show no performance change under permutation, revealing unused potential gains from modality selection. Positive controls demonstrate score dependence only when reliability signals strongly predict unimodal correctness.

multimodal fusionreliability scoresdecision-level dependencequality-awarepermutation test

Epiphany-Aware KV Cache Eviction Without the Attention Matrix

arXiv cs.LG · Steven Kolawole, Virginia Smith · 2026-06-25

The paper introduces EpiKV, a KV cache eviction method that scores tokens using epiphany scores—changes in the model's internal representation—without materializing the attention matrix. This approach eliminates the need for attention-based ranking, fused kernels, or custom training, enabling seamless integration with FlashAttention inference stacks. EpiKV achieves 72% accuracy on MATH-500 with a 4096-token cache, matching attention-based baselines (ThinKV 71%, H2O 67%), and scales to 16x longer contexts. On AIME-2024 with 8192 tokens, a lag-normalized variant reaches 37% accuracy (vs. 33% for baselines) at 2.8x speed.

kv cacheepiphany scoreflashattentioncache evictioninference optimization

A Causal Foundation Model for Structure and Outcome Prediction

arXiv cs.LG · Max Zhu, Martino Mansoldo, Ching-Hao Wang, Stefan Groha · 2026-06-25

TabPFN-CFM is introduced as a causal foundation model capable of addressing multiple causal problems, including structure and outcome prediction from observational data. The model supports queries across all three levels of Pearl's Causal Hierarchy and leverages known graph structure when available to enhance predictions. Trained on synthetic datasets, TabPFN-CFM demonstrates generalization to real-world data, outperforming existing baselines in both structural and outcome prediction tasks.

causal foundation modelpearl's causal hierarchyobservational datastructure predictionoutcome prediction

Finding the Time to Think: Learning Planning Budgets in Real-Time RL

arXiv cs.LG · Aneesh Muppidi, Firas Darwish, Dylan Cope, João F. Henriques · 2026-06-24

The paper introduces variable-delay real-time reinforcement learning (RL), where agents select state-dependent planning budgets while the environment progresses. The method employs a lightweight gating policy trained atop a planner to dynamically allocate deliberation time, avoiding the computational paralysis of nested planning. Evaluations across real-time Pac-Man, Tetris, Snake, Speed Hex, and Speed Go demonstrate superior performance over fixed-budget and heuristic baselines, with successful transfer to a dual-GPU real-time setup.

real-time rlplanning budgetsgating policyvariable-delaystate-dependent

A probabilistic framework for online test-time adaptation

arXiv cs.LG · Daniel Corrales, David Ríos Insua · 2026-06-24

The paper proposes a probabilistic framework for online test-time adaptation, addressing scenarios where models must adapt to distributional shifts between training and test data. The method employs state-space modeling to characterize parameter learning, temporal evolution, prior tuning, and prediction. This architecture provides a unified approach for handling distribution shifts in unlabeled test data during deployment.

test-time adaptationdistributional shiftstate-space modelingparameter learningonline learning

Optimizing CUDA like a Human: Micro-Profiling Tools as Expert Surrogates for LLM-Based GPU Kernel Optimization

arXiv cs.LG · Jiading Gai, Shuai Zhang, Kaj Bostrom, Jin Huang · 2026-06-24

KernelPro introduces a closed-loop multi-agent system for GPU kernel optimization, integrating LLM code generation with hardware profiler feedback and bottleneck detection tools. Key innovations include a semantic feedback operator encoding expert heuristics, a two-stage tool invocation architecture combining kernel/instruction/system-level profiling, domain-adapted Monte Carlo Tree Search with progressive widening, and CuTe source-level code generation. Evaluated on KernelBench, KernelPro achieves geometric mean speedups of 2.42x/4.69x/5.30x across difficulty levels and outperforms hand-tuned Triton by 1.23x on MoE training kernels. Ablation studies confirm significant improvements from micro-profiling tools (p < 0.0001), MCTS search (26% higher geometric mean, p = 0.004), and tool orchestration (23% improvement, p = 0.035). KernelPro also reduces energy consumption by 11.6% at matched speed.

gpu kernel optimizationmonte carlo tree searchhardware profilersemantic feedback operatorcute code generation

Listening Like a Judge: A Music-Aware Framework for Automatic Singing Performance Evaluation

arXiv cs.LG · Neelam Saini, Sourav Ghosh · 2026-06-24

MusicJudge introduces a modality-guided framework for automatic singing quality assessment (SQA) that integrates lyric correctness and pitch-rhythm fidelity through block-aligned multimodal analysis. The system employs multi-signal matching, combining semantic embeddings, lexical similarity, and phonetic alignment to detect semantically meaningful lyric blocks. It enhances singing audio transcription via Modality-Guided LoRA for ASR fine-tuning. Evaluations across datasets demonstrate strong alignment with human expert judgments, validating MusicJudge's generalizability and effectiveness in holistic performance evaluation.

singing quality assessmentmultimodal analysissemantic embeddingsphonetic alignmentmodality-guided lora

Embedding Foundation Model Predictions in Discrete-Choice Models with Structural Guarantees

arXiv cs.LG · Yingshuo Wang, Xian Sun, Yanhang Li, Zhichao Fan · 2026-06-24

The authors propose a two-stage adapter that integrates foundation model predictions into multinomial logit models while preserving structural economic guarantees. Stage 1 fits multinomial logit coefficients with sign constraints via maximum likelihood, while Stage 2 freezes these coefficients and applies a neural correction to the foundation model's predictions. This approach ensures marginal rate of substitution preservation and analytically computable value-of-time. Evaluated across three datasets and two foundation models, the adapter improves test accuracy by 6.4 percentage points on average, maintains 100% cost monotonicity, and produces plausible values of time. Performance remains robust under context restriction, retaining over 6 percentage points accuracy gain at 10% context.

multinomial logitfoundation modelmarginal rate of substitutioncost monotonicityvalue-of-time

DualEval: Joint Model-Item Calibration for Unified LLM Evaluation

arXiv cs.LG · Aaron J. Li, Hao Huang, Youngmin Park, Yitong Ma · 2026-06-24

DualEval introduces a latent model-item calibration framework that jointly estimates model ability, item difficulty, and sharpness across static benchmarks and arena-style preference data. The method represents models and evaluation items in a shared space, validated on 18 frontier LLMs across coding, math, domain-knowledge, and generic query domains. Results demonstrate reliable model rankings, benchmark compression for sample efficiency, and anomaly detection capabilities, unifying static and interactive evaluation paradigms.

latent calibrationmodel-item joint estimationarena-style evaluationbenchmark compressionanomaly detection

Rethinking Training & Inference for Forecasting: Linking Winner-Take-All back to GMMs

arXiv cs.LG · Qiyuan Wu, Katie Z Luo, Bharath Hariharan, Wei-Lun Chao · 2026-06-24

The paper identifies a modeling-training mismatch in trajectory forecasting for autonomous driving, where models are typically conditional Gaussian mixture models (GMMs) but trained with winner-take-all (WTA) loss, leading to uninformative mode posteriors. To address this, the authors propose two post-hoc treatments: (1) test-time posterior-weighted merging of nearby trajectories and (2) a one-step EM update replacing hard labels with soft responsibilities. Evaluated on WTA-trained architectures, these methods improve mode posteriors and displacement metrics without retraining, bridging GMM and K-means perspectives.

trajectory forecastinggaussian mixture modelswinner-take-all lossexpectation-maximizationmode pruning

Otter Weather: Skillful and Computationally Efficient Medium-Range Weather Forecasting

arXiv cs.LG · Cristiana Diaconu, Jonas Scholz, Aliaksandra Shysheya, Stratis Markou · 2026-06-24

The authors introduce Otter Weather, a computationally efficient spatiotemporal model for medium-range weather forecasting that advances the skill-compute Pareto frontier. The deterministic variant, trained on ERA5 reanalysis data at 1.5° resolution, outperforms Numerical Weather Prediction baselines by 9.6% at 24-hour lead time while requiring <3.5 A100-days of training. Scaling to probabilistic forecasting via Continuous Ranked Probability Score optimization, Otter-XL achieves 9.7% CRPS improvement over IFS ENS with 100x less compute than frontier architectures. The model also generalizes to PDE tasks, outperforming foundation models in acoustic scattering.

spatiotemporal forecastingnumerical weather predictioncontinuous ranked probability scoreera5 reanalysispareto frontier

At the Edge of Understanding: Sparse Autoencoders Trace The Limits of Transformer Generalization

arXiv cs.LG · Praneet Suresh, Jack Stanley, Sonia Joseph, Luca Scimeca · 2026-06-24

The study introduces a mechanistic framework for analyzing transformer generalization limits by examining out-of-distribution (OOD) behavior through sparse autoencoders. It demonstrates that OOD inputs, including adversarial examples like typos and jailbreak prompts, activate fallacious internal concepts in language models. The authors propose a diagnostic method to quantify distributional shifts and develop a fine-tuning strategy to enhance model robustness, extending OOD analysis from input data to internal computational processes.

sparse autoencodersout-of-distributiontransformer robustnessmechanistic interpretabilityjailbreak prompts

Staying VIGILant: Mitigating Visual Laziness via Counterfactual Visual Alignment in MLLMs

arXiv cs.LG · Xi Xiao, Chen Liu, Chih-Ting Liao, Yunbei Zhang · 2026-06-24

VIGIL introduces a reinforcement-learning framework to mitigate visual laziness in multimodal large language models (MLLMs), where models overly rely on language priors despite encoding correct visual evidence. The method employs a geometric constraint to maximize mutual information between visual inputs and responses, penalizing improper certainty in counterfactual blind states. Experiments demonstrate that VIGIL outperforms existing alignment methods on hallucination and reasoning benchmarks, achieving full-data performance with only 25% of preference data and emergent spatial grounding without explicit supervision.

multimodal large language modelsvisual lazinessmutual informationcounterfactual blind statereinforcement-learning

Scoring Is Not Enough: Addressing Gaps in Utility-fairness Trade-offs for Ranking

arXiv cs.LG · Shubham Singh, Ian A. Kash, Mesrob I. Ohannessian · 2026-06-24

The authors demonstrate that scoring functions, while effective for utility-centric objectives, are sub-optimal for achieving utility-fairness trade-offs in ranking systems. They present counter-examples using a generic fairness formulation, showing limitations persist across deterministic and randomized scoring functions, as well as single and multi-query fairness measures. Empirically, they find that semi-greedy post-processing can achieve superior trade-offs, approaching the ideal of exhaustive post-processing in a tractable manner.

scoring functionsutility-fairness trade-offsranking systemssemi-greedy post-processingalgorithmic fairness

Does Aurora Encode Atmospheric Structure? Latent Regime Analysis and Attribution

arXiv cs.LG · Emma Kasteleyn, Ana Lucic · 2026-06-24

The study demonstrates that Aurora, a machine learning foundation model for atmospheric dynamics, encodes meteorological coherence and vertical structure without explicit supervision. Using spatially pooled PCA and layer-wise relevance propagation (LRP), the authors analyze Aurora's latent representations, revealing primary organization by seasonal cycles and attention to features consistent with the 3D vertical structure of the Great Storm of 1987. Perturbation tests show that masking relevant regions degrades forecasts 3.31× more than random masking, indicating learned meteorological coherence. Extreme storm events, however, do not form linearly separable clusters in the latent space.

atmospheric dynamicslatent representationslayer-wise relevance propagationperturbation testsmeteorological coherence

EMA-FS: Accelerating GBDT Training via Gain-Informed Feature Screening

arXiv cs.LG · Yan Song · 2026-06-24

The paper proposes EMA-FS, an algorithm-level optimization for Gradient Boosted Decision Trees (GBDT) that accelerates training by screening low-gain features. The method maintains exponential moving averages of per-feature split gains and restricts histogram construction to top-K features after a warmup period, preserving compatibility with LightGBM's histogram subtraction. Evaluations on datasets (29-968 features) show speedups of 2.61x on synthetic data and 1.45x on IEEE-CIS Fraud at 30% retention, with AUC improvements at 70% retention. A stochastic variant (S-EMA-FS) introduces gain-weighted random sampling, unifying deterministic and random approaches.

gradient boosted decision treesfeature screeningexponential moving averagelightgbmhistogram construction

Mesh-RL: Coupled subgrid reinforcement learning

arXiv cs.LG · Behnam Gheshlaghi, Bahador Rashidi, Shahin Atakishiyev · 2026-06-24

Mesh-RL proposes a spatial domain-decomposition framework for reinforcement learning, inspired by finite element methods, to accelerate temporal-difference reward propagation in sparse-reward environments. The method partitions the environment into overlapping subgrids with boundary-consistent updates, enabling localized learning while maintaining global coherence. Evaluations on grid-world environments show improved convergence speed (Q-learning, SARSA, Dyna-Q), cumulative reward, and learning stability, with higher mesh resolutions enhancing exploration and long-range value propagation.

reinforcement learningdomain decompositiontemporal-difference learningfinite element methodcredit assignment

Tailor Made Embeddings for Quantum Machine Learning

arXiv cs.LG · Aldo Lamarre, Dominik Šafránek · 2026-06-24

The authors introduce a variational autoencoder framework for quantum machine learning that learns task-specific quantum embeddings of classical data. The method compresses high-dimensional datasets like ImageNet into 13-qubit representations while maintaining reconstructability via a learned decoder. On MNIST (3 vs 5), it achieves 98.5% validation accuracy (1.2pp below classical baseline) and outperforms naive amplitude embeddings by >30pp, requiring only polynomial measurements for reconstruction. Hardware validation on IBM quantum devices confirms noise stability.

quantum machine learningvariational autoencoderquantum embeddingsstate tomographycircuit-centric classifier

Equivariance and Augmentation for Bayesian Neural Networks

arXiv cs.LG · Miaowen Dong, Axel Flinth, Jan E. Gerken · 2026-06-24

The paper analyzes data augmentation for Bayesian neural networks (BNNs) trained with variational inference, contrasting it with architectural equivariance constraints. For variational distributions in the exponential family, the authors derive conditions for exact equivariance and bounds on equivariance error, proposing three novel symmetrization techniques. Numerical experiments demonstrate that orbit expansion, one of their methods, improves both equivariance and overall performance compared to baselines. The work bridges theoretical understanding of equivariance with practical augmentation strategies in BNNs.

bayesian neural networksequivariancedata augmentationvariational inferencesymmetrization

Dataset Usage Inference without Shadow Models or Held-out Data

arXiv cs.LG · Wojciech Łapacz, Stanisław Pawlak, Jan Dubiński, Franziska Boenisch · 2026-06-24

The authors propose a practical Dataset Usage Inference (DUI) framework that eliminates the need for shadow models or held-out data, addressing limitations of existing methods. Their approach generates synthetic non-member samples, extracts diverse membership signals, and formulates DUI as a mixture proportion estimation problem to estimate the fraction of a candidate dataset used in training. Experiments on large image generative models demonstrate reliable quantification of dataset usage, offering a practical tool for data ownership verification.

dataset usage inferencemembership inferencemixture proportion estimationsynthetic non-member samplesdata ownership

Interpreting "Interpretability" and Explaining "Explainability" in Machine Learning in Physics

arXiv cs.LG · Rikab Gambhir, Luisa Lucie-Smith, Jesse Thaler · 2026-06-24

The paper analyzes interpretability (structural transparency) and explainability (scientific mapping) in physics ML, framing them as deliberate modeling choices rather than inherent properties. It examines trade-offs (interpretability vs. expressivity, explainability vs. adaptability), applicable contexts, and available intrinsic/post-hoc tools. The authors argue ML models should address the same scientific questions as classical models, differing only in scale, and stress task specification and intervention plans as core design elements.

interpretabilityexplainabilityphysicstrade-offstransparency

Fast LeWorldModel

arXiv cs.LG · Yuntian Gao, Xiangyu Xu · 2026-06-24

Fast-LeWM introduces action-prefix prediction to accelerate latent world modeling, replacing LeWM's autoregressive rollout with parallel prefix-based future latent prediction. The method encodes action sequence prefixes to directly model multi-horizon accumulated effects, enforcing prefix-level supervision for continuous state evolution. Evaluations show Fast-LeWM improves success rates over LeWM while reducing planning time, with slower-growing open-loop latent loss across increasing horizons.

joint-embedding predictive architectureslatent transition modelaction-prefix predictionopen-loop latent lossvisual planning

A General Framework for Learning Algebraic Properties from Cayley Graphs using Graph Neural Networks

arXiv cs.LG · Tal Weissblat · 2026-06-24

The authors propose a general Graph Neural Network (GNN) framework for learning algebraic properties of finite groups directly from their Cayley graph representations, extending prior work on solvability prediction. Using a unified GNN architecture and training pipeline, the method learns properties including abelianity, nilpotency, and solvability from graph-based representations alone. Experiments across multiple families of finite groups demonstrate the framework's ability to distinguish and recover these algebraic properties, indicating that substantial algebraic information is encoded in Cayley graphs. This work establishes graph representation learning as a viable approach for studying algebraic properties of finite groups.

graph neural networkcayley graphalgebraic propertiesfinite groupsrepresentation learning

The Role of Input Dimensionality in the Emergence and Targeted Control of Adversarial Examples

arXiv cs.LG · Nasrin Malekzadeh Goradel, Niccolo Pancino, Yaser Gholizade Atani, Benedetta Tondi · 2026-06-24

This work systematically investigates how input dimensionality affects adversarial example emergence and targeted control, challenging assumptions in theoretical frameworks based on concentration of measure. Through empirical evaluation across hierarchical image datasets and diverse neural architectures, the study demonstrates that adversarial examples become easier to construct as dimensionality increases. Results show the gap between targeted and untargeted perturbations remains small and narrows further with higher dimensions, supported by theoretical arguments about high-dimensional geometry. The findings establish input dimensionality as a fundamental factor in adversarial vulnerability, though its interplay with data distributions versus architectural properties remains unresolved.

adversarial examplesinput dimensionalityconcentration of measuretargeted attackshigh-dimensional geometry

Topology-Informed Neural Networks for Flood Detection in Optical and Synthetic Aperture Radar Imagery

arXiv cs.LG · Sophia Li, Max Zhao, Raghu G. Raj, Tianyu Chen · 2026-06-24

The paper introduces topology-informed neural networks for flood detection in optical and SAR imagery, addressing limitations of opaque black-box models. The method combines topological data analysis (TDA) with convolutional neural networks (ResNet-50) and vision transformers, extracting topological features from the SEN12-FLOOD dataset to enhance interpretability and robustness. Results demonstrate that topological descriptors independently carry meaningful flood signals and complement existing architectures, improving detection accuracy while providing mathematically grounded feature interpretation.

topological data analysisflood detectionresnet-50vision transformersen12-flood dataset

From Structure to Synergy: A Survey of Vision-Language Perception Paradigm Evolution in Multimodal Large Language Models

arXiv cs.LG · Haoxiang Sun, Tao Wang, Li Yuan, Jian Zhao · 2026-06-24

The survey presents a systematic examination of unified vision-language perception in Multimodal Large Language Models (MLLMs), addressing a gap in existing fragmented reviews. It formalizes MLLM perception as an intrinsic, unified capability akin to human perception and introduces a five-stage taxonomy to trace paradigm evolution, surveying representative methods and milestones at each phase. The study identifies open challenges and outlines research directions toward general multimodal intelligence, offering both foundational insights and a roadmap for advancing artificial general intelligence (AGI).

multimodal large language modelsvision-language perceptionparadigm evolutionartificial general intelligencecross-modal capability

Self-Supervised Tree-level Biomass Estimation in Urban Environments From Airborne LiDAR and Optical Observations

arXiv cs.LG · Jose Bermudez, Zilong Zhong, Dominic Cyr, Camile Sothe · 2026-06-24

The study presents a self-supervised framework for crown-level above-ground biomass (AGB) estimation in urban environments using airborne LiDAR (8–10 pulses m⁻²) and near-infrared RGB orthophotography (0.16–0.20 m resolution). A dual-stream cross-attention network trained on rule-based pseudo-labels segments tree crowns, achieving precision/recall/Dice scores of 0.86/0.83/0.84. Biomass is estimated via a crown area–height power-law proxy calibrated to species-specific allometry, yielding R²=0.570 on a 90,726-tree test set. The method maps 1.73–1.81 Tg AGB over 810 km² in Ontario, with uncertainty quantification guiding allometric equation assignment.

above-ground biomasslidarcross-attention networkallometric equationwatershed segmentation

Federated Hash Projected Latent Factor Learning

arXiv cs.LG · Jialan He · 2026-06-24

The paper proposes Federated Hash Projected Latent Factor (FHPLF), a model combining federated learning with hash learning to address privacy and efficiency challenges. FHPLF introduces binary gradient-like matrices for reduced communication overhead, Projected Hamming Distance for enhanced representation, and a Secure Binary Gradient Reassembly strategy for privacy protection. Evaluations on four real-world datasets show FHPLF outperforms state-of-the-art methods in accuracy, efficiency, and privacy preservation.

federated learninghash learningbinary gradient matricesprojected hamming distanceprivacy preservation

Clue-Guided Money Laundering Group Discovery

arXiv cs.LG · Boyang Wang, Jianing Cao · 2026-06-24

The paper introduces Clue-Guided Group Discovery (CGGD) for money laundering investigations, addressing the mismatch between existing graph anomaly detection methods and real-world AML workflows. The proposed Clue2Group framework constructs a local investigation context, estimates a clue-conditioned risk field using a multi-semantic local-temporal GNN, and integrates risk, structural, and prior-pattern evidence to recover laundering groups. Evaluated on two large-scale AML benchmarks, Clue2Group demonstrates practical effectiveness in aligning with real investigation processes.

money laundering group discoverygraph anomaly detectionlocal-temporal gnnanti-money launderingclue-guided analysis

MiniOpt: Reasoning to Model and Solve General Optimization Problems with Limited Resources

arXiv cs.LG · Ke Zhao, Zixiang Di, Hong Qian, Xiang Shu · 2026-06-24

The paper introduces MiniOpt, a reinforcement learning framework for solving general optimization problems with limited resources. MiniOpt employs a 'reasoning-to-model-and-solve' paradigm, decomposing optimization into structured modeling and solver generation, and introduces OptReward, a hierarchical reward function for joint formulation-solution evaluation. The method includes an optimization-oriented policy optimization strategy to enhance exploration efficiency. Experiments demonstrate that MiniOpt-3B achieves strong optimization generalization across diverse problem types, with the highest average solving accuracy (SA) for models under 10B parameters and competitive performance for larger models.

reinforcement learningoptimization generalizationhierarchical rewardpolicy optimizationlanguage models

Necessary but Not Sufficient: Temperature Control and Reproducibility in LLM-as-Judge Safety Evaluations

arXiv cs.LG · Hiroki Tamba · 2026-06-24

This work identifies a critical reproducibility gap in LLM-as-judge safety evaluations, demonstrating that setting temperature=0 does not fully eliminate pass/fail flips in grading. The authors analyze 690 API calls across two providers, three model tiers, and five sampling configurations using Japan AISI's aisev codebase, finding 1-2 of 7 borderline items remain non-reproducible even under forced greedy decoding (top_k=1). They reveal that default temperature=1.0 leads to ~50% per-item disagreement across runs and note Claude Opus 4.7/4.8's deprecation of temperature control. The authors recommend treating grader disagreement as a first-class metric and release a reproduction harness.

llm-as-judgetemperature controlreproducibilitygreedy decodingsafety evaluation

LiMoDE: Rethinking Lifelong Robot Manipulation from a Mixture-of-Dynamic-Experts Perspective

arXiv cs.LG · Zhihao Gu, Lin Wang · 2026-06-24

The paper introduces LiMoDE, a two-stage lifelong learning framework for robot manipulation that combines a dynamic Mixture-of-Experts (MoE) architecture with lifelong adaptation. In the pre-training stage, heterogeneous experts are dynamically activated based on motion information for multi-task learning. During adaptation, frozen experts are combined with newly learned ones for task transfer. Evaluations on simulated and real-world benchmarks show LiMoDE achieves superior performance with moderate parameter and computational overhead compared to prior methods.

lifelong learningmixture-of-expertsrobot manipulationtask adaptationdynamic architecture

KG-TRACE: A Neuro-Symbolic Framework for Mechanistic Grounding in Antimicrobial Resistance Prediction

arXiv cs.LG · Naman Garg, Sarika Jain, Sourav Yadav, Bharat K. Bhargava · 2026-06-24

KG-TRACE introduces a neuro-symbolic framework for antimicrobial resistance (AMR) prediction that integrates the WHO mutation knowledge graph with neural genomic models via a learned epistemic trust gate. The method combines RotatE-based KG embeddings with genomic features, dynamically balancing neural evidence against symbolic biological knowledge. Evaluated on the CRyPTIC M. tuberculosis cohort, KG-TRACE achieves 0.9760 AUROC for isoniazid while providing 92.5% symbolic coverage of predictions, as measured by the novel Biological Grounding Ratio (BGR) metric, and flags uncertain cases for clinical follow-up.

neuro-symbolicknowledge graphrotate embeddingsantimicrobial resistancebiological grounding ratio

📰 Industry Media (14)

Perplexity Launches Computer for Counsel: A Multi-Model Agentic Layer for Legal Workflows

MarkTechPost · Michal Sutter · 2026-06-26

Perplexity launched Computer for Counsel, a multi-model agentic system for legal workflows that orchestrates 20+ frontier AI models via the Model Context Protocol (MCP). The system decomposes legal tasks (e.g., NDA review, regulatory monitoring) into subtasks, routes each to specialized models and data sources (Midpage, Docusign), and synthesizes outputs with verifiable citations. Enterprise deployments show automated document drafting, compliance tracking, and case research with 400+ tool integrations while preserving attorney oversight through source-linked outputs.

agentic workflowmulti-model orchestrationmodel context protocollegal citation verificationtask decomposition

OpenAI Previews GPT-5.6 With Sol, Terra, and Luna: Tiered Models, New Reasoning Modes, Limited Access

MarkTechPost · Michal Sutter · 2026-06-26

OpenAI introduces GPT-5.6, a tiered model family comprising Sol, Terra, and Luna, each optimized for intelligence, cost, and speed. Sol achieves state-of-the-art performance on Terminal-Bench 2.1 (91.91% in ultra mode) and excels in long-horizon tasks like genomics analysis and cybersecurity. Two new reasoning modes, max and ultra, enhance deep reasoning and parallel subagent coordination, respectively. Pricing varies by tier, with Sol priced at $5/$30 per million tokens for input/output. Limited access is granted to trusted partners, with broader availability planned. Safety measures and token efficiency improvements are emphasized, though latency details remain undisclosed.

tiered modelsreasoning modesterminal-benchsubagent coordinationtoken efficiency

Meet container: Apple’s Open-Source Swift Tool for Running Linux Containers as Lightweight VMs on Apple Silicon

MarkTechPost · Asif Razzaq · 2026-06-26

Apple introduces 'container', an open-source Swift CLI tool for running Linux containers as lightweight virtual machines on Apple silicon. The tool leverages macOS frameworks like Virtualization and vmnet, employing per-container VM isolation for enhanced security and privacy. It supports OCI-compatible images, enabling seamless integration with Docker Hub and GitHub Container Registry. Default resource allocation is 1 GiB RAM and 4 CPUs, configurable per run. Performance benchmarks indicate reduced memory usage and comparable boot times to shared VM containers. Limitations include partial memory ballooning support and networking restrictions on macOS 15. The tool is licensed under Apache 2.0.

oci-compatiblevirtualizationvmnetmemory ballooningapple silicon

Build a Nanobot-Style AI Agent in Google Colab with Tool Calling, Session Memory, Skills, and MCP Servers

MarkTechPost · Sana Hassan · 2026-06-26

The tutorial presents a modular framework for constructing lightweight AI agents in Google Colab, implementing core components including tool calling, session memory, and skill integration. It introduces a provider abstraction layer compatible with OpenAI API and mock LLMs, demonstrating deterministic tool invocation via regex-based pattern matching (e.g., 92% accuracy in math expression detection). The architecture features a token-budgeted memory system, MCP-style tool servers, and type-hint-derived JSON schema generation for tool registration.

tool callingsession memoryprovider abstractionjson schema generationdeterministic mocking

DeepReinforce Releases Ornith-1.0: An Open-Source Coding Model Family That Learns Its Own RL Scaffolds

MarkTechPost · Asif Razzaq · 2026-06-25

DeepReinforce introduces Ornith-1.0, an open-source model family for agentic coding that learns its own reinforcement learning scaffolds. The models, ranging from 9B dense to 397B mixture-of-experts, are post-trained on Gemma 4 and Qwen 3.5 and jointly optimize harness and solution during RL. Ornith-1.0-397B outperforms Claude Opus 4.7 on benchmarks but falls short of Opus 4.8 and GLM-5.2-744B. Three defense layers—fixed trust boundary, deterministic monitor, and frozen LLM judge—prevent reward hacking. The models support FP8 and GGUF builds for efficient local serving.

reinforcement learningmixture-of-expertsreward hackingagentic codingscaffold learning

Baidu Releases Unlimited OCR, a 3B Model That Keeps the KV Cache Flat for Long-Document Parsing

MarkTechPost · Asif Razzaq · 2026-06-25

Baidu introduces Unlimited OCR, a 3B-parameter Mixture-of-Experts model for long-document parsing that maintains constant KV-cache memory via Reference Sliding Window Attention (R-SWA). The model activates only 500M parameters during inference, replacing traditional decoder attention with R-SWA to cap memory usage at L_m + n, where L_m represents reference tokens and n is the sliding window size (default 128). Built on DeepSeek OCR via continued training, it achieves 93.23 on OmniDocBench v1.5, outperforming DeepSeek OCR by 6.22 points. The DeepEncoder compresses input to 256 visual tokens per 1024×1024 page, supporting both multi-page ('Base') and single-page ('Gundam') resolutions.

kv-cachemixture-of-expertsreference sliding window attentiondeepencoderomniocbench

Gradium Launches stt-translate and s2s-translate, Real-Time Speech Translation Models Beating gpt-realtime-translate on Accuracy and Latency

MarkTechPost · Asif Razzaq · 2026-06-24

Gradium introduces stt-translate and s2s-translate, two real-time speech translation models that outperform gpt-realtime-translate and gemini-3.5-live-translate in accuracy-latency tradeoffs. The models employ a single-pass architecture (Hibiki-Zero framework) for speech-to-text and translation, reducing the traditional 3-model cascade to 2. Evaluated on BLEU and MetricX using a proprietary dataset, stt-translate achieves higher BLEU scores than both competitors (3.0s latency vs. 3.6s for GPT, 2.9s for Gemini) and supports voice cloning. Supported languages include EN, FR, DE, ES, and PT across 20 bidirectional pairs.

real-time translationhibiki-zerobleumetricxvoice cloning

How to Design an OpenHarness Style Agent Runtime with Tools, Memory, Permissions, Skills, and Multi-Agent Coordination

MarkTechPost · Sana Hassan · 2026-06-24

The article presents OpenHarness, a modular framework for constructing AI agent systems with tool use, memory, permissions, and multi-agent coordination. It details core components including typed tool schemas (via Python dataclasses), permission policies (READ/WRITE/EXECUTE/META), cost tracking, and lifecycle hooks. The implementation demonstrates tool execution flows with input validation, JSON schema generation, and async runtime management. Key innovations include path-based permission rules, token-based cost metering, and a virtual filesystem for deterministic testing.

openharnessagent runtimetool schemaspermission policiescost tracking

Using Graphify and NetworkX to Map Python Codebase Structure with God Nodes, Communities, and Architecture Visualizations

MarkTechPost · Sana Hassan · 2026-06-24

The tutorial presents a Graphify-based workflow for structural analysis of Python codebases through graph representation, demonstrating offline extraction of architectural patterns. Using tree-sitter parsing, it constructs a knowledge graph from a multi-module Python application (configuration, database, authentication layers), subsequently analyzed via NetworkX for centrality metrics (degree: 0.7 threshold), community detection (Louvain method), and path tracing. Results identify god nodes (e.g., Settings class with betweenness centrality 0.82) and module interdependencies, visualized through static (matplotlib) and interactive (pyvis) graphs with 94 nodes and 137 edges. The method enables architecture analysis without LLM dependencies, achieving full offline codebase exploration.

knowledge graphcentrality metricstree-sitterlouvain methodgod nodes

Nous Research Adds /learn to Hermes Agent’s Skills System, Capturing Workflows as Slash Commands Without Hand-Writing SKILL.md

MarkTechPost · Asif Razzaq · 2026-06-24

Nous Research enhances Hermes Agent's Skills System with /learn, enabling automated skill creation from diverse sources without manual SKILL.md authoring. The method leverages existing agent tools (read_file, web_extract) to process inputs (directories, URLs, conversations) into standardized skills following the agentskills.io format. Skills employ progressive disclosure (3-level loading) to optimize token usage, with /learn-generated skills automatically becoming slash commands. The system supports four creation methods (manual, agent-generated, Skills Hub, /learn) and maintains skills in ~/.hermes/skills/ as the single source of truth.

hermes agentskills systemprogressive disclosureagentskills.ioslash commands

SAP aligns commerce data for AI personalisation

AI News · Ryan Daws · 2026-06-26

SAP introduces the 'Advanced Success Plan' to enable AI-driven personalisation in commerce by addressing fragmented data structures across three operational layers: data aggregation, decisioning, and delivery. The method integrates SAP Commerce Cloud and SAP Engagement Cloud to automate personalised recommendations, send-time optimisation, and cross-channel communications using real-time behavioural data. Results include improved conversion rates, higher average order values, and enhanced customer engagement metrics, validated through outcome-based governance models and continuous improvement frameworks.

ai-driven personalisationsap commerce cloudsend-time optimisationbehavioural dataoutcome-based governance

The math behind the OpenAI Jalapeño chip

AI News · Dashveenjit Kaur · 2026-06-25

OpenAI introduces the Jalapeño chip, an application-specific integrated circuit (ASIC) optimized for large language model (LLM) inference, developed in collaboration with Broadcom. The architecture minimizes data movement, integrates Broadcom’s Tomahawk networking silicon, and targets GPT-5.3-Codex-Spark workloads. OpenAI’s vertical integration strategy spans chip design, software kernels, and network scheduling, aiming to reduce operational costs projected at $14 billion annually. The chip transitioned from design to manufacturing tape-out in nine months, leveraging OpenAI’s own LLMs for hardware optimization. Initial deployment is scheduled for 2026, scaling with Microsoft’s infrastructure.

asicsllm inferencetomahawk siliconvertical integrationtape-out

Samsung opens ChatGPT Enterprise and Codex access after AI restrictions

AI News · Muhammad Zulhusni · 2026-06-24

Samsung Electronics has expanded employee access to OpenAI's ChatGPT Enterprise and Codex, reversing previous restrictions due to data-security concerns. The deployment covers all Korean employees and global Device eXperience division staff, enabling use cases in software development, marketing, and manufacturing. ChatGPT Enterprise provides enhanced data protection and access controls, while Codex supports both technical (code review, debugging) and non-technical (workflow automation) tasks. OpenAI reports 5M+ weekly Codex users, with 800% growth in Korea since February 2026. The partnership aligns with Samsung's role as a memory supplier for OpenAI's Stargate AI infrastructure, projected to require 900K DRAM wafers monthly.

chatgpt enterprisecodexdram wafersstargate projectin-context learning

Anthropic drops ‘workplace AI agents’ directly inside Slack

AI News · Dashveenjit Kaur · 2026-06-24

Anthropic introduces Claude Tag, a beta feature integrating its Opus 4.8-powered AI agent directly into Slack channels for enterprise teams, enabling asynchronous task delegation and context tracking within group threads. The system autonomously monitors threads, prioritizes notifications, and accesses corporate databases while maintaining channel-specific security controls. Early adoption data shows 34.4% enterprise penetration, with internal reports indicating 65% of code generation automated via Claude Tag, though trade-offs exist in governance and data exposure risks.

claude tagopus 4.8asynchronous executioncontext trackingenterprise adoption


Generated automatically at 2026-06-26 21:07 UTC. Summaries and keywords are produced by an LLM and may contain inaccuracies — always consult the original article.