Daily Digest — 2026-07-04

Friday, July 03, 2026 · 307 items · model: deepseek/deepseek-chat

307 items · 307 arxiv papers

⚠️ Source issues today:
  • MarkTechPost: all feed URLs failed (last tried: https://www.marktechpost.com/feed/)
  • AI News: all feed URLs failed (last tried: https://artificialintelligence-news.com/feed/)

🏛️ Research Labs

No new items today.

📜 arXiv Papers (307)

Distributed Attacks in Persistent-State AI Control

arXiv cs.AI · Josh Hills, Ida Caspary, Asa Cooper Stickland · 2026-07-02

The paper introduces Iterative VibeCoding, a benchmark for studying distributed attacks in persistent-state AI control, where coding agents iteratively modify codebases to execute covert side tasks. Using Claude Sonnet 4.5 as the attack agent and GPT-4o as the monitor, the authors compare gradual (distributed) and non-gradual (single-PR) attacks across 20 CLI and Flask task variations. Results show no single monitor robust to both strategies (evasion ≥65%), with evasion generalizing across model backends. A stateful link-tracker monitor reduces gradual-attack evasion from 93% to 47% when combined in a four-monitor ensemble.

iterative vibe-codingpersistent-state attacksgradual attacksai controlstateful monitoring

LACUNA: A Testbed for Evaluating Localization Precision for LLM Unlearning

arXiv cs.AI · Matteo Boglioni, Thibault Rousset, Siva Reddy, Marius Mosbach · 2026-07-02

The paper introduces LACUNA, a testbed for evaluating parameter-level localization precision in LLM unlearning, addressing the gap in existing benchmarks that only assess output-level performance. Using OLMo-based models (1B and 7B parameters), LACUNA injects synthetic PII via masked continual pretraining to establish ground-truth knowledge localization. Results show current SOTA unlearning methods are imprecise and vulnerable to resurfacing attacks, while successful localization enables robust erasure even with simple gradient-based methods. The authors release LACUNA to advance localization-based unlearning research.

llm unlearningparameter localizationresurfacing attacksmasked continual pretrainingolmo-based models

Program-as-Weights: A Programming Paradigm for Fuzzy Functions

arXiv cs.AI · Wentao Zhang, Liliana Hotsko, Woojeong Kim, Pengyu Nie · 2026-07-02

The paper introduces Program-as-Weights (PAW), a paradigm for compiling natural-language function specifications into compact neural artifacts. PAW employs a 4B-parameter compiler trained on FuzzyBench (10M examples) to generate parameter-efficient adapters for a frozen 0.6B Qwen3 interpreter. Results show PAW matches Qwen3-32B's performance while using 1/50th inference memory and achieving 30 tokens/s on an M3 MacBook, reframing foundation models as tool builders that emit reusable, locally-executable functions.

fuzzy-function programmingparameter-efficient adaptersneural artifactnatural-language specificationfoundation model

Online Safety Monitoring for LLMs

arXiv cs.AI · Mona Schirmer, Metod Jazbec, Alexander Timans, Christian Naesseth · 2026-07-02

The paper proposes a real-time safety monitor for large language models (LLMs) that thresholds an external verifier signal, with thresholds calibrated via risk control. This simple design outperforms sequential hypothesis testing-based monitors in experiments on mathematical reasoning and red teaming datasets. Results demonstrate competitive performance in detecting unsafe LLM outputs during deployment.

llm safetyonline monitoringrisk controlverifier signalthreshold calibration

ReContext: Recursive Evidence Replay as LLM Harness for Long-Context Reasoning

arXiv cs.AI · Yanjun Zhao, Ruizhong Qiu, Tianxin Wei, Yuanchen Bei · 2026-07-02

The paper introduces RECONTEXT, a training-free inference method for enhancing long-context reasoning in LLMs by recursively replaying query-conditioned evidence. The approach constructs an evidence pool using model-internal relevance signals, separates evidence organization from answer generation, and preserves the full original context without pruning or external memory. Evaluated on eight datasets with 128K context length, RECONTEXT improves evidence utilization across Qwen3-4B, Qwen3-8B, and Llama3-8B, achieving the best average rank on all three backbones.

long-context reasoningevidence replayassociative memoryinference methodattention mechanism

What LLM Agents Say When No One Is Watching: Social Structure and Latent Objective Emergence in Multi-Agent Debates

arXiv cs.AI · Arman Ghaffarizadeh, Danyal Mohaddes, Aliakbar Izadkhah, Shahriar Noroozizadeh · 2026-07-02

The study investigates how social structure influences LLM agent behavior by introducing a dual-channel debate framework that separates public utterances from off-the-record (OTR) responses. Using 10 models across 3 scenarios with 5 variations each, the authors measure divergence between public and OTR expressions through stance analysis, semantic similarity, natural language inference, and survey responses. Results show alignment-inducing settings increase public-OTR divergence from ~3% to 40%, with OTR responses often citing relational pressures as the cause of public accommodation, suggesting the need for evaluation frameworks that detect emergent objectives beyond explicit prompts.

llm agentssocial structuredual-channel frameworkemergent objectivesalignment divergence

Reasoning LLM Improves Speaker Recognition in Long-form TV Dramas

arXiv cs.AI · Yuxuan Li, Lingxi Xie, Xinyue Huo, Jihao Qiu · 2026-07-02

The paper introduces DramaSR-532K, a large-scale benchmark with 532K annotated dialogue lines across 900+ characters, requiring multimodal integration for speaker recognition in TV dramas. It proposes DramaSR-LRM, a reasoning-based model that autonomously aggregates contextual evidence via multimodal tool-use to synthesize diverse inputs. Experiments show DramaSR-LRM significantly outperforms baselines, especially on short utterances where acoustic biometrics fail. Data and code are publicly available.

speaker recognitionmultimodal integrationreasoning modelbenchmark datasetlong-form video understanding

DemoPSD: Disagreement-Modulated Policy Self-Distillation

arXiv cs.AI · Yunhe Li, Hao Shi, Wenhao Liu, Mengzhe Ruan · 2026-07-02

DemoPSD introduces a disagreement-modulated policy self-distillation framework to address privileged information leakage and exploration suppression in on-policy self-distillation (OPSD) for large language models. By steering the student toward a reverse-KL barycenter target—a weighted geometric combination of teacher and student distributions—DemoPSD adaptively controls blending at each token position based on distribution discrepancies. The method provably achieves leakage attenuation and exploration preservation. Experiments on SciKnowEval across four scientific fields demonstrate that DemoPSD outperforms GRPO and SDPO, maintaining higher training entropy and robust generalization to out-of-distribution GPQA benchmarks.

self-distillationreverse-kl barycenterprivileged information leakageexploration preservationtoken-level supervision

Beyond Adam: SOAP and Muon for Faster, Label-Efficient Training of Machine Learning Interatomic Potentials

arXiv cs.AI · Gil Harari, Yoel Zimmermann, Ola Tangen Kulseng, Laura Zichi · 2026-07-02

The study demonstrates that matrix-structured optimizers (SOAP, Muon, and hybrid SOAP-Muon) outperform Adam in training machine learning interatomic potentials (MLIPs), specifically NequIP and Allegro models. These optimizers achieve faster convergence and higher final accuracy, with SOAP and SOAP-Muon showing robust performance, while Muon offers partial improvements. The gains are most significant under partial force supervision, highlighting optimizer choice as a critical but underexplored factor in MLIP training.

machine learning interatomic potentialsmatrix-structured optimizersnequipallegropartial force supervision

G-RRM: Guiding Symbolic Solvers with Recurrent Reasoning Models

arXiv cs.AI · Timo Bertram, Sidhant Bhavnani, Richard Freinschlag, Erich Kobler · 2026-07-02

The paper introduces G-RRM, a neuro-symbolic method combining Symbol-Equivariant Recurrent Reasoning Models (SE-RRMs) with classical symbolic solvers (e.g., backtracking, Glucose 4.1, CaDiCaL 3.0.0) for constraint satisfaction problems. SE-RRMs generate solution proposals to guide solvers, improving efficiency when two conditions hold: (1) problems have expansive combinatorial search spaces, and (2) solvers dynamically overwrite branching choices. Results show median speedups of 33.3× (backtracking) and 1.70× (Glucose 4.1) on 9×9 Sudoku (91.1% accuracy), with Glucose maintaining 1.17× speedup on 25×25 grids. CaDiCaL 3.0.0 shows no significant improvement due to fixed hint adherence.

neuro-symbolicrecurrent reasoning modelsconstraint satisfactionsymbolic solverscombinatorial search

Combating Textual Noise and Redundancy: Entropy-Aware Dense Visual Token Pruning

arXiv cs.AI · Xuehui Wang, Xuankun Yang, Wei Shen · 2026-07-02

The paper proposes Entropy-Aware Dense Pruning (EADP), a framework addressing two bottlenecks in visual token pruning for VLMs: textual noise dispersion and feature fragmentation. EADP employs statistical entropy to filter textual noise and formulates token selection as submodular maximization with a spatial prior, ensuring non-redundant visual representations. Experiments show EADP improves accuracy-efficiency trade-offs, preserving fine-grained cues under strict token budgets and achieving state-of-the-art performance on multimodal benchmarks.

visual token pruningtextual noisesubmodular maximizationmultimodal benchmarksentropy-aware

TestEvo-Bench: An Executable and Live Benchmark for Test and Code Co-Evolution

arXiv cs.AI · Jiale Amber Wang, Kaiyuan Wang, Pengyu Nie · 2026-07-02

TestEvo-Bench introduces an executable benchmark for evaluating test and code co-evolution, addressing limitations of static metadata in existing benchmarks. It features two tracks: test generation and test update, anchored to real commit histories with execution-grounded metrics like pass rate, coverage, and mutation score. The benchmark is live, periodically updated to reduce data leakage, and currently includes 746 test generation and 509 test update tasks from 152 Java projects. Experiments with state-of-the-art agents (Claude Opus 4.7, Gemini 3.1 Pro, SWE-Agent) achieve up to 77.5% success rate on test generation and 74.6% on test update, though performance drops on recent tasks and under cost constraints.

test co-evolutionexecutable benchmarkcommit historyexecution-grounded metricsdata leakage

Human Capital, Not Model Benchmarks, Predicts Hybrid Intelligence in Forecasting

arXiv cs.AI · Vivienne Ming · 2026-07-02

The study demonstrates that human-AI hybrid forecasting performance exhibits trimodal distribution, with complementary reasoning by a minority of forecasters yielding accuracy surpassing both AI and prediction markets. Using Polymarket's real-money prediction data, the authors find that collaborative traits (perspective-taking, intellectual humility, curiosity) rather than cognitive ability or model benchmarks predict successful human-AI collaboration. Preliminary results show statistically robust patterns, with most participants either deferring to AI or performing worse than AI alone. The findings motivate a pre-registered replication study.

hybrid intelligenceprediction marketshuman capitalcomplementary reasoningtrimodal distribution

Learning to Move Before Learning to Do: Task-Agnostic pretraining for VLAs

arXiv cs.AI · Junhao Shi, Siyin Wang, Xiaopeng Yu, Li Ji · 2026-07-02

The paper proposes Task-Agnostic Pretraining (TAP), a two-stage framework for Vision-Language-Action (VLA) models that decouples physical competence learning from semantic alignment. TAP first learns motor priors via self-supervised Inverse Dynamics on unlabeled interaction data (off-task trajectories, robot play), then grounds them in language with minimal expert data. On SIMPLER benchmark, TAP matches performance of models trained on 1M+ expert trajectories while using far less labeled data (10% absolute gain over behavior cloning). Real-world WidowX experiments show 25% success under camera perturbations where baselines fail completely, demonstrating robust transferable representations.

vision-language-action modelstask-agnostic pretraininginverse dynamicsmotor priorsembodied ai

OrbitQuant: Data-Agnostic Quantization for Image and Video Diffusion Transformers

arXiv cs.AI · Donghyun Lee, Jitesh Chavan, Duy Nguyen, Sam Huang · 2026-07-02

OrbitQuant introduces a data-agnostic weight-activation quantizer for diffusion transformers (DiTs), addressing activation shifts across timesteps, prompts, and guidance branches without per-checkpoint recalibration. The method employs a normalized, rotated basis via randomized permuted block-Hadamard (RPBH) rotation, concentrating coordinates around fixed marginals to enable a single Lloyd-Max codebook per input dimension. Weights are quantized offline, absorbing the rotation to cancel within linear layers. Evaluated on FLUX.1, Z-Image-Turbo, Wan 2.1, and CogVideoX, OrbitQuant achieves state-of-the-art post-training quantization results, including W2A4 for image DiTs with usable quality.

diffusion transformerspost-training quantizationrandomized permuted block-hadamardlloyd-max codebookweight-activation quantization

Neuron-Aware Data Selection for Annotation-Free LLM Self-Distillation

arXiv cs.AI · Zhuowei Chen, Xiang Lorraine Li · 2026-07-02

The paper introduces Neuron On-Policy Self-Distillation (Neuron-OPSD), an annotation-free self-distillation framework for LLMs that uses internal neuron activations to guide training-data selection and teacher context construction. The method performs on-policy distillation from teacher distributions without ground-truth labels, addressing limitations of prior approaches like out-of-domain degradation and calibration error inflation. Evaluations on specialized-domain benchmarks show improved in-domain performance while maintaining cross-domain generalization and mitigating calibration collapse compared to annotation-free baselines.

self-distillationneuron activationson-policy learningcalibration errorpseudo-labeling

EvoPolicyGym: Evaluating Autonomous Policy Evolution in Interactive Environments

arXiv cs.AI · Zhilin Wang, Han Song, Runzhe Zhan, Jusen Du · 2026-07-02

The paper introduces EvoPolicyGym, a benchmark for evaluating Autonomous Policy Evolution, where agents iteratively improve executable policies under fixed interaction budgets. The method employs compact interactive RL environments to assess how harness-model agents edit policies based on feedback, providing both aggregate performance metrics and trajectory-level diagnostics. Results show GPT-5.5 achieves top performance across all 16 environments, with analysis revealing that effective policy evolution depends on discovering task-appropriate mechanisms and refining policies within feedback constraints.

autonomous policy evolutioninteractive environmentsharness-model agentparametric tuningtrajectory-level diagnostics

Reasoning effort, not tool access, buys first-try reliability in agentic code generation: an observational study

arXiv cs.AI · Achint Mehta · 2026-07-02

The study challenges the assumption that additional capabilities improve agentic code generation, demonstrating that reasoning effort, not tool access, drives first-try reliability. Ninety agent runs built a real-time retrospective board across varying model generations, reasoning effort levels, and tool configurations, evaluated via a 14-criterion functional rubric (42-point max) and visual quality review. Frontier models achieved near-ceiling performance, while a local model scored 24–37 points; increasing reasoning effort from High to xHigh boosted perfect first-try runs from 28% to 89% and reduced corrective prompts 5×, whereas testing tools raised costs 42–68% without functional improvement. Design prompts improved visual quality (4.5 vs. 3.0 on a 5-point scale) but not functionality.

agentic code generationreasoning effortfunctional rubrictesting toolsdesign prompts

Automated grading of Linux/bash examinations using large language models: a four-level cognitive taxonomy approach

arXiv cs.AI · Manuel Alonso-Carracedo, Ruben Fernandez-Boullon, Pedro Celard, Francisco J. Rodriguez-Martinez · 2026-07-02

This study introduces a four-level cognitive taxonomy framework for evaluating the efficacy of Large Language Models (LLMs) in grading Linux/bash command responses, addressing scalability and reliability challenges in computing education. Four LLMs (GPT, Claude Opus, Gemini, and GLM) were tested using minimal baseline and rubric-enhanced prompts on 1200 student responses graded by expert instructors. Gemini~3.0 Pro with rubric-guided prompting achieved the highest human-AI agreement (ICC(3,1) = 0.888, MAE = 0.10, Bland-Altman bias = -0.014). Results indicate that rubric quality significantly impacts grading accuracy, with agreement declining as cognitive complexity increases, providing a transferable evaluation protocol and prompt templates for AI-assisted grading.

large language modelscognitive taxonomyrubric-enhanced promptinghuman-ai agreementlinux/bash

WorldSample: Closed-loop Real-robot RL with World Modelling

arXiv cs.AI · Yuquan Xue, Le Xu, Zeyi Liu, Zhenyu Wu · 2026-07-02

WorldSample introduces a closed-loop real-robot reinforcement learning framework combining physical rollouts, world-model generation, and policy improvement to reduce interaction costs. The method employs a post-trained world model for high-fidelity synthetic transitions and Policy-Paced Learning (PPL) to regulate training via sample selection and scheduling, mitigating visual hallucination. Experiments on contact-rich manipulation tasks show a 28% success rate improvement, 59% training step reduction, and 19.4dB PSNR/0.47 SSIM gains in world model fidelity over demonstration-only baselines.

reinforcement learningworld modeldata augmentationpolicy-paced learningreal-robot rl

QFedAgent: Quantum-Enhanced Personalized Federated Learning for Multi-Agent Activity Recognition

arXiv cs.AI · Quoc Bao Phan, Tuy Tan Nguyen · 2026-07-02

QFedAgent introduces a quantum-enhanced personalized federated learning framework for multi-agent activity recognition, addressing challenges of heterogeneous non-IID sensor data in privacy-sensitive robotic applications. The method employs variational quantum circuits to model accelerometer-gyroscope interactions via quantum state encoding and entanglement, reducing fusion parameters from 33K to 72 (10x reduction) compared to classical MLP fusion. On the OPPORTUNITY dataset under non-IID partitions, it achieves 97.7% mean test accuracy while maintaining competitiveness with conventional federated baselines.

federated learningquantum circuitnon-iid datamulti-agent systemsactivity recognition

Neuron-Aware Active Few-Shot Learning for LLMs

arXiv cs.AI · Zhuowei Chen, Liwei Chen, Christian Schunn, Raquel Coelho · 2026-07-02

The paper introduces NeuFS, a Neuron-Aware Active Few-Shot Learning framework that improves sample selection for LLM adaptation by leveraging internal neuron activation patterns instead of output-level signals. NeuFS employs a dual-criteria strategy combining neuron pattern diversity for broad coverage and neuron consensus to identify challenging samples prone to hallucination. Experiments on three datasets show NeuFS outperforms existing AFSL methods in reasoning and text classification tasks, with ablations confirming neuron activations provide superior selection signals to external embeddings.

active few-shot learningneuron activation patternsllm adaptationsample selectionhallucination detection

Text-Driven 3D Indoor Scene Synthesis in Non-Manhattan Environments

arXiv cs.AI · Xianhui Meng, Zirui Song, Yuchen Zhang, Li Zhang · 2026-07-02

SPG-Layout introduces a text-driven framework for 3D indoor scene synthesis in non-Manhattan environments, addressing limitations of existing methods in modeling non-orthogonal spatial relationships. The method combines statistical priors of object distributions with a hierarchical layout strategy, prioritizing large objects to minimize geometric violations. Evaluated on a new benchmark of 500 non-Manhattan environments, SPG-Layout outperforms existing methods in both Manhattan and non-Manhattan settings, achieving higher physical plausibility and semantic realism.

3d scene synthesisnon-manhattan environmentsstatistical priorshierarchical layoutphysical plausibility

ACID: Action Consistency via Inverse Dynamics for Planning with World Models

arXiv cs.AI · Gawon Seo, Dongwon Kim, Suha Kwak · 2026-07-02

ACID introduces cycle action consistency for decision-time planning with action-conditioned world models, addressing the unchecked realizability of intermediate transitions in standard planning. The method enforces consistency by comparing forward-predicted actions with those inferred backward via an inverse dynamics model, incorporating this residual into the planning cost through a scale-invariant adaptive weight. Evaluated across four world models and six tasks (rigid/deformable manipulation, articulated control, visual navigation), ACID improves planning accuracy and matches baseline performance with significantly reduced compute.

decision-time planningaction-conditioned world modelsinverse dynamicscycle consistencyscale-invariant cost

Fast Multi-dimensional Refusal Subspaces via RFM-AGOP

arXiv cs.AI · Thomas Winninger · 2026-07-02

The paper introduces RFM-AGOP, a computationally efficient method for identifying multi-dimensional refusal subspaces in LLMs, addressing limitations of prior linear approaches. By adapting the Recursive Feature Machine (RFM) algorithm with probe-informed initialization, the method achieves subspace extraction in seconds on both reasoning (Qwen 3) and non-reasoning (Qwen 2.5) models. Results demonstrate superior performance in ablation tasks compared to alternatives, suggesting RFM-AGOP as a scalable complement to existing subspace-extraction techniques.

recursive feature machinerefusal subspaceslarge language modelsactivation steeringmulti-dimensional behavior

Steerability via constraints: a substrate for scalable oversight of coding agents

arXiv cs.AI · Thomas Winninger · 2026-07-02

The paper proposes constraint-based oversight as a scalable alternative to agentic scaffolding for coding agents, demonstrating that traditional software engineering practices (access control, network policies, coding conventions) transfer effectively to AI systems. The authors implement a constrained oversight system where a small reviewer model (Gemma 4B) audits a Python codebase with 11 backdoors, showing that combining constraint substrates with lightweight tooling (~200-LoC docs CLI) improves backdoor recall from 54.5% to 90.9%. Results suggest oversight gains are most pronounced in weakly-typed languages like Python, with principles generalizable to stronger-typed alternatives like Rust.

coding agentsscalable oversightconstraint substratebackdoor recallagentic scaffolding

Hardware-Enforced Semantic Coordination for Safety-Critical Real-Time Autonomous Systems

arXiv cs.AI · Uwe M. Borghoff, Paolo Bottoni, Remo Pareschi · 2026-07-02

The paper proposes a hardware-enforced semantic coordination architecture for safety-critical autonomous systems, addressing limitations of software-mediated coordination in real-time deployments. The method implements selected coordination semantics from the Topic-Based Communication Space Petri Net (TB-CSPN) framework directly on FPGAs, separating semantic reasoning (software-driven) from deterministic interaction management (hardware-enforced). This approach ensures bounded latency, temporal synchronization, semantic gating, and authorization constraints through hardware-native primitives, while maintaining adaptive reasoning capabilities in software.

hardware-enforced coordinationtopic-based communication space petri netfield-programmable gate arrayssafety-critical systemsreal-time autonomy

DRIFTLENS: Measuring Memory-Induced Reasoning Drift in Personalized Language Models

arXiv cs.AI · Xi Fang, Weijie Xu, Yingqiang Ge, Yuhui Xu · 2026-07-02

The paper introduces DRIFTLENS, a framework for quantifying memory-induced reasoning drift in personalized language models, where stored user attributes alter reasoning trajectories without changing response plausibility. The method maps reasoning steps to value categories, comparing trajectories with and without injected memory, and distinguishes substantive drift from pragmatic noise. Experiments across four LLMs and 10 attribute categories show medium-to-large drift effects, while GRPO and DPO post-training methods reduce drift unevenly with model-dependent tradeoffs in capability and instruction following.

reasoning driftpersonalized language modelsvalue categoriespost-training methodsuser-attribute memory

VisionAId: An Offline-First Multimodal Android Assistant for People with Visual Impairment, Featuring Personalized Object Retrieval

arXiv cs.AI · Cristian-Gabriel Florea, Stelian Spînu · 2026-07-02

VisionAId introduces an offline-first Android assistant for visual impairment, featuring personalized object retrieval via few-shot learning. The system integrates six on-device models (metric depth estimation, instance segmentation, facial embeddings, face detection, banknote detection, and visual embeddings) through ONNX Runtime, with optional cloud-based Google Gemini Flash for scene description. Multimodal feedback includes AR markers, spatial audio, and haptics. On a Samsung Galaxy S21 Ultra, INT8 quantization reduces depth latency to 491 ms, banknote detection achieves mAP@50 of 0.986, and metric depth error remains below 1 cm within 3 m.

few-shot learningonnx runtimemetric depth estimationmultimodal feedbackint8 quantization

Understanding Agent-Based Patching of Compiler Missed Optimizations

arXiv cs.AI · Batu Guan, Zirui Wang, Shaohua Li · 2026-07-02

The paper investigates agent-based patching of compiler missed optimizations, focusing on generalization beyond individual cases. A benchmark of real-world LLVM missed optimization issues was constructed to evaluate agent-generated patches against developer patches in terms of optimization scope. Results indicate that coding agents often optimize given examples but frequently produce patches that partially align with developer-intended scope or generalize beyond it. Historical-knowledge augmentation techniques leveraging prior LLVM optimization pull requests through retrieval and distillation were introduced, demonstrating improved developer-aligned generalization and practical benefits on real-world intermediate representation (IR).

compiler optimizationsllvmagent-based patchinggeneralizationintermediate representation

World Wide Models: Literary Tools for Cultural AI

arXiv cs.AI · Nina Begus · 2026-07-02

The essay proposes literary methodologies as essential tools for developing culturally literate AI systems, addressing the monolingual bias in large language models (LLMs). It introduces a layered framework integrating comparative literature, narratology, and world literature approaches to analyze AI textuality through macrostructure, circulation patterns, and untranslatability. The work bridges critical theory with AI development, offering pluralistic interpretation strategies for global textual models.

large language modelscultural literacymonolingual biasworld literatureuntranslatability

The Dual Nature of LLM Persona: Aggregated Tendencies and Frame-Dependent Geometry

arXiv cs.AI · Yuan Yuan · 2026-07-02

The study establishes a dual-nature framework for LLM personas, distinguishing between frame-robust aggregated features (Big Five scores) and frame-dependent geometric features (SPD manifold structure). Using GPT-4o simulating American and Chinese-American personas, the authors analyze IPIP-50 responses under manipulated question orderings. Results show aggregated features degrade by 21% under randomization but remain frame-robust, while geometric features collapse by 42% under frame misalignment but recover to 84% under shared frames, surpassing aggregated feature recovery (76%). This reveals persona geometry as a frame-dependent coordination pattern encoding non-aggregatable information.

llm personasspd manifoldbig fiveframe-dependenceipip-50

Stable Self-Modulating Quantum Fast-Weight Programmers with Bounded Memory Gates

arXiv cs.AI · Kuo-Chung Peng, Jiun-Cheng Jiang, Chun-Hua Lin, Yifeng Peng · 2026-07-02

The paper introduces a bounded old-state modulation rule for Self-Modulating Quantum Fast-Weight Programmers (QFWPs), addressing divergence in long-sequence regimes while preserving performance. The method applies a sign-preserving tanh gate to the recurrent memory branch, leaving additive updates and new-update modulation unchanged. Evaluations on CUDA-Q quantum-dynamics forecasting and Milan SMS prediction tasks show that old-state modulation consistently improves Standard QFWP, with bounded gating enhancing robustness and preventing divergence. Unbounded Self-Modulating QFWP excels in longer input windows, closely matching the Only-Old ablation.

quantum fast-weight programmersself-modulatingbounded memory gatesquantum-dynamics forecastingrecurrent memory modulation

GAP-GDRNet: Geometry-Aware Monocular Visual Pose Sensing on a Single-Target Synthetic Spacecraft Dataset

arXiv cs.AI · Yonglong Zhang, Yang Liu · 2026-07-02

GAP-GDRNet introduces a geometry-aware attention-enhanced framework for monocular 6D pose estimation in spacecraft imagery, addressing challenges like weak texture and occlusion. The method extends GDR-Net by adding an attention-based feature refinement (AFR) module for global-local feature enhancement and a patch-level geometric self-attention (PGSA) module in Patch-PnP for geometric relation modeling. Training utilizes synthetic data from Blender, including target masks, dense coordinate maps, and pose labels. The approach specifically targets non-cooperative rendezvous scenarios with sparse geometric evidence.

6d pose estimationmonocular visionattention mechanismsspacecraft imagerysynthetic dataset

SkillFuzz: Fuzzing Skill Composition for Implicit Intents Discovery in Open Skill Marketplaces

arXiv cs.AI · Jinwei Hu, Yi Dong, Youcheng Sun, Xiaowei Huang · 2026-07-02

SkillFuzz introduces a fuzzing-based approach for discovering implicit intents in LLM-based agent skill marketplaces, where benign skills interact to produce unintended behaviors. The method formulates skill compositions as test units, uses planning artifacts as intent proxies, and employs contract-guided Monte Carlo Tree Search to prioritize high-risk interactions without execution. Evaluation on marketplace workloads reveals 1,000+ distinct implicit intents, with 80% validation accuracy for high-risk cases, outperforming alternative search strategies in both coverage and severity detection.

skill compositionimplicit intentsmonte carlo tree searchdifferential oraclecontract-guided fuzzing

Self-Gating Attention for Efficient Time Series Forecasting

arXiv cs.AI · Dezheng Wang, Tong Chen, Wei Yuan, Congyan Chen · 2026-07-02

The paper proposes Self-Gating Attention (SGA), an efficient attention mechanism for time series forecasting that reduces quadratic complexity to linear. SGA combines a shared learnable matrix for common patterns with an input-dependent residual component, eliminating query/key projections. Evaluated on nine datasets (electricity, finance, weather, etc.), SGA maintains competitive forecasting performance while improving inference efficiency compared to standard self-attention and lightweight variants.

self-gating attentiontime series forecastinglinear complexityattention mechanismstemporal dependencies

SelectTSL: Prompt-Guided Selective Target Sound Localization in Complex Scenarios

arXiv cs.AI · Ziyang Jiang, Yu Chen, Zexu Pan, Xinyuan Qian · 2026-07-02

The paper introduces SelectTSL, an end-to-end architecture for prompt-guided selective target sound localization in complex acoustic scenes. The method employs a Prompt-Guided Selective Attention Module (PGSA) to generate prompt-informed embeddings, which guide an inter-channel phase difference (IPD) enhancer to refine spatial cues. This coupled design estimates direction of arrival (DoA) and target-source cardinality while focusing on user-specified targets. Experiments on synthetic and real-world data show superior performance and robust generalization compared to baselines.

target sound localizationprompt-guided attentioninter-channel phase differencedirection of arrivalend-to-end architecture

Grounded autonomous research: a fault-tolerant LLM pipeline from corpus to manuscript in frontier computational physics

arXiv cs.AI · Haonan Huang · 2026-07-02

The paper presents a fault-tolerant LLM pipeline for autonomous research in frontier computational physics, demonstrating end-to-end automation from corpus analysis to publication-grade manuscript generation. The system processes 11,083 arXiv papers, autonomously conceives research directions, calibrates methodologies via reference reproduction, conducts novel first-principles computations, and writes grounded manuscripts through 47 fresh-context sessions with 2,162 literature consultations. Key innovations include redundancy-based fault tolerance (fresh-context isolation, distributed grounding, adversarial review) and structurally enforced numerical confrontation at calibration checkpoints. The pipeline produces three substantive physics findings on altermagnetic piezomagnetism with bounded human intervention only at reproduction failures.

autonomous researchllm pipelinefault tolerancefirst-principles computationscalibration checkpoints

A Hippocampus for Linear Attention: An Exact Memory for What the Recurrent State Forgets

arXiv cs.AI · Wanyun Cui · 2026-07-02

HOLA (Hippocampal Linear Attention) introduces a hippocampal complement to linear-attention and state-space language models, addressing their lossy exact memory issue. It combines a compressive delta-rule state with a bounded exact KV cache, enabling semiparametric test-time memory: the state compresses linearly structured data, while the cache retains associations unsuitable for compression. Cache writes prioritize tokens with large prediction residuals, and a decoupled RMSNorm-gamma cache read ensures sharp retrieval. Trained on 15B SlimPajama tokens, HOLA reduces Wikitext perplexity from 27.32 to 22.92 (-16.1%) and improves LAMBADA perplexity from 30.95 to 30.26, outperforming full-attention Transformer++ and enhancing robustness on RULER needle-in-a-haystack recall up to 32k tokens.

linear-attentionkv cachedelta-rulesemiparametric memoryrmsnorm-gamma

Generalization in offline RL: The structure is more important than the amount of pessimism

arXiv cs.AI · Max Weltevrede, Matthijs T. J. Spaan, Wendelin Böhmer · 2026-07-02

The paper demonstrates that optimal generalization in offline reinforcement learning (CMDPs) depends not on the degree of pessimism but on whether the pessimistic structure aligns with the symmetries of the optimal solution. Theoretical analysis shows that a symmetric, overly pessimistic value function can generalize better than a mildly pessimistic, non-symmetric one. The authors propose applying data augmentation via consistency loss during policy extraction, rather than traditional offline training on augmented datasets. Empirical validation on a rotationally symmetric reacher environment using IQL and CQL supports this approach.

offline reinforcement learningpessimismsymmetrydata augmentationconsistency loss

AnyGroundBench: A Specialized-Domain Benchmark for Video Grounding in Vision-Language Models

arXiv cs.AI · Rintaro Otsubo, Ryo Fujii, Reina Ishikawa, Taiki Kanaya · 2026-07-02

We introduce AnyGroundBench, a domain-adaptation benchmark for Spatio-Temporal Video Grounding (STVG) in Vision-Language Models (VLMs), addressing the gap in specialized-domain evaluation. The benchmark targets five domains (animal, industry, sports, surgery, public security) with dense spatio-temporal annotations and dedicated training subsets to systematically measure domain adaptability. We evaluate 15 state-of-the-art VLMs, assessing zero-shot generalization and In-Context Learning (ICL) capabilities under practical constraints. Results reveal significant failures in both zero-shot and ICL-based adaptation, exposing critical flaws in spatio-temporal reasoning for specialized domains.

spatio-temporal video groundingvision-language modelsdomain adaptationin-context learningzero-shot generalization

HERMES: A Multi-Granularity Labeling Substrate for Pre-training Data Mixtures

arXiv cs.AI · Ziyun Qiao, Yue Min, Ruining Chen, Yujun Li · 2026-07-02

HERMES introduces a hierarchical labeling substrate for pre-training data mixtures, addressing limitations of fixed-granularity approaches. The method employs a Learned Semantic Transform followed by 3-stage residual vector quantization to annotate documents with coarse-to-fine codes, enabling granularity control via prefix length (up to ~130k cells). While matching KMeans-family methods on clustering metrics at coarse granularity, HERMES uniquely reveals granularity-dependent interactions: a Stage-2 rule combining contrast and coverage improves 16-task macro-average by +0.0253 at one level, but loses efficacy at finer granularities. The system enables dynamic navigation of data-derived hierarchies rather than fixed label selection.

hierarchical labelingresidual vector quantizationpre-training mixturesgranularity controlsemantic transform

AgenticSTS: A Bounded-Memory Testbed for Long-Horizon LLM Agents

arXiv cs.AI · Xiangchen Cheng, Yunwei Jiang, Jianwen Sun, Zizhen Li · 2026-07-02

AgenticSTS introduces a bounded-memory contract for long-horizon LLM agents, replacing raw transcript appending with typed retrieval to assemble fresh user messages for each decision. This approach maintains bounded prompts across long runs and enables isolated ablation of memory layers. The method is instantiated in Slay the Spire 2, a stochastic deck-building game requiring hundreds of decisions. Experiments show a fixed-A0 ablation with strategic skills enabled improves win rates from 3/10 to 6/10, though statistical significance is limited. The testbed includes 298 trajectories, memory/skill snapshots, and analysis scripts for reproducible study of memory layers in LLM-agent decisions.

bounded-memorytyped retrievalablationlong-horizonllm-agent

Copewell: A Multi-Agent Swarm Architecture for Equitable Mental Wellness Support

arXiv cs.AI · Seren Yenikent, Jack Vinijtrongjit, Katherine Ng · 2026-07-02

Copewell introduces a multi-agent swarm architecture for equitable mental wellness support, addressing limitations of single-mode AI solutions. The system combines (1) multi-source assessment (self-reported, physiological, contextual data) to reduce bias, (2) valence-arousal emotion mapping via Russell's Circumplex Model for agent routing, and (3) dual-mode interventions (conversational + sensory protocols). The design incorporates privacy-first architecture, an Ethics Supervisor agent, and practitioner-informed participatory development. Early practitioner engagement informed system design, though empirical evaluation remains future work.

multi-agent swarmvalence-arousal mappingcircumplex modelparticipatory designethics supervisor

Challenges and Recommendations for LLMs-as-a-Judge in Multilingual Settings and Low-Resource Languages

arXiv cs.AI · A. Seza Doğruöz, Xixian Liao, Verena Blaschke, Jakob Prange · 2026-07-02

The paper examines challenges in using LLM-as-a-Judge for multilingual and low-resource language evaluation, analyzing 650 ACL Anthology papers to identify gaps. Only 33 studies focused on these settings, revealing inconsistent outcomes, overreliance on LLM judgments, and single-model bias. The authors provide recommendations for more robust evaluation practices in such contexts, emphasizing the need for human validation and multi-model assessment.

llm-as-a-judgemultilingual evaluationlow-resource languageshuman validationevaluation bias

Purified OPSD: On-Policy Self-Distillation Without Losing How to Think

arXiv cs.AI · Zhanming Shen, Jintao Tong, Shaotian Yan, Chen Shen · 2026-07-02

The paper identifies and addresses a failure mode in on-policy self-distillation (OPSD) for long chain-of-thought reasoning models, where reference-induced shortcuts dominate the teacher's supervision signal. The proposed solution first isolates non-transferable components using a reference-only teacher, then transforms the residual into a PMI target distribution for distillation. Experiments on four long-CoT models across two datasets show consistent improvements over baseline OPSD while preserving epistemic behavior.

on-policy self-distillationchain-of-thought reasoningpointwise mutual informationreference-induced shortcutsepistemic behavior

Efficient Waste Sorting for Circular Economy: A Confidence-guided comparison between One-Vs-All and One-Vs-Rest Classification Strategies with Human-in-the-Loop for Automated Waste Sorting

arXiv cs.AI · Mohammed Fahad Ali, Dominique Briechle, Marit Briechle-Mathiszig, Tobias Geger · 2026-07-02

This work evaluates One-Vs-All (OvA) and One-Vs-Rest (OvR) classification strategies for AI-based waste sorting systems, focusing on their adaptability to municipal-specific waste disposal schemes in Germany. Using a dataset aligned with Goslar's waste categories, the study examines classifier behavior in identifying potentially misclassified samples through confidence thresholding, enabling human-in-the-loop review. The analysis aims to optimize the trade-off between misclassification rates and human annotation effort, supporting the development of configurable waste sorting tools for advancing Circular Economy objectives.

one-vs-all classificationone-vs-rest classificationconfidence thresholdinghuman-in-the-loopcircular economy

CoFL-S: Spatially Queryable Sector Flow Fields for Local Language-Conditioned Navigation

arXiv cs.AI · Haokun Liu, Zhaoqi Ma, Yicheng Chen, Wentao Zhang · 2026-07-02

CoFL-S introduces a language-conditioned flow field framework for local navigation, predicting spatially queryable sector flow fields to generate continuous trajectories. The method converts VLN-CE episodes into frame-level supervision with aligned sub-instructions and dense flow-field targets, enabling low-level action representation. Evaluated on a continuous-time Habitat benchmark, CoFL-S outperforms action-token and action-chunk baselines across planner frequencies and demonstrates zero-shot real-world deployment advantages.

vision-language navigationflow fieldcontinuous trajectoryhabitat benchmarkzero-shot deployment

Criticality-Based Guard Rail Validation for AI Agent Decisions in Autonomous Telecom Networks

arXiv cs.AI · Ravi Kant Sharma · 2026-07-02

The paper proposes Guard Rail Validation (GRV), a runtime framework for validating AI-driven decisions in autonomous telecom networks (ANL 4-5). GRV evaluates decisions across six weighted dimensions (action scope, type, service criticality, autonomy level, reversibility, temporal patterns) to compute a criticality level, triggering graduated validation mechanisms from logging to multi-agent consensus. The architecture includes conflict detection with criticality-weighted resolution and compliance logging for EU AI Act. Deployment models for O-RAN demonstrate coverage against known AI/ML telecom threats.

autonomous networksruntime validationcriticality assessmentmulti-agent consensuso-ran deployment

The Eticas AI Risk Taxonomy: Open Infrastructure for Operationalizing AI Audits

arXiv cs.AI · Gemma Galdon Clavell, Pablo Accuosto, Usman Gohar · 2026-07-02

The Eticas AI Risk Taxonomy v2.0.0 introduces an operationalization layer for AI audits, bridging conceptual risk taxonomies to actionable evaluations. It demonstrates this through a case study on GPT-4-0314, measuring PII leakage risks at 0%, 51%, and 84% under varying adversarial conditions, resulting in a subcategory grade of E with a SYSTEMIC pattern. The taxonomy organizes 76 subcategories across 10 categories and 20 sub-groups, mapping to 18 external frameworks. Published under CC BY 4.0, it provides open semantic infrastructure with stable URIs and SKOS/JSON-LD distributions, enabling scalable, standardized AI risk assessments.

operationalization layerpii leakageadversarial conditioningsemantic infrastructureskos/json-ld

What Types of Human-AI Teams Exist?

arXiv cs.AI · Nathan Hughes, Ibrahim Habli · 2026-07-02

This study contributes a taxonomy of human-AI teams by analyzing 53 papers and categorizing them into five clusters: AI Assistant, Ad-hoc Dependency, Ad-hoc Forced Dependency, Paired Equanimity, and Group Equanimity. The classification is based on psychological taxonomies of teaming, highlighting distinct holistic team-level characteristics. Results reveal disparate team types studied under the same definition, raising questions about the transferability of insights across papers. The authors provide guidance for identifying team types, a reporting checklist, and suggestions for synthesizing the field.

ai assistantad-hoc dependencyad-hoc forced dependencypaired equanimitygroup equanimity

Overview of Risk Assessment and Management for Intelligent Systems under the AI Act and Beyond

arXiv cs.AI · Javier Irigoyen, Roberto Daza, Aythami Morales, Julian Fierrez · 2026-07-02

This paper provides a systematic overview of AI risk assessment and management methodologies in response to emerging regulatory frameworks like the AI Act. It analyzes the global regulatory landscape driving AI risk assessment needs and characterizes AI-related risks spanning technical failures to societal impacts. The review identifies key risk assessment frameworks, highlights best practices, and pinlights methodological gaps requiring further research to ensure safe and reliable AI systems.

ai risk assessmentregulatory frameworkstechnical failuresrisk managementmethodological gaps

UA-ChatDev: Uncertainty-Aware Multi-Agent Collaboration for Reliable Software Development

arXiv cs.AI · Temitayo Olamilekan Ogunsusi, Lijun Qian, Xishuang Dong · 2026-07-02

UA-ChatDev introduces an uncertainty-aware multi-agent framework for reliable software development, addressing hallucination propagation in LLM-based systems. The method integrates token-level log probability uncertainty estimation with phase-aware threshold calibration to trigger retrieval-based verification when confidence is low. Experiments on SRDD show improvements in completeness (12.7%), executability (18.3%), consistency (9.5%), and overall quality over single-agent and non-uncertainty-aware baselines, with ablation studies confirming enhanced execution reliability.

multi-agent systemsuncertainty quantificationsoftware developmentlarge language modelshallucination propagation

RadiomicNet: A Hybrid Radiomics-Guided Lightweight Architecture for Interpretable Medical Image Segmentation

arXiv cs.AI · Mohammad Amanour Rahman · 2026-07-02

RadiomicNet introduces a hybrid architecture combining deep learning with radiomics features for interpretable medical image segmentation. The method integrates handcrafted radiomics features via a Radiomics Attention Gate (RAG) and enforces alignment through Radiomics Consistency Loss, reducing Expected Calibration Error (ECE) by 16.9%. Evaluated on BUSI and Kvasir-SEG datasets, it achieves Dice Similarity Coefficients of 0.763 and 0.854, outperforming U-KAN by 1.2% and 1.8% with only 3.27M parameters. Dominant radiomics cues include GLCM dissimilarity (15.24%) and LBP entropy (11.49%), providing clinical interpretability.

radiomicssegmentationattention gateinterpretabilitylightweight

A rubric-based controlled comparison of frontier language models on expert-authored clinical reasoning tasks

arXiv cs.AI · Samiha A. Ismail, Fan X. Chen, Ali Merali · 2026-07-02

The study introduces a clinician-authored, rubric-based evaluation framework for assessing LLMs on open-ended clinical reasoning tasks, addressing limitations of saturated multiple-choice benchmarks. Five expert-designed scenarios across four medical specialties were evaluated using atomic, weighted, MECE rubrics (184 total criteria). GPT-4, Claude Opus, and Gemini Pro were tested, achieving mean rubric pass rates of 0.39, 0.47, and 0.37 respectively. Critical (weight-5) criteria showed low pass rates (32.4-41.7%), while low-stakes (weight-1) criteria achieved 80-90%. LLM autoraters replicated expert labels with 92.8-94.7% accuracy. The work establishes a scalable pipeline for future large-scale clinical reasoning benchmarks.

rubric-based evaluationclinical reasoningmece criteriallm autoraterspass rates

Dynamic Neural Graph Encoding of Inference Processes in Deep Weight Space

arXiv cs.AI · Di Wu, Huan Liu, Zhixiang Chi, Yuanhao Yu · 2026-07-02

The authors propose Dynamic Neural Graph Encoder (DNG-Encoder), a novel method for analyzing neural network weight spaces by modeling them as dynamic graphs that capture temporal inference dynamics. The approach processes neural network parameters through graph representations that preserve layer-by-layer sequential processing, and introduces INR2JLS for mapping Implicit Neural Representations (INRs) to a joint latent space. Evaluations show a 10% accuracy improvement over state-of-the-art on CIFAR-100-INR classification.

dynamic graph encodingneural weight spaceimplicit neural representationssequential inferencejoint latent space

Predicting Early Stages Of Alzheimer's Disease And Identifying Key Biomarkers Using Deep Artificial Neural Network And Ensemble Of Machine Learning Methodologies

arXiv cs.AI · Debopriya Ghosh · 2026-07-02

Proposes a machine learning framework for early Alzheimer's disease detection using clinical, neuropsychological, and neuroimaging data from ADNI. Handles missing values via iterative imputation and class imbalance with Borderline SVM-SMOTE. Employs wrapper-based and embedded feature selection, followed by a stacking ensemble model combining Logistic Regression, Extra Trees, Bagging KNN, and LightGBM, alongside a deep artificial neural network. Evaluates performance using precision, recall, F1-score, and AUC-ROC metrics to identify optimal classifiers and key biomarkers for early diagnosis.

iterative imputationborderline svm-smotefeature selectionstacking ensembleauc-roc

A$^{2}$utoLPBench: An Auto-Generated, Agent-Friendly LP Benchmark via Inverse-KKT Construction

arXiv cs.AI · Shuo Ren, Yaohui Han, Yifan Shi, Libo Shen · 2026-07-02

The paper introduces A$^{2}$utoLPBench, an auto-generated benchmark for evaluating LLM-driven agents on linear programming (LP) problems expressed in plain text. The method constructs problems by selecting feasible primal-dual points and deriving corresponding LP instances where optimality and objective values are known by construction, eliminating solver calls and human annotation. Key features include unlimited problem generation, adjustable difficulty via $(n,m)$ parameters, ground-truth correctness, low LLM-side cost, repeatable scores, and resistance to training-data leakage through fresh seed ranges.

linear programmingllm-driven agentsinverse-kktbenchmark generationsolver-critic baseline

ART for Diffusion Sampling: Continuous-Time Control and Actor-Critic Learning

arXiv cs.AI · Yilie Huang, Wenpin Tang, Xun Yu Zhou · 2026-07-02

We propose Adaptive Reparameterized Time (ART), a continuous-time control framework for optimizing timestep allocation in score-based diffusion sampling. ART treats the sampling clock speed as a control variable, enabling adaptive timesteps via a learned time-warping rate. We introduce ART-RL, a reinforcement learning formulation with Gaussian policies that solves this control problem while maintaining equivalence to ART at the optimizer level. Experiments demonstrate that ART-RL improves sample quality across diverse settings—including image generation—without modifying the sampling pipeline, while exhibiting strong generalization across budgets, datasets, and solvers.

diffusion samplingcontinuous-time controltime-warping rategaussian policiesactor-critic learning

Coding-agents can replicate scientific machine learning papers

arXiv cs.AI · Atharva Hans, Ilias Bilionis · 2026-07-02

We introduce Paper-replication, a workflow enabling coding agents to replicate computational claims from scientific machine learning papers by reconstructing methods, running experiments, and validating evidence against paper claims. The workflow records targets, links outputs to provenance, and ensures report coverage before completion. Evaluated on twelve independent runs across four papers, all workspaces passed completion gates, with 158 targets matched to report coverage. Variations persisted in target division, numerical fidelity, replication time, intermediate executions, and evidence acceptance rules, demonstrating robustness despite procedural differences.

coding agentscomputational claimsprovenancevalidation checksreport coverage

Behind the Refusal: Determining Guardrail Activation via Behavioral Monitoring

arXiv cs.AI · William Hackett, Peter Garraghan · 2026-07-02

The authors present the first black-box methodology for detecting guardrail systems in AI applications by monitoring HTTP, lexical, and timing signals without prior knowledge. Their approach distinguishes guardrail blocks from LLM rejections through behavioral analysis, enabling targeted adversarial bypass strategies. Experiments show 100% guardrail detection accuracy, significant behavioral separation (q < 0.001), and 98% F1 score for classifying rejection types on unseen prompts.

guardrail systemsblack-box reconnaissancebehavioral monitoringllm securityadversarial emulation

An Efficient vLLM-Based Inference Pipeline for Unified Audio Understanding and Generation

arXiv cs.AI · Haoran Wang, Jinchuan Tian, Siddhant Arora, Shinji Watanabe · 2026-07-02

The authors present a vLLM-based inference pipeline for unified speech understanding and generation, addressing limitations in multimodal generation support for Speech Language Models. Their method extends autoregressive decoding to handle delay-pattern de-interleaving and multi-stream sampling, integrating an on-GPU acoustic decoder for end-to-end waveform synthesis. Notably, they maintain 80% of non-CFG throughput during Classifier-Free Guidance by co-scheduling conditional and unconditional requests within continuous batching, overcoming the expected 50% throughput penalty.

vllmmultimodal generationautoregressive decodingclassifier-free guidancespeech language models

Enhancing Fitness Intelligence through Domain-Specific LLM Post-Training

arXiv cs.AI · Xingtao Zhao, Tian Yang, Han Jiang · 2026-07-02

The paper introduces FitOne, a series of domain-specific LLMs (8B/32B parameters) for Scientific Fitness Coaching (SFC), addressing limitations of general-purpose models. Built on Qwen3, FitOne employs a three-stage post-training pipeline: continual pre-training, supervised fine-tuning, and reinforcement learning, using knowledge-engineered datasets. Evaluations on ACSM-EP and NSCA-CSCS certification exams show average improvements of 10.09%/9.29% and 12.73%/7.01% over base models, respectively, while retaining general capabilities. Ablation studies validate each pipeline stage's necessity for balancing domain expertise and general performance.

domain-specific llmscientific fitness coachingpost-training pipelineknowledge engineeringreinforcement learning

ContextNest: Verifiable Context Governance for Autonomous AI Agent

arXiv cs.AI · Misha Sulpovar, Benn R. Konsynski, Qaish Kanchwala, Gabe Goodhart · 2026-07-02

ContextNest introduces a governance layer for autonomous AI agents' knowledge retrieval, addressing provenance, version identity, integrity, traceability, and point-in-time reconstruction. The method combines typed Markdown documents, deterministic set-algebraic selectors, SHA-256 hash-chained version histories, and audit traces through the Model Context Protocol (MCP). Empirical results show governed selection Pareto-dominates BM25 sparse retrieval with a 97% answer-quality pass rate versus 93-90%, and deterministic selectors maintain stable document sets (Jaccard 1.0) compared to dense+HNSW baselines (mean Jaccard 0.611).

context governancedeterministic selectorssha-256 hash-chainingmodel context protocolretrieval-augmented generation

Guided Action Flow: Q-Guided Inference for Flow-Matching Vision-Language-Action Policies

arXiv cs.AI · Liuhaichen Yang, Zhuang Jiang, Chenchao Sheng, Zezhi Tang · 2026-07-02

Guided Action Flow introduces Q-guided inference for flow-matching vision-language-action (VLA) policies, enabling test-time guidance without retraining. The method employs a pretrained SmolVLA policy and a learned action-chunk critic, trained on real success and failure rollouts, to guide reverse-time flow sampling via action gradients. Evaluated on LIBERO manipulation tasks, a single-task critic improves success rates from 68.0% to 82.0% and 82.0% to 86.0% across seed windows, while a multi-family task-description critic increases validation success from 46.0% to 56.0%. Held-out test gains are modest, rising from 65.0% to 67.5%, highlighting critic generalization and uncertainty-aware guidance as key challenges.

flow-matchingvision-language-actionq-guided inferenceaction-chunk criticreverse-time flow sampling

SUNTA: Hierarchical Video Prediction with Surprise-based Chunking

arXiv cs.AI · Tomoshi Iiyama, Masahiro Suzuki, Yutaka Matsuo · 2026-07-02

SUNTA introduces surprise-based chunking for hierarchical video prediction, addressing challenges of hierarchical collapse and missing surprise signals in open-loop prediction. The method employs a decoupled training strategy to preserve surprise signals and uses internal inconsistency as a top-down metric for chunk boundary determination during imagined rollouts. Evaluated on 2D and 3D video prediction tasks, SUNTA maintains accurate predictions over 250 timesteps, significantly outperforming baselines that degrade within the first 10 timesteps.

hierarchical state-space modelssurprise-based chunkingdecoupled traininginternal inconsistencyvideo prediction

Evolutionary Wave Function Collapse

arXiv cs.AI · Dipika Rajesh, Ahmed Khalifa, Julian Togelius · 2026-07-02

The paper introduces an evolutionary approach to Wave Function Collapse (WFC), a procedural content generation method, by evolving small input examples rather than complete outputs. WFC serves as a genotype-to-phenotype mapping, with generated levels evaluated via domain-specific fitness functions. Experiments in maze connectivity maps and Zelda-style dungeon layouts demonstrate improved generation quality when objectives align with local structure, though global constraints remain challenging. Results indicate evolutionary search effectively guides WFC when target properties emerge from local relationships.

wave function collapseprocedural content generationevolutionary searchgenotype-to-phenotype mappinglocal adjacency constraints

Evidence-State Rewards for Long-Context Reasoning

arXiv cs.AI · Ya Gao, Pekka Marttinen · 2026-07-02

The paper introduces Maven, a reinforcement learning framework for long-context reasoning that optimizes evidence-state transitions rather than final outcomes. Maven employs an editable evidence memory with action-specific rewards: add actions are scored by marginal gain and hindsight contribution, link actions by evidence synergy, and drop actions by improved answer support. Evaluated on Llama and Qwen models across LongBench v2, LongReason, and RULER, Maven outperforms baseline methods in evidence sufficiency and distractor reduction, demonstrating the advantage of stateful evidence navigation over static extraction.

reinforcement learninglong-context reasoningevidence memoryaction-level rewardsstate transitions

kNNGuard: Turning LLM Hidden Activations into a Training-Free Configurable Guardrail

arXiv cs.AI · Mahmoud Abdelfattah, Hamid Nasiri, Peter Garraghan · 2026-07-02

kNNGuard introduces a training-free guardrail for LLMs that leverages hidden activations and a small prompt bank (50 examples) to detect unsafe or off-topic inputs. The method combines multi-layer kNN classification across activation and embedding spaces, requiring no fine-tuning. Evaluated on six domains, it matches or exceeds fine-tuned guardrails in F1 (unspecified absolute values) while reducing inference latency by 2.7x–10x. Domain adaptation is achieved by updating the prompt bank in under 10 seconds, significantly faster than traditional methods. The analysis covers system prompt effects, layer selection, and production integration.

llm guardrailshidden activationsknn classificationtraining-freeinference latency

Algebraic Model Counting for Global Analysis of Optimal Decision Trees

arXiv cs.AI · Hiroki Arimura · 2026-07-02

The paper introduces Algebraic Decision Tree Counting (ADTC), a formal framework for exhaustive analysis of optimal and near-optimal decision trees in Explainable AI. Inspired by Algebraic Model Counting, ADTC reformulates analytical tasks (optimization, counting, sampling) as sum-of-products computations over semirings, using dynamic programming with O*(n^O(Δ)) complexity for n features and depth Δ. The method employs model behavior tensors with convolution products over tensor semirings to handle multi-metric constraints. The emtrees software demonstrates ADTC's utility in analyzing trade-offs between accuracy, size, and fairness on real-world datasets.

algebraic model countingdecision treestensor semiringdynamic programmingmodel behavior tensors

SA-HGNN: Sample-Adaptive Hyperbolic Graph Neural Network for EEG-Based Depression Recognition

arXiv cs.AI · Yang Li, Pan Hu, Yan Zhang, Wenfan Yang · 2026-07-02

The Sample-Adaptive Hyperbolic Graph Neural Network (SA-HGNN) improves EEG-based depression recognition by capturing hierarchical brain network structures. SA-HGNN integrates three modules: a Sample-Adaptive Graph Construction module for personalized brain network topologies, hyperbolic graph convolution to model hierarchical relationships in hyperbolic space, and an Attention Pooling module to reduce EEG noise. Experiments on public EEG datasets demonstrate SA-HGNN's superior performance in both resting-state and task-related paradigms, validating its robustness to noise and effectiveness in identifying abnormal functional connectivity patterns in depression.

graph neural networkshyperbolic geometryeeg signalsattention poolingfunctional connectivity

Prompt Coverage Adequacy

arXiv cs.AI · Florian Tambon, Michael Konstantinou, Cedric Richter, Charles Chenouard · 2026-07-02

The authors introduce Prompt Coverage Adequacy, a novel coverage criterion for testing code generated from task descriptions in LLM-driven software development. This criterion operates at the prompt level, leveraging LLM attention mechanisms to measure how well a test suite satisfies prompt requirements. Evaluated using attention boosting across two datasets and multiple LLMs, Prompt Coverage Adequacy demonstrates a 30+% improvement in fault detection over traditional code coverage, suggesting its utility in guiding test generation for LLM-based systems.

prompt coverage adequacyattention mechanismsllm-driven developmenttest generationfault detection

Beyond the Performance Illusion: Structure-Aware Stratified Partitioning and Curriculum Distributionally Robust Optimization for Spatially Correlated Domains

arXiv cs.AI · Prathamesh Patil, Arpit Jain, Aswanth Krishnan · 2026-07-02

The paper proposes a unified framework addressing evaluation and training pitfalls in spatiotemporally correlated AI domains. It introduces Structure-Aware Stratified Partitioning (SASP) to mitigate data leakage in validation splits while maintaining class balance, and Curriculum Distributionally Robust Optimization (CDRO) for stable training under stricter splits. Evaluations across benchmarks demonstrate improved generalization, better confidence calibration, and revealed hidden failure modes compared to random-split approaches.

spatiotemporal correlationdata leakagedistributionally robust optimizationhidden stratificationconfidence calibration

SPLIT: Cross-Lingual Empathy and Cultural Grounding in English and Ukrainian LLM Responses

arXiv cs.AI · Anna Chorna · 2026-07-02

The paper introduces SPLIT, a 500-prompt benchmark evaluating LLM consistency in generating emotionally grounded responses across five crisis-related categories (Stress, Panic, Loneliness, Internal Displacement, Tension) in English and Ukrainian. It assesses three LLMs (Gemini-2.5-Flash, LLaMA-3.3-70B-Instruct, DeepSeek-V3) on Empathetic Accuracy, Linguistic Naturalness, and Contextual & Cultural Grounding. Results show performance degradation in Ukrainian for Gemini-2.5-Flash and LLaMA-3.3-70B-Instruct, while DeepSeek-V3 remains stable, with weak human-AI evaluator agreement on cultural grounding.

cross-lingual evaluationemotional supportcultural groundingllm-as-a-jurylow-resource languages

OpenSafeIntent: Evaluating Intent-Calibrated Safe Completion Across Dual-Use Prompt Sets

arXiv cs.AI · Rheeya Uppaal, Seungwoo Lyu, Selina Sung, Junjie Hu · 2026-07-02

We introduce OpenSafeIntent, a benchmark for evaluating intent-calibrated safe completion in language models through controlled prompt-sets that vary intent while holding tasks fixed. The benchmark includes benign, dual-use, and malicious variants of the same task, enabling assessment of model safety across intent shifts. Results across multiple models reveal that prompt-level safety masks critical failures: models often fail to maintain safety across intent variants, dual-use behavior is brittle under paraphrase, and high-level answers on risky topics lack reliability. Responses reframing ambiguous requests into safer tasks are less likely to cross safety boundaries. The findings advocate for evaluating safe completion as intent-calibrated behavior over controlled task variants rather than as a single safety-helpfulness tradeoff.

intent-calibrateddual-usesafe completionprompt-setssafety boundary

Towards Load-Aware Prefill Deflection for Disaggregated LLM Serving

arXiv cs.AI · Shrikara Arun, Anjaly Parayil, Srikant Bharadwaj, Renee St. Amant · 2026-07-02

The paper introduces a load-aware prefill deflection scheduler for disaggregated LLM serving, addressing compute asymmetry between prefill and decode nodes under bursty workloads. The method proactively deflects prefill requests to decode nodes by estimating Time-to-First-Token (TTFT) trade-offs and interleaving chunked-prefill steps with in-flight decode batches, eliminating inter-node KV-cache transfers. Evaluated on vLLM with DeepSeek-V2-Lite, the approach reduces P95 TTFT by up to 81% and improves SLO attainment by 79% over existing schedulers, with sub-millisecond routing overhead.

disaggregated servingprefill deflectionkv-cache transfertime-to-first-tokenchunked-prefill

PACE: A Proxy for Agentic Capability Evaluation

arXiv cs.AI · Yueqi Song, Lintang Sutawika, Jiarui Liu, Lindia Tjuatja · 2026-07-02

We introduce PACE, a framework for constructing proxy benchmarks that predict LLM agentic performance from non-agentic evaluations. PACE selects instances from atomic capability tests using target-relevance local selection and globally informative global selection, then fits a regression mapping source instance scores to target agentic benchmark performance. Experiments on 14 models, 4 agentic benchmarks, and 19 non-agentic benchmarks show PACE-Bench achieves LOOCV MAE under 4%, Spearman correlation above 0.80, and 85% pairwise model-ranking accuracy at less than 1% of full agentic evaluation cost. Analysis reveals unique skill demands of each agentic benchmark.

agentic benchmarksproxy benchmarksatomic capabilitiesinstance selectionregression mapping

Mirror Illusion Art

arXiv cs.AI · Xiaopei Zhu, Zeyuan Li, Jun Zhu, Xiaolin Hu · 2026-07-02

AutoMIA introduces an automated pipeline for Mirror Illusion Art, jointly optimizing shape and color to generate printable 3D objects from two target 2D images (front and mirror). The method employs four mechanisms: projection-alignment component selection for surface noise reduction, position-weighted adaptive suppression for background noise, internal voxel preservation to prevent fractures, and shape-color decoupled optimization for balanced refinement. AutoMIA achieves diverse, smooth artworks in both digital and physical domains, with an average design time of 76s and memory usage of 2.6 GB on a single RTX 3090. This advances inverse graphics and computational design by addressing limitations of prior topology-driven and shadow-based approaches.

inverse graphicscomputational designvoxel preservationshape-color decouplingmirror illusion

InduceKV: Fixed-Footprint Continual Adaptation of Multimodal LLMs via Inducing KV Memories

arXiv cs.AI · Qianyu Chen, Ziteng Feng, Canran Xiao, Runxuan Tang · 2026-07-02

InduceKV introduces a fixed-footprint continual adaptation method for multimodal LLMs, enabling task-specific updates without modifying the backbone model or exceeding memory budgets. The approach stores training prefixes as attention-ready memory entries with frozen retrieval keys and compact KV payloads, appended to the self-attention cache. Bilevel selection optimizes retrieval calibration and memory balancing for likelihood, retention, and coverage. Evaluations across task-incremental instruction tuning, continual VQA, domain-incremental adaptation, and lifelong multimodal instruction tuning show consistent improvements over PEFT, MoE, replay, and prompt-retrieval baselines under matched memory constraints.

multimodal llmsfixed-footprint adaptationkv payloadsbilevel selectionself-attention cache

Traceable Fault Diagnosis for Battery Energy Storage Systems via Retrieval-Augmented Multi-Agent O&M Assistant

arXiv cs.AI · Jiangdi Ru, Bing Li, Yage Huang, Ding Wang · 2026-07-02

A traceable fault-diagnosis assistant for battery energy storage systems (BESSs) is introduced, leveraging retrieval-augmented multi-agent reasoning to integrate operational data, domain knowledge, visual evidence, and report generation. The system employs BESS-specific task routing, schema-constrained natural-language database access, hybrid text-image retrieval, and evidence-based answer synthesis to enhance reliability. Preliminary internal evaluations demonstrate efficacy in routing, database access, and diagnostic reasoning.

battery energy storage systemsretrieval-augmented reasoningtask routingschema-constrained accesshybrid text-image retrieval

Episodic-to-Semantic Consolidation Without Identity Drift

arXiv cs.AI · Xue Qin, Simin Luan, Cong Yang, Zhijun Li · 2026-07-02

Proposes a method for episodic-to-semantic memory consolidation that preserves agent identity integrity in regulated autonomic systems. Introduces a deterministic function f: M^ep -> M^sem that generates a separate semantic knowledge layer without altering the agent's cryptographically certified identity. Formalizes agent representation, proves identity invariance via a structural lemma, and specifies an auditable aggregation algorithm. Synthetic experiments demonstrate per-field correctness, byte-equal identity preservation, and a 79.82% reduction in unproductive planner attempts (95% CI [78.02%, 81.49%]) compared to a Bayesian-shrunk baseline. Enables knowledge accumulation while maintaining identity across operational lifetime.

episodic memorysemantic memoryidentity invariancedeterministic functionautonomic agents

Do Newer Lightweight CNNs Perform Better Under Resource Constraints? A Controlled Multigenerational Study of Architecture, Initialization, Training Budget, and Efficiency

arXiv cs.AI · Tasnim Shahriar · 2026-07-02

This controlled study evaluates nine lightweight CNN architectures across CIFAR-10, CIFAR-100, and Tiny ImageNet, measuring accuracy, computational efficiency, and hardware performance. The analysis employs standardized metrics including top-1 accuracy, parameter count, GMACs, and latency on NVIDIA L4 and AMD Ryzen 5 5500U platforms. Results show EfficientNetV2-S achieves highest accuracy (97.57% on CIFAR-10), while EfficientNet-B0 maintains competitive performance with 79% fewer parameters. MobileNetV3-Small demonstrates superior inference speed, and latency rankings vary significantly between hardware platforms, indicating GMACs alone insufficiently predict real-world performance.

lightweight cnnscomputational efficiencypareto frontierinference latencygmaccs

MolSight: A Graph-Aware Vision-Language Model for Unified Chemical Image Understanding

arXiv cs.AI · Wenda Wang, Yihan Tong, Yuwei Hu, Zhewei Wei · 2026-07-02

MolSight introduces a graph-aware vision-language model framework for enhanced molecular image understanding, addressing limitations in structural alignment and topological modeling. The method integrates a Molecular Topology Module to embed chemical-bond adjacency into vision tokens and a Molecular Grounding Module for visual-symbolic alignment. Evaluations show MolSight outperforms existing VLMs, molecular LLMs, and specialized tools across chemical visual tasks, achieving superior molecular image reasoning.

molecular vision-language modelgraph-aware frameworktopological modelingchemical-bond adjacencyvisual-symbolic alignment

Multimodal Knowledge Edit-Scoped Generalization for Online Recursive MLLM Editing

arXiv cs.AI · Siyuan Li, Youyuan Zhang, Ruitong Liu, Junxi Wang · 2026-07-02

The paper introduces Edit-Scoped Generalization, a framework for online multimodal knowledge editing in MLLMs that controls the propagation boundary of each edit. The proposed ScopeEdit method decomposes updates into a modality-local absorption branch and an evidence-gated shared generalization branch, using orthogonal low-rank spaces and Sherman-Morrison recursions for constant overhead. Experiments across benchmarks, edit streams, and MLLM backbones demonstrate improved trade-offs between cross-modal transfer and locality while maintaining reliability and efficiency.

multimodal knowledge editingedit-scoped generalizationlow-rank spacessherman-morrison recursionscross-modal transfer

OntoLearner: A Modular Python Library for Ontology Learning with Large Language Models

arXiv cs.AI · Hamed Babaei Giglou, Jennifer D'Souza, Andrei Aioanei, Nandana Mihindukulasooriya · 2026-07-02

OntoLearner introduces a modular Python framework for ontology learning, unifying ontology access, LLM-driven pipelines, and standardized benchmarking across domains. It provides 180 machine-readable ontologies spanning 22 domains and pipeline-ready datasets for term typing, taxonomy discovery, and non-taxonomic relation extraction. A large-scale empirical study evaluates 22 retrieval models and 12 LLMs, revealing that failure modes scale with ontological complexity rather than model size or architecture, highlighting a structural mismatch between model knowledge encoding and ontology organization. OntoLearner enables cross-domain, multi-task benchmarking and is open-source under the MIT license.

ontology learninglarge language modelstaxonomy discoverynon-taxonomic relation extractionmachine-readable ontologies

A Multi-Branch Hierarchy-Aware Framework for Heterogeneous Audio Classification

arXiv cs.AI · Beile Ning, Jiayi Yu, Zitong Wang, Yufei Hu · 2026-07-02

The authors present a multi-branch hierarchy-aware framework for heterogeneous audio classification in DCASE 2026 Task 1, achieving improved hierarchical F1 scores through three key strategies: dataset expansion with BSD35k, feature-specific acoustic modeling branches, and hierarchy-aware classifiers with KNN post-processing. The system leverages CLAP-based audio-text representations, with log-STFT features yielding the strongest single-model performance (80.84% Hier. F1 on BSD10k-v1.2). Ensemble systems combining complementary features and classifiers further improve performance to 81.25% and 81.18% Hier. F1.

hierarchical classificationaudio-text representationsknn post-processingclap-based featuresmulti-branch architecture

Assessing VLM Reliability for Medical Image Quality Evaluation Under Corruption and Bias

arXiv cs.AI · Sofiane Ouaari, Kevin Vorwalder, Nico Pfeifer · 2026-07-02

The study evaluates Vision-Language Models (VLMs) for Medical Image Quality Assessment (MIQA) under real-world conditions, including image corruption and textual bias. Using the MediMeta-C dataset, 16 VLMs were tested zero-shot across seven corruption types and five severity levels, while assessing sensitivity to embedding geometry and textual attributes. Results show pixelation caused the largest score reductions (mean -20.58%), while brightness had minimal impact (-0.81%). Textual metadata influenced scores (+17.15% for institutional prestige, -14.7% for equipment age), revealing privacy-bias trade-offs. VLMs exhibited limited reliability, with embedding displacement correlating with performance changes.

vision-language modelsmedical image quality assessmentzero-shot evaluationembedding geometrycontextual bias

Object Aligner: A Configurable JSON Schema Similarity Score for Graphs, Applied to LLM Prompt Optimization

arXiv cs.AI · Jan Drchal · 2026-07-02

The paper introduces Object Aligner (OA), a configurable JSON similarity metric for structured LLM outputs that handles graphs via referential alignment. OA recursively aligns JSON trees using Hungarian algorithm for unordered collections and sequence alignment for ordered ones, with schema-guided partial credit. It solves graph isomorphism via Weisfeiler-Leman color refinement for identifier bijection, ensuring invariance to relabeling. When integrated with GEPA prompt optimizer, OA improved or maintained performance across all evaluated datasets while providing mismatch localization and repair suggestions.

json schemareferential alignmentweisfeiler-lemanhungarian algorithmprompt optimization

NeoMap: Training-free Novel-View Synthesis from Single Images and Videos

arXiv cs.AI · Jinxi Li, Tianyi Zhang, Yafei Yang, Zihui Zhang · 2026-07-02

NeoMap introduces a training-free framework for novel-view synthesis from single images or monocular videos, addressing artifacts and global scene consistency issues in existing methods. The approach leverages the insight that optimal novel-view solutions are encoded within pre-trained video models' data manifolds, employing convergent manifold alternating projection iterations to optimize initial noise. Evaluations on Tanks-and-Temples, LLFF, and DAVIS benchmarks demonstrate state-of-the-art generation fidelity and view consistency, outperforming existing methods.

novel-view synthesismanifold projectiontraining-freemonocular videoscene consistency

Robust for the Wrong Reasons: The Representational Geometry of LLM Robustness to Science Skepticism

arXiv cs.AI · Minjong Cheon · 2026-07-02

The study investigates how instruction-tuned LLMs (Llama-3.1-8B, Qwen2.5-7B, Mistral-7B) respond to scientific skepticism across climate, vaccines, and evolution domains, combining behavioral analysis with linear probing and activation patching. Contrary to concerns about sycophantic retreat, models exhibit three distinct policies: reactive assertion (Llama), surface hedging (Qwen), and non-response (Mistral), with behavioral shifts driven by increased consensus assertion (β=+0.042, p<1e-77). Probe analysis reveals middle-layer divergence (perfect separation in Llama/Qwen vs. 72% in Mistral), showing robustness is domain-specific and may reverse in safety-critical contexts like vaccines.

instruction-tuned llmslinear probingactivation patchingrepresentational geometryscientific skepticism

Atomic Task Graph: A Unified Framework for Agentic Planning and Execution

arXiv cs.AI · Yue Zhang, Sihan Chen, Ziwen Huang, Hanyun Cui · 2026-07-02

The paper introduces Atomic Task Graph (ATG), a unified framework for LLM-based agentic planning and execution that improves efficiency without model scaling or fine-tuning. ATG explicitly models task dependencies as directed acyclic graphs, enabling parallel execution of independent subtasks and targeted error recovery through graph evolution tracing. Experiments demonstrate ATG's superior success rates and execution efficiency over baselines across three interactive benchmarks using only 7B-8B parameter models.

atomic task graphagentic planningdirected acyclic grapherror localizationparallel execution

Conditional Co-Ablation: Recovering Self-Repair Backups in Transformer Circuits

arXiv cs.AI · Zhiren Gong, Zihao Zeng, Chau Yuen, Wei Yang Bryan Lim · 2026-07-02

Conditional Co-Ablation (CoAx) is introduced as a label-free, output-grounded scoring method to recover dormant backup components in transformer circuits that activate during self-repair. CoAx measures the conditional growth in ablation effects after removing a primary component set, exposing second-order interactions neglected by single-unit scores. On GPT-2-small's IOI circuit, CoAx improves backup-head recovery from 0.33 to 0.91 ROC-AUC, surpassing self-repair-aware gradient scores (best 0.82), and verifies causal repair roles via counterfactual patching. The method generalizes to induction tasks across eight models and enables repair-aware structured pruning from 124M to 7B parameters.

conditional co-ablationtransformer circuitsself-repaircounterfactual patchingstructured pruning

PhysMani: Physics-principled 3D World Model for Dynamic Object Manipulation

arXiv cs.AI · Peng Yun, Shouwang Huang, Hao Li, Jinxi Li · 2026-07-02

PhysMani introduces a physics-principled 3D world model for dynamic object manipulation, combining a 3D Gaussian world model with a future-aware action policy. The world model learns a divergence-free Gaussian velocity field through online optimization for physically grounded dynamics prediction, while the policy integrates predicted 3D scene dynamics via a token-based cross-attention module. Evaluated on PhysMani-Bench (16 tasks), the framework achieves superior success rates in simulation and real-world robot experiments compared to baselines.

3d gaussian world modeldivergence-free velocity fieldfuture-aware policydynamic manipulationphysics-principled

CausalSteward: An Agentic Divide-Conquer-Combine Copilot for Causal Discovery

arXiv cs.AI · Nicholas Tagliapietra, Gian Lorenzo Marchioni, Moritz Willig, Juergen Luettin · 2026-07-02

CausalSteward (CAST) introduces a human-in-the-loop multi-agent framework for scalable causal discovery in high-dimensional settings. The system employs a divide-and-conquer strategy, iteratively partitioning large variable clusters and analyzing them separately, while integrating prior knowledge through retrieval-augmented generation and conditional independence tests. This approach addresses causal identifiability issues arising from assumption violations in real-world data. The framework demonstrates the potential of multi-agent collaboration in causal reasoning, highlighting the role of human oversight in ensuring accurate and trustworthy causal models.

causal discoverymulti-agent frameworkretrieval-augmented generationconditional independence testscausal identifiability

A-TMA: Decoupling State-Aware Memory Failures in Long-Term Agent Memory

arXiv cs.AI · Zitong Shi, Yixuan Tang, Anthony Kum Hoe Tung · 2026-07-02

We propose A-TMA, a state-aware overlay for long-term memory systems in LLM agents, addressing ghost memory failures where outdated, current, and transitional facts coexist and mislead retrieval. A-TMA maintains superseded and transition records, constructs evidence packets for specific state views, and labels facts as current, historical, or transitional during QA. We introduce LTP, a conflict-heavy benchmark for ghost memory, and evaluate on LoCoMo for long conversation generalization. Graphiti+A-TMA improves conflict accuracy by 0.240 on LTP and raises temporal F1 from 0.0295 to 0.1705 on LoCoMo, demonstrating that explicit state roles reduce memory failures obscured by QA accuracy.

ghost memorystate-awareevidence packetsconflict accuracytemporal f1

AIriskEval-edu: New Dataset for Risk Assessment in AI-mediated K-12 Educational Explanations

arXiv cs.AI · Javier Irigoyen, Roberto Daza, Francisco Jurado, Julian Fierrez · 2026-07-02

The paper introduces AIriskEval-edu-db2, a dataset of 1,639 K-12 educational explanations (human-written and LLM-generated) with annotated pedagogical risks across five dimensions: factual precision, depth/completeness, focus/relevance, student-level appropriateness, and ideological bias. The dataset includes 785 explanations with structured explainability annotations (risk localization/description) via semi-automatic expert validation. Experiments compare proprietary models with a locally deployable Llama 3.1 8B model, showing supervised fine-tuning enables competitive risk detection while preserving privacy in educational auditing.

pedagogical risk assessmentllm-generated explanationsstructured explainabilitysupervised fine-tuningeducational auditing

TUDUM: A Turkish-Thinking Reasoning Pipeline for Qwen3.5-27B

arXiv cs.AI · Baran Bingol, Bahaeddin Turkoglu · 2026-07-02

The paper introduces TUDUM, a pipeline for adapting Qwen3.5-27B to perform explicit Turkish reasoning traces rather than internal English translations. The method involves supervised fine-tuning (SFT) on 15,991 Turkish reasoning examples using LoRA adapters, followed by GRPO-family reinforcement learning on a Turkish mathematics environment. Results show SFT reduced response length and improved Turkish consistency but lowered benchmark accuracy; RL partially recovered mathematical performance (notably on AIME24) without surpassing the base model on Macro-6 average.

turkish reasoningsupervised fine-tuninglora adaptersgrpo reinforcement learningqwen3.5-27b

Low-Latency Task-Oriented Image Transmission with Opportunistic Spectrum Access

arXiv cs.AI · João Henrique Inacio de Souza, Mattia Merluzzi, Mateus P. Mota, Beatriz Soret · 2026-07-02

The paper proposes a low-latency task-oriented image transmission framework using opportunistic spectrum access and VQ-VAE-based compression. The method employs discrete latent representations transmitted via digital modulation over idle channels, with an AI-powered receiver reconstructing task-relevant information. A cross-layer latency model incorporates compression, block errors, retransmissions, and stochastic channel access. Results demonstrate 79-3.3x latency reduction versus conventional coding benchmarks, with only 5.7-2.4% accuracy drops on classification tasks under constrained spectrum and fading channels.

task-oriented communicationvq-vaeopportunistic spectrum accesslatency-accuracy tradeoffdigital modulation

ElephantAgent: Contextual State Continuity in Agentic Systems

arXiv cs.AI · Jiankai Jin, Xiangzheng Zhang, Zhao Liu, Wenzhuo Xu · 2026-07-02

ElephantAgent introduces Contextual State Continuity, a protocol defending agentic systems against contextual state poisoning attacks. The method enforces verifiable continuity by recomputing and validating cryptographic digests of security-critical contextual state (e.g., tool state, memory) against an authorized ledger maintained via replicated trusted hardware. It provides Historical Traceability for post-hoc audit and recovery, handling both out-of-band tampering and in-band semantic abuse. The approach extends prior state-continuity mechanisms like Nimble to dynamic agentic contexts.

contextual state continuityagentic systemsstate poisoninghistorical traceabilitytrusted hardware

ContextSniper: AntTrail's Token-Efficient Code Memory for Repository-Level Program Repair

arXiv cs.AI · Chiwang Luk, Matin Mohammad Najafi, Zhifeng Jia, Wei Yang · 2026-07-02

ContextSniper introduces a token-efficient code memory layer for repository-level program repair, addressing the inefficiency of large language model agents in processing irrelevant code and logs. The method employs precision evidence selection via hybrid retrieval signals, intention-aware context gating, and compact evidence packets while preserving recoverable source context. Evaluated on SWE-bench Lite with OpenClaw and Claude Code, ContextSniper reduces token use by 51.5% and 38.9% respectively, with minimal impact on resolution rates (24.0% vs 26.0% for OpenClaw, 30.0% vs 32.0% for Claude Code).

contextsniperrepository-level repairhybrid retrievalintention-aware gatingtoken efficiency

Population-Based Multi-Objective Training of Discriminators for Semi-Supervised GANs

arXiv cs.AI · Francisco Sedeño, Francisco Chicano, Jamal Toutouh · 2026-07-02

The paper introduces a population-based evolutionary training strategy for semi-supervised GANs (SSL-GANs), framing discriminator learning as a multi-objective optimization problem. Instead of scalar loss aggregation, the method maintains a Pareto-optimal population of discriminators to explore trade-offs between classification accuracy and real/fake discrimination. Evaluated on MNIST with limited labels, the approach demonstrates improved training robustness over SSL-GAN and CE-SSL-GAN baselines, with an elitist variant achieving the highest classification accuracy.

semi-supervised gansmulti-objective optimizationpareto dominancediscriminator trainingevolutionary strategy

Rethinking Complexity Metrics for LLM-Integrated Applications: Beyond Source Code

arXiv cs.AI · Zihao Xu, Yuekang Li, Gelei Deng, Yi Liu · 2026-07-02

We introduce HECATE, the first tool to assess complexity in LLM-integrated applications across both prompt and code layers, addressing a gap in existing code-centric metrics. HECATE employs Prompt-as-Specification, a Hoare-logic-inspired formalism interpreting prompts as behavioral specifications, and evaluates 52 candidate metrics grounded in 25 complexity dimensions. Testing on 118 components from 18 repositories, we identify ten significant metrics, seven of which measure structural breadth (e.g., LLM call sites, memory attributes) rather than volume. Prompt-layer metrics retain significance even when controlling for code-level complexity, establishing prompt complexity as an independent dimension. Validation on 20 components from six repositories confirms the generalizability of the top-performing metrics.

hecateprompt-as-specificationstructural breadthllm-integrated applicationshoare-logic

SABER: A Semantic-Aligned Brain Network Analysis Framework via Multi-scale Hypergraphs

arXiv cs.AI · Yidan Xu, Xiangmin Han, Rundong Xue, Huihui Ye · 2026-07-02

SABER introduces a semantic-aligned brain network analysis framework that integrates large language model (LLM) derived semantics into brain disease diagnosis. The method enriches node representations via ROI-level semantics and global self-attention, constructs multi-scale hypergraphs to model functional subnetworks and multi-ROI interactions, and employs a decision-level semantic alignment mechanism to inject patient-specific textual embeddings into graph representations. Experiments on ABIDE and ADHD-200 datasets demonstrate state-of-the-art performance, enhanced stability, and improved interpretability, particularly in small-sample settings.

semantic-alignedmulti-scale hypergraphsroi-level semanticsdecision-level alignmentfunctional subnetworks

Rank-Then-Act: Reward-Free Control from Frame-Order Progress

arXiv cs.AI · Yuriy Maksyuta, George Bredis, Ruslan Rakhimov, Daniil Gavrilov · 2026-07-02

The paper introduces Rank-Then-Act (RTA), a reward-free control framework that learns policies from expert videos by training a Vision-Language Model (VLM) as an ordinal progress scorer. The method employs Group Relative Policy Optimization (GRPO) on shuffled frames to infer temporal order from visual semantics, then uses Spearman rank correlation between predicted and true temporal indices as a scale-invariant reward signal for reinforcement learning. Evaluated on PyBoy (Catrap, Kirby) and continuous control tasks (PointMaze, MetaWorld), RTA matches or outperforms prior video-based reward learning methods while enabling cross-task scorer reuse.

rank-then-actvision-language modelgroup relative policy optimizationspearman rank correlationreward-free control

Spec-AUF: Accept-Until-Fail Training under Train-Inference Misalignment for Masked Block Drafters

arXiv cs.AI · Tianjian Yang, Meng Li · 2026-07-02

The paper introduces Spec-AUF, a training method for masked block drafters in speculative decoding that addresses train-inference misalignment by truncating cross-entropy support after the first predicted failure. This Accept-Until-Fail (AUF) approach concentrates supervision on the accepted prefix without modifying the inference pipeline or requiring auxiliary objectives. Evaluated on Qwen3-8B across six benchmarks, AUF increases the average emitted length from 2.40 to 2.61 for DFlash drafters and from 2.56 to 2.68 for Domino's two-branch head, demonstrating consistent improvements.

speculative decodingmasked block drafterstrain-inference misalignmentcross-entropy supportaccept-until-fail

SAB-LVLM: Significance-Aware Binarization for Large Vision-Language Models

arXiv cs.AI · Qi Lyu, Jiahua Dong, Baichen Liu, Xudong Wang · 2026-07-02

The paper introduces SAB-LVLM, a significance-aware binarization method for Large Vision-Language Models (LVLMs) that addresses performance degradation in existing binarization approaches. The method constructs Hessian matrices for textual and visual inputs, derives a spatial significance map to identify modality-specific and cross-modal weights, and employs a modality-guided integration strategy for significance-aware binarization. Experiments demonstrate SAB-LVLM's superiority over existing binary PTQ methods under ~1-bit compression. Code is available at https://github.com/LyuQi127/SAB_LVLM.

binarizationvision-language modelshessian matrixmodality-guidedquantization

SkillCoach: Self-Evolving Rubrics for Evaluating and Enhancing Agentic Skill-Use

arXiv cs.AI · Jiayin Zhu, Kelong Mao, Yudong Guo, Dengbo He · 2026-07-02

SkillCoach introduces self-evolving rubrics for evaluating LLM agent skill-use, addressing challenges in skill selection, following, composition, and reflection. The framework derives process rubrics from real rollouts, decoupling process quality from final task success. It uses these rubrics for process supervision in trajectory selection. Experiments demonstrate improved evaluation quality, detection of hidden failures, and stronger supervision signals compared to outcome-only filtering.

llm agentsskill compositionprocess supervisionself-evolving rubricstrajectory evaluation

CamoNAS: Neural Architecture Search for Enhanced Camouflaged Object Detection

arXiv cs.AI · Dawei Ren, Yan Zhang, Hongying Tang, Qiaoling Zhou · 2026-07-02

CamoNAS introduces a frequency-aware multi-resolution Neural Architecture Search framework for Camouflaged Object Detection (COD), automating both cell-level operations and network-level downsampling paths. The method employs an RGB frequency dual-stream architecture with a learnable wavelet transform to enhance detection of objects with weak edges. Evaluated on four benchmarks (CAMO, COD10K, NC4K, CHAMELEON), CamoNAS achieves state-of-the-art performance, demonstrating NAS effectiveness for COD tasks.

neural architecture searchcamouflaged object detectionwavelet transformmulti-resolutiondual-stream architecture

An Exploratory Study on LLM-Generated Code and Comments in Code Repositories

arXiv cs.AI · Yongyi Ji, Jiaji Wang, Yi Zhou, Fuxiang Chen · 2026-07-02

This study investigates LLM-generated code and comments in software repositories, comparing company- and community-maintained projects from 2021-2025. Using detection tools as proxies, the analysis reveals that LLM-generated code prevalence decreased over time, often appearing in test cases, while comment generation remained stable. Detected LLM code exhibited high intra-repository cloning, and comments showed low grammatical correctness. Company repositories had higher LLM-generated content proportions, with only a small fraction of human-labeled bugs linked to LLM code.

llm-generated codecode repositoriesintra-repository clonesgrammatical correctnessbug detection

Safety Targeted Embedding Exploit via Refinement

arXiv cs.AI · Joshua Adrian Cahyono · 2026-07-02

The paper introduces STEER, a gradient-guided attack that exploits the epistemic gap in multilingual safety alignment of LLMs by iteratively translating safety-critical words into low-resource languages to suppress refusal behavior while preserving harmful intent. The method outperforms random code-switching and GCG, achieving up to 93.0% attack success on JailbreakBench and 96.7% on AdvBench across six 8B-parameter models, with transferability to GPT-4o-mini (35.5% success). Results indicate that English-centric safety training fails to generalize to multilingual inputs, necessitating broader alignment coverage and out-of-distribution detection mechanisms.

safety alignmentgradient-guided attackcode-switchingepistemic gaprefusal behavior

Has This Checkpoint Been Abliterated? A Two-Signal Audit and Its Failure Map

arXiv cs.AI · Gabriel Hurtado · 2026-07-02

The paper proposes a two-signal audit method combining a reference-anchored activation refusal-gap and weight-recovery energy to detect checkpoint abliteration (removal of refusal mechanisms) in open-weight models. The approach achieves AUROC 0.95 on a 273-checkpoint registry spanning Qwen, DeepSeek-distilled Qwen, Llama, and Gemma, significantly outperforming individual signals (0.84, 0.90). A Youden-calibrated threshold maintains 0.89 balanced accuracy on held-out families, though two failure modes are identified: spoofed references and white-box adversarial training.

checkpoint abliterationrefusal-gapweight-recovery energythreshold-free auditopen-weight models

Evaluating Chunking Strategies for Retrieval-Augmented Generation on Academic Texts

arXiv cs.AI · Valentin J. J. Kreileder, Johannes Reisinger, Andreas Fischer · 2026-07-02

This study evaluates the efficacy of cluster-based semantic chunking in Retrieval-Augmented Generation (RAG) systems for processing academic theses, comparing it against fixed-size and recursive chunking strategies. Using the Retrieval Augmented Generation Assessment (RAGAs) framework, the authors assess retrieval and answer quality across different question types. Results indicate that RAGAs-based faithfulness metrics show limited reliability, and cluster-based chunking does not outperform simpler chunking strategies in this configuration. Performance variations are attributed to document formatting and preprocessing factors.

retrieval-augmented generationsemantic chunkingragas frameworkfaithfulness metricsdocument preprocessing

Decomposer: Learning to Decompile Symbolic Music to Programs

arXiv cs.AI · Yewon Kim, Apurva Gandhi, David Chung, Graham Neubig · 2026-07-02

Decomposer introduces a framework for symbolic music decompilation, recovering executable music programs from symbolic MIDI input. The method addresses challenges in MIDI-to-Strudel decompilation through supervised fine-tuning on Strudel-Synth, a synthetic corpus of paired Strudel programs and MIDI, followed by reinforcement learning on unpaired MIDI to optimize both MIDI reconstruction faithfulness and code readability. Evaluations on synthetic and real-world MIDI benchmarks demonstrate that Decomposer outperforms closed-source LLMs in MIDI reconstruction accuracy while generating more readable and diverse code compared to heuristic converters.

symbolic musicmidi-to-strudelsupervised fine-tuningreinforcement learningcode readability

CLAP: Closed-Loop Training, Evaluation, and Release Control for Domain Agent Post-training

arXiv cs.AI · Fangfei Li, Chenyang Zhao, Long Wang, Feng Tian · 2026-07-02

The paper introduces CLAP, a closed-loop framework for domain agent post-training that addresses data noise, evaluation uncertainty, and release risks. CLAP structures business data into training samples, diagnostic sets, and release gates, combining data validation, reward/KL analysis, and application-chain replay. Evaluated on five manufacturing batches using QLoRA-style LoRA-SFT, results show modest average gains (0.0098 score increase, 0.0240 pass rate improvement) but batch-wise variability, with GRPO revealing KL risks. Application-chain replay demonstrates RAG necessity, where a 3B model with RAG-oriented adapter improves factual extraction metrics but increases latency versus base+RAG.

closed-loop trainingdomain adaptationlora-sftragkl divergence

Mixture-of-Parallelisms: Towards Memory-Efficient Training Stack for Mixture-of-Experts Models

arXiv cs.AI · Xuan-Phi Nguyen, Shrey Pandit, Yiran Zhao, Semih Yavuz · 2026-07-02

The paper introduces Mixture-of-Parallelisms (MoP), a memory-efficient training stack for Mixture-of-Experts (MoE) models that combines specialized parallelism techniques across layers and stages. MoP optimizes CPU, GPU HBM memory, and inter-node communication bandwidth to enable lossless pre-training/fine-tuning of trillion-parameter models at million-token context lengths. A novel optimizer strategy enhances throughput and memory efficiency, allowing training on under 12 8x H200 GPU nodes. Experiments show MoP achieves 4.7x--8.2x higher per-GPU throughput compared to FSDP2 and sustains training at 1M tokens, whereas FSDP2 fails beyond 64--128K tokens.

mixture-of-expertsparallelismmemory-efficiencyoptimizerthroughput

Actual causality in fault trees

arXiv cs.AI · Georgiana Caltais, Milan Lopuhaä-Zwakenberg, Mariëlle Stoelinga · 2026-07-02

The paper establishes a formal connection between fault tree analysis and Halpern & Pearl's theory of actual causality, enabling fault trees to address diagnostic questions ("why has it gone wrong?") beyond their traditional risk assessment role. By analyzing fault tree graph structures and logical constructs, the authors classify different notions of actual causality and demonstrate how minimal cut sets correspond to actual causes. The results provide a complete structural characterization of causality in fault trees, bridging reliability engineering and formal causal reasoning.

fault treesactual causalityminimal cut setsfailure diagnosticsrisk models

Pre-Flight: A Benchmark for Evaluating Large Language Models on Aviation Operational Knowledge

arXiv cs.AI · Alex Brooker, Tim Hughes · 2026-07-02

The authors introduce Pre-Flight, an open-source benchmark of 300 multiple-choice questions for evaluating LLMs on aviation operational knowledge, covering international standards, regulations, and complex scenarios. Questions were practitioner-authored and evaluated using the Inspect framework, with models scored by accuracy against an expert reference (~95%). The top 2026 model achieved 82.7%, showing persistent gaps below expert reliability. The benchmark supports domain-specific evaluation for responsible AI deployment in non-safety-critical aviation operations.

large language modelsaviation operationsdomain-specific evaluationinspect frameworkmultiple-choice benchmark

MMIR-TCM: Memory-Integrated Multimodal Inference and Retrieval for TCM Clinical Decision Support

arXiv cs.AI · Lihui Luo, Joongwon Chae, Ziyan Chen, Yang Liu · 2026-07-02

The paper introduces MMIR-TCM, a multimodal framework for Traditional Chinese Medicine (TCM) clinical decision support, addressing subjectivity in tongue diagnosis through memory-augmented segmentation and retrieval-augmented generation. The three-stage architecture combines Memory-SAM for tongue extraction, Qwen3-VL for structured diagnosis, and Qwen3-based RAG for evidence-grounded prescriptions, evaluated on the new MedTCM dataset with the TDEU metric. Experiments show MMIR-TCM outperforms GPT-4o and Gemini 2.5 Flash in clinical accuracy.

multimodal large language modelretrieval-augmented generationmemory-augmented segmentationclinical decision supporttongue diagnosis

MMBench-Live: A Continuously Evolving Benchmark for Multimodal Models

arXiv cs.AI · Yuanzhi Liu, Shousheng Zhao, Bo Zhou, Kongming Liang · 2026-07-02

The authors introduce MMBench-Live, a continuously evolving multimodal benchmark for vision-language models that addresses temporal staleness and data contamination in static benchmarks. Their method employs a multi-agent-driven pipeline for task-guided dataset construction, featuring distribution-consistent updates to maintain cross-version comparability by extracting visual patterns from the original MMBench. Results demonstrate 5.9K new evaluation instances with high correctness (costing ~USD 30 per update), stable model rankings, reduced memorization signals, and semantic alignment with the original benchmark.

multimodal benchmarkvision-language modelsdata contaminationdistribution-consistent updatetask-guided dataset construction

Decoupling Code Complexity from Newcomer Participation: A Causal Study of AI Coding Agent Adoption in OSS

arXiv cs.AI · Weiwei Xu, Xuanning Cui, Hengzhi Ye, Minghui Zhou · 2026-07-02

The study provides causal evidence that AI coding agent adoption in open-source projects (n=603 genuine adopters) does not crowd out newcomer participation, contrary to concerns. Using difference-in-differences analysis on GitHub projects with matched controls, it finds no significant decline in newcomer inflow, onboarding, or retention post-adoption, despite measurable increases in code complexity (+11% cognitive metric for Python, +3-4% cyclomatic complexity). The results demonstrate a decoupling between increased complexity from AI tools and human participation, suggesting no trade-off exists in established projects.

ai coding agentsdifference-in-differencescode complexityopen-source softwaregithub

Expander Sparse Autoencoders: Parameter-Efficient Dictionaries for Mechanistic Interpretability

arXiv cs.AI · Rodrigo Mendoza-Smith · 2026-07-02

Expander Sparse Autoencoders (SAEs) introduce parameter-efficient dictionaries for mechanistic interpretability by leveraging expander graph structures. The method replaces dense decoders with TopK SAEs supported on left-$d$-regular expander masks, reducing storage from $O(mn)$ to $O(dn)$ while maintaining sparse-coding performance. Experiments on Pythia-70M/160M, Qwen2.5-3B, and Llama-3.2-1B show $293\times$ fewer decoder values at $84\%$ fidelity retention, with theoretical guarantees on identifiability and exact support recovery via OMP under expansion and column flatness conditions.

sparse autoencodersexpander graphsmechanistic interpretabilitycompressed sensingmatching pursuit

Single-Channel EEG-Based Cognitive Load Assessment in Online Learning: A Hybrid Deep Learning Approach

arXiv cs.AI · Rowan Hussein, Mohamed Ouf · 2026-07-02

The study proposes a hybrid CNN+LSTM+Attention model for cognitive load assessment during online learning using single-channel EEG data from the NeuroSky MindWave Mobile 2. The model combines raw waveform and band-power features, achieving 78.5% accuracy in within-subject evaluation, outperforming conventional feature-based classifiers (55%). Regularization techniques (dropout and L2) stabilize validation accuracy at 68-73%. The authors emphasize subject-independent evaluation as the standard and release a reproducible pipeline, framing the work as a feasibility study with an open tool for visualizing cognitive load over video timelines.

cognitive loadeeghybrid modelonline learningattention mechanism

Lightweight Safe Reinforcement Learning for End-to-End UAV Navigation

arXiv cs.AI · Shenghui Zhang, YuXuan Gao, Songwei Zhao, Jifeng Hu · 2026-07-02

We propose a lightweight safety-constrained reinforcement learning framework for end-to-end UAV navigation, addressing challenges of unsafe exploration, computational burden, and dynamic constraints. The method integrates perception and control via a hierarchical architecture, employing a lightweight network with asymmetric and depthwise separable convolutions to encode sparse observations into collision-risk-aware features. A Lagrangian-based safe PPO algorithm solves the constrained Markov decision process, enhanced by curriculum learning for training stability. Experiments demonstrate improved success rates, safety, and efficiency over baselines across varying obstacle densities and flight speeds.

unmanned aerial vehiclesconstrained markov decision processlagrangian-based safe ppodepthwise separable convolutionscurriculum learning

Safety Testing LLM Agents at Scale: From Risk Discovery to Evidence-Grounded Verification

arXiv cs.AI · Yunhao Feng, Ruixiao Lin, Ming Wen, Qinqin He · 2026-07-02

We present Vera, an automated safety testing framework for LLM agents that implements a three-stage pipeline: literature-driven risk discovery, combinatorial composition of executable safety cases, and adaptive execution with evidence-grounded verification. Vera operates in isolated sandboxes where a control agent steers multi-turn interactions, and verifiers judge outcomes based on environment state and tool-call evidence. Evaluation on four production agent frameworks (OpenClaw, Hermes, Codex, Claude Code) revealed substantial safety weaknesses, with 93.9% average attack success rates under multi-channel attacks. Vera-Bench, comprising 1600 executable safety cases across 124 risk categories, demonstrates the need for modular testing infrastructure for evolving agentic systems.

llm agentssafety testingcombinatorial compositionevidence-grounded verificationadaptive execution

EPnG: Adaptive Expert Prune-and-Grow for Parameter-Efficient MoE Fine-tuning

arXiv cs.AI · Ahin Lee, Sehyun Yun, Taesik Gong · 2026-07-02

EPnG introduces an adaptive prune-and-grow framework for parameter-efficient fine-tuning of Mixture-of-Experts (MoE) models, dynamically reallocating LoRA capacity based on router gate probabilities. The method prunes under-utilized experts and expands high-importance experts via rank growth with orthogonal initialization while maintaining a fixed parameter budget. Evaluated on OLMoE and Qwen1.5-MoE, EPnG outperforms LoRA under identical budgets and matches full fine-tuning performance while updating only 0.55%-0.72% of parameters (140x-180x fewer).

mixture-of-expertsparameter-efficient fine-tuninglorarouter gate probabilitiesorthogonal initialization

Scene-Conditioned PINN-GNN for Multipath RF Maps: Cross-Scene Generation and In-Scene Completion

arXiv cs.AI · Lizhou Liu, Xiaohui Chen, Zihan Tang, Mengyao Ma · 2026-07-02

The paper proposes Scene-Conditioned PINN-GNN, a unified framework for constructing radio frequency (RF) maps supporting cross-scene generation and in-scene completion. The method combines a physics-informed neural network (PINN) to enforce electromagnetic propagation constraints and a graph neural network (GNN) to maintain spatial consistency among neighboring receivers. Evaluated using a novel peak-weighted dynamic time warping metric, the approach outperforms image-based, diffusion-based, and interpolation baselines in both map-level and multipath-level metrics, demonstrating robust generalization and high-fidelity reconstruction under sparse observations.

radio frequency mapsphysics-informed neural networkgraph neural networkmultipath propagationdynamic time warping

AI Virtue: What is "Good" Knowledge in the Age of Artificial Intelligence?

arXiv cs.AI · Alan Liu · 2026-07-02

The article proposes a framework for evaluating epistemic virtues in AI-generated knowledge, focusing on 'creativity' as a case study. Using digital humanities methods, it analyzes 553 AI-related journal articles from 2024 to map prevailing value discourses. Results suggest a need to shift from pre-AI knowledge work values toward 'generativity' as a future-oriented epistemic virtue, supported by an accompanying digital toolkit for corpus exploration.

epistemic virtuesknowledge workgenerativitydigital humanitiescorpus analysis

Subliminal Clocks: Latent Time Modelling in Diffusion Language Models

arXiv cs.AI · Maximo Rulli, Thomas Fontanari, Simone Petruzzi, Federico Alvetreti · 2026-07-02

This work identifies and analyzes latent timestep representations in Diffusion Language Models (DLMs), demonstrating that denoising progress is encoded in residual stream activations. Using layer-wise probing, the authors show that timestep information is reliably decodable and can be manipulated via low-dimensional subspaces to systematically modulate model confidence and entropy. Geometric analysis reveals structured, interpretable properties in activation space, elucidating how DLMs process denoising progress signals internally. These findings provide insights into the implicit temporal modeling mechanisms of DLMs compared to explicitly conditioned diffusion approaches.

diffusion language modelslatent timestepresidual streamsdenoising progressactivation space

Verifiable Knowledge Expansion through Retrieval-Grounded Formal Concept Analysis

arXiv cs.AI · Yujin Yang, Heejung Lee · 2026-07-02

The paper introduces a retrieval-augmented small language model (SLM) framework that leverages formal concept analysis (FCA) for verifiable knowledge expansion in ontology construction. The method uses FCA to propose implications over a formal context, validated by a retrieval-grounded SLM oracle that performs incidence judgments, consistency checks, and attribute proposals. Evaluated on a rare ataxia dataset from Orphadata, the framework achieves relation F1 scores of 0.29-0.52 and closure-based implication F1 scores of 0.22-0.30, with performance improvements observed for larger seed sets. Ablations highlight challenges in identifying positive object-attribute pairs even with fixed candidates.

ontology constructionformal concept analysisretrieval-augmented slmknowledge expansionincidence judgments

Repair the Amplifier, Not the Symptom: Stable World-Model Correction for Agent Rollouts

arXiv cs.AI · Xinyuan Song, Zekun Cai · 2026-07-02

The paper introduces WM-SAR (World-Model Subgraph Amplification Repair), a method for in-place correction of failed planning graphs in long-horizon agent workflows. Unlike conventional engineering approaches that scan for local symptoms, WM-SAR identifies and repairs causal subgraphs responsible for error amplification, reducing context load and improving repair precision. Experiments show WM-SAR outperforms baseline correctors under token constraints, achieving near-complete graph stabilization with compact subgraph interventions while maintaining cleaner repair targets for LLMs.

world-model correctionsubgraph amplificationagent rolloutsllm repairplanning graphs

SimWorlds: A Multi-Agent System for Dynamic 3D Scene Creation

arXiv cs.AI · Chunjiang Liu, Xiaoyuan Wang, Haoyu Chen, Yizhou Zhao · 2026-07-02

SimWorlds introduces a multi-agent framework for generating dynamic 4D scenes from text, addressing the challenges of coordinating spatial layout, physics solvers, temporal sequencing, and scene verification. The system employs a planner-coder-reviewer workflow with Blender-specific procedural knowledge, a layered scene protocol, and runtime-state inspection tools to ensure physical consistency. Evaluated on the 4DBuildBench benchmark, SimWorlds outperforms prior dynamic Blender generation baselines in both visual fidelity and physical correctness.

multi-agent system4d scene generationphysics solversprocedural generationruntime-state inspection

Mastermind: Strategy-grounded Learning for Repository-Scale Vulnerability Reproduction

arXiv cs.AI · Mingzhe Du, Luu Anh Tuan, Tianyi Wu, Renyang Liu · 2026-07-02

Mastermind introduces a dual-loop framework for repository-level vulnerability reproduction, separating transferable strategy learning from task-specific execution. The trainable planner learns reusable strategies through supervised fine-tuning (SFT) and milestone-based generalized reinforcement policy optimization (GRPO), while the experience loop maintains task-local strategy records. This approach allows strategy learning to enhance multiple frozen executors without modifying their action-generation capabilities. Evaluated on CyberGym with 260 training and 200 evaluation tasks, Mastermind achieves an 84.5% pass rate using GPT-5.5 as the executor, outperforming baselines like open-book PoC context (60.0%) and iterative improvement (77.0%). The planner also improves GPT-5.4 mini and GLM~5.1 by 15.0% and 12.5%, respectively.

vulnerability reproductionstrategy learningsupervised fine-tuninggeneralized reinforcement policy optimizationfrozen executor

ProCal: Inference-Time Proposal Calibration for Open-Vocabulary Object Detection

arXiv cs.AI · Jae-Ryung Hong, Ho-Joong Kim, Seong-Whan Lee · 2026-07-02

ProCal introduces an inference-time proposal calibration method for open-vocabulary object detection, addressing the limitation of frozen VLMs in recognizing object position and scale. The method combines a localization-aware foreground score and a background-aware suppression score to compute a proposal prior, improving localization quality. Evaluated on CLIPSelf ViT-L/14, ProCal increases APr by +2.5 on OV-LVIS, demonstrating effectiveness in suppressing false novel activations and ranking true novel proposals accurately.

open-vocabulary object detectionproposal calibrationlocalization-aware scorebackground-aware suppressioninference-time adaptation

Decentralized Stochastic Subgradient-type Methods with Communication Compression for Nonsmooth Nonconvex Optimization

arXiv cs.AI · Siyuan Zhang, Nachuan Xiao, Xin Liu · 2026-07-02

The authors propose a unified framework for decentralized stochastic subgradient-type methods with communication compression in nonsmooth nonconvex optimization. The framework encompasses unbiased compression and contractive compression with error compensation, analyzing convergence via continuous-time differential inclusions. They establish global convergence for nonsmooth objectives lacking Clarke regularity and develop compression-based methods incorporating sign-based regularization and gradient-tracking momentum. Numerical experiments validate the theoretical results, demonstrating communication-accuracy trade-offs in the proposed methods.

decentralized optimizationcommunication compressionstochastic subgradientnonsmooth nonconvexerror compensation

Path-level Hindsight Instructions for Semantic Exploration in Vision-Language Navigation

arXiv cs.AI · Sung June Kim, Sangpil Kim, Honglak Lee · 2026-07-02

The paper introduces Phi-Nav, a unified on-policy framework for Vision-Language Navigation (VLN) that addresses semantic mismatch during exploration through hindsight reasoning. The method employs a three-stage cycle: oracle-guided exploration, hindsight instruction synthesis by a speaker, and imitation learning with synthesized trajectory-instruction pairs. Evaluated on R2R-CE and RxR-CE benchmarks, Phi-Nav achieves competitive performance while using fewer expert demonstrations than baselines, demonstrating effective semantic exploration with limited data.

vision-language navigationhindsight reasoningon-policy explorationimitation learningsemantic supervision

MedStreamBench: A Time-Aware Benchmark for Streaming and Proactive Medical Video Understanding

arXiv cs.AI · Yuan Wang, Shujian Gao, Songtao Jiang, Zhengyu Hu · 2026-07-02

MedStreamBench introduces a time-aware benchmark for streaming and proactive medical video understanding, addressing the gap between conventional offline evaluation and real-world clinical decision-making requirements. The benchmark integrates 22 medical datasets (5,419 QA instances) across four temporal settings (retrospective, present, future, proactive) with bounded evidence windows, evaluating both answer correctness and temporal behavior metrics like responsiveness. Experiments show significant performance drops (exact magnitude unspecified) for general-purpose and medical vision-language models in streaming/proactive settings compared to offline recognition.

medical video understandingstreaming evaluationproactive monitoringtemporal behavior metricsevidence windows

Full Bayesian Reinforcement Learning via LF-IBIS

arXiv cs.AI · Stefano Masini, Cecilia Viscardi, Michela Baccini · 2026-07-02

The paper introduces Likelihood-Free Iterated Batch Importance Sampling (LF-IBIS), a novel Bayesian Reinforcement Learning (BRL) algorithm for environments with intractable likelihoods. LF-IBIS combines Approximate Bayesian Computation with Iterated Batch Importance Sampling to perform online belief updates and approximate posterior inference over environment parameters and optimal policies. Validation on response-adaptive randomization in clinical trials demonstrates accurate posterior approximation, while additional experiments show effective online policy updates in likelihood-free settings.

bayesian reinforcement learningapproximate bayesian computationiterated batch importance samplingpolicy uncertaintyexploration-exploitation trade-off

Meta-Benchmarks for Financial-Services LLM Evaluation

arXiv cs.AI · Blair Hudson · 2026-07-02

The authors introduce a meta-benchmarking framework for evaluating LLMs in financial services, addressing the mismatch between general leaderboards and domain-specific cognitive demands. The method organizes 452 benchmarks into 41 O*NET Generalized Work Activities and 38 BIAN banking domains, using a multiplicative weighting scheme (discrimination × coverage × recency) to prioritize relevant tests. These weights scale the K-factor in pairwise Elo tournaments, producing comparable work-activity scores without raw score normalization. The framework is demonstrated on a snapshot of 288 models from 25 organizations, providing reproducible methodology for financial institution model selection.

meta-benchmarkinggeneralized work activitiesbian banking domainselo tournamentmultiplicative weighting

Predicting Closed-Loop Performance of Latent World Models: Offline Checkpoint Selection for MPC and Model-Based RL Under Non-Markovian Rewards in LunarLander

arXiv cs.AI · Nikolai Smolyanskiy · 2026-07-02

The paper introduces Composite Reward Observability Fraction (CROF), a validation metric for selecting latent world model checkpoints that best predict closed-loop control performance. The method evaluates 40 diagnostics on an RSSM world model trained in LunarLander v3, using CEM-MPC returns as ground truth. Results show CROF combines Reward Observability Fraction (ROF) with structural regularizers to select models enabling 24.5-point higher returns in model-based A2C versus model-free baselines, while reducing environment interactions 65x.

latent world modelscheckpoint selectionreward observabilitymodel-based rlnon-markovian rewards

Reformalization of the Jordan Curve Theorem

arXiv cs.AI · Simon Guilloud, Sankalp Gambhir, Samuel Chassot · 2026-07-02

The study introduces reformalization, a variant of autoformalization where proofs are translated between formal systems rather than from natural language. It presents three reformalizations of the Jordan Curve Theorem: Mizar-to-Lean, HOL Light-to-Lean, and HOL Light-to-Agda. The analysis identifies key pipeline design choices impacting practical reformalization, offering insights for cross-system proof translation.

reformalizationautoformalizationjordan curve theoremproof assistantformal verification

DRL-CLBA: A Clean Label Backdoor Attack for Speech Classification via DDPG Reinforcement Learning

arXiv cs.AI · Yueming Huang, Wenhan Yao, Fen Xiao, Xiarun Chen · 2026-07-02

DRL-CLBA introduces a clean-label backdoor attack for speech classification using Deep Deterministic Policy Gradient (DDPG) reinforcement learning and deep audio steganography. The method embeds sample-specific triggers as feature-space anchors and optimizes target samples toward these anchors in latent space without label modification. Evaluations on three datasets and four DNNs show high attack success rates, resistance to fine-tuning/pruning defenses, and evasion of spectral signature detection.

backdoor attackspeech classificationddpg reinforcement learningaudio steganographyfeature-space anchors

Distributionally Robust Listwise Preference Optimization

arXiv cs.AI · Xudong Wu, Jian Qian, Pangpang Liu, Vaneet Aggarwal · 2026-07-02

The paper proposes a distributionally robust listwise preference optimization method for language-model alignment, addressing ranking-label uncertainty due to annotator inconsistency or reward-model noise. The authors introduce a pointwise total-variation robust Plackett-Luce (PL) objective that decomposes into a nominal PL loss plus a worst-case PL correction, reducing inner maximization complexity from factorial to linearithmic. Theoretical guarantees include convexity in offline settings and weak convexity in online policy-induced settings. Experiments demonstrate preserved performance under clean labels and improved robustness under noise, with enhancements in both reward-model and GPT-4 judge metrics.

listwise preference optimizationplackett-luce modeldistributional robustnesslanguage-model alignmentranking-label uncertainty

Generic Expert Coverage for Pruning SparseMixture-of-Experts Language Models

arXiv cs.AI · Yongqin Zeng, Sicheng Pan, Jiale Wang, Hai-tao Zheng · 2026-07-02

We propose Generic TB-Coverage, a coverage-aware expert pruning method for sparsely activated Mixture-of-Experts (MoE) language models that operates without downstream calibration data. The method profiles per-expert utility separately on generic text corpora (WikiText2 and C4) and enforces a fixed-budget coverage rule to preserve high-utility experts across corpora before constructing the pruning mask. Evaluated on Qwen1.5-MoE-A2.7B and DeepSeek-MoE-16B-Base at 25%, 50%, and 75% retention budgets, our method improves average accuracy on six zero-shot benchmarks over random pruning, REAP, and ExpertSparsity, while reducing perplexity degradation on WikiText2 and C4. Gains are most pronounced under aggressive pruning (25% and 50% retain), demonstrating the efficacy of cross-corpus expert coverage as a generic-data prior.

mixture-of-expertsexpert pruningzero-shot benchmarksperplexity degradationfixed-budget coverage

COMFYCLAW: Self-Evolving Skill Harnesses for Image Generation Workflows

arXiv cs.AI · Zongxia Li, Dawei Liu, Fuxiao Liu, Yuhang Zhou · 2026-07-02

COMFYCLAW introduces a self-evolving skill harness for ComfyUI-based image generation workflows, addressing the challenge of agent memory and reusable skills in recurring tasks. The framework formulates workflow construction as typed graph editing, integrates stage-specific tools with automatic error reversion, and employs a region-level VLM verifier to diagnose visual failures. It distills execution trajectories, errors, and verifier feedback into reusable Agent Skills through progressive skill library disclosure. Evaluations across four benchmark splits, three agent models, and two image backbones show COMFYCLAW outperforms verifier-only baselines in image-generation scores (all six configurations) and human preference studies.

agentic skill evolutioncomfyui workflowstyped graph editingregion-level vlmprogressive skill library

Pmeta-TLA: Backdoor Attacks for Speech Classification Models via Meta-Learning with Timbre Leakage Attack

arXiv cs.AI · Yueming Huang, Wenhan Yao, Fen Xiao, Xiarun Chen · 2026-07-02

The paper introduces Pmeta-TLA, a meta-learning-based backdoor attack framework for speech classification models that combines Timbre Leakage Attack (TLA) with Projected Conflicting Gradients (PCGrad). TLA distributes timbre information at the frame level in self-supervised features to create stealthy poisoned samples, while Pmeta-TLA enables efficient multi-backdoor injection via meta-learning. Evaluated on keyword spotting tasks, the method demonstrates higher attack success (exact metrics unspecified), improved stealth, robustness, and lower computational cost compared to baselines.

backdoor attackmeta-learningspeech classificationtimbre leakageself-supervised features

Epistemic Goggles: A Pretrained Module that Induces an Epistemic Frame via Gradient Editing

arXiv cs.AI · Joshua Penman · 2026-07-02

The paper introduces Goggles, a pretrained module that edits gradients during supervised finetuning to impart a chosen epistemic frame to language models, addressing Negation Neglect. Goggles intervenes on LoRA gradients, enabling models to correctly identify fictional content 91% of the time (vs. 9% baseline) while maintaining performance on GPQA and TruthfulQA. The method supports multiple frames (e.g., AI safety evaluation) and persists under adversarial finetuning. Results demonstrate effective training on misaligned data without absorbing undesired behaviors.

epistemic framegradient editingnegation neglectlorasupervised finetuning

Model Merging as Probabilistic Inference in Fine-Tuning Parameter Space

arXiv cs.AI · Long Minh Bui, Tuan Anh Le Van, Tung Phi Duc, Phi Le Nguyen · 2026-07-02

The paper introduces a probabilistic inference framework for model merging, formulating it as a product-of-experts (PoE) scenario where each single-task solution defines an energy-based expert model (EBM) over merged parameters. It demonstrates that existing merging methods are special cases under Gaussian assumptions on directional residuals, but proposes a heavy-tailed PoE design using Cauchy experts to better capture observed residual behavior. Experiments across multiple tasks and architectures show significant improvements over state-of-the-art baselines.

model mergingproduct-of-expertsenergy-based modelheavy-tailed residualscauchy experts

Beyond Gradient-Based Attacks: Adversarial Robustness and Explainability Stability in Cybersecurity Classifiers

arXiv cs.AI · Mona Rajhans, Vishal Khawarey · 2026-07-02

The study introduces the Explainability Stability Index (ESI) to quantify SHAP attribution drift under adversarial attacks on cybersecurity classifiers, complementing traditional Robustness Index (RI) measurements. Evaluating Random Forest and XGBoost across four tabular security datasets (phishing URLs, UNSW-NB15, NF-ToN-IoT, HIKARI-2021), the research compares five attack methods including three black-box approaches. Key findings reveal gradient-based attacks (ZOO) produce misleading robustness scores (~0.98 RI) for XGBoost due to piecewise-constant surfaces, while score-based Square Attack exposes vulnerabilities (RI ~0.36). Attribution drift remains significant (ESI 0.06-0.16 for XGBoost vs. 0.14-0.29 for RF), demonstrating prediction robustness and explanation stability are independent dimensions requiring joint assessment.

adversarial robustnessexplainability stabilitytree ensemblesshap attributionblack-box attacks

Separating Expert Retention from Autonomous Source Inference in Raw-ECG-Replay-Free Continual ECG Deployment

arXiv cs.AI · Yufan Lu, Xinhui Liu, Chenyang Xu, Yuxi Zhou · 2026-07-02

The study introduces a method for continual ECG deployment that separates expert retention from autonomous source inference, addressing scenarios where raw ECGs cannot be retained or replayed. The approach employs a frozen 1024-dimensional ECGFounder backbone, incrementally adding balanced-softmax linear experts for new domains and using a lightweight router trained on retained features and domain labels. A validation-calibrated margin rule fuses the two most likely experts. Results on CPSC, PTB-XL, Georgia, and Chapman-Shaoxing datasets show source-aware expert selection achieves 0.7915±0.0036 Macro-F1, while top-2 margin fusion reaches 0.7782±0.0022 without source IDs, identifying autonomous source inference as the primary bottleneck.

ecgfounderbalanced-softmaxmacro-f1margin rulelightweight router

Diverse Evidence, Better Forecasts: Multi-Agent Deliberation Under Information Asymmetry

arXiv cs.AI · Yuante Li, Yicheng Tao, Kate Zhang, Taozhi Wang · 2026-07-02

The paper introduces InfoDelphi, a multi-agent forecasting framework that improves prediction accuracy through designed information asymmetry. By partitioning evidence into shared public and disjoint private subsets, the method reduces inter-agent error correlation via relevance-aware evidence routing, rationale-based iterative deliberation, and confidence-weighted aggregation. On PolyGym (375 binary forecasting questions), InfoDelphi outperforms single-agent and symmetric multi-agent baselines by 12-18% in Brier score and 4-8 percentage points in accuracy, demonstrating that input diversity is critical for effective deliberation.

multi-agent systemsinformation asymmetrybrier scorerationale-based deliberationerror correlation

AgenticDataBench: A Comprehensive Benchmark for Data Agents

arXiv cs.AI · Zhaoyan Sun, Shan Zhong, Daizhou Wen, Jiaxing Han · 2026-07-02

The paper introduces AgenticDataBench, a comprehensive benchmark for evaluating LLM-based data agents across diverse data science workflows. The benchmark incorporates 15 domains (including 5 fintech use cases), employs skill-aligned hierarchical clustering of Stack Overflow solutions to identify representative data science skills, and uses LLM-based generation for domains lacking real tasks. Evaluation on this annotated benchmark provides granular skill-level performance insights. The method ensures broad coverage through diversity-maximizing task selection and systematic synthetic task generation.

data agentsskill-aligned clusteringllm-based generationbenchmark coveragedata science workflows

Autonomous discovery of traffic laws with AI traffic scientists

arXiv cs.AI · Xingyuan Dai, Yue Liu, Xiaoyan Gong, Qinghai Miao · 2026-07-02

TrafficSci introduces an autonomous AI system for discovering universal traffic laws through an iterative workflow combining evidence scoping, hypothesis induction, and observational-interventional validation. The method formulates traffic-law discovery as an agentic process across population, network, control, and trajectory scales. Results demonstrate rediscovery of three established traffic laws and identification of a novel intrinsic temporal memory scale in urban driving behavior, validated across eight cities and two trajectory datasets.

traffic lawsautonomous discoveryagentic aihypothesis inductionobservational validation

MKGR: Multimodal Knowledge-Graph Representation Learning for Cold-Start Protein-Protein Interaction Prediction

arXiv cs.AI · Wenbo Zhang · 2026-07-02

The paper introduces MKGR, a multimodal representation learning framework for cold-start protein-protein interaction (PPI) prediction, addressing scenarios where candidate proteins lack observed PPI edges during training. MKGR integrates region-aware protein sequence encoding with four protein-centered biomedical knowledge graphs (protein-drug, protein-disease, protein-miRNA, protein-lncRNA) using graph attention encoders. It employs a bridge reconstruction objective for graph regularization and a pair-level gating module to adaptively combine sequence and graph evidence. Evaluations on two benchmark datasets demonstrate MKGR's superior performance over sequence, network, and knowledge-graph baselines across ACC, F1, AUC, AUPR, and MCC metrics in novel-old and novel-novel cold-start settings.

protein-protein interactionknowledge-graphgraph attentionbridge reconstructioncold-start prediction

Spatial Support Matters: Geometry-Aware Graph Fusion for Rainfall Field Reconstruction

arXiv cs.AI · Low Jun Yu, Niramay Kachhadiya, Herath Mudiyanselage Viraj Vidura Herath, Sanka Rasnayaka · 2026-07-02

The paper proposes a geometry-aware multi-support heterogeneous graph neural network for fine-scale rainfall field reconstruction, addressing incompatible spatial supports from gauges (0D), microwave links (1D), and radar/satellite (2D). The method represents each support type as distinct node layers and fuses them via cross-support message passing into a point-support prediction layer, enabling resolution-decoupled reconstruction. Evaluated on Singapore data, it reduces RMSE by 23.2% over inverse-distance weighting and outperforms convolutional fusion and support-agnostic graph baselines. Generalization tests in Sydney reveal fusion benefits depend on gauge spacing relative to the field's spatial correlation length.

rainfall reconstructionheterogeneous graphspatial supportmessage passingmulti-sensor fusion

Scaling with Confidence: Calibrating Confidence of LLMs for Adaptive Test Time Scaling

arXiv cs.AI · Xuqing Yang, Yi Yuan, Shanzhe Lei, Xuhong Wang · 2026-07-02

The paper introduces C3RL, a reinforcement learning algorithm that jointly optimizes correctness and confidence calibration in large language models (LLMs), addressing overconfidence during uncertainty. C3RL combines correctness, calibration, and dataset-informed reference accuracy rewards, evaluated across 8 text and multimodal datasets. Results show improved calibration without accuracy loss, outperforming state-of-the-art methods. The authors also propose CAS, a confidence-based adaptive test-time scaling strategy that reduces inference budgets by up to 12.33× while maintaining performance.

reinforcement learningconfidence calibrationlarge language modelsadaptive inferencemultimodal datasets

Profit-Based Counterfactual Explanations for Product Improvement: A Case Study of Manga Sales in Japan

arXiv cs.AI · Keita Kinjo, Takeshi Ebina · 2026-07-02

The paper proposes profit-based counterfactual explanation (PBCE), a framework that reformulates counterfactual explanations as a profit maximization problem in management contexts. Unlike traditional CE methods requiring exogenous target specification and distance metrics, PBCE directly optimizes profit while reinterpreting distance as attribute modification costs. The approach addresses limitations in regression settings by providing economically grounded interpretations of variable changes without predefined targets. A case study on Japanese manga sales demonstrates the method's applicability.

counterfactual explanationprofit maximizationregression analysisproduct attributesdecision optimization

SemHash-LLM: A Multi-Granularity Semantic Hashing Framework for Document Deduplication

arXiv cs.AI · Xinyi Fang, Kejian Tong, Jiabei Liu, Tao Ning · 2026-07-02

SemHash-LLM introduces a multi-granularity semantic hashing framework for efficient document deduplication, combining character, token, and document-level signals via gated fusion. The method integrates semantic projection hashing in distilled LLM embedding space, attention-weighted MinHash for content emphasis, and contrastive boundary learning with uncertainty estimation. A cascaded filtering pipeline enables efficient candidate reduction while maintaining semantic equivalence. Experiments demonstrate strong duplicate detection quality with under 1% neural verification cost, addressing template pollution, short text perturbation, containment, and viral fragments.

semantic hashingdocument deduplicationcontrastive learningminhashllm distillation

Safe and Adaptive Cloud Healing: Verifying LLM-Generated Recovery Plans with a Neural-Symbolic World Model

arXiv cs.AI · Junyan Tan, Haoran Lin, Siyuan Guo, Yichen Fang · 2026-07-02

The paper introduces PASE, a Planning-Aware Semantic self-healing engine that redefines cloud fault recovery as a neuro-symbolic program synthesis task. The framework integrates an LLM as a Plan Synthesis Engine to generate structured recovery plans from semantic primitives, verified by a Neural-Symbolic World Model and optimized via DRL-trained Meta-Prompt Optimizer. Experiments on a cloud fault injection dataset show PASE reduces recovery time by 40% and improves fault detection accuracy in unknown scenarios compared to state-of-the-art methods.

neuro-symbolicplan synthesismeta-prompt optimizerfault injectionself-healing

Hawk: Harnessing Hardware-Aware Knowledge for High-Performance NPU Kernel Generation

arXiv cs.AI · Junyi Wen, Ruiyan Zhuang, Yongjia Xu, Pengtu Li · 2026-07-02

Hawk introduces a training-free framework for high-performance NPU kernel generation by incorporating hardware-aware knowledge through three modules: (1) Run-Time Knowledge Synthesis using Triple-Part Executable Knowledge Representation to link error context with executable semantics, (2) Bottleneck-Aware Knowledge Retrieval via 2D-Retrieval in syntactic and hardware-aligned semantic spaces, and (3) Effect-Driven Knowledge Distillation employing LLM-driven semantic arbitration to prune errors and consolidate redundancies. Evaluations on real-world NPU workloads show Hawk improves generation accuracy from 49.4% to 80.0% and achieves up to 2.2x speedup over baselines.

neural processing unitskernel generationhardware-aware knowledgeknowledge distillationexecutable semantics

VLAFlow: A Unified Training Framework for Vision-Language-Action Models via Co-training and Future Latent Alignment

arXiv cs.AI · Guoyang Xia, Fengfa Li, Hongjin Ji, Lei Ren · 2026-07-02

VLAFlow introduces a unified flow-matching framework for comparing vision-language-action (VLA) training objectives, using the OXEMix dataset (5,000 hours of heterogeneous robot data) and a shared pi0-style architecture. The study evaluates four paradigms: action-only modeling (MindPI), language-supervised co-training (MindLPI), future latent alignment (MindWPI), and their combination (MindLWPI). Results on LIBERO, LIBERO-Plus, and SimplerEnv show that MindLWPI, combining language supervision and future latent alignment, achieves the most stable transfer performance, suggesting complementary benefits of language and latent representations for heterogeneous action spaces.

vision-language-action modelsflow-matchingheterogeneous datafuture latent alignmentco-training

ADVENT: LLM-Driven Automatic Predicate Invention for ILP

arXiv cs.AI · Tingting Yu, Pei-Cing Huang, Chan Hsu, Chan-Tung Ku · 2026-07-02

ADVENT introduces an LLM-driven predicate invention mechanism for Inductive Logic Programming (ILP), addressing the bottleneck of manual predicate creation. The method combines LLM abductive generation with Prolog deductive verification, iteratively refining candidate predicates through concrete execution feedback. Invented predicates accumulate in a reusable knowledge pool, enhancing cross-task performance. Evaluations on poker-hand concepts demonstrate that ADVENT achieves a 58% success rate where ILP fails, rising to 80% with formal verification, and yields up to +31 percentage points improvement via knowledge reuse, while maintaining human-interpretable rules.

inductive logic programmingpredicate inventionlarge language modelsabductive generationknowledge reuse

EO-Agents: A Three-Agent LLM Pipeline for Earth Observation Hypothesis Generation

arXiv cs.AI · Mahyar Ghazanfari, Amin Tabrizian, Armin Mehrabian, Peng Wei · 2026-07-02

The paper introduces EO-Agents, a three-agent LLM pipeline for generating Earth observation hypotheses grounded in NASA's structured knowledge graph. The method combines a heterogeneous graph neural network for ranking dataset pairings with an LLM-based filtering, generation, and evaluation process. When applied to 1,475 NASA datasets, the system produced 160 hypotheses across multiple domains, with predicted novel pairings rated nearly as plausible as real co-usages (GPT-5.2 and Claude Sonnet 4.6 experiments show ranking stability but score dependence on judge identity).

earth observationknowledge graphheterogeneous gnnllm pipelinehypothesis generation

Scaling Trends for Lie Detector Oversight in Preference Learning

arXiv cs.AI · Oskar J. Hollinsworth, Ann-Kathrin Dombrowski, Sam Adam-Day, Adam Gleave · 2026-07-02

This work scales Scalable Oversight via Lie Detectors (SOLiD) to larger language models and evaluates it in diverse preference-learning settings. SOLiD employs lie detectors to identify deceptive responses for human review, reducing reliance on costly labelers. Results show favorable scaling: undetected deception decreases from 34% in 1B-parameter models to 14% in 405B-parameter models at a 99% true positive rate. Human labelers can be removed from fine-tuning without statistically significant increases in deception. However, SOLiD exhibits sensitivity to distribution shift between detector training and preference-training data, leading to impractical false positive rates.

scalable oversightlie detectorspreference learningdistribution shiftfalse positive rate

DiPS: Dialogue Policy Selection for High-Stakes Persuasion Agents

arXiv cs.AI · Tianyi Zhang, Mousumi Das, Abrar Anwar, Jesse Thomason · 2026-07-02

The paper introduces Dialogue Policy Selection (DiPS), a Q-learning framework for dynamically selecting persuasion strategies in high-stakes dialogues. Focusing on fire-rescue evacuation scenarios, DiPS trains a critic to maximize evacuation success by adapting policy choices to the resident's utterances. Evaluations against zero-shot LLM and RAG-augmented baselines show DiPS achieves higher success rates in both simulated and human interactions.

dialogue policy selectionq-learningpersuasion agentshigh-stakes scenariosevacuation success

X-LogSMask: Expand Transformer for Graph-Structured Data

arXiv cs.AI · Leyan Li, Rennong Yang, Zhenxing Zhang, Liping Hu · 2026-07-02

The authors propose X-LogSMask, an explainable logarithmic structural mask for Transformers that injects symmetrically normalized graph topology directly into attention logits. The method uses a logarithmic transform to convert structural connectivity into a topology-aware gating signal, suppressing unsupported node interactions while preserving feature-dependent attention. Different powers of the normalized adjacency matrix are assigned to different attention heads, enabling multi-hop propagation within a single layer. Evaluated on 20 node-, edge-, and graph-level benchmarks, X-LogSMask-equipped Transformers achieve state-of-the-art performance on 13 datasets and remain competitive in a lightweight one-layer configuration.

graph transformersstructural masklogarithmic transformmulti-hop propagationnormalized adjacency matrix

Evolutionary Feature Engineering for Structured Data

arXiv cs.AI · Ege Onur Taga, Yilin Zhuang, M. Emrullah Ildiz, Petros Mol · 2026-07-02

The paper introduces Evolutionary Feature Engineering (EFE), a framework using LLM-based evolution to discover preprocessing transformations for structured data. EFE represents transformations as Python programs with fit/transform interfaces, refined via dataset context, summary statistics, and validation performance. Two instantiations are presented: EFE-Time for time-series forecasting, reducing errors by 3-19% on datasets like COVID-Deaths with Chronos-2, and EFE-Tab for tabular prediction, evolving interpretable features that improve decision tree accuracy. Results show EFE enhances both performance and interpretability in structured data tasks.

evolutionary feature engineeringstructured datallm-based evolutiontime-series forecastingtabular prediction

OPINE-World: Programmatic World Modeling with Ontology-error-Prioritized Interactive Exploration

arXiv cs.AI · David Courtis, Wenhao Li, Scott Sanner · 2026-07-01

OPINE-World introduces a programmatic world modeling approach using LLM agents for interactive environment learning. The method employs two cooperating agents: one for environment interaction and another for code synthesis via counterexample-guided inductive synthesis (CEGIS), with exploration guided by a Bayesian ontology-error metric. Evaluated on ARC-AGI-3, OPINE-World solves 20 of 25 games without per-game training, achieving an action-efficiency score of 78.4 against human baselines.

programmatic world modelingcounterexample-guided inductive synthesisontology errorllm agentsinteractive exploration

IntentTune: Using user demand and personalization to resolve "unknown" query intents for e-commerce search

arXiv cs.AI · Rachith Aiyappa, Ishita Khan, Chester Palen-Michel, Jayanth Yetukuri · 2026-07-01

IntentTune introduces a framework for resolving ambiguous query intents in e-commerce search by leveraging user-specific behavioral signals and population-level demand patterns. The method addresses under-specified queries (e.g., 'watch' or 'shirt') by inferring latent attributes such as gender, age group, and product category. Experiments on real-world e-commerce data show that user-specific behavioral signals, particularly prior search queries, outperform population-level statistics and static profile information in accurately inferring intent. Population-level demand patterns alone are insufficient for reliable intent resolution.

query intent detectionbehavioral signalspopulation-level demandlatent attributese-commerce search

Multi-Head Recurrent Memory Agents

arXiv cs.AI · Jiatong Li, Samuel Yeh, Sharon Li · 2026-07-01

We propose Multi-Head Recurrent Memory (MHM), a training-free framework addressing the memory retention bottleneck in recurrent memory agents for long-context LLMs. MHM partitions memory into independent heads, employing a stage-wise select-then-update strategy that shields inactive heads from overwriting. As a lightweight instantiation, MHM-LRU ensures uniform head utilization with zero token overhead. Experiments on long-context benchmarks (100K--1M tokens) demonstrate substantial improvements: MHM-LRU boosts memory retention from <30% to 73.96% on RULER-HQA at 896K tokens, with gains generalizing across model families, scales, and task types.

recurrent memorymemory retentionlong-contextmulti-headselect-then-update

Robust and Explainable 3D Mode Shape Recognition Using Region-Aware Graph Neural Networks

arXiv cs.AI · Tong Duy Son, Marc Brughmans, Andrey Hense, Kohta Sugiura · 2026-07-01

The paper introduces a Canonical Engineering Graph Representation and region-aware graph learning framework for robust 3D mode shape recognition in automotive NVH development. The method transforms heterogeneous finite element models and experimental measurements into a common graph with semantically meaningful structural regions, using geometry-independent descriptors, graph attention learning, and region-aware pooling. Evaluated on FE and experimental datasets from four vehicle programs under label scarcity, the framework achieves high classification accuracy, cross-vehicle transferability, and physically interpretable predictions by linking to engineering-defined structural regions.

mode shape recognitiongraph neural networksfinite element modelsnvh analysisregion-aware pooling

Revisiting Chain-of-Thought Reasoning under Limited Supervision: Semi-supervised Chain-of-Thought Learning

arXiv cs.AI · Hongyang He, Jiuming Liu, Victor Sanchez · 2026-07-01

The paper introduces Semi-supervised Chain-of-Thought Learning (Semi-CoT), a framework leveraging unlabeled questions to construct pseudo reasoning supervision for chain-of-thought reasoning. The method samples multiple pseudo-CoTs per unlabeled question, estimates answer-level semantic entropy, and selects low-entropy chains as reliable demonstrations. Experiments on AQuA, SVAMP, GSM8K, and MultiArith show pseudo-answer precision of 91.36%-100%, with small gains on SVAMP and GSM8K, while AQuA exhibits negative transfer and MultiArith reaches saturation.

chain-of-thoughtsemi-supervised learningsemantic entropypseudo supervisionreasoning chains

Janus: a Playground for User-Involved Agentic Permission Management

arXiv cs.AI · Natalie Grace Brigham, Eugene Bagdasarian, Tadayoshi Kohno, Franziska Roesner · 2026-07-01

The paper introduces Janus, a modular system for studying user involvement in AI agent permission management. Janus comprises Janus-Core for implementing permission designs and Janus-Harness for automated evaluation, grounded in a conceptual model of design axes. Six permission assistants were implemented and evaluated across three scenarios with synthetic responders, revealing that user input enhances privacy/security (though requiring cognitive load mitigation via AI augmentation) and that permission fatigue impacts system performance. Results show no universally optimal design, necessitating context-sensitive approaches. The system is publicly available for further research.

agentic systemspermission managementcognitive loadprivacy preservationautomated evaluation

The Agentic Garden of Forking Paths

arXiv cs.AI · Jiacheng Miao, Jonathan K Pritchard, James Zou · 2026-07-01

The paper demonstrates that AI agents can replicate human researchers' analytical variability by producing divergent conclusions from identical data when assigned different personas. Across four domains, AI agents reproduced 72% of the ideological gap observed among 42 human teams analyzing immigration data, with 86% of AI-generated reports passing independent review. The authors propose the m-value metric to quantify analysis extremeness and introduce Agentic Bootstrap, which samples plausible analysis paths using AI agents. Results show 13.5% of human analyses fell in the most extreme 5% of the analysis space (m<0.05), highlighting the need to evaluate scientific claims within the distribution of defensible analyses.

agentic bootstrapm-valueanalytical variabilityideological gapmultiverse analysis

Don't Let Gains FADE: Breaking Down Policy Gradient Weights in RL

arXiv cs.AI · Juliette Decugis, Sean O'Brien, Francis Bach, Gabriel Synnaeve · 2026-07-01

We introduce FADE (Focal Advantage with Dynamic Entropy), a self-adapting advantage function for reinforcement learning post-training of LLMs that addresses training instability and diversity collapse. FADE decomposes advantage functions along orthogonal axes of sign (balancing entropy and weight geometry) and difficulty (balancing signal sharpness and sample size), dynamically adapting gradient weights based on training phase. Experiments demonstrate FADE achieves peak pass@1 20k steps earlier than static baselines at 7B scale and 2k steps earlier at 32B scale, while optimizing accuracy-diversity trade-offs on LiveCodeBench and AIME benchmarks.

advantage functionpolicy gradiententropy collapsepass@kreinforcement learning

Fully Unsupervised Detection of Physical Contacts on Subsea Cables via State-of-Polarization Monitoring

arXiv cs.AI · Agastya Raj, Alvaro Doval, Tian Tian, Steinar Bjørnstad · 2026-07-01

The authors propose a fully unsupervised Fast-Slow Deep Support Vector Data Description (DSVDD) detector for monitoring physical contacts on subsea cables via State-of-Polarization measurements. The method operates without labeled event data, using an anomaly detection approach on continuous cable monitoring recordings. Evaluation on deployed infrastructure demonstrates effective detection, with all five confirmed trawler contacts ranked in the top 13 out of 122,174 recordings, while also identifying additional validated contact events.

unsupervised learninganomaly detectionsubsea cablesstate-of-polarizationdeep svdd

Procedural Memory Distillation: Online Reflection for Self-Improving Language Models

arXiv cs.AI · Ye Liu, Srijan Bansal, Bo Pang, Yang Li · 2026-07-01

The paper introduces Procedural Memory Distillation (PMD), a method for online self-improvement in language models by converting cross-episode training signals into reusable procedural memory. PMD organizes memory at three abstraction levels (raw trajectories, self-reflected strategies, and behavioral patterns) and uses a memory-conditioned self-teacher to distill knowledge into the policy weights. Evaluated on Qwen3-8B and OLMo3-Instruct-7B, PMD outperforms SDPO by 3.8-5.5% on SCIKNOWEVAL and 7.9-13.6% on LIVECODEBENCH, with co-evolution of memory and policy being critical (freezing either reduces performance by >10%).

procedural memory distillationself-distillationco-evolutionverifiable rewardsonline reflection

World Feedback for Clinical Agents: Diagnosing RL in FHIR Environments

arXiv cs.AI · Ananya Mantravadi, Harshit Rajgarhia, Prasanna Desikan, Abhishek Mukherji · 2026-07-01

The paper introduces MedAgentBench-v3 (MAB-v3), a benchmark with 508 clinical protocol-execution tasks addressing a 41.7% silent-finish ceiling in prior versions. It evaluates reinforcement learning (RL) in FHIR environments using Qwen3-8B, identifying two structural barriers: a capability ceiling (10/20 task types show 0% base performance) and a format-knowledge barrier (3/20 types require exact clinical codes). Pure RL achieves 18.2% pass@1 versus 34.1% for rule-based supervised fine-tuning (SFT), with the 15.9 percentage-point gap attributed to these barriers. The proposed taxonomy prescribes SFT for code injection and RL for learning conditionals.

reinforcement learningclinical protocolsfhir environmentscapability ceilingformat-knowledge barrier

Beyond Next-Token Prediction: An RLVR Proof of Concept for Tool-Use Agents on Atlassian Workflows

arXiv cs.AI · Karthikeya Aditya Vissa, Sankalp Mane, Ananya Mantravadi, Harshit Rajgarhia · 2026-07-01

The paper demonstrates that Reinforcement Learning with Verifiable Rewards (RLVR) significantly improves tool-use accuracy for LLMs in enterprise API workflows, addressing next-token prediction mismatch. Using synthetic Jira REST v3 and Confluence v2 environments with schema-accurate reward checkers, RL-trained Qwen3.5-4B policies achieve 0.95--1.00 average reward (vs 0.35--0.92 for prompted baselines), notably perfecting Confluence page creation. The method requires no human labels or live APIs, but highlights scalability limits in hand-crafted rewards and saturating reward scenarios.

reinforcement learningtool-use agentsenterprise apisverifiable rewardsnext-token prediction

Grounded Optimization: A Layered Engineering Framework for Reducing LLM Hallucination in Automated Personal Document Rewriting

arXiv cs.AI · Shashank Indukuri, Adarsh Agrawal · 2026-07-01

The paper introduces Grounded Optimization, a five-layer framework to mitigate LLM hallucination in automated resume rewriting. The method combines temporal context validation, contamination detection, structural enforcement, prompt grounding, and evaluator agents. Testing across 3 LLMs, 4 temperatures, and 25 synthetic resumes showed baseline hallucination rates of 2.48-5.36 per resume, reduced to 0.04-0.24 with the framework. Temporal hallucinations decreased by 50-95%, with prompt-level grounding achieving zero hallucinations at low temperatures for capable models.

llm hallucinationresume optimizationtemporal validationcontamination detectiongrounded optimization

Token Geometry

arXiv cs.AI · Kathan Shah · 2026-07-01

The authors introduce Ember, a lightweight optimizer for embedding and LM-head matrices in language models, reducing VRAM usage from O(2VD) in Adam to O(V + D) and eliminating the need for sharding token table optimizer states. Ember exploits distinct gradient geometry at the embedding-LM-head interface, improving Pareto frontiers across supervised finetuning, RL, and pretraining with minimal optimizer state. Empirical results demonstrate Ember's scalability across batch sizes and parameter counts, revealing token optimization trajectories as simple 1D rays rather than navigating complex nonconvex landscapes. The authors provide a principled analysis of Transformer optimizer requirements and release a distributed Ember implementation compatible with ZeRO/FSDP setups.

emberoptimizerembeddinglm-headgeometry

On the Utility and Factual Reliability of Pruned Mixture-of-Experts Models in the Biomedical Domain

arXiv cs.AI · Atsuki Yamaguchi, Szymon Palucha, Léo Bijar, Aline Villavicencio · 2026-07-01

The study evaluates how structured expert pruning affects both utility and factual reliability in Mixture-of-Experts (MoE) models, focusing on biomedical applications. It tests four MoE models with six pruning methods across various pruning ratios, measuring performance in generation and classification tasks under in-domain and cross-domain settings. Results show moderate pruning preserves in-domain utility without immediate reliability loss, but hallucination risks rise at extreme ratios, while cross-domain performance degrades rapidly, highlighting the need for reliability assessment alongside utility in high-stakes deployments.

mixture-of-expertsstructured pruningfactual reliabilitybiomedical domainhallucination risk

Discrete Diffusion Language Models for Interactive Radiology Report Drafting

arXiv cs.AI · Max Van Puyvelde, Halil Ibrahim Gulluk, Wim Van Criekinge, Olivier Gevaert · 2026-07-01

We introduce a discrete diffusion language model, DiffusionGemma-26B, adapted for interactive radiology report drafting, demonstrating competitive performance with autoregressive (AR) counterparts. The model employs a mixture-of-experts architecture and bidirectional token canvas denoising, enabling any-order infill capabilities absent in AR models. Evaluated on medical visual question answering datasets using a verbosity-robust LLM judge, DiffusionGemma-26B matches or exceeds Gemma-4-26B's performance while achieving 3.5-4.4x faster decoding. The finetuned model (3.8B active parameters) proves competitive with frontier vision-language models, offering radiologists enhanced drafting flexibility by allowing fragment fixes and contextual infills, addressing inconsistencies in clinical reports.

diffusion language modelbidirectional denoisingany-order infillmixture-of-expertsvisual question answering

CreativityNeuro: Steering Language Model Weights to Improve Divergent Thinking and Reduce Mode Collapse

arXiv cs.AI · Samuel Schapiro, Core Francisco Park, Felix Sosa, Lav R. Varshney · 2026-07-01

CreativityNeuro introduces a data-free method for enhancing divergent thinking in LLMs via contrastive weight steering, addressing mode collapse in open-ended generation. The approach operates without behavioral data, retraining, or gradient-based fine-tuning. Evaluations on the Divergent Association Task (DAT) show a 14-percentile improvement, while human assessments (N=720) on the Alternative Uses Test (AUT) and Task Task demonstrate gains in originality, surprise, and creativity. Weight-space steering outperforms activation steering in generalization to unseen tasks.

divergent thinkingmode collapsecontrastive weight steeringlarge language modelsdata-free method

IsoSci: A Benchmark of Isomorphic Cross-Domain Science Problems for Evaluating Reasoning versus Knowledge Retrieval in LLMs

arXiv cs.AI · Samir Abdaljalil, Erchin Serpedin, Hasan Kurban · 2026-07-01

The paper introduces ISOSCI, a novel benchmark of isomorphic cross-domain science problem pairs designed to disentangle reasoning ability from domain knowledge retrieval in LLMs. The benchmark constructs problem pairs with identical logical structures but differing domain-specific knowledge requirements, enabling precise attribution of reasoning gains. Results across five model pairs show 91.3% of reasoning improvements are knowledge-dependent (63/69 gains), challenging assumptions about chain-of-thought reasoning's benefits for procedural problem-solving. Reasoning provides <5pp accuracy gains, and a reasoning-specialized model (o3-mini) shows divergent performance (-24.7pp on ISOSCI vs +19.2pp on GPQA Diamond), demonstrating benchmark-dependence of reasoning utility conclusions.

isomorphic problemsknowledge retrievalchain-of-thoughtcross-domain evaluationreasoning-mode gains

When Should Service Agents Reconsider? Difficulty-Routed Control in Customer-Service Operations

arXiv cs.AI · Qian Chen, Chengyuan Liu, Xin Yu · 2026-07-01

The paper introduces a difficulty-routed service-control architecture for autonomous customer-service agents, addressing operational errors in backend writes while maintaining efficiency in routine tasks. The method employs a lightweight router to direct routine sessions through a low-cost baseline path and escalate operationally coupled sessions to a conflict-aware workflow. This escalated path emphasizes write-triggered reconsideration and evidence gathering before consequential backend actions. Evaluated on human-verified retail and airline tasks from $τ^{2}$-bench, the architecture improves reliability on requests with operational conflicts, demonstrating targeted control without indiscriminate interaction expansion. Case-level evidence highlights the workflow's effectiveness in preserving fallback plans and sequencing writes.

service-control architecturewrite-triggered reconsiderationconflict-aware communicationoperational conflictbackend writes

Agent4cs: A Multi-agent System for Code Summarization in Large Hierarchical Codebases

arXiv cs.AI · Yongjian Tang, Ezgi Sarikayak, Doruk Tuncel, Jie M. Zhang · 2026-07-01

Agent4cs introduces a multi-agent system for hierarchical code summarization, addressing limitations of flat-text approaches in large codebases. The framework employs three specialized agents: summarization (generates summaries), keyword-extraction (identifies critical subfolder information), and quality-assurance (refines outputs). Evaluated against 7 frontier models, Agent4cs achieves 8% average improvement in semantic consistency across folder levels and 38% gains in normalized keyword coverage rate versus structured prompting baselines.

multi-agent systemcode summarizationhierarchical codebasessemantic consistencykeyword coverage

Risk Architecture for AI-Native Engineering Teams: An Organizational Framework for Agentic System Governance

arXiv cs.AI · Laxmipriya Ganesh Iyer · 2026-07-01

The paper proposes a risk architecture framework for AI-native engineering teams, addressing gaps in existing AI risk management approaches at the organizational level. It introduces (i) a seven-dimension team profile classification, (ii) a six-cluster failure-mode taxonomy including dependency-boundary determinism mismatch, and (iii) a synthetic methodology to evaluate framework adequacy against defined scenarios. Results show coverage degradation as teams transition from software-engineering to AI-native operation, with severe failures concentrated at organizational boundaries where probabilistic outputs meet deterministic dependencies.

agentic systemsrisk architecturefailure-mode taxonomydependency-boundaryprobabilistic outputs

MultAttnAttrib: Training-Free Multimodal Attribution in Long Document Question Answering

arXiv cs.AI · Dang Quang Thien Tran, Quang V. Dang, Vinamra Tyagi, Sai Soorya Rao Veeravalli · 2026-07-01

The paper introduces MultAttnAttrib, a training-free method for multimodal attribution in long document QA that leverages prefill passes, attention head selection, and calibrated thresholds to locate evidence. It also presents MultAttrEval, the first benchmark dataset with fine-grained multimodal attribution annotations. Experiments show MultAttnAttrib outperforms prompting-based approaches and frontier models like GPT-5.4, improving attribution accuracy by up to 22% while reducing inference latency to 1/7th of direct prompting on the same base model.

multimodal attributionattention headsprefill passtraining-freelong document qa

Adoption and Impact of Command-Line AI Coding Agents: A Study of Microsoft's Early 2026 Rollout of Claude Code and GitHub Copilot CLI

arXiv cs.AI · Emerson Murphy-Hill, Jenna Butler, Alexandra Savelieva · 2026-07-01

This study examines the adoption and impact of command-line AI coding agents, specifically Anthropic's Claude Code and GitHub Copilot CLI, during Microsoft's early 2026 rollout. Analyzing tens of thousands of engineers, the research identifies that initial adoption spread through social networks, retention correlated with coding activity rather than demographics, and adopters merged 24% more pull requests compared to non-adopters. Merged pull requests served as the proxy for output, with the observed productivity lift persisting over a four-month period. The findings highlight that CLI coding agents are not uniformly adopted or transient novelties, emphasizing the importance of visible peer use in organizational rollout strategies.

command-line aipull requestscoding activitysocial networksretention

GPUAlert: A Zero-Instrumentation Process-Boundary Monitor for Diagnosing GPU Training-Job Failures

arXiv cs.AI · Parv Agarwal, Asif Ekbal · 2026-07-01

GPUAlert introduces a zero-instrumentation process-boundary monitor for diagnosing GPU training-job failures, addressing the high failure rate (∼40%) in production clusters. The tool employs three reliability primitives: pre-launch log guarantee, notifier isolation, and non-silent artifact budget, enabling structured failure notifications without modifying training scripts. Evaluation on a corpus of 474 logs across 15 failure classes shows 0.997 macro-F1 for ordered-rule classification, with constant 3ms overhead and correct exit-code propagation under SMTP failures.

gpu trainingfailure diagnosisprocess-boundary monitoringreliability primitivesstructured logging

NeuroBridge: Bridging Multi-Task MRI Knowledge for Neurodegenerative Disease Diagnosis

arXiv cs.AI · Mengyu Li, Guoyao Shen, Chad W. Farris, Xin Zhang · 2026-07-01

NeuroBridge introduces a clinically guided multi-task MRI framework for neurodegenerative disease diagnosis, addressing the challenge of subtle and heterogeneous structural changes in Alzheimer's disease (AD) and mild cognitive impairment (MCI). The method integrates large-scale self-supervised MRI pretraining with hippocampal segmentation, atrophy classification, and reconstruction objectives, followed by gated fusion fine-tuning. Evaluated on ADNI and OASIS cohorts, NeuroBridge achieved 88.17% accuracy for AD versus cognitively normal controls in ADNI and 82.78% in OASIS, with significant gains in MCI-related and mixed-diagnosis settings. The framework demonstrated strong cross-cohort generalization and feasibility for probability-based opportunistic screening.

mrineurodegenerative diseasemulti-task learninghippocampal segmentationopportunistic screening

Rethinking Generic Object Tracking Toward Human-Level Perceptual Intelligence

arXiv cs.AI · Shih-Fang Chen · 2026-07-01

The dissertation proposes methods to enhance Generic Object Tracking (GOT) by narrowing the gap between machine and human visual perception. It addresses challenges in target discrimination, robust adaptation, and geometric reasoning by integrating prior knowledge, spatial geometry, and semantic context. The approach aims to improve tracking reliability under severe deformation, complex distractors, environmental changes, and unseen categories, leveraging human-like continuous adaptation and scene understanding.

generic object trackingvisual perceptiontarget discriminationgeometric reasoningonline adaptation

The Wiola Architecture for Efficient Small Language Models

arXiv cs.AI · Aryuemaan Kumar Chowdhury, Afreen Shaik, Yaparla Bhargavi, Brahma Kumar · 2026-07-01

The Wiola architecture introduces five novel components for efficient Small Language Models (SLMs): (i) Spiral Rotary Positional Encoding (SRPE) for 3D helical position embedding, (ii) Gated Cross-Layer Attention (GCLA) for inter-layer coherence, (iii) Adaptive Token Merging (ATM) to reduce attention complexity, (iv) Dual Stream Feed-Forward (DSFF) with parallel MLP streams, and (v) WiolaRMSNorm with per-dimension offsets. The architecture, released in four sizes (120M to 1.5B parameters), is fully compatible with HuggingFace Transformers and validated against GPT-2, LLaMA-2, and Mistral.

spiral rotary positional encodinggated cross-layer attentionadaptive token mergingdual stream feed-forwardwiolarmsnorm

How Should Transformers Encode Numeric Values in Electronic Health Records?

arXiv cs.AI · Maria Elkjær Montgomery, Christian Igel, Mikkel Odgaard, Martin Sillesen · 2026-07-01

The study systematically evaluates transformer-based encoding strategies for numeric values in electronic health records, comparing discrete, continuous, and hybrid approaches. Using synthetic arithmetic tasks embedded in real-world EHR data and clinical prediction tasks, the authors analyze trade-offs between numeric precision, optimization stability, and architectural flexibility. Results show that value-concept interaction models excel in precision-sensitive tasks when architecture permits, while hybrid token-based approaches with binning offer robust performance, with optimal bin count following a power-law in dataset size. Models demonstrate reliable approximate computation, with clinical benefits from lab values being task-dependent, favoring robustness over maximal precision.

transformerselectronic health recordsvalue encodinghybrid tokenizationnumeric precision

AI-enabled gravitational-waves searches for binary neutron stars at optimal sensitivity

arXiv cs.AI · Bhavya Gupta, Deep Chatterjee, William Benoit, Ethan Marx · 2026-07-01

The authors present Aframe, the first AI-enabled gravitational wave search achieving matched-filter sensitivity for binary neutron stars (BNS) at reduced computational cost. The method adapts their existing binary black hole detection architecture by preprocessing data via heterodyning to handle longer BNS signals, enabling real-time inference on a single GPU. Results show comparable sensitivity to traditional CPU-intensive matched-filter pipelines (requiring ~1000 cores) while enabling both online detection during LIGO-Virgo-KAGRA observations and efficient archival analysis through distributed GPU inference-as-a-service. The system successfully detected multiple compact binary mergers during the fourth observing run.

gravitational wavesbinary neutron starsheterodyningmatched-filterinference-as-a-service

Auto-FL-Research: Agentic Search for Federated Learning Algorithms

arXiv cs.AI · Holger R. Roth, Ziyue Xu, Chester Chen, Daguang Xu · 2026-07-01

Auto-FL-Research (AFR) introduces an agentic workflow for automated search of federated learning (FL) algorithms, addressing the challenge of exploring algorithmic choices like optimizer variants, server aggregation rules, and local training schedules. AFR employs constrained coding agents to propose and implement candidate training algorithms while maintaining fixed task profiles for mutation surfaces, compute budgets, and evaluation protocols. Evaluated on five FLamby healthcare tasks and six LEAF datasets, AFR demonstrates performance gains on four FLamby tasks and five LEAF profiles, though some improvements are attributed to fixed-surface tuning rather than FL-recipe changes. Results highlight the separation of agent-generated candidates into reproducible mechanisms, tuning effects, and single-run artifacts.

federated learningalgorithmic searchserver aggregationlocal trainingtask profiles

Multi-modal Rail Crossing Safety Analysis

arXiv cs.AI · Paimon Goulart, Chansong Lim, Nícolas Roque dos Santos, Yue Dong · 2026-07-01

The authors propose a multi-modal AI pipeline for railway crossing safety assessment, combining visual data with structured accident reports to align with Federal Railroad Administration (FRA) standards. Their method employs a routed fine-tuned compact Vision-Language Model (VLM) to process both modalities, addressing challenges in data preparation and learning paradigms. The system achieves a macro F1 score of 0.757 for HIGH-RISK/LOW-RISK classification, with FRA score estimation RMSE of 0.078 and correlation of 0.492, while demonstrating qualitative alignment with expert evaluations.

multi-modal learningsafety assessmentvision-language modelrailway crossingfederal railroad administration

TurnNat: Automatic Evaluation of Turn-Taking Naturalness in Dyadic Spoken Dialogue

arXiv cs.AI · Hao Zhang, Thomas Thebaud, Georgi Tinchev, Venkatesh Ravichandran · 2026-07-01

TurnNat introduces a likelihood-based framework for automatic evaluation of turn-taking naturalness in dyadic spoken dialogue. The method employs a causal turn-taking prediction model trained on natural conversations to estimate future voice-activity states, using negative log-likelihood (NLL) of observed future activity to measure timing atypicality. Frame-level NLLs are pooled over turn-taking boundary units (TBUs) and aggregated into dialogue-level scores. Experiments on a controlled perturbation benchmark show TurnNat effectively identifies unnatural turn-taking across heterogeneous timing failures, validated by human judgments.

turn-taking naturalnessnegative log-likelihoodturn-taking boundary unitsvoice-activity statesdyadic spoken dialogue

Controllable Sim Agents with Behavior Latents

arXiv cs.LG · Juanwu Lu, Junyu Zhu, Ziran Wang · 2026-07-02

The paper introduces Controllable Neural Variational Agents (CNeVA), a framework for generating realistic and steerable traffic agents. CNeVA infers Gaussian behavior latents from discounted returns via conjugate variational updates, conditions a rectified-flow trajectory generator with classifier-free guidance, and employs soft eligibility gates for gradient preservation in sparse-reward settings. Evaluated on Waymo Open Motion Dataset, CNeVA achieves competitive realism while enabling monotonic control over speed, acceleration, and safety metrics without reward hacking. The method demonstrates steerable map compliance under context-residual returns, emphasizing the need for physical-plausibility guardrails during metric evaluation.

controllable agentsbehavior latentsrectified-flowclassifier-free guidancesoft eligibility

Understanding the Robustness of Distributed Self-Supervised Learning Frameworks Against Non-IID Data

arXiv cs.LG · Xuanyu Chen, Nan Yang, Shuai Wang, Dong Yuan · 2026-07-02

The paper presents a theoretical analysis of distributed self-supervised learning (D-SSL) robustness under non-IID data, demonstrating that Masked Image Modeling (MIM) pre-training outperforms Contrastive Learning (CL) in handling data heterogeneity. It establishes that decentralized SSL robustness scales with network connectivity, implying federated learning (FL) matches or exceeds decentralized learning (DecL) robustness. The authors propose MAR loss, a MIM variant with local-to-global alignment regularization, validated through extensive experiments across architectures and distributed settings.

distributed self-supervised learningnon-iid datamasked image modelingcontrastive learningfederated learning

Optimal Stabilizer Testing and Learning with Limited Quantum Memory

arXiv cs.LG · Srinivasan Arunachalam, Louis Schatzki · 2026-07-02

The paper establishes fundamental limits on stabilizer state testing and learning under quantum memory constraints. By connecting stabilizer testing to the hidden shift problem and employing novel combinatorial techniques for likelihood ratio analysis, the authors prove: (1) stabilizer testing requires Θ(n-k) samples with k-qubit memory, losing the constant-sample advantage of unlimited-memory settings; (2) non-adaptive learning needs Θ(n²/k) samples. A key finding shows coherent memory enables the testing-learning separation, as even k=0.99n requires linear sample complexity for testing. Additional results include exponential lower bounds for purity testing with coherent memory.

stabilizer testingquantum memoryhidden shift problemsample complexitylikelihood ratios

Extreme Adaptive Transformer for Time Series Forecasting

arXiv cs.LG · Sanjeev Shrestha, Hui Liu, Yifan Zhang · 2026-07-02

The Extreme-Adaptive Transformer (Exformer) improves time series forecasting for imbalanced data with rare extreme events by introducing an extreme-adaptive attention mechanism. This mechanism combines Local, Stride, and Extreme components to capture short-term, periodic, and event-aware temporal dependencies, respectively. Evaluated on four hydrologic streamflow datasets, Exformer outperforms state-of-the-art baselines in 3-day forecasting, demonstrating enhanced capacity for modeling extreme events.

extreme-adaptive attentiontime series forecastinghydrologic streamflowtransformersparse attention

LIME: Learning Intent-aware Camera Motion from Egocentric Video

arXiv cs.LG · Boyang Sun, Jiajie Li, Yung-Hsu Yang, Chenyangguang Zhang · 2026-07-02

The paper introduces LIME, a method for language-conditioned camera motion generation that predicts relative target camera poses from RGB observations and natural-language intents. The approach mines multi-intention supervision from egocentric video, combining an auto-regressive observation-gain output with a continuous flow-matching pose head to jointly predict view revelations and multi-hypothesis target views. Experiments demonstrate LIME's ability to learn intent-aware active perception from passive human video, enabling downstream robotic tasks.

language-conditioned camera motionegocentric videoflow-matching pose headauto-regressive observation-gainintent-aware active perception

Q-GAIN: A Python Package for Machine Learning and Physically Informed Analysis Applications

arXiv cs.LG · M. Doris, S. Guo, S. M. Koh, L. Ritter · 2026-07-02

The Q-GAIN Python package introduces a modular framework for machine learning and physics-informed analysis in cold-atom experiments, specifically targeting Bose-Einstein condensates (BECs). It integrates classification, object detection, and physics-based feature extraction through a three-stage workflow: data preprocessing, ML-based feature identification, and conventional analysis. The package's utility is demonstrated via three tasks: MNIST digit classification, soliton detection in time-of-flight data (reimplementing SolDet), and vortex identification in ring-shaped BEC images. This unified approach bridges ML techniques with domain-specific physical analysis.

quantum gas analysisbose-einstein condensatesphysics-informed mlobject detectionsoliton detection

Object-centric LeJEPA

arXiv cs.LG · Jakob Geusen, Ender Konukoglu · 2026-07-02

The authors propose Object-centric LeJEPA, an extension of the LeJEPA self-supervised learning framework that operates on object-level representations rather than whole images. To avoid instability in joint scene partitioning and representation learning, they leverage off-the-shelf SAM proposals for object masks during training. The method incorporates a distributional anti-collapse objective adapted for variable-sized object sets and an additional instance-separating loss. Evaluated across model scales and dataset fractions (10-100% of COCO), Object-centric LeJEPA outperforms image-level LeJEPA on DAVIS (tracking), ImageNet-1k (classification), ADE20k (segmentation), and NAVI (re-identification).

object-centricself-supervisedrepresentation learningmask proposalsanti-collapse objective

WattGPU: Predicting Inference Power and Latency on Unseen GPUs and LLMs

arXiv cs.LG · Mauricio Fadel Argerich, Jonathan Fürst, Marta Patiño-Martínez · 2026-07-02

WattGPU introduces predictive models for GPU power draw and inter-token latency during LLM inference, requiring only public metadata without hardware profiling. The method employs leave-one-out cross-validation on 42 LLMs (0.1B–27B parameters) and 8 GPUs, generalizing to unseen hardware. Results show ≤3.4% median absolute percentage error for power prediction (offline) and ≤13.5% (server), with latency errors ≤8.5%, outperforming TDP and roofline baselines by 4× on unseen combinations.

llm inferencegpu power predictioninter-token latencycross-validationserver-grade gpus

DecompRL: Solving Harder Problems by Learning Modular Code Generation

arXiv cs.LG · Juliette Decugis, Fabian Gloeckle, Francis Bach, Taco Cohen · 2026-07-02

DecompRL introduces a reinforcement learning algorithm that enables large language models to solve complex problems by learning modular code generation. The method decomposes problems into smaller sub-functions whose implementations can be recombined, yielding up to $k^{n}$ candidate solutions and reducing GPU token cost by ~50×. Evaluated on LiveCodeBench and CodeContests with Qwen~2.5~7B and Code World Model~32B, DecompRL outperforms standard and diversity-optimized RL baselines beyond $10^5$ tokens per problem, solving previously intractable problems.

modular code generationreinforcement learninglarge language modelsproblem decompositiongpu efficiency

Bringing Agentic Search to Earth Observation Data Discovery

arXiv cs.LG · Minghan Yu, Youran Sun, Chugang Yi, Yixin Wen · 2026-07-02

We introduce an agentic search system for geoscience data discovery, leveraging large language models (LLMs) and knowledge graphs (KGs) to match natural-language queries with NASA datasets and tools. The method combines a neural scorer fine-tuned on NASA-EO-Bench, a benchmark of 47k query-dataset pairs, with BM25 score fusion and zero-shot LLM reranking. This approach improves Recall@10 and MRR by over 5x compared to cosine and BM25 baselines, with an additional 28% MRR lift from LLM reasoning. The system demonstrates the complementary value of supervised retrieval and LLM-based agentic search.

agentic searchknowledge graphneural scorerbm25zero-shot reranking

Transformer Geometry Observatory TGO-II: Representational Similarity Observatory

arXiv cs.LG · Kaustubh Kapil, Kishor P. Upla · 2026-07-02

The Transformer Geometry Observatory-II (TGO-II) introduces a framework for analyzing representation geometry evolution in Vision Transformers during supervised training. Using Centered Kernel Alignment (CKA), Singular Vector Canonical Correlation Analysis (SVCCA), Two-Nearest Neighbor Intrinsic Dimensionality (TwoNN-ID), and token covariance analysis on ViT-Small/16, the study reveals three findings: (1) decreasing CKA/SVCCA scores indicate layer specialization, (2) increasing intrinsic dimensionality suggests manifold expansion, and (3) persistent token interaction structure challenges the decoupling hypothesis. Results suggest complexity emerges through richer transformations while maintaining token interactions.

vision transformersrepresentation geometrycentered kernel alignmentintrinsic dimensionalitytoken covariance

HNSW with Accuracy Guarantees Using Graph Spanners -- A Technical Report

arXiv cs.LG · Minghao Li, Raghav Mittal, Sanjivni Rana, Suraj Shetiya · 2026-07-02

The paper introduces a 'Certify-then-Rectify' framework that combines the efficiency of Hierarchical Navigable Small World (HNSW) graphs with theoretical accuracy guarantees. The method first uses a statistical certifier to assess HNSW search quality, then switches to exact recovery when needed, leveraging HNSW as a geometric spanner with Extreme Value Theory for distance bounds. Benchmarks show the framework maintains HNSW's average-case speed while ensuring worst-case correctness, outperforming alternatives.

hierarchical navigable small worldgeometric spannerextreme value theoryexact retrievalstatistical certifier

On the Role of Directionality in Structural Generalization

arXiv cs.LG · Zichao Wei · 2026-07-02

The study demonstrates that directional representations in symbolic parsing significantly improve structural generalization, particularly for directional linguistic phenomena. The authors redesign a CCG-based parser with directed types (30K parameters) using a BERT-base encoder, achieving 75.9% LF exact match (+5.1pp over AM-Parser). Directionality yields asymmetric gains: +29.9pp on 5 position-shift categories but no improvement on 6 recursive-depth categories. Upgrading to DeBERTa-v3-large raises performance to 90.7%, with encoder gains complementing directional benefits. Results indicate directionality shifts bottlenecks from symbolic to neural layers.

structural generalizationdirectional representationsccg parsingsymbolic-neural interfaceencoder scaling

One More Time: Revisiting Neural Quantum States from a Reinforcement Learning Perspective

arXiv cs.LG · Juan Agustín Duque, Sergio García Heredia, Vinicius Hernandes, Eliška Greplová · 2026-07-02

Proximal Wavefunction Optimization (PWO) is introduced as a trust-region algorithm for training neural quantum states (NQS), addressing optimization challenges in autoregressive models for quantum many-body wavefunctions. PWO frames variational energy minimization as an advantage policy-gradient problem, clipping probability-ratio changes in the amplitude channel and phase increments in the phase channel while avoiding explicit matrix inversion. It improves stability and convergence over Adam, minSR, and SPRING across Ising and frustrated $J_1$-$J_2$ spin systems. The method scales effectively, demonstrated by fine-tuning a 1.5B-parameter RWKV-7 model, surpassing prior NQS optimization by three orders of magnitude.

neural quantum statesautoregressive modelstrust-region optimizationvariational energy minimizationpolicy-gradient

Optimizing Visual Generative Models via Distribution-wise Rewards

arXiv cs.LG · Ruihang Li, Mengde Xu, Shuyang Gu, Leigang Qu · 2026-07-02

The paper introduces a distribution-wise reward framework for finetuning visual generative models, addressing limitations of sample-wise rewards that cause reward hacking and mode collapse. The method employs a subset-replace strategy to efficiently compute distribution-wise rewards and uses RL to optimize model merging coefficients, mitigating train-inference inconsistency from stochastic differential equations. Experiments demonstrate significant FID-50K improvements: from 8.30 to 5.77 for SiT and from 3.74 to 3.52 for EDM2, with qualitative gains in perceptual quality and diversity.

distribution-wise rewardmode collapsesubset-replace strategystochastic differential equationfid-50k

Dendritic In-Context Learning in a Single-Layer Spiking Neural Network

arXiv cs.LG · Juwei Shen, Yujie Wu, Changwen Chen · 2026-07-02

The paper introduces DendriCL, a single-layer spiking neural network (SNN) that achieves in-context learning (ICL) through dendritic compartment dynamics, eliminating the need for architectural depth or inference-time synaptic plasticity. The method leverages subthreshold dendritic dynamics to implement an online learning algorithm structurally equivalent to leaky Widrow-Hoff LMS, enabling seed-stable performance on the Garg-2022 benchmark. Results show DendriCL outperforms dense Transformers in super-dimensional tasks (R^2 = 0.93 correlation with online-LMS trajectories), demonstrating that ICL can emerge from single-compartment dynamics without attention or multi-layer architectures.

in-context learningspiking neural networksdendritic compartmentonline-lmsgarg-2022 benchmark

Aggregation with Exponential Weights is Optimal in Expectation

arXiv cs.LG · Mikael Møller Høgsgaard, Patrick Rebeschini, Tobias Wegel · 2026-07-02

The paper resolves the open problem of whether aggregation with exponential weights (AEW) achieves minimax-rate optimality in expectation for model selection aggregation with squared loss. The authors analyze AEW under random design without requiring a Bernstein-type assumption, proving it attains an excess risk of $T \log(M)/(n+1)$ when the temperature $T$ satisfies $(L^2/T)\exp(B/T)\leq \mu/2$. For squared loss with $[0,b]$-valued predictions and labels, $T\geq 4b^2$ suffices. This confirms AEW's sharp phase transition at constant temperatures, as conjectured by Lecué and Mendelson.

aggregation with exponential weightsminimax-rate optimalityexcess riskmodel selectionstrong convexity

An Additive MLP-GNN Framework for Characterizing Chemical and Structural Contributions to Aqueous Solubility

arXiv cs.LG · Sampreeti Bhattacharya, Arkaprava Roy · 2026-07-02

The authors propose an additive MLP-GNN framework that explicitly separates physicochemical and molecular graph information for aqueous solubility prediction, enabling interpretable decomposition of chemical and structural contributions. The model combines a multilayer perceptron (chemical branch) and graph neural network (structural branch) with optional multiplicative interaction, pretrained on AqSolDB and fine-tuned on BigSolDB2. Interpretability analyses using linear projections, embedding summaries, and GNNExplainer masks reveal that the chemical branch aligns with physicochemical descriptors while the structural branch captures graph-topological patterns. The framework achieves competitive predictive performance while providing transparency into distinct information sources.

aqueous solubilitymultilayer perceptrongraph neural networkgnnexplainerpretraining

Prediction Sets for Counterfactual Decisions: Coverage, Optimality, and Conformal Prediction

arXiv cs.LG · Yurui Zheng, Ying Jin · 2026-07-02

The authors introduce Policy-Coupled Risk-Averse Conformal Prediction (PC-RACP), a decision-theoretic framework for uncertainty-informed counterfactual decisions. The method establishes policy-coupled coverage, ensuring coverage of realized outcomes under actions induced by prediction sets, and proves its minimax-optimality under distributional ambiguity. PC-RACP optimizes prediction sets via a two-stage procedure, demonstrating equivalence to universal-coverage formulations and direct risk-averse policy optimization. Empirical validation through simulations and an email-marketing experiment shows PC-RACP achieves higher utility while maintaining valid coverage, outperforming existing approaches that neglect counterfactual decision structures.

conformal predictioncounterfactual decisionspolicy-coupled coverageminimax-optimalityrisk-averse optimization

Self-explainable Operator Learning for Discovering Spatial Patterns in Functional Data

arXiv cs.LG · Mojgan Alishiri, Amirhossein Arzani · 2026-07-02

The authors propose a self-explainable operator learning framework that reformulates operator learning as a linear combination of generalized functional linear models via integral equations. By decomposing input domains into subdomains and computing localized integrals, the method provides interpretability by linking input regions to output patterns, revealing spatial feature contributions. Evaluated on fluid flow problems (blood flow, unsteady aerodynamics), the operator prioritizes regions with strong gradients, offering physically meaningful insights. The approach embeds explainability directly in the operator structure, outperforming post-hoc methods in transparency.

operator learningfunctional linear modelsintegral equationsspatial patternsinterpretability

Fourier Preconditioning for Neural Feature Learning

arXiv cs.LG · Preston Pitzer, Anish Pradhan, Harpreet S. Dhillon · 2026-07-02

The paper introduces Fourier preconditioning to improve neural feature learning via the H-Score objective, a proxy for mutual information that avoids noisy distribution estimates. The method employs unitary preconditioning with FFT to concentrate predictive dependence into dominant modes for approximately stationary processes, reducing finite-width truncation error. Experiments on eight datasets show up to 50% NMSE reduction in resource-constrained settings, with spectral entropy and cumulative energy metrics effectively predicting preconditioning benefits.

mutual informationh-scorefourier preconditioningspectral entropycross-covariance

An Optimisation Framework for the Well-Conditioned Training of Physics-Informed Neural Networks

arXiv cs.LG · Joseph Webb, Sadok Jerad, Coralia Cartis · 2026-07-02

We introduce DSGNAR, a second-order optimization framework for physics-informed neural networks (PINNs) that addresses ill-conditioned loss landscapes through doubly-sketched Gauss-Newton modeling and adaptive regularization-step length control. DSGNAR achieves state-of-the-art accuracy across nonlinear, chaotic, multi-scale, high-dimensional, and Navier-Stokes problems, with relative ℓ2 errors as low as 3×10^-16 in double precision and improvements of 5-8 orders of magnitude on Burgers' and high-dimensional Poisson equations. In single precision, DSGNAR solves Burgers' equation to ℓ2^rel = 4.75×10^-7 in under 10 seconds, demonstrating robustness to architecture, precision, and hyperparameters.

physics-informed neural networksill-conditioned lossgauss-newtonadaptive regularizationmulti-scale

Privacy-Preserving and Verifiable Approximate Distributed Coded Computing

arXiv cs.LG · Xavier Martínez-Luaña, Alba Gude-Santos, Manuel Fernández-Veiga, Rebeca P. Díaz-Redondo · 2026-07-02

The paper introduces a model-agnostic framework for adversary-resistant distributed learning that jointly addresses privacy preservation and malicious behavior in both federated and decentralized settings. The approach combines GPBACC (a privacy-enhancing coded computing technique) with robust aggregation strategies for federated learning and approximate decode-and-compare with group testing for decentralized learning. Empirical evaluation demonstrates significant reduction in privacy leakage and improved resilience against active adversaries through attack-driven analysis.

privacy-preservingcoded computingfederated learningdecentralized learningadversary-resistant

Bayesian Sparse Low-Rank Adaptation for Large Language Model Uncertainty Estimation

arXiv cs.LG · Jijie Zhang, Zhe Ren, Quan Zhang, Dandan Guo · 2026-07-02

We propose Data-Adaptive Lower-Rank Adaptation (DALorRA), a variational Bayesian sparse framework for uncertainty estimation in large language models (LLMs). DALorRA shifts uncertainty quantification from dense parameter space to the rank level of low-rank adaptation (LoRA) by imposing stochastic masking on rank dimensions, enabling Bayesian regularization during training and ensemble-like calibration during inference. Experiments demonstrate DALorRA's ability to calibrate LLMs effectively without compromising reasoning accuracy, addressing the overconfidence issue in task-specific fine-tuning.

low-rank adaptationbayesian regularizationuncertainty estimationstochastic maskinglarge language models

Probing Chemical Language Models: Effects of Pre-training and Fine-tuning

arXiv cs.LG · Anna Karnysheva, Dietrich Klakow, Ji-Ung Lee · 2026-07-02

This study systematically probes chemical language models (CLMs) to understand their encoding of chemically meaningful substructures, examining 78 substructures across eight pre-trained and six randomly initialized models. The research investigates how pre-training and fine-tuning on chemical tasks affect substructure representations. Results indicate pre-training enhances molecular structure awareness, particularly in upper layers, while randomly initialized models effectively encode ring structures in early layers. Fine-tuning preferentially modifies task-relevant substructures, aligning with chemical theory.

chemical language modelsmolecular substructurespre-trainingfine-tuningrepresentation learning

AbsoluteDegradation: A Physics-Inspired Synthetic Film-Degradation Pipeline and Archival Film Restoration Benchmark

arXiv cs.LG · Mikołaj Jastrzębski, Dawid Glinkowski, Dawid Zieliński, Daniel Borkowski · 2026-07-02

The authors present AbsoluteDegradation, a physics-inspired synthetic film-degradation pipeline and archival film restoration benchmark, addressing the lack of paired training data and standardized evaluation in film restoration. The pipeline models analog-to-digital degradation through structured artifact families, including signal-dependent grain, parametric scratches, and temporally coherent camera motion. A curated dataset of 81,576 high-resolution frames from real archival footage is introduced for consistent evaluation. Experiments demonstrate improved generalization to real-world footage for models trained with AbsoluteDegradation, while the benchmark reveals systematic failure modes of current methods.

film restorationsynthetic degradationtemporal coherencesignal-dependent grainbenchmark evaluation

Population-Scale Segmentation of Penile Tissue in DIXON MRI using Deep Learning for Quantitative Phenotyping in Male Reproductive Health

arXiv cs.LG · Jan Ernsting, Gunnar Paul Kordes, Nils Johannaber, Lynn Ogoniak · 2026-07-02

A deep learning framework for population-scale penile tissue segmentation in DIXON MRI was developed, enabling automated volumetric assessment for male reproductive health studies. The method utilized a 3D nnU-Net architecture trained on a curated dataset of 145 subjects (13,050 annotated slices) and validated on an independent benchmark of 24 subjects (2,160 double-annotated slices). The model achieved a 5-fold cross-validation Dice score of 0.90 and observer-level accuracy on the test set (Dice: 0.92; Hausdorff distance: 3.58). Deployed in 34,412 UK Biobank participants, it demonstrated high inter-session reproducibility (r = 0.87) in longitudinal evaluation of 2,282 men. Model weights will be publicly released.

3d nnunetdixon mripenile segmentationdice scorehausdorff distance

Predictive Conformal Slip Monitoring: An Empirical Evaluation of Rolling Split Conformal Prediction for Pre-Incident Traction Loss Detection

arXiv cs.LG · Varshith Roy Kotla · 2026-07-02

The paper presents a negative empirical evaluation of Rolling Split Conformal Prediction (RSCP) for pre-incident traction loss detection in racing telemetry. Using per-driver Random Forest models to monitor slip behavior volatility via non-conformity residuals, the method was tested on 55,563 samples from 19 drivers against 14 ground-truth incidents. Results showed zero precision and recall, with 15.3% false alarms, failing to outperform a static 95th-percentile baseline. Violations of exchangeability (Ljung-Box p < 0.001 for all drivers) likely caused the high false-alarm rate, highlighting methodological challenges for predictive applications.

rolling split conformal predictionnon-conformity residualstraction loss detectionexchangeability violationljung-box test

Ask the Right Comparison:Bias-Aware Bayesian Active Top-$k$ Ranking with LLM Judges

arXiv cs.LG · Jian Xu, Delu Zeng, John Paisley, Qibin Zhao · 2026-07-02

The paper introduces a bias-aware Bayesian active learning framework for top-k ranking using LLM judges, addressing systematic biases (e.g., verbosity, position effects) through explicit bias covariates and shrinkage priors. It proposes a top-k-aware acquisition rule that optimizes comparisons for membership uncertainty reduction. Evaluated on 16 LLMs (Llama, Qwen, GPT-4o, etc.), the method corrects biases in mid-tier judges (improving recall from 0.5–0.6 to 0.84–1.0) while frontier models show minimal bias, achieving accurate rankings with fewer comparisons than round-robin or D-optimal baselines.

bayesian active learningllm judgestop-k rankingbias correctionshrinkage prior

Structured Gaussian Processes for Uncertainty-Aware Classification of High-Dimensional, Small-Sampled Omics Data

arXiv cs.LG · Yue Zhang, Nandini Amit Gadhia, Georgios Karagiannis, Michalis Smyrnakis · 2026-07-02

The authors propose a structured Gaussian process classification framework for high-dimensional, small-sample omics data, integrating graph-encoded biological pathways into kernel construction to capture both feature abundance and topological context. The method combines network propagation with traditional features and evaluates imbalance-handling strategies (resampling, threshold calibration, confusion-matrix adjustments) on three microbiome datasets. Results show performance gains over unstructured baselines, matching established benchmarks while providing calibrated predictive uncertainty for ambiguous samples.

gaussian processomics datakernel constructionclass imbalancenetwork propagation

WBMM: Windowed Batch Matrix Multiplication for Efficient Large Receptive Field Convolution

arXiv cs.LG · Wan Song, Wei Zhou, Rui Wang, Jun Yu · 2026-07-02

The paper proposes Windowed Batch Matrix Multiplication (WBMM), a novel operator for efficient large receptive field convolution that addresses memory access inefficiencies in large kernel depthwise convolutions. WBMM partitions inputs into contiguous windows, constructs weight matrices via relative position bias indexing, and computes using batched matrix multiplication for regular memory access. Benchmarks show WBMM with 14x14 windows achieves 7.8x larger receptive fields than 5x5 depthwise convolutions while being faster, with 1.31-1.88x training speedups on ImageNet-1K, COCO, and ADE20K. The method maintains performance across GPU, CPU, and edge devices without specialized kernels.

windowed batch matrix multiplicationlarge receptive fielddepthwise convolutionrelative position biasmemory access efficiency

Fourier Neural Operators for Rayleigh-Bénard Convection

arXiv cs.LG · Chelsea Maria John, Thibaut Lunet, Sebastian Götschel, Andreas Herten · 2026-07-02

The authors propose an improved Fourier Neural Operator (FNO) for modeling 2D Rayleigh-Bénard convection by predicting time increments rather than full solutions, achieving higher accuracy than standard FNO. The method employs a compact architecture (314k parameters, 1.26 MB) with fast inference (7 ms), maintaining benchmark-comparable accuracy. Results demonstrate FNO's mesh generalization capability while revealing accuracy limitations tied to training data resolution.

fourier neural operatorrayleigh-bénard convectiontime incrementsmesh generalizationcompact architecture

HaloGuard 1.0: An Open Weights Constitutional Classifier for Multilingual AI Safety

arXiv cs.LG · Navaneeth Sangameswaran, Preetham S, Ashmiya Lenin · 2026-07-02

HaloGuard 1.0 introduces an open-weights constitutional classifier for multilingual AI safety, achieving state-of-the-art performance with models 10× smaller than competitors. The method employs a 46-policy natural-language constitution driving synthetic data generation, featuring paired counterfactuals, two-tier harmless design, and balanced multilingual coverage across 46 languages. The 0.8B-parameter variant achieves 90.9 average F1 (4.3 FPR, 9.5 FNR) across seven benchmarks, outperforming 27B baselines, while the 4B variant reaches 92.1 F1 with 3.5 FPR through precision-focused scaling.

constitutional classifiermultilingual safetysynthetic data generationfalse-positive rateadversarial red-teaming

Fast and Accurate Anomaly Detection in Time Series

arXiv cs.LG · Emanuele Mele, Massimo Cafaro, Angelo Coluccia, Italo Epicoco · 2026-07-02

The authors propose a novel unsupervised algorithm for time series anomaly detection using Haar discrete wavelet transforms and a custom t-test, addressing class imbalance and label scarcity issues in traditional methods. The method's theoretical foundations are established, and empirical evaluation across 343 datasets demonstrates superior performance over state-of-the-art unsupervised and self-supervised benchmarks. Results indicate reduced false positive rates compared to existing approaches while maintaining computational efficiency.

anomaly detectionhaar wavelett-testtime seriesunsupervised learning

Cross-Platform Control for Autonomous Surface Vehicles via Adaptive Reinforcement Learning

arXiv cs.LG · Ruiheng Jiang, Thomas Bi, Raffaello D'Andrea, Aswin Ramachandran · 2026-07-02

The paper presents an adaptive reinforcement learning method for cross-platform trajectory tracking in autonomous surface vehicles, using a single policy without platform-specific fine-tuning. The approach employs a teacher-student architecture with a learned latent dynamics representation, trained under randomized vessel dynamics in simulation. In real-world deployment on two platforms, the adaptive policy reduces position mean absolute error by up to 58% compared to non-adaptive baselines, nearing the performance of platform-specific controllers.

reinforcement learningtrajectory trackingcross-platform controlteacher-student architecturelatent dynamics

Born Discrete, Made Smooth: Variational Formulation of Shallow Neural Networks

arXiv cs.LG · Matej Benko, Pierre Bousquet, Iwona Chlebicka, Błażej Miasojedow · 2026-07-02

The paper introduces a variational formulation for shallow neural networks, replacing discrete training with a continuum surrogate in weighted Sobolev spaces. The method identifies λ-convex functionals over parameter densities, proving global well-posedness, stability, and almost C³ regularity. Results show the optimal density is computable via a linear system, with generalization error bounded at 1/α and finite-width networks converging at O(1/N). This bridges Neural Tangent Kernel and feature-learning regimes, offering a variational perspective on over-parameterization.

variational formulationshallow neural networkssobolev spacesλ-convex functionalsneural tangent kernel

Scalable and Distributed Silhouette Approximation

arXiv cs.LG · Ilie Sarpe, Federico Altieri, Andrea Pietracaprina, Geppino Pucci · 2026-07-02

The authors present the first rigorous sampling-based algorithms for approximating both local (per-element) and global silhouette scores in metric $k$-clustering, with provable guarantees. Their methods require only $O(nk\varepsilon^{-2}\ln(nk/\delta))$ distance computations to achieve additive error $\varepsilon$ with probability $1-\delta$, significantly improving upon the $\Theta(n^2)$ complexity of exact computation. They also introduce distributed implementations for MapReduce and MPC frameworks using constant rounds and sublinear memory. Experimental results demonstrate superior accuracy-efficiency trade-offs compared to existing heuristics, enabling scalable silhouette computation on massive datasets.

silhouette approximationmetric clusteringsampling algorithmsdistributed computingprovable guarantees

Liquid Latent State Dynamics for Interpretable Turbofan Degradation Modeling

arXiv cs.LG · Weizhi Nie, Weijie Wang, Yuting Su · 2026-07-02

The paper introduces a liquid neural network architecture for interpretable turbofan degradation modeling, factorizing latent states into degradation and operating-condition components. The model employs a liquid transition mechanism with specialized losses (remaining useful life, monotonic risk, latent-consistency) to disentangle health evolution from operational variation. On the C-MAPSS benchmark, it achieves 0.2266 RMSE for sensor forecasting (7% improvement over GRU baselines) and 0.5960 Spearman correlation for degradation-state temporal coherence, though direct RUL regression remains stronger with traditional approaches.

liquid neural networkslatent state factorizationturbofan degradationc-mapss benchmarkremaining useful life

Probabilistic Low-Voltage Peak Load Forecasting with Time Series Foundation Models Evaluated on Application-Oriented Metrics

arXiv cs.LG · Benedikt Kaas, Manuel Treutlein, Hannes Benedikt Gerber, Oliver Neumann · 2026-07-02

The study evaluates time series foundation models for probabilistic low-voltage peak load forecasting, addressing key limitations of current methods regarding manual effort, uncertainty estimation, and peak prediction. It compares Chronos-Bolt, Chronos-2, and TabPFN-TS against six baselines on 200 real-world low-voltage feeders, demonstrating Chronos-2's superior performance. An ablation study reveals these models adapt to increased uncertainty even without weather covariates. A novel application-oriented metric quantifies the trade-off between cost reduction and grid failure risk in peak prediction.

low-voltage load forecastingtime series foundation modelsprobabilistic forecastingpeak predictiongrid asset planning

Towards a Phonology-Informed Evaluation of Multilingual TTS

arXiv cs.LG · Sneha Ray Barman, Neeraj Kumar Sharma, Shakuntala Mahanta · 2026-07-02

We propose a classifier-based framework for evaluating multilingual text-to-speech (TTS) systems' preservation of language-specific phonological patterns, addressing limitations of standard metrics like MOS. The method audits TTS output against human speech benchmarks using transferable classifiers trained on acoustic-phonetic features. Testing Meta's MMS TTS on Assamese advanced tongue root (ATR) vowel harmony reveals systematic biases: [+ATR] mid vowels are realized as [-ATR] in 1/3 tokens, a deviation absent in human speech. Word-level analysis shows predicted ATR labels classify harmony more accurately than transcription labels, highlighting discrepancies between intended and produced phonology. The framework generalizes to other phonological contrasts with measurable acoustic cues.

text-to-speechphonological patternsacoustic-phonetic featuresvowel harmonytransferable classifiers

Autorelevance function and other feature relevance measures for univariate time series

arXiv cs.LG · Julian Cardenas, Jamie Arjona, Pedro Delicado · 2026-07-02

The authors propose a model-agnostic methodology for measuring lag relevance in univariate time series forecasting, introducing autorelevance and partial autorelevance functions based on ghost variables, Shapley values, and additive importance measures. They present a novel method for handling absent features in coalition-based approaches by substituting one-step forecasts from the same model. Evaluation across simulations and real-world datasets using seasonal ARMA models and recurrent neural networks demonstrates that the proposed relevance measures accurately capture expected lag structures in most cases.

shapley valuesghost variablesunivariate time serieslag relevanceadditive importance

A More Accurate Algorithm Comparison through A/B Testing using Offline Evaluation Methods

arXiv cs.LG · Koki Konishi, Masataka Ushiku, Yuta Saito · 2026-07-02

The paper demonstrates that conventional A/B testing can yield higher algorithm selection error rates than offline evaluation due to insufficient positive correlation in sample mean estimators. The authors propose a novel estimator that introduces a hypothetical middle algorithm (M) and performs stepwise performance comparisons (A-M-B) using shared data at each step, inducing beneficial correlation akin to offline methods. Theoretical analysis derives the optimal middle algorithm configuration, and experiments on real-world data show the method achieves equivalent selection error rates using 50% less A/B testing data.

a/b testingoffline evaluationalgorithm selectionpositive correlationbias-variance analysis

Hybrid quantum-classical neural network for sentiment analysis

arXiv cs.LG · Giacomo Cappiello, Filippo Caruso, Xing Liang, Dimitrios Makris · 2026-07-02

This work demonstrates the feasibility of hybrid quantum-classical neural networks for sentiment analysis, achieving comparable accuracy to classical baselines while exhibiting distinct learning dynamics. The authors employ TF-IDF vectorized COVID-19 tweet data, comparing classical feedforward networks with hybrid architectures incorporating parameterized quantum circuits. Results show enhanced generalization, with hybrid models outperforming classical counterparts by 15 percentage points (66% to 81%) in SMS spam classification transfer learning, suggesting richer representational capacity.

hybrid quantum-classical neural networkssentiment analysistf-idf vectorizationparameterized quantum circuitstransfer learning

Zeus: Towards Tuning-Free Foundation Model for Time Series Analysis

arXiv cs.LG · Yisong Fu, Zezhi Shao, Chengqing Yu, Yujie Li · 2026-07-02

The authors propose Zeus, a tuning-free Time Series Foundation Model (TSFM) that achieves strong multi-task generalization without task-specific fine-tuning. The model addresses two key challenges: (1) multi-scale representation via a Transformer with point-wise tokenization and U-shaped hierarchy for granularity-scalability trade-offs, and (2) Multi-Objective Temporal Masking (MOTM) to unify heterogeneous tasks like extrapolation and interpolation. Experiments across five tasks show Zeus achieves competitive performance in tuning-free settings, demonstrating its potential as a general-purpose TSFM.

time series foundation modelmulti-scale transformermulti-objective temporal maskingtuning-free learningu-shaped hierarchy

Rethinking Post-Hoc Calibration in Semantic Segmentation

arXiv cs.LG · Tristan Kirscher, Kim-Celine Kahl, Balint Kovacs, Maximilian R. Rokuss · 2026-07-02

The paper addresses two structural issues in post-hoc calibration for semantic segmentation: translation invariance and decision preservation. It introduces translation-invariant (TI) calibrators to eliminate arbitrary logit offset dependence and proposes class-conditional affine calibrators to maintain argmax- or order-preservation while improving calibration. Experiments on natural-image and medical segmentation benchmarks under covariate shift show TI variants enhance calibration metrics, while decision-preserving variants prevent segmentation degradation without sacrificing calibration performance.

semantic segmentationpost-hoc calibrationtranslation invariancedecision preservationcovariate shift

Regularized Variational and Spectral Log-Density-Ratio Estimation in the Gaussian Location Model

arXiv cs.LG · Francis Bach · 2026-07-02

The paper analyzes ridge-regularized log-density-ratio estimation in a Gaussian location model with common covariance, comparing variational and spectral estimators. The variational approach minimizes empirical KL divergence with L2-penalty, while the spectral method solves a continuum of ridge-regularized least-squares problems. High-dimensional deterministic asymptotics are derived via convex-Gaussian-min-max theorem (CGMT) and resolvent analysis of Gaussian sample covariances. Results show variational estimator outperforms in well-specified high-observation regimes, while spectral excels in low-observation scenarios due to lower variance. Nuclear penalty for feature learning is also explored.

log-density-ratio estimationgaussian location modelconvex-gaussian-min-max theoremridge regularizationdeterministic equivalents

Learning the Supports for Categorical Critic in Reinforcement Learning

arXiv cs.LG · Jen-Yen Chang, Takayuki Osa, Tatsuya Harada · 2026-07-02

The paper proposes a method to dynamically learn support bounds for categorical value functions in reinforcement learning, eliminating the need for predefined intervals. By jointly optimizing support bounds and categorical representations, the approach establishes an upper bound on the mean-squared Bellman error, theoretically tighter than fixed-support methods like HL-Gauss. Empirical results show comparable performance to HL-Gauss on continuous-control tasks, with improvements on specific subsets, while maintaining stability in support adaptation.

distributional rlcategorical criticsupport learningbellman erroractor-critic

Adaptive Group-Based Counterfactual Explanations for Time-Series Rehabilitation Data

arXiv cs.LG · Emmanuel C. Chukwu, Rianne M. Schouten, Monique Tabak, Mykola Pechenizkiy · 2026-07-02

The paper introduces a group-based counterfactual explanation framework for time-series rehabilitation data, addressing the interpretability gap in existing channel-level methods. The proposed two-stage approach combines Shapley-Adaptive group ranking with Learnable Gate methods, which optimize per-group relevance gates alongside perturbation masks. Evaluated on the KneE-PAD dataset, the method improves modality-group sparsity by 37% over M-CELS while maintaining validity (92% fidelity) and temporal smoothness, yielding biomechanically coherent explanations aligned with clinical reasoning.

counterfactual explanationstime-series classificationrehabilitation analysisgroup sparsityinterpretable machine learning

Lynx: Progressive Speculative Quantization for accelerating KV Transfer in Long-Context Inference

arXiv cs.LG · Wenchen Han, Gingfung Matthew Yeung, Marco Barletta, William Toner · 2026-07-02

Lynx introduces progressive speculative quantization to accelerate KV cache transfer in long-context LLM inference by partitioning the KV cache into high-priority Anchor (most significant bits) and low-priority Residual streams. This enables speculative decoding upon Anchor receipt while Residual transfer occurs concurrently, with verification ensuring equivalence to full-precision decoding. Evaluations show Lynx achieves 1.43× faster Time-to-First-Token than 8-bit quantization while matching BF16 accuracy, outperforming state-of-the-art by up to 5.1% in accuracy.

kv-cachespeculative decodingprogressive quantizationlong-context inferencetime-to-first-token

Many Voices, One Reward: Multi-Role Rubric Generation for LLM Judging and Reward Modeling

arXiv cs.LG · Dazhi Fu, Jiuding Yang, Yiwen Guo, Jicong Fan · 2026-07-02

The paper introduces Multi-Role Rubric Generation (MRRG), a training-free framework for generating evaluation rubrics from multiple complementary roles to address dimensional blind spots in single-role rubric generation. MRRG consolidates criteria into an auditable rubric-based scorer usable for preference validation and Reinforcement Learning with Verifiable Rewards (RLVR). Experiments on preference validation benchmarks show MRRG outperforms single-role baselines across multiple backbone models, and RLVR experiments demonstrate its effectiveness in improving open-ended generation.

rubric generationpreference validationreinforcement learningevaluation criteriaopen-ended generation

Gaming Consensus: Coordinated Manipulation in Crowdsourced Fact-Checking

arXiv cs.LG · Nikil Roashan Selvam, Jay Baxter, Sophie Hilgard, Brad Miller · 2026-07-02

The paper analyzes vulnerability to coordinated manipulation in matrix factorization-based crowdsourced fact-checking systems (e.g., X's Community Notes). Through theoretical analysis and empirical evaluation using production data, the authors demonstrate that strategic voting exploiting latent representations can artificially inflate consensus scores, with 10.7% of low-quality notes manipulated above thresholds using <10 ratings. Counterintuitively, 'Not Helpful' votes may increase helpfulness scores due to algorithmic dynamics. The work includes a cost model for manipulation and reports deployed mitigations in X's system.

matrix factorizationcrowdsourced fact-checkinglatent representationsconsensus manipulationcoordinated voting

Koopman operator theory: fundamentals, control, and applications

arXiv cs.LG · Igor Mezić, Jorge Cortés, Karl Worthmann, Mircea Lazar · 2026-07-02

This tutorial paper introduces Koopman operator theory for linear representation of nonlinear dynamical systems, emphasizing data-driven surrogate models and control applications. The method employs extended dynamic mode decomposition (EDMD), kernelized variants, and machine-learning techniques to construct finite-dimensional approximations with error bounds. Results include simulation studies with provided source code, demonstrating EDMD and Koopman model predictive control (MPC) for systems with inputs.

koopman operatorextended dynamic mode decompositiondata-driven surrogate modelsnonlinear dynamical systemsmodel predictive control

Do LLMs Truly Generalize in the Molecular Domain? A Perturbation-Based Analysis

arXiv cs.LG · Jiatong Li, Weida Wang, Changmeng Zheng, Shufei Zhang · 2026-07-02

The study introduces a Molecular Perturbation framework to assess the generalization capabilities of Large Language Models (LLMs) in molecular discovery by generating syntax-valid structural variants under controlled Graph Edit Distance (GED). Results reveal that even single edits cause significant performance drops, indicating narrow local trust regions and fragility to structural changes. In-Context Tuning (ICT) partially mitigates this fragility by anchoring predictions on structurally similar molecules, suggesting a promising direction for stabilizing molecular LLMs against structural variation.

large language modelsmolecular perturbationgraph edit distancein-context tuningstructural variation

PARTREP: Learning What to Repeat for Decoder-only LLMs

arXiv cs.LG · Andikawati P Widjaja, Yongjun Kim, Hyounghun Kim, Jaeho Lee · 2026-07-02

PARTREP introduces selective prompt repetition for decoder-only LLMs, addressing asymmetric information flow in causal attention by appending only high-information tokens rather than full prompts. The method uses token-wise negative log-likelihood (NLL) as a selection signal, predicted via a lightweight gate from early-layer hidden states to enable efficient mid-prefill token selection. Evaluated across eight benchmarks (MMLU, GSM8K, RULER) and three model families (Qwen2.5, Llama3.2, Gemma4), PARTREP achieves 59.4% KV cache and 79.0% prefill FLOPs efficiency while retaining most performance gains of full repetition.

decoder-only llmscausal attentionkv cachenegative log-likelihoodprefill flops

EHHN: An Event-driven Heterogeneous Hypergraph Network for Object-Centric Next Activity Prediction

arXiv cs.LG · Jiaxing Wang, Kaitao Chen, Zhubin Han, Chenyu Hou · 2026-07-02

EHHN, an Event-driven Heterogeneous Hypergraph Network, advances object-centric next activity prediction by jointly modeling event-driven object state changes, inter-event timing, and global execution patterns. The method represents prediction prefixes as heterogeneous hypergraphs with event-object hyperedges and lifecycle hyperedges, processed via a dual-stream architecture: a micro-spatial stream captures object-state evolution, while a macro-evolution stream models temporal dynamics using global prototypes. Evaluated on four OCEL benchmarks against nine baselines, EHHN achieves superior accuracy and macro F1-score, improving by up to 8.1 and 12.4 percentage points, respectively, while reducing peak GPU memory usage by up to 24x compared to the strongest OCEL-native graph baseline.

hypergraph networkobject-centricevent-drivendual-stream architecturemacro-evolution stream

Set Diffusion: Interpolating Token Orderings Between Autoregression and Diffusion for Fast and Flexible Decoding

arXiv cs.LG · Marianne Arriola, Volodymyr Kuleshov · 2026-07-02

Set diffusion introduces a novel class of language models that interpolates between autoregressive and diffusion approaches by factorizing likelihood over flexible-position, flexible-length token sets. The method employs a set-causal diffusion architecture supporting KV cache updates after each inference step, enabling arbitrarily-ordered token decoding, including sliding-window sets. This approach achieves superior speed-quality tradeoffs on mathematical reasoning, summarization, and unconditional generation tasks compared to prior diffusion language models, while outperforming block diffusion in infilling performance. The authors provide code, model weights, and a blog post on the project page.

set diffusionkv cachingtoken setsinfillingautoregressive

Denser $\neq$ Better: Limits of On-Policy Self-Distillation for Continual Post-Training

arXiv cs.LG · Meng Wang, Haohan Zhao, Wenzhuo Liu, Lu Yang · 2026-07-02

The paper challenges the effectiveness of on-policy self-distillation for continual post-training of foundation models, demonstrating its limitations through self-distillation policy optimization (SDPO). While SDPO accelerates in-domain specialization with stable teacher signals, it exhibits stronger forgetting and potential collapse in continual learning scenarios compared to methods like GRPO. Analyses reveal that dense self-distillation induces parameter/response space drift and amplifies formatting artifacts through self-reinforcing loops, suggesting it is unsuitable as a default stabilizer for continual post-training.

self-distillationcontinual learningpolicy optimizationparameter driftfoundation models

Role-Aware Neural Convex Divergence Heads for Asymmetric Representation Learning

arXiv cs.LG · He Huang, Lu Shen, Yunfeng Huang, Li Qi · 2026-07-02

The paper proposes a role-aware neural convex divergence head for asymmetric representation learning, addressing limitations of symmetric distance metrics and unstructured asymmetric scorers. The method applies role-specific projections followed by an input-convex neural Bregman divergence, providing nonnegative structured scores with proven geometric properties. Experiments on lexical entailment, sentence entailment, ontology hierarchy, and citation tasks show consistent accuracy improvements over plain ICNN-Bregman heads (across 10 random seeds) while maintaining zero negative divergence rates, though symmetric baselines outperform on citation prediction tasks.

asymmetric representation learningbregman divergenceinput-convex neural networksrole-aware projectionsdirectional relations

Efficient Temporal Point Processes via Monotone Alternating Splines

arXiv cs.LG · Cheng Wan, Quyu Kong, Feng Zhou · 2026-07-02

The paper introduces Monotone Alternating Splines (MAS), a novel framework for modeling cumulative conditional intensity functions (CCIFs) in temporal point processes (TPPs), addressing structural limitations of Monotone Neural Networks (MNNs). MAS combines distinct interpolation and extrapolation components to enhance flexibility and efficiency, theoretically improving fitting accuracy and generalization while reducing approximation gaps. Experiments demonstrate MAS's superior performance on synthetic and real-world datasets compared to existing methods.

temporal point processescumulative conditional intensity functionmonotone neural networksinterpolationextrapolation

Finite-Lag Operator Geometry of Recurrent Representations

arXiv cs.LG · Kanishka Reddy · 2026-07-02

The paper introduces finite-lag operator geometry for analyzing recurrent hidden states through observed source-successor pairs $(X_t,X_{t+Δ})$. The method constructs a conditional transport law $Q_Δ(dy\mid x)$ via a dense Gaussian source-smoothing operator, decomposing it into a transport tensor $G_Δ$ (conditional spread and coherent displacement) and antisymmetric coordinate circulation $W_Δ^ρ$. Theoretical results include affine covariance, estimator stability, and finite-lag separation for deterministic recurrent motion. Linear-Gaussian closed forms and controlled experiments validate the framework, revealing architecture-dependent transport differences in repeat-copy networks.

recurrent representationsconditional transport lawtransport tensorcoordinate circulationfinite-lag geometry

Quantum-Inspired Vision: Leveraging Wave-Particle Duality for Low-Illumination Enhancement

arXiv cs.LG · Yiquan Gao · 2026-07-02

The study formalizes a physics-to-AI paradigm for image enhancement by modeling images as probabilistic wave functions, extending the Data Relativistic Uncertainty (DRU) framework. It integrates wave-particle duality to elucidate how DRU leverages intrinsic physical uncertainty of light, addressing illumination bias and noise robustness. The approach provides an Explainable AI (XAI) method to enhance interpretability of DRU's mechanisms in low-illumination scenarios.

data relativistic uncertaintywave-particle dualityprobabilistic wave functionsexplainable aiillumination enhancement

Frequency Shift Physics-Informed Extreme Learning Machine for Solving High-Frequency Partial Differential Equations

arXiv cs.LG · Xiong Xiong, Ruonan Zhai, Zheng Zeng, Sheng Zhou · 2026-07-02

The paper proposes Frequency Shift Physics-Informed Extreme Learning Machine (FS-PIELM), a novel framework addressing spectral bias in solving high-frequency PDEs through additive weight initialization. Instead of scaling random weights, the method shifts the Gaussian distribution's mean while maintaining unit variance, preventing variance amplification. Two variants are introduced: FS-PIELM-L for individual neuron frequencies and FS-PIELM-G for grouped neurons. Theoretical analysis shows bounded frequency variance approaching unity, contrasting with quadratic growth in conventional methods. Experiments on seven benchmark problems across six PDE types demonstrate up to five orders of magnitude accuracy improvement over existing PIELM variants.

spectral biasextreme learning machinepartial differential equationsweight initializationfrequency shift

A Mathematical Introduction to Diffusion Models

arXiv cs.LG · Jianfeng Lu · 2026-07-02

The article provides a mathematically rigorous introduction to diffusion models, focusing on their foundations in sampling theory. It systematically connects classical sampling dynamics to contemporary diffusion samplers, covering error analysis and inference-time control. The presentation is structured into core definitions with complete proofs, simplified estimates, and advanced theorems with proof sketches, targeting graduate students familiar with probability but new to stochastic differential equations and diffusion models.

diffusion modelsstochastic differential equationssampling theoryerror analysisinference-time control

WARP: Weight-Space Analysis for Recovering Training Data Portfolios

arXiv cs.LG · Tzu-Heng Huang, Aditya Goyal, John Cooper, Frederic Sala · 2026-07-02

WARP introduces a framework for recovering domain mixture weights from fine-tuned model weights, addressing the access asymmetry in training data disclosure. By interpolating between base and fine-tuned models via model merging, WARP generates pseudo-checkpoints that approximate the training trajectory and expose geometric footprints of training data in weight space. These footprints are mapped to domain proportions using either a softmax readout or an MLP projector trained on synthetic mixtures. Evaluated on BERT and GPT-2, WARP achieves domain mixture recovery with MAEs of 0.046 and 0.104 respectively, surpassing membership inference baselines.

domain mixture weightsmodel mergingpseudo-checkpointsgeometric footprintssoftmax readout

SCAPE: Accurate and Efficient LLM Training with Extreme Sparse Communication

arXiv cs.LG · Mingkai Zheng, Junlin Chen, Haotian Xie, Zhao Zhang · 2026-07-02

SCAPE introduces a communication-efficient distributed optimizer for Large Language Model (LLM) training, addressing the dominance of communication costs in data-parallel and sharded schemes. By leveraging the stability of AdamS's first-moment, SCAPE enables aggressive gradient sparsification without compromising model quality. It derives masks from first-moment statistics, partitions mask generation across workers, and delays mask usage to overlap synchronization with computation. SCAPE reconstructs second-moment updates from a single sparse buffer, eliminating additional collectives. Evaluated on GPT-345M and Llama-500M using 32 NVIDIA GH200 GPUs, SCAPE maintains training stability and accuracy under 90% and 99% sparsity, reducing pre-training time by up to 43.3% for Llama-500M and achieving a 3.26× speedup per step for Llama-1.8B.

sparsificationadam optimizergradient synchronizationdistributed trainingllm pre-training

UniWind: Toward Unified Day-Ahead Wind Power Forecasting via Physics-Informed State Routing

arXiv cs.LG · Ronghui Xu, Tongxin Wu, Guozhen Zhang, Yihan Li · 2026-07-02

UniWind introduces a unified physics-informed state routing framework for day-ahead wind power forecasting, addressing the limitations of existing physical and data-driven models. The method combines a Physical Prior Estimator, which constructs site-calibrated physical priors using monotonic warping and a shared physical power curve, with a Latent State Encoder and State-aware Power Corrector to model operational states and refine forecasts. Experiments on over 20 real-world datasets demonstrate UniWind's accuracy and robustness in both full-shot and cross-farm zero-shot scenarios.

physics-informed state routingphysical prior estimatorlatent state encoderstate-aware power correctorwind power forecasting

Revisiting Decentralized Online Convex Optimization with Compressed Communication

arXiv cs.LG · Hao Zhou, Xiaoyu Wang, Chang Yao, Mingli Song · 2026-07-02

The paper introduces two novel follow-the-regularized-leader (FTRL) algorithms for decentralized online convex optimization (D-OCO) with compressed communication, addressing a gap where prior work focused on online gradient descent (OGD) variants. The first algorithm operates in the full-information setting, matching existing regret bounds, while the second improves both regret bounds and communication costs in the bandit setting. Key innovation lies in leveraging FTRL's dual update mechanism to integrate average consensus with communication compression, yielding more elegant algorithmic design and theoretical analysis compared to OGD-type approaches.

decentralized online convex optimizationfollow-the-regularized-leadercommunication compressionaverage consensusbandit setting

Message Passing Based Two-Timescale Bayesian Learning for Joint Channel and Memory Hardware Impairments Tracking

arXiv cs.LG · Wei Xu, An Liu · 2026-07-02

The paper proposes a message-passing-based two-timescale Bayesian deep learning (MP-TTBDL) framework for joint tracking of fast-varying sparse channels and slow-varying hardware impairments in massive MIMO systems. The method employs a residual recurrent gated unit (RGRU) to model hardware memory effects, formulates a multi-slot factor graph with distinct Markov priors, and decomposes inference into channel tracking via Turbo-OAMP and impairment calibration via deep approximate message passing (DAMP) with expectation propagation. Simulations demonstrate superior channel estimation accuracy compared to conventional compensators across varying SNR and impairment scenarios.

massive mimohardware impairmentsmessage passingbayesian learningchannel estimation

CALM: Interpretable Cross-Modal Alignment for Biomarker Discovery from Unpaired Data

arXiv cs.LG · Jueqi Wang, Zachary Jacokes, John Darrell Van Horn, Kevin A. Pelphrey · 2026-07-02

CALM introduces a cross-modal alignment framework for biomarker discovery from unpaired neuroimaging and genetics data. The method learns interpretable associations between brain regions of interest (ROIs) and genetic pathways via linear projections that align class-conditional latent distributions while ensuring group separability. Evaluated on autism spectrum disorder data, CALM outperforms state-of-the-art methods in generalization to unseen paired datasets, with discovered immune/metabolic pathway-ROI associations consistent with literature. The approach demonstrates stable performance compared to paired-data baselines.

cross-modal alignmentbiomarker discoverylatent space projectionneuroimaging-geneticsinterpretable associations

DeadPool: Resilient LLM Training with Hot-Swapping via Zero-Overhead Checkpoint

arXiv cs.LG · Haotian Xie, Junlin Chen, Mingkai Zheng, Lishan Yang · 2026-07-02

DeadPool introduces a fault-tolerant mechanism for large language model (LLM) training that enables hot-swapping of failed compute nodes with zero overhead during error-free execution. The method combines off-critical-path in-memory checkpointing for spatial redundancy and a communicator reconstruction protocol to replace failed nodes at runtime. Evaluation on up to 512 NVIDIA A100 GPUs and LLMs with up to 65B parameters demonstrates zero checkpoint overhead and hot-swapping recovery in under 40 seconds, achieving both zero-overhead execution and low recovery latency.

hot-swappingin-memory checkpointingfault-tolerancelarge language modelscommunicator reconstruction

SINA: A Fully Automated Circuit Schematic Image to Netlist Generator Using Artificial Intelligence

arXiv cs.LG · Saoud Aldowaish, Yashwanth Karumanchi, Kai-Chen Chiang, Mohammed Ayman Habib · 2026-07-02

SINA introduces a fully automated pipeline for converting circuit schematic images to machine-readable netlists, addressing limitations in analog/mixed-signal EDA. The method integrates deep learning for component detection, connected-component labeling for connectivity inference, OCR for designator extraction, and a VLM for designator assignment, with dedicated crossing-wire detection. Evaluated via graph isomorphism, SINA achieves 96.67% netlist generation accuracy (2.72× improvement over SOTA), handling both IC- and PCB-level schematics.

electronic design automationnetlist generationvision-language modeloptical character recognitiongraph isomorphism

BOUNDARY_SYNC: Measuring Communication-Induced Representational Coupling in Multi-Agent LLM Systems

arXiv cs.LG · Zewen Liu · 2026-07-02

The paper introduces BOUNDARY_SYNC, a protocol quantifying representational coupling in multi-agent LLM systems via the Coupling Amplification Factor (CAF). Using controlled GPT-4o experiments (N=30, ~9,900 API calls), the study finds text communication induces homogenization (CAF=0.803), while image communication diversifies outputs (CAF>1.0). Cross-model replication reveals extreme variation (CAF 0.034-0.803), with DeepSeek exhibiting format artifacts. Coupling is stateless, driven by prompt context rather than cumulative updates. Results demonstrate measurable and controllable LLM agent coupling at the prompt level.

representational couplingcoupling amplification factormulti-agent systemsprompt contexthomogenization

Geometric Signatures of Reasoning: A Spectral Perspective on Task Hardness

arXiv cs.LG · Aria Masoomi, Mahsa Bazzaz, Adel Javanmard, Vahab Mirrokni · 2026-07-02

The paper introduces geometric signatures to analyze chain-of-thought (CoT) reasoning trajectories in transformer models' hidden state space, formalizing them as discrete curves in ℝᵈ. Using spectral, positional, and kinematic functionals, it proposes effective dimension dₚ as a measure of trajectory complexity, showing flatter eigenvalue spectra correlate with harder tasks. On MATH500, dₚ achieves 0.93 AUC in difficulty classification, while kinematic features predict correctness from initial 20% of tokens, demonstrating transferable geometric signatures of reasoning hardness and quality.

chain-of-thoughtspectral analysiseffective dimensionkinematic featureshidden state geometry

MMAO-Cls: Metabolic Multi-Agent Optimization for Joint Feature Selection and Classifier Tuning

arXiv cs.LG · Jinliang Xu, Liping Ma · 2026-07-01

The paper introduces MMAO-Cls, a metabolic multi-agent optimization framework for joint feature selection and classifier hyperparameter tuning in classification tasks. The method encodes feature masks and hyperparameters within agent representations, incorporating energy-budget dynamics to balance accuracy and complexity, while regularizing validation rewards with subset compactness and overfitting metrics. Evaluated on seven tabular benchmarks, MMAO-Cls achieves competitive test performance (mean score 0.8882) and produces the most compact feature subsets (mean ratio 0.4881), though statistical significance over baselines (RandomSearch, GA-lite, PSO-lite) is not yet established. The results suggest applicability for classification but do not conclusively demonstrate advantages from communal agent sharing.

multi-agent optimizationfeature selectionhyperparameter tuningwrapper learningmetabolic algorithm

Certified World Models as Sensing Clocks: Drift-Aware Deadlines for Active Perception

arXiv cs.LG · Hongbo Wang · 2026-07-01

The paper introduces certified sensing clocks, a method for determining optimal re-sensing intervals in active perception systems based on drift-aware validity horizons derived from equivariant world models. Using a frozen 3D VN-JEPA model, the authors demonstrate that calibrated native rollout-drift envelopes ensure deployment guarantees, reducing eventful-tail violations compared to exact-mixture expected-belief scheduling. The method is validated on a cue-conditioned theorem-bed, isolating scheduling rules, and shows empirical conformal horizons matching deployed clock validity in short-horizon regimes. The contribution focuses on drift-aware deployment and certified sensing-clock primitives, without claiming spectral clock dominance.

certified world modelsdrift-awareequivariant world modelsvn-jepaactive perception

Wind-Aware Reinforcement Learning Control of a Small Quadrotor Using Learned Onboard Wind Estimation in Simulated Atmospheric Turbulence

arXiv cs.LG · Abdullah Al Tasim, Wei Sun · 2026-07-01

The paper presents a two-stage learning pipeline for wind-aware control of small quadrotors in turbulent conditions. First, an attention-augmented gated recurrent network estimates local wind from onboard kinematics, achieving 0.40 m/s RMSE and 3.2° direction error. Second, a proximal policy optimization controller using these estimates reduces trajectory tracking error by 48% versus a wind-blind baseline, with full success across 4-12 m/s winds. The approach degrades gracefully at 13-15 m/s out-of-distribution winds where conventional control fails.

quadrotor controlwind estimationreinforcement learningatmospheric turbulenceproximal policy optimization

Quantifying the Uncertainty of Blindly Estimated Room Embeddings Using a Dispersion-Calibrated Score

arXiv cs.LG · Yang Xiang, Philipp Götz, Emanuël A. P. Habets, Andreas Walther · 2026-07-01

The authors propose an unsupervised framework for learning robust room embeddings and uncertainty scores from reverberant speech, addressing reliability issues caused by speech content variation and recording degradation. The method anchors embeddings to a structured room impulse response (RIR) latent space using Kullback-Leibler (KL)-based alignment and multi-positive contrastive learning, while a lightweight uncertainty head is calibrated via embedding dispersion under corruption. Evaluated across waveform- and spectrogram-level corruptions, the uncertainty score consistently reflects representation dispersion and enables effective selective prediction using only a single utterance at inference.

room embeddingsreverberant speechkl-based alignmentuncertainty scoreselective prediction

Mean Field Reinforcement Learning

arXiv cs.LG · René Carmona, Mathieu Laurière · 2026-07-01

The monograph establishes a theoretical and methodological bridge between mean field control and reinforcement learning by analyzing Markov decision processes in large-population stochastic systems with mean field interactions. It develops a probabilistic framework connecting multi-agent reinforcement learning to mean field control, covering dynamic programming principles, propagation-of-chaos limits, and analyses of tabular Q-learning and policy-gradient methods. Theoretical results are complemented by numerical implementations including tabular schemes and deep reinforcement learning approaches like deep deterministic policy gradient, with particular attention to linear-quadratic models and finite-population system relationships.

mean field reinforcement learningmarkov decision processespropagation-of-chaospolicy-gradient methodsdeep deterministic policy gradient

The risk of KV cache compression

arXiv cs.LG · Lukas Haverbeck, Carmen Amo Alonso, Andres Felipe Posada-Moreno, Sebastian Trimpe · 2026-07-01

The paper characterizes the minimax risk of KV cache compression in transformers, providing theoretical guidance for designing efficient compression algorithms when accurate compression is feasible. It introduces principled methods for KV cache compression under causal masking, optimizing both prefill and autoregressive decoding phases while achieving minimax-optimal risk. Experimental results on LongBench demonstrate the practical viability of the proposed approach, offering a theoretically grounded solution to the KV cache bottleneck in long-sequence inference.

kv cache compressionminimax riskcausal maskingautoregressive decodinglongbench

Towards Learning Representations of Policies in Two-Player Zero-Sum Imperfect-Information Games

arXiv cs.LG · Kevin Wang, Kevin Yang, Arjun Prakash, Amy Greenwald · 2026-07-01

The paper introduces methods for learning policy representations in two-player zero-sum imperfect-information games through three contributions: dataset creation for policy collections, policy embedding techniques, and downstream task evaluation. The authors employ self-supervised learning on Kuhn and Leduc Poker, demonstrating that basic methods can capture useful behavioral representations in the embeddings. This work represents an early systematic comparison of self-supervised approaches for policy representation learning in games.

policy representationsimperfect-information gamesself-supervised learningbehavioral embeddingsdownstream tasks

Unveiling the Non-Monotonic Effect of Privacy on Generalization under Byzantine Robustness

arXiv cs.LG · Thomas Boudou, Batiste Le Bars, Nirupam Gupta, Aurélien Bellet · 2026-07-01

The paper reveals a non-monotonic relationship between privacy and generalization in Byzantine-robust distributed learning, challenging the established trilemma between robustness, local differential privacy (LDP), and optimization error. Through theoretical analysis, it demonstrates that in high-noise (strong privacy) regimes, increased privacy improves generalization, while in low-noise (weaker privacy) regimes, it degrades generalization. The authors provide matching lower and upper bounds on algorithmic stability under LDP constraints and validate findings with empirical evaluations.

byzantine robustnesslocal differential privacygeneralization erroralgorithmic stabilitydistributed learning

How to Allocate Your Tokens? Scaling Laws with Training Steps and Batch Size

arXiv cs.LG · Fabian Schaipp · 2026-07-01

The authors propose a three-term scaling law that incorporates model size, training steps, and batch size, improving upon existing approaches by explicitly accounting for batch size effects. By fitting this law on extensive training runs, they demonstrate accurate recovery of optimal batch size scaling and robust parameter estimation from fewer runs. The framework also enables derivation of scaling laws for suboptimal batch sizes and aligns with prior empirical results on critical batch size.

scaling lawsbatch sizetraining stepsoptimal allocationthree-term law

Boundary-Aware Quantization: Finite-Scale Decision Geometry of Neural Classifiers

arXiv cs.LG · O. M. Kiselev · 2026-07-01

The study introduces boundary-aware quantization, analyzing quantization-induced decision-boundary changes in neural classifiers through metrics like local logit-margin radii and boundary Jaccard distance. Experiments on MNIST, Fashion-MNIST, and CIFAR-10 benchmarks demonstrate that 8-bit weight quantization preserves test labels while maintaining high accuracy (0.9733 at 4 bits), with boundary Jaccard increasing to 0.970. Calibration-to-test stopping reduces flip rates and boundary Jaccard significantly, e.g., from 0.0094 to 0.0022 on digits. Boundary-aware stopping outperforms accuracy-based selection, achieving lower flip rates (0.0083 at 8-bit) and boundary Jaccard (0.048) on CIFAR-10 subsets. Calibration boundary Jaccard strongly predicts held-out boundary Jaccard (r=0.947–0.994).

quantizationdecision-boundaryjaccard distancelogit-margincalibration

Class-Grouped Normalized Momentum and Faster Hyperparameter Exploration to Tackle Class Imbalance in Federated Learning

arXiv cs.LG · Haemin Park, Diego Klabjan, Martin W. Braun, Xiuqi Li · 2026-07-01

FedCGNM, a client-side optimizer for federated learning, addresses class imbalance by partitioning classes into groups based on minimum within-group variance, maintaining normalized momentum per group, and summing these for updates. This equalizes gradient magnitude across classes and reduces noise in rare-class gradients. FedHOO, an X-armed-bandit algorithm, optimizes resampling rates efficiently in small-client regimes. Empirical evaluations on four long-tailed benchmarks and a proprietary chip-defect dataset show FedCGNM consistently outperforms baselines, with FedHOO providing additional gains in small-scale federations.

federated learningclass imbalancenormalized momentumx-armed-banditresampling rates

Geometry-Aware R-Structured Kolmogorov-Arnold Networks

arXiv cs.LG · Sergei Kucherenko, Nilay Shah · 2026-07-01

The authors propose Geometry-aware R-Structured Kolmogorov-Arnold Networks (GRS-KAN), a hybrid architecture combining Kolmogorov-Arnold Networks with V.L. Rvachev's R-functions to integrate geometric constraints into neural models. The method uses KAN branches for smooth nonlinear learning while encoding geometric/logical constraints via differentiable R-functions, enabling explicit representation of discontinuities and boundaries. Variants include additive, multiplicative, and agnostic branch-weighted architectures. Experiments on regression tasks with circular/rectangular supports show 67% RMSE reduction versus standard KANs, with improved boundary localization and interpretability through analytical geometric representations.

kolmogorov-arnold networksr-functionsgeometric constraintsdifferentiable logichybrid architecture

Hamm-Grams: An Algorithm for Mining Regular Expressions of Bytes

arXiv cs.LG · Derek Everett, Edward Raff, James Holt · 2026-07-01

The paper introduces hamm-grams, a robust feature class for malware detection consisting of fixed-length regular expressions with single-character wildcards. The authors develop an efficient mining algorithm using a novel locality-sensitive hash for small Hamming distance collisions, followed by hash bucket clustering to position wildcards. Experiments demonstrate improved performance over traditional $n$-grams in malware classification and detection tasks.

malware detectionregular expressionslocality-sensitive hashinghamming distancefeature extraction

Sign in the Air to Unlock: An Interface for authentication in Virtual and Augmented Reality Powered by Point-Voxel Cross-Attention Network

arXiv cs.LG · Neda Abdolrahimi, Thiru Siddharth, Frank Sicongchen, Vir V Phoha · 2026-07-01

The authors propose Sign in the Air to Unlock, a 3D authentication interface for VR/AR using in-air signatures, enabled by a novel Point-Voxel Cross-Attention Network (PV-Net) that jointly models local motion dynamics and global spatial structure from 3D trajectories. PV-Net achieves a 2.5% Equal Error Rate on the DeepAirSig dataset (1,800 signatures from 40 users) and 76% accuracy on ImmAirSig (880 samples from 22 users collected via Meta Quest 2), demonstrating viability for immersive authentication without breaking interaction flow.

3d authenticationpoint-voxel networkin-air signatureimmersive interfacescross-attention

Conditional Inference Trees and Forests for Feature Selection

arXiv cs.LG · Robert Milletich, Justin Downes, Steve Goley, Newel Hirst · 2026-07-01

The paper evaluates conditional inference forests (CIF) as a feature-ranking method, demonstrating its competitive performance in classification and regression tasks. CIF reduces split-selection bias through permutation-based feature testing, with Bonferroni-corrected p-values ensuring nodewise error control under the permutation null. Benchmark results show CIF ranks 4th among 17 classifiers on 22 datasets and 3rd among 18 regressors on 8 datasets. Runtime analysis reveals adaptive stopping and threshold search count as primary computational bottlenecks (4.0--10.8× slowdowns when disabled), while downstream accuracy remains stable (±0.011). Simulations highlight limitations in high-dimensional sparse settings where forest feature sampling may exclude informative features.

conditional inference forestsfeature selectionpermutation testingbonferroni correctionsplit-selection bias

The Rollout Infrastructure Tax in Coding-Agent Reinforcement Learning

arXiv cs.LG · Daniel Thi Graviet, Lovre Pesut, Ivan Dagelic, Vedran Jukic · 2026-07-01

The study quantifies infrastructure overhead in coding-agent reinforcement learning, demonstrating that execution substrate choice significantly impacts efficiency. Through comparative analysis of four deployment methods (single containers, hosted sandboxes, Kubernetes-orchestrated containers, and cloud VMs), the authors measure cold-start latency (up to 110× variation) and projected worker-hours (1.8× spread) for large-scale trajectory rollouts. Results indicate that optimizing execution substrates should be integrated into RL training systems rather than treated as deployment details.

reinforcement learningexecution substratecold-start latencykubernetestrajectory rollouts

BIFROST: Bridging Invariant Feature Representation for Observation-space Sim2Real Transfer

arXiv cs.LG · Yunfu Deng, Josiah P. Hanna · 2026-07-01

BIFROST introduces a method for zero-shot sim2real transfer by learning a shared history encoder via cross-domain bisimulation. The approach maps observation-action sequences with equivalent long-term outcomes to nearby latent states across domains, enabling policy transfer without domain adaptation. Evaluations on visual navigation, contact-rich manipulation, and visual servoing tasks demonstrate BIFROST's effectiveness where prior domain adaptation and co-training methods fail under visual and dynamics domain gaps.

sim2real transferbisimulationzero-shot learninghistory encoderdomain adaptation

A global predicted-fMRI drive signal from TRIBE does not predict YouTube replay heatmaps

arXiv cs.LG · Barada Sahu, Shivesh Pandey · 2026-07-01

This study evaluates whether TRIBE, a state-of-the-art multimodal brain-encoding model (Llama-3.2 + V-JEPA2 + Wav2Vec-BERT), can predict viewer engagement on YouTube videos using its fMRI-derived global field power signal. The authors analyze 48 videos, correlating TRIBE's per-second engagement curve with YouTube's 'most replayed' heatmaps. Results show no significant predictive power (pooled partial correlation +0.058, p=0.23), with performance indistinguishable from loudness and motion baselines. The null result holds across cortical-network readouts and permutation tests, contradicting prior music-video findings as genre-specific artifacts. The authors release code, video IDs, and a SABR-compatible acquisition method.

brain-encoding modelglobal field powerfmri predictionnaturalistic videoengagement correlation

Bi-NAS: Towards Effective and Personalized Explanation for Recommender Systems via Bi-Level Neural Architecture Search

arXiv cs.LG · Longfeng Wu, Yao Zhou, Tong Zeng, Zhimin Peng · 2026-07-01

The paper proposes Bi-NAS, a bi-level neural architecture search framework for optimizing explanations in recommender systems. The method jointly refines cross-attention mechanisms and feature interactions through intra-layer and inter-layer search spaces, while incorporating LLMs via zero-shot prompting for personalized justification generation. Evaluations on four real-world datasets show improvements in both recommendation accuracy and explanation effectiveness by aligning user preferences with item attributes through transparent reasoning.

neural architecture searchrecommender systemscross-attention mechanismszero-shot promptingfeature interaction

Enerzyme: A Framework for Efficient Training of Reactive Neural Network Potentials for Enzyme Catalysis with Application to Methyltransferases

arXiv cs.LG · Weiliang Luo, Heather J. Kulik · 2026-07-01

The authors present Enerzyme, a software framework for efficient training of neural network potentials (NNPs) tailored to enzyme catalysis, demonstrated on quantum mechanical cluster models of methyltransferases. Enerzyme integrates modular electrostatics-aware NNP architectures, automated QM-cluster construction, and reactive dataset generation, while Enerzymette automates reaction pathway exploration. Results show that NNPs trained on fewer than 1,000 system-specific datapoints achieve near-chemical accuracy in reproducing reaction energetics and transition-state structures for clusters up to 545 atoms. Direct supervision of atomic charges and consistent dielectric screening enhance simulation stability, while multitask-learned atomic charges capture charge transfer and polarization. Transferability across catechol O-methyltransferase substrates indicates NNPs learn generalizable reactivity patterns.

neural network potentialsmethyltransferasesquantum mechanical clusteratomic chargesreaction pathway exploration

Mechanistic Interpretability and Causal Feature Steering of Neural Quantum States via Sparse Autoencoders

arXiv cs.LG · Zihao Qi, Christopher Earls · 2026-07-01

The work introduces a mechanistic interpretability framework for Neural Quantum States (NQS) using sparse autoencoders to analyze internal activations. Unsupervised feature extraction from the residual stream reveals strong correlations with physical observables like order parameters and magnetization, without explicit optimization. Post-training interventions demonstrate causal relationships, as steering a single feature monotonically adjusts the corresponding observable while preserving variational energy. This establishes NQS as encoding interpretable physical representations, providing diagnostic and intervention tools for reliable NQS development.

neural quantum statessparse autoencodersmechanistic interpretabilityresidual streamvariational energy

Ravines in quantum cost landscapes: opportunities for improved VQA predictions

arXiv cs.LG · Felix J. Beckmann, João F. Bravo · 2026-07-01

The study introduces an adapted nudged elastic band (NEB) algorithm to analyze ravines in quantum cost landscapes (QCLs) of variational quantum algorithms (VQAs), focusing on quantum neural networks (QNNs) trained for concentratable entanglement classification. By constructing an ensemble prediction framework averaging QNN predictions along low-cost NEB paths, the method outperforms classical and naive quantum alternatives, leveraging local-prediction variability as a performance indicator. Complexity analysis reveals substantial computational cost reduction compared to naive QNN ensembling, with ravines persisting across depth and qubit scaling. The NEB approach accelerates convergence, demonstrating its efficacy in optimizing VQAs.

quantum cost landscapesnudged elastic bandvariational quantum algorithmsquantum neural networksconcentratable entanglement

Is One Layer Enough? Training A Single Transformer Layer Can Match Full-Parameter RL Training

arXiv cs.LG · Zijian Zhang, Rizhen Hu, Athanasios Glentis, Dawei Li · 2026-07-01

The study demonstrates that reinforcement learning (RL) gains in transformer-based language models are concentrated in a small subset of layers, with training just one layer often matching or exceeding full-parameter RL performance. Through systematic layer-wise analysis across seven models (Qwen3, Qwen2.5), three RL algorithms (GRPO, GiGPO, Dr. GRPO), and multiple tasks (mathematical reasoning, code generation, agentic decision-making), the authors introduce layer contribution to quantify improvement recovery. Results reveal a consistent pattern: high-contribution layers cluster in the middle of the transformer stack, with strong cross-domain correlation in layer rankings.

reinforcement learningtransformer layerslayer contributionpost-trainingparameter efficiency

Theoria: Rewrite-Acceptability Verification over Informal Reasoning States

arXiv cs.LG · Michael Saldivar, Ben Slivinski · 2026-07-01

Theoria introduces a verification architecture for AI-generated solutions by rewriting them into auditable sequences of typed state transitions, each requiring explicit justifications. This method enforces completeness of change, ensuring all modifications are accounted for and unlicensed premises are surfaced. Evaluated on HLE-Verified Gold (185 problems), Theoria achieves 91.4% strict precision (Wilson 95% CI [84.5%, 95.4%]) and outperforms holistic LLM judges in detecting adversarial errors (94.7% vs. 83.2%, p=0.0017), particularly in hidden premises (90.6% vs. 62.5%) and fabricated citations (100% vs. 90%). On GPQA Diamond (n=65), certified precision reaches 97.1% (Wilson CI [85.1%, 99.5%]).

verification architecturetyped state transitionscompleteness of changeadversarial errorsexplicit justifications

Black-Box Inference of LLM Architectural Properties with Restrictive API Access

arXiv cs.LG · Christopher Ellis, Shreyas Chaudhari, Mei-Yu Wang, Leighton Barnes · 2026-07-01

NightVision introduces a black-box inference method that recovers LLM architectural properties (hidden dimension, depth, parameter count) under restrictive API access, which only exposes single-token log probabilities. The approach combines common set prompting—eliciting log probabilities for fixed token sets across multiple prompts—with spectral analysis and time-to-first-token measurements. Evaluated on 32 open-source LLMs, it achieves 23% average relative error for hidden dimension (9% on MoE models) and 53% for depth/parameter count in models >3B parameters. Results indicate current API restrictions inadequately protect architectural details.

black-box inferencellm architecturespectral analysiscommon set promptingtime-to-first-token

Quantum vs. Classical Machine Learning: A Unified Empirical Comparison

arXiv cs.LG · Chuanming Yu, Jiaming Liu, Zihao Ge, Xiongfei Wu · 2026-07-01

This paper provides a unified empirical comparison of quantum and classical machine learning models across seven supervised and reinforcement learning tasks. Using standardized benchmarks, the study evaluates prediction performance, policy stability, and training time. Results show current quantum machine learning models underperform classical baselines in overall metrics but demonstrate advantages in noise filtering and false positive control. The analysis identifies key challenges in hardware compatibility, training efficiency, and convergence stability for quantum approaches.

quantum machine learningempirical comparisonsupervised learningreinforcement learningconvergence stability

From Approximation to Emergence: A Theory of Deep Learning

arXiv cs.LG · Zhilin Zhao · 2026-07-01

The monograph 'From Approximation to Emergence' presents a unified theoretical framework for deep learning, synthesizing classical foundations (approximation, optimization, generalization) with modern mechanisms (overparameterization, transformers, scaling laws). It adopts a proof-oriented approach to organize disparate results into a coherent narrative, analyzing each theory through its controlled objects, validity assumptions, and unexplained phenomena. The work targets researchers and graduate students, offering a rigorous survey of current theory while highlighting open questions about emergent mechanisms in large-scale models.

overparameterizationin-context learningscaling lawstransformersemergence

A Novel Machine Learning Approach for Central Nervous System Tumor Classification from DNA Methylation

arXiv cs.LG · Paulo R. Ferreira, Lucas Coutinho Freitas, Laís dos Santos Gonçalves, William Borges Domingues · 2026-07-01

We propose a novel machine learning approach for central nervous system (CNS) tumor classification from DNA methylation data, combining Sparse Random Projection for dimensionality reduction with multinomial logistic regression. The method demonstrates improved cross-cohort transferability and robust multiclass evaluation compared to existing approaches. On a 2,801-sample reference cohort, it achieves 96% mean accuracy under stratified 3-fold cross-validation. On an independent 1,104-sample clinical cohort, it reaches 86% accuracy at the 91-class level and 93% at the methylation class family level, outperforming state-of-the-art reference figures by 4 and 5 percentage points respectively. This improvement has direct clinical implications for tumor subtype assignment and treatment selection.

dna methylationsparse random projectionmultinomial logistic regressiontumor classificationcross-cohort transferability

Generative AI and Federated Learning for Intrusion Detection Systems: A Survey

arXiv cs.LG · Jiefei Liu, Abu Saleh Md Tayeen, Pratyay Kumar, Qixu Gong · 2026-07-01

The survey systematically reviews the integration of generative AI and Federated Learning (FL) for Intrusion Detection Systems (IDS), addressing challenges like evolving attack behaviors, data scarcity, and privacy constraints. It categorizes generative AI applications—including autoencoders, GANs, diffusion models, and LLMs—for tasks like anomaly detection, synthetic traffic generation, and adversarial traffic simulation. The paper also examines FL-based IDS training for privacy-preserving distributed environments, identifying open challenges such as synthetic data quality, non-IID distributions, and domain-specific LLMs.

intrusion detection systemsgenerative adversarial networksfederated learninglarge language modelsanomaly detection

Staleness-Learning Rate Scaling Laws for Asynchronous RLHF

arXiv cs.LG · Jingwei Song, Haofeng Xu, Jie Xiao, Chengke Bao · 2026-07-01

The work analyzes gradient bias and stability in asynchronous Generalized Reinforcement Learning with Policy Optimization (GRPO) due to stale rollouts in RLHF systems. By explicitly modeling the behavior policy in the GRPO surrogate objective and distinguishing between surrogate-gradient mappings, the authors derive a per-step bias of O(S * eta) under boundedness and smoothness assumptions. They establish a two-regime collapse-time scaling law, showing stability depends on either cumulative learner drift (T * eta) or staleness (S * eta), yielding the condition eta << min{R_batch/(S*G_upd), R_crit/(T*G_upd)} for maximum stable learning rates.

asynchronous rlhfgradient biassurrogate objectivestale rolloutsscaling laws

📰 Industry Media

No new items today.


Generated automatically at 2026-07-03 20:38 UTC. Summaries and keywords are produced by an LLM and may contain inaccuracies — always consult the original article.