Daily Digest — 2026-06-24

Tuesday, June 23, 2026 · 350 items · model: deepseek/deepseek-chat

350 items · 7 research labs, 336 arxiv papers, 7 industry media

🏛️ Research Labs (7)

Helping build shared standards for advanced AI

OpenAI News · 2026-06-23

OpenAI co-founded the Appia Foundation under the Linux Foundation to develop open, modular specifications for AI system assessments, aiming to establish interoperable standards across organizations and jurisdictions. This initiative seeks to translate international frameworks into practical criteria, enabling third-party verification of compliance with safety and security standards. The effort complements OpenAI's Preparedness Framework and Frontier Governance Framework, which operationalize risk management for advanced AI systems. OpenAI also participates in multiple standardization bodies (ISO/IEC JTC1/SC42, NIST AI Consortium) to promote technically grounded practices globally.

appia foundationpreparedness frameworkfrontier governancemodular specificationsinteroperable standards

How GPT-5 helped immunologist Derya Unutmaz solve a 3-year-old mystery

OpenAI News · 2026-06-23

GPT-5 Pro enabled immunologist Derya Unutmaz to resolve a three-year-old immunological mystery by analyzing experimental data on T-cell specialization. The model identified that deoxyglucose disrupted IL-2 protein construction, promoting Th17 inflammatory-response T cells, unlike low-glucose conditions. GPT-5 also accurately predicted outcomes of Unutmaz’s unpublished lymphoma-targeting CD8+ T-cell experiments, demonstrating its ability to simulate biological processes. Unutmaz now uses GPT-5 Pro as a collaborative tool for hypothesis refinement, literature review, and experiment simulation, significantly accelerating biological research. However, domain expertise remains critical for evaluating AI-generated insights. OpenAI emphasizes responsible AI use in biology to mitigate misuse risks.

gpt-5t-celldeoxyglucoseil-2th17

How Omio is building the future of conversational travel

OpenAI News · 2026-06-23

Omio leverages OpenAI's conversational AI models to transform multimodal travel planning and internal operations, achieving a shift from search-based to AI-native interfaces. By integrating ChatGPT and Codex, Omio connects real-time transportation data from 3,000+ providers across 47 countries, enabling natural-language trip planning and bookable journeys. Internally, Codex accelerates software development, reducing effort to 20% of previous levels and shortening project timelines from one quarter to one month. The company emphasizes responsible AI deployment, maintaining human accountability while embedding AI across workflows to enhance decision-making, experimentation, and product iteration.

conversational aiai-nativemultimodal travelcodexchatgpt

Build real agentic apps using CUGA: two dozen working examples on a lightweight harness

Hugging Face Blog · 2026-06-23

IBM Research introduces CUGA (Configurable Generalist Agent), an open-source agent harness that simplifies agentic app development by handling planning, execution, and state management. CUGA requires only a tool list and prompt, demonstrated through 24 single-file FastAPI apps (e.g., IBM Cloud advisor, movie recommender). The framework supports multi-modal reasoning (Fast/Balanced/Accurate modes), tool interoperability (OpenAPI/MCP/LangChain), and governance policies (intent guards, tool approval). CUGA achieved top rankings on AppWorld (07/25-02/26) and WebArena (02/25-09/25) benchmarks, enabling smaller open-weight models (e.g., gpt-oss-120b) to outperform frontier APIs through structured planning and reflection.

agentic appsin-context learningtool interoperabilitymulti-agent delegationdeclarative guardrails

Shipping huggingface_hub every week with AI, open tools, and a human in the loop

Hugging Face Blog · 2026-06-23

The Hugging Face team automated their weekly release pipeline for huggingface_hub using open-source tools and AI-assisted drafting, while maintaining human oversight. Key innovations include deterministic validation of AI-generated release notes against PR manifests, OIDC-based secure publishing, and skill-based prompting for consistent output. The system reduced release cadence from 4-6 weeks to weekly, improved note quality, and caught integration issues earlier via downstream testing. All components are designed for reuse without proprietary dependencies.

oidctrusted publishingrelease automationdeterministic validationskill-based prompting

Experimenting with the proposed Cross-Origin Storage API in Transformers.js

Hugging Face Blog · 2026-06-23

The article evaluates the proposed Cross-Origin Storage (COS) API for mitigating redundant caching of AI model resources and Wasm runtimes in Transformers.js. It demonstrates how origin-partitioned caching forces duplicate downloads of identical resources (e.g., 177MB for Whisper-tiny.en ASR model) across different domains. The COS API enables hash-based deduplication via navigator.crossOriginStorage, with configurable origin-scoping and cryptographic integrity checks. Early experiments show potential bandwidth savings for shared ML assets while addressing privacy concerns through availability gating and selective origin permissions.

cross-origin storagetransformers.jswasm runtimecache partitioningmodel resources

We got local models to triage the OpenClaw repo for FREE!*

Hugging Face Blog · 2026-06-22

The article demonstrates using local models (Gemma-4-26b-a4b and Qwen3.6-35b-a3b) for real-time triage of GitHub issues in the OpenClaw repository, contrasting with cloud-based SOTA models. A restricted reposhell environment enables read-only repository inspection during agentic classification, achieving 700+ tokens/sec throughput on NVIDIA GB10 hardware via NVFP4 quantization and vLLM optimizations. Evaluation on 330 issues showed Qwen's superior precision (fewer false positives) versus Gemma's higher recall, while maintaining free operation versus paid GPT-5.5 batch processing.

reposhellnvfp4vllmagenticquantization

📜 arXiv Papers (336)

CoorDex: Coordinating Body and Hand Priors for Continuous Dexterous Humanoid Loco-Manipulation

arXiv cs.AI · Sikai Li, Shuning Li, Zhenyu Wei, Yunchao Yao · 2026-06-22

CoorDex introduces a learning pipeline for continuous dexterous humanoid loco-manipulation by coordinating body and hand priors. The method trains privileged motion tracking teachers for the humanoid body and dexterous hand, distills them into proprioception-conditioned latent priors, and uses these priors as the action space for residual reinforcement learning. A coordinated latent residual policy composes these priors through shared task context and separate body-hand residual heads. This approach enables a Unitree G1 humanoid with a 20-DoF WUJI hand to perform complex tasks like non-stop bottle grasping, fridge door opening, and cube pick-and-turn while in motion. Ablations demonstrate the necessity of the latent-prior interface and coordinated residual structure for trainability.

dexterous loco-manipulationlatent priorsresidual reinforcement learningproprioception-conditionedtask context

Semantic Browsing: Controllable Diversity for Image Generation

arXiv cs.AI · Sara Dorfman, Maya Vishnevsky, Omer Dahary, Or Patashnik · 2026-06-22

The authors propose Semantic Browsing, a method for controlled diversity in text-to-image generation that enables systematic traversal of interpretable axes of variation. Leveraging the fact that modern text-to-image models are trained on elaborated captions, they induce diversity at the text level rather than relying on stochastic variation within the model. A Vision Language Model (VLM) operates on full scene context, and an agentic workflow enforces structured variation aligned with the original prompt. The method produces diverse, navigable design spaces where each variation corresponds to a specific, user-understandable semantic decision.

text-to-image generationcontrolled diversityvision language modelagentic workflowsemantic browsing

AIR: Adaptive Interleaved Reasoning with Code in MLLMs

arXiv cs.AI · Cong Han, Xiaohan Lan, Haibo Qiu, Yujie Zhong · 2026-06-22

The paper introduces AIR, a method to enhance multimodal large language models (MLLMs) with adaptive interleaved reasoning capabilities for numerical computation tasks. It proposes a three-component solution: a two-stage cold-start data pipeline, RL dataset curation via filtering, and an adaptive tool-invocation strategy with group-constrained rewards. Reinforcement learning training yields a 6.1 pp average performance gain, with interleaved reasoning accuracy improving by 9.9 pp and tool-use success reaching 95%.

adaptive interleaved reasoningmultimodal llmsreinforcement learningtool-invocationnumerical computation

Open Problem: Is AdamW Effective Under Heavy-Tailed Noise?

arXiv cs.AI · Dingzhi Yu, Hongyi Tao, Yuanyu Wan, Luo Luo · 2026-06-22

The article identifies a theoretical gap in understanding AdamW's convergence under heavy-tailed gradient noise, prevalent in LLM pretraining. While sign-based optimizers (Lion, Muon) and AdaGrad have established heavy-tailed convergence, AdamW lacks rigorous analysis in this regime. The authors formalize this as an open problem, proving a positive weighted-metric benchmark and demonstrating via corridor lower-bound mechanisms how denominator memory in AdamW may obscure large gradients. Results suggest potential obstructions from second-moment accumulation but leave definitive convergence properties unresolved.

adamwheavy-tailed noiseconvergence theoryllm pretrainingsecond-moment accumulator

PsyBridge: A Hybrid Intelligent Framework for Multi-Dimensional Mental Health Assessment and Decision Support

arXiv cs.AI · Sunil Wanjari, Manish Thakre, Aayushi Asole, Sharwari Raut · 2026-06-22

PsyBridge introduces a hybrid intelligent framework for multi-dimensional mental health assessment by integrating clinically validated screening tools (PHQ-9, GAD-7), cognitive evaluation, and personality profiling. The modular architecture employs weighted aggregation to generate interpretable risk classifications and recommendations. Evaluated on a semi-synthetic dataset of 500 patient profiles, PsyBridge achieves 0.84 accuracy, outperforming standalone PHQ-9/GAD-7 assessments in precision, recall, and F1-score. Ablation studies confirm the stability benefits of cognitive and personality components, particularly for moderate-risk prediction.

hybrid intelligent frameworkmulti-dimensional assessmentweighted aggregationsemi-synthetic datasetinterpretable classification

Teaching LLMs String Matching, Backtracking, and Error Recovery to Deduce Bases and Truth Tables for the Combinatorially Exploding Bit Manipulation Puzzles

arXiv cs.AI · Prateek Agnihotri, Sanchit Jain, Prabhat Agnihotri, Aditya Prasad · 2026-06-22

The paper introduces a novel approach for solving Bit Manipulation Puzzles by reframing logic-gate deduction as a base-selection task using string similarity and structured search. The method employs backtracking DFS with autonomous error recovery, bit tokenization, and interactive reasoning SFT to simulate oracle feedback and train the model to hypothesize and backtrack natively. Evaluated on bit manipulation puzzles, the approach achieved over 96% validation accuracy, securing 7th place in the NVIDIA Nemotron Model Reasoning Challenge.

bit manipulationstring similaritybacktracking dfserror recoverybit tokenization

Tapered Language Models

arXiv cs.AI · Reza Bayat, Ali Behrouz, Aaron Courville · 2026-06-22

Tapered Language Models (TLMs) introduce depth-aware parameter allocation by monotonically tapering MLP width across layers under a fixed budget, challenging the uniform-width convention in modern architectures. The method employs a smooth cosine schedule to allocate more capacity to earlier layers and less to later ones, based on empirical evidence that layers contribute non-uniformly to model output. Evaluated across three model scales and four architectures (Transformer, Gated Attention, Hope-attention, Titans), TLMs consistently improve perplexity and downstream benchmark performance without increasing parameter count or compute cost, establishing depth-aware allocation as a simple, architecture-agnostic design principle.

tapered language modelsmlp widthdepth-aware allocationcosine scheduleperplexity improvement

TailorMind: Towards Preference-Aligned Multimodal Content Generation

arXiv cs.AI · Hengji Zhou, Ye Liu, Yufeng Liu, Si Wu · 2026-06-22

TailorMind introduces a framework for preference-aligned multimodal content generation, addressing the gap between behavioral traces and generation-ready preferences. The method combines hypergraph collaborative filtering for enriching sparse user histories, ranking-error feedback for textual profile optimization, and retrieval-augmented style control with cross-modal cohesion reflection to reduce semantic drift. Evaluated on TailorBench across five dimensions (coherence, novelty, aesthetic, hallucination, profiling), TailorMind outperforms baselines in coherence (competitive or stronger), novelty, and aesthetic quality, achieving up to 29% Recall gains in reranking compared to ground-truth UGC.

multimodal generationcollaborative filteringretrieval-augmentedsemantic driftpreference modeling

Learning Process Rewards via Success Visitation Matching for Efficient RL

arXiv cs.AI · Raymond Tsao, Andrew Wagenmaker, Sergey Levine · 2026-06-22

The paper introduces Success Visitation Matching (SVM), a method to transform sparse outcome rewards into dense process rewards for efficient reinforcement learning (RL). SVM trains a discriminator to differentiate between successful and unsuccessful episodes, incentivizing the RL policy to match state-action visitations of successful episodes while avoiding unsuccessful ones. This approach provides dense feedback on progress toward task completion without altering the optimal policy. Empirical results demonstrate that SVM significantly accelerates RL finetuning on both simulated and real-world robotic manipulation tasks compared to sparse reward maximization.

sparse rewarddense rewardsuccess visitation matchingdiscriminatorrobotic control

AI Exposure Scores: what they measure, what they miss, and what comes next

arXiv cs.AI · Campbell Lund, Thomas Euyang, Zanele Munyikwa, Marzieh Fadaee · 2026-06-22

This paper critically examines the GPTs are GPTs exposure scores introduced by Eloundou et al. (2023), which quantify the share of occupational tasks assisted by large language models. The authors identify structural limitations in these static scores, including temporal, geographic, and ontological constraints, and survey five research approaches addressing these gaps: dynamic and benchmark-based measures, ensemble methods, task-framework extensions, worker-centered metrics, and adoption/usage data. They highlight a critical research-policy coordination gap, where policymakers rely on outdated metrics without incorporating methodological updates. The paper calls for improved measurement, participatory methods, and policy preparedness to navigate uncertainty and bridge this gap effectively.

exposure scoreslarge language modelstask-framework extensionsworker-centered metricspolicy preparedness

AI-driven Optimisation of Quality of Recovery (QoR) in Remote Patient Monitoring

arXiv cs.AI · Yansong Liu, Li-Hsi, Lin, Pramit Khetrapal · 2026-06-22

The authors introduce QoR-compact, a five-item daily survey optimized for remote patient monitoring (RPM) prediction pathways, derived from the 15-item Quality of Recovery (QoR-15) instrument. They exhaustively evaluated 3,003 five-question subsets to identify a compact version statistically comparable to the full QoR-15 in predicting postoperative recovery severity (mean AUC-ROC: 0.968 vs. 0.964). QoR-compact spans physical and psychological recovery dimensions and tracks readmission events as effectively as the full form. While QoR-15 remains the gold standard, QoR-compact offers a lighter daily input for RPM, pending external validation on larger cohorts.

remote patient monitoringquality of recoveryauc-rocpostoperative recoveryreadmission events

DiT-Reward: Generative Representations for Text-to-Image Reward Modeling

arXiv cs.AI · Yuanming Yang, Guoqing Ma, Bo Wang, Yuan Zhang · 2026-06-22

DiT-Reward repurposes a pretrained Diffusion Transformer (DiT) for text-to-image reward modeling by processing image latents and aggregating text-conditioned representations across transformer layers. The method outperforms HPSv3 on four preference benchmarks (85.6% on HPDv2, 77.6% on HPDv3) and achieves 1.65x faster inference with comparable memory usage. Layer-wise probing reveals optimal performance in middle-to-late layers, with scaling benefits from larger backbone capacities. When optimizing Stable Diffusion 3.5 Large via Flow-GRPO, DiT-Reward demonstrates superior realism gains over HPSv3.

diffusion transformerreward modelinglatent representationstext-conditionedflow-grpo

RECALL: Recovery Experience Collection for Active Lifelong Learning in Vision-Language-Action Models

arXiv cs.AI · Ulas Berk Karli, Tesca Fitzgerald · 2026-06-22

The paper introduces RECALL, an active continual learning paradigm for Vision-Language-Action (VLA) models, addressing inefficiencies in passive imitation learning. The method employs uncertainty-guided data collection to fine-tune VLAs more efficiently, but identifies catastrophic forgetting when using only recovery data. Techniques like replay-based data mixing and elastic weight consolidation are evaluated, revealing tradeoffs between plasticity and retention of learned behaviors. Empirical results demonstrate improved adaptation efficiency with uncertainty-guided recovery demonstrations, while highlighting challenges in integrating targeted new data into large robot policies.

vision-language-actionuncertainty-guidedcatastrophic forgettingelastic weight consolidationreplay-based data mixing

Data Selection Through Iterative Self-Filtering for Vision-Language Settings

arXiv cs.AI · Andrei Liviu Nicolicioiu, Sarvjeet Singh Ghotra, Morgane M. Moss, Aaron Courville · 2026-06-22

We propose Self-Filtering, a bootstrapped method for improving vision-language model training through iterative data selection. The approach trains a CLIP model on an evolving dataset that balances high-probability clean samples with diverse samples from the full distribution. The method alternates between model training and data mixture refinement, progressively improving the dataset quality. This self-selection process enhances downstream performance without requiring additional data or pre-trained models, addressing the challenge of noise in large-scale datasets.

clipself-filteringvision-languagebootstrappeddata selection

Discovering Latent Groups for Robust Classification

arXiv cs.AI · Ankur Garg, Ulrich Aïvodji, Samira Ebrahimi Kahou, Vincent Michalski · 2026-06-22

The paper introduces neural classification trees (NCT), a framework addressing spurious correlations in machine learning by encoding subgroup structure in a tree-shaped architecture. NCT routes samples to 'easy' or 'hard' nodes based on prediction correctness, using these routes as pseudo-labels for iterative refinement, disentangling subgroups without supervision. Evaluated on five benchmarks, NCT isolates minority subgroups transparently, matching state-of-the-art robustness while providing interpretable tree topologies.

neural classification treesspurious correlationssubgroup structurepseudo-labelsinterpretability

Causal Discovery in the Era of Agents

arXiv cs.AI · Yujia Zheng, Vishal Verma, Mantej Gill, Haoyue Dai · 2026-06-22

The paper advocates for a principled role of AI agents in causal discovery, where they assist workflows without directly supplying causal claims. It proposes that agents should handle data inspection, context retrieval, and explanation of method assumptions, while causal conclusions remain grounded in formal algorithms and domain expertise. The authors instantiate this principle in causal-learn+, an online platform integrating data analysis, method recommendation, and expert knowledge around the causal-learn ecosystem. A case study on Big Five personality data demonstrates the approach, avoiding unreliable language-model outputs as causal evidence. The platform is publicly available at causallearn.com.

causal discoverylarge language modelsexpert knowledgeformal algorithmsworkflow assistance

Scaling Linear Mode Connectivity and Merging to Billion Parameter Pretrained Transformers

arXiv cs.AI · Tianyi Li, Zhiqiang Shen · 2026-06-22

The authors propose a scalable framework for linear mode connectivity (LMC)-based merging of billion-parameter pretrained Transformers, addressing limitations of prior single-endpoint optimization methods. Their approach employs functionality-preserving weight transformations and a dual learning procedure where both models jointly optimize toward a shared linear interpolation path. Results show near-zero loss barriers on WikiText for medium-sized language models and 69%+ ImageNet top-1 accuracy throughout ViT-L interpolation paths, demonstrating improved merging performance at scale through parameter symmetry resolution.

linear mode connectivitypretrained transformersweight transformationsdual learninginterpolation barriers

Polycepta: Object-Centric Appearance Estimation for Multi-Object Tracking

arXiv cs.AI · Mohamed Nagy, Naoufel Werghi, Jorge Dias, Majid Khonji · 2026-06-22

Polycepta introduces an object-centric appearance state estimation framework for multi-object tracking (MOT), reformulating appearance modeling as a recursive estimation problem rather than frame-wise matching. It constructs and updates independent appearance states for each tracked object, enabling future appearance representations to be estimated from accumulated observations. Polycepta learns object-specific representations rather than memorizing them, facilitating appearance estimation for unseen classes. The framework progressively refines appearance estimates during inference, improving tracking performance. Experiments on KITTI, Waymo Open Dataset, and MOT17 demonstrate reduced identity switches and enhanced tracking accuracy. Integrated into RobMOT, Polycepta achieves a MOTA of 92.27% on KITTI and operates at 90.57 Hz.

multi-object trackingappearance state estimationrecursive estimationobject-centrictracking-by-detection

Against Proxy Optimization

arXiv cs.AI · Sven Neth · 2026-06-22

The article identifies conditions where maximizing a proxy utility function leads to suboptimal or harmful outcomes, challenging foundational assumptions in decision theory. It systematically examines scenarios where proxy optimization diverges from true utility maximization, highlighting potential pitfalls in practical applications. The analysis suggests that reliance on proxies can introduce unintended biases or inefficiencies, particularly in complex decision-making environments. These findings raise critical questions about the robustness of proxy-based approaches in theoretical and applied contexts.

proxy utilitydecision theoryoptimizationutility maximizationsuboptimal outcomes

SPIRAL: Learning to Search and Aggregate

arXiv cs.AI · Jubayer Ibn Hamid, Ifdita Hasan Orney, Michael Y. Li, Omar Shaikh · 2026-06-22

SPIRAL introduces Sequential-Parallel-Aggregative Reinforcement Learning, a framework enhancing language model reasoning by training models to utilize sequential reasoning, parallel trace sampling, and trace aggregation within a unified inference pipeline. The method employs set reinforcement learning for generating useful trace sets and standard reinforcement learning for aggregating these traces into improved responses. Experiments demonstrate SPIRAL's superior scaling efficiency, outperforming GRPO by up to 11× in scaling efficiency and 15% in performance when all compute primitives are scaled.

sequential reasoningparallel tracestrace aggregationset reinforcement learninginference pipeline

The Topology of Ill-Posed Questions: Persistent Homology for Detection and Steering in LLMs

arXiv cs.AI · Guangyu Jiang, Sizhe Tang, Mahdi Imani, Tian Lan · 2026-06-22

This work introduces a topological approach to detect and steer responses to ill-posed questions in large language models (LLMs). It models contextual hidden states of prompt tokens as point clouds, characterizing their geometry using finite zero-dimensional persistent homology, summarized by three descriptors per layer. These descriptors form a topology representation used for classification and activation steering. Evaluations on AmbigQA, SituatedQA, and CLAMBER show improvements in ill-posedness classification accuracy (67.4% to 78.9%, 79.9% to 88.5%, 57.6% to 69.6%) and acceptable response rates (61.4% to 70.6%), demonstrating persistent homology's effectiveness for interpretable representation and targeted steering.

persistent homologyill-posed questionsactivation steeringpoint cloudtransformer layer

A Generative Model for Closed-Loop Microsimulation of Signalized Intersections

arXiv cs.AI · Yash Ranjan, Rahul Sengupta, Anand Rangarajan, Sanjay Ranka · 2026-06-22

The paper introduces Enactor, a generative model for closed-loop microsimulation of signalized intersections that captures heterogeneous vehicle interactions. The actor-centric model encodes dynamic actors and lane polylines in polar coordinates, using a transformer with separate spatial and temporal attention blocks to predict next-step motion distributions. Evaluated in simulation-in-the-loop tests, Enactor achieves KL divergences over an order of magnitude lower than baselines on travel time and speed distributions, reduces red-light violations by >10×, and outperforms constant-velocity baselines on real-world trajectory prediction tasks.

microsimulationsignalized intersectionsclosed-looptransformertrajectory prediction

Decentralized Autonomous Traffic Management through Corridor Networks

arXiv cs.AI · Jasmine Jerry Aloor, Aadarsh Govada, Hamsa Balakrishnan · 2026-06-22

The paper proposes a decentralized multi-agent reinforcement learning (MARL) approach for autonomous traffic management in Advanced Air Mobility (AAM) corridor networks. The method extends single-corridor MARL policies to multi-corridor networks with merges and splits, demonstrating zero-shot transfer to varied traffic densities, network geometries, and heterogeneous vehicle performance. Results show effective conformance to corridor boundaries (100% completion rates), maintained inter-aircraft separation, and optimized travel metrics without centralized coordination or retraining.

multi-agent reinforcement learningadvanced air mobilitytraffic flow managementzero-shot transferdecentralized coordination

Kamera: Unified Position-Invariant Multimodal KV Cache for Training-Free Reuse

arXiv cs.AI · Bole Ma, Jan Eitzinger, Harald Koestler, Gerhard Wellein · 2026-06-22

The paper introduces Kamera, a unified position-invariant KV cache that enables training-free reuse of multimodal content across sliding windows without recomputation. The method addresses the cross-chunk conditioning loss in naive KV reuse by storing low-rank conditioning patches alongside each chunk, recoverable via exact RoPE re-rotation and patch application. Results show full task accuracy recovery on MM-NIAH and doc-QA benchmarks with minimal KV footprint, achieving bf16-equivalent reconstruction in SGLang across six backbones, particularly benefiting vision and video streams.

kv cachemultimodal agentsrope re-rotationlow-rank conditioningsliding-window

Solve for the Hyperparameter, Skip the Search: Kolmogorov-Optimal Scaling Laws for Spline Regression

arXiv cs.AI · Yong Yi Bay, Kathleen A. Yearick · 2026-06-22

The paper introduces KORE (Kolmogorov-optimal Order-aware Resolution Estimation), a method for determining optimal hyperparameters in spline regression without exhaustive search. Leveraging classical approximation theory, the Kolmogorov n-width, and the PRESS identity, KORE analytically solves for the optimal resolution by balancing bias and noise scales. This approach reduces computational cost significantly, requiring only about a dozen fits instead of a full grid sweep. Empirical results show that KORE matches exhaustive 3-fold cross-validation and outperforms 21 other methods on 36 real tabular datasets in accuracy per unit of compute, particularly when complexity resides in low interaction order.

spline regressionkolmogorov n-widthpress identityhyperparameter tuninginteraction order

Scheduling Thoughts: Learning the Order of Thought in Diffusion Language Models

arXiv cs.AI · Jiawei Xu, Minghui Liu, Aakriti Agrawal, Yifan Chen · 2026-06-22

The paper introduces Self-Aware Scheduling (SAS), a method for optimizing the order of thought in masked diffusion language models by deriving a tractable upper bound on sequential decoding mismatch. This bound enables dense self-aware rewards over ordered trajectories, framing order selection as a policy optimization problem with a frozen denoiser. SAS employs Group Relative Policy Optimization to learn a lightweight order policy, applicable to any-order and semi-autoregressive decoding. Results show significant improvements: Sudoku puzzle accuracy increases from 82.0% to 91.8%, and mathematical reasoning pass@1 scores on GSM8K and MBPP rise from 64% to 76% and 39.5% to 41%, respectively.

masked diffusionsequential decodingpolicy optimizationself-aware schedulinggroup relative policy

The Energy Consumption of Transformer Fine-Tuning: A Roofline-Inspired Scaling Model

arXiv cs.AI · Mansour Zoubeirou a Mayaki · 2026-06-22

The authors propose a roofline-inspired scaling model to predict energy consumption during Transformer fine-tuning across multi-GPU configurations. Their framework correlates measured energy with computational proxies, memory traffic, and hardware efficiency factors, incorporating speedup-based adjustments for tensor parallelism and fully sharded data parallelism. The derived scaling law demonstrates accurate energy prediction across heterogeneous BERT model architectures, addressing the growing computational costs of large-scale NLP training.

transformerenergy consumptionroofline modeltensor parallelismscaling law

VeriEvol: Scaling Multimodal Mathematical Reasoning via Verifiable Evol-Instruct

arXiv cs.AI · Haoling Li, Kai Zheng, Jie Wu, Can Xu · 2026-06-22

VeriEvol introduces a scalable framework for multimodal mathematical reasoning by decoupling prompt difficulty and answer reliability through verifiable data construction. The method employs a type-aware evolution module to generate harder, image-grounded prompts and HTV-Agent, a verifier that ensures answer reliability via multi-source counter-evidence falsification. Scaling from 10K to 250K samples improves mean accuracy from 35.42 to 54.73 on a five-benchmark visual-math suite. With GRPO-style RL, VeriEvol adds a cumulative +3.88 accuracy over baseline, with +1.82 from evolved prompts and +2.06 from HTV-Agent. The framework releases prompts, data, models, code, and verifier traces for auditability.

verifiable data constructiontype-aware evolutionhtv-agentmultimodal reasoninggrpo-style rl

SQLConductor: Search-to-Policy Learning for Step-wise Text-to-SQL Orchestration

arXiv cs.AI · Yizhang Zhu, Zhangyang Peng, Boyan Li, Yuyu Luo · 2026-06-22

SQLConductor introduces a step-wise orchestration learning framework for Text-to-SQL, addressing limitations of fixed pipelines and plan-then-execute approaches. The method formulates subtasks as specialized actions, training a policy model via Search-to-Policy Learning with Monte Carlo Tree Search and stability estimation, enhanced by Stability-weighted Supervised Fine-tuning and Curriculum Reinforcement Learning. Evaluated on BIRD-Dev and out-of-distribution datasets, it achieves 73.2% execution accuracy, outperforming prior methods while using frozen larger action models, with analyses showing adaptive orchestration to diverse queries.

text-to-sqlorchestration learningmonte carlo tree searchstability estimationcurriculum reinforcement learning

POTracker: Optimizing Large Language Models for Standard-Compliant Power Outage Report Generation

arXiv cs.AI · Hung Phan, Aniroop Naladala, Dubey Avanindra, Supryia Chinthavali · 2026-06-22

We propose POTracker, a fine-tuned Qwen2.5-7B-Instruct model optimized for generating standard-compliant power outage reports. POTracker introduces POTrackerLoss, a novel loss function that jointly optimizes for textual similarity and structural (tag) similarity between generated and ground-truth reports. Evaluated on 1,000 power outage reports, POTracker achieves 86.47% structural accuracy and improves overall accuracy by up to 51% compared to five fine-tuning methods and one rule-based XML conversion approach. A human study with domain experts yields an average quality score of 4.03/5 for generated reports.

potrackerlossqwen2.5-7b-instructstructural accuracytextual similaritypower outage reports

DVL-DeepONet: A Physics-Guided Operator Learning for Resilient Underwater Navigation

arXiv cs.AI · Arup Kumar Sahoo, Itzik Klein · 2026-06-22

The authors propose DVL-DeepONet, a physics-guided deep neural operator framework for resilient underwater navigation under degraded sensing conditions. The method learns a nonlinear operator mapping temporal inertial/Doppler velocity log (DVL) observations to vehicle velocity while enforcing DVL measurement physics through consistency constraints, addressing three operational scenarios: noise-resilient estimation, DVL-only learning, and beam measurement recovery. Validation on real-world AUV experiments (10,000 m cumulative path length) shows 40% performance improvement over baseline model-based and learning-based approaches.

deep neural operatordoppler velocity logphysics-guided learningautonomous underwater vehiclesresilient navigation

CADRE: Stable, Parameter Efficient Adaptation of Medical Vision Language Models with Bounded Forgetting and Prior Drift

arXiv cs.AI · Amrita Singh, Rishabh Jha · 2026-06-22

CADRE introduces a parameter-efficient framework for adapting medical vision-language models (VLMs) to clinical services while mitigating catastrophic forgetting and prior drift. The method combines low-rank adaptation (LoRA) with an elastic weight consolidation term to bound retained-competence loss and an anchor-to-prior penalty to limit embedding drift from the frozen pretrained prior. Evaluated on breast cancer across three dissimilar imaging modalities (histopathology, ultrasound, chest radiography), CADRE achieves the highest accuracy, SPQ, and backward transfer while reducing forgetting sevenfold (0.011 vs. 0.075; p=0.023) compared to regularized baselines, adapting only 0.23% of parameters.

vision-language modelslow-rank adaptationcatastrophic forgettingelastic weight consolidationembedding drift

War in the Abstract: The Rise and Consequences of Militarized Language in Scientific Communication

arXiv cs.AI · Sovesh Mohapatra, David Lydon-Staley, Dani S. Bassett · 2026-06-22

This study quantifies the rise and consequences of militaristic language in scientific communication through analysis of 21.4 million papers (2010-2025) from OpenAlex and PubMed, supplemented by a within-subject war-framing experiment (N=801). Results show a 48% increase in militaristic terms in OpenAlex and 32% in PubMed, with accelerated growth post-2019 (cross-database r=0.96, p<10^-8). Militaristic language prevalence correlates with global conflicts (r=0.77-0.84), with social sciences leading in usage and engineering/computer science in growth. War framing reduced credibility (-0.18 Likert units, p<10^-20), funding willingness, and policy support, while increasing urgency perception.

militaristic languagescientific abstractswar-framingcredibilityglobal conflicts

AOHP: An Open-Source OS-Level Agent Harness for Personalized, Efficient and Secure Interaction

arXiv cs.AI · Shanhui Zhao, Jiacheng Liu, Guohong Liu, Jichao Yan · 2026-06-22

AOHP introduces an OS-level agent harness built on Android Open Source Project (AOSP) to address the mismatch between AI agents and conventional application-centric operating systems. It treats agents as first-class OS actors, enabling personalized service composition, efficient agent interfaces, and secure information flow. Preliminary experiments demonstrate AOHP's advantages in task completion (+21.12% completion rate), execution cost (-51.55% token cost), and security-policy compliance compared to traditional systems.

agent-native operating systemspersonalized service compositionefficient agent interfacessecure information flowandroid open source project

What Does a Chemical Language Model Know About Molecules?

arXiv cs.AI · Christian Kenneth, Etowah Adams, Liam Bai, Gerard JP van Westen · 2026-06-22

The study mechanistically analyzes chemical language model (cLM) representations using sparse autoencoders (SAEs) applied to MolFormer, an encoder-only architecture. Results reveal layer-specific specialization: early layers employ position-tracking latents for molecular grammar parsing, while later layers encode atom-in-substructure and pharmacophoric features. Analysis of SMILES perturbations shows non-canonical strings induce greater representation shifts than invalid ones, attributed to position-latent disruption propagation. The authors release InterMol, an interactive visualization tool for SAE activations across molecular representations.

chemical language modelsparse autoencodermolecular representationsmiles parsingmechanistic interpretability

Cross-Architectural Mixture-of-Experts with Adaptive Soft Routing for Plant Leaf Disease Classification

arXiv cs.AI · Phi-Hung Hoang, Thi-Thu-Hong Phan · 2026-06-22

Proposes a cross-architectural Mixture-of-Experts (MoE) framework with adaptive soft routing for plant leaf disease classification, integrating EfficientNet-B0, DenseNet-121, and Swin-Tiny to capture complementary multi-scale features. The method employs a soft gating mechanism for dynamic expert weighting and a two-stage refinement training strategy for optimization stability. Achieves 91.68% recall and 92.62% F1-score on an imbalanced potato leaf dataset, outperforming individual experts by 5.91% and 5.03%, with cross-dataset F1-scores of 94.03% (durian) and 97.04% (sesame).

mixture-of-expertsadaptive soft routingcross-architecturalclass imbalanceleaf disease classification

Rethinking Object-Centric Representations for Video Dynamics Modeling

arXiv cs.AI · Amaury Wei, Ismail Nejjar, Olga Fink · 2026-06-22

STAITUS introduces a unified framework for unsupervised video object tracking that disentangles slot-based representations into appearance and geometric pose components. By enforcing spatial separation within frames and applying temporal alignment exclusively in appearance space, the model achieves sharper masks and persistent object identities under motion, occlusion, and object entry/exit. An adaptive gating mechanism dynamically adjusts the number of active slots to match scene complexity, mitigating over-segmentation. Extensive experiments on synthetic and real-world benchmarks demonstrate that STAITUS significantly outperforms state-of-the-art baselines in segmentation quality and tracking stability.

slot-based representationstemporal alignmentadaptive gatingunsupervised video trackingobject-centric disentanglement

GRINQH: Graded Input-based Quantization Hierarchy for Efficient LLM Generation

arXiv cs.AI · Jette Oberländer, Jan Finkbeiner, Catherine M. Schöfmann, Emre Neftci · 2026-06-22

GRINQH introduces a graded input-based quantization hierarchy for efficient LLM generation, addressing the memory bottleneck in autoregressive decoding. The method dynamically assigns weight channels to different precision levels using activation magnitudes as importance proxies, unifying quantization and sparsification. Evaluated on Llama3 and Qwen3, GRINQH outperforms fixed- and mixed-precision baselines at 3- and 4-bit settings, achieving effective 2-bit generation, and establishes a new Pareto frontier for quality-speed trade-offs via a custom GPU kernel with hierarchical memory layout.

quantizationautoregressive decodingmemory bandwidthsparsificationgpu kernel

Digital Humanism and Evolutionary Design

arXiv cs.AI · Wolfgang Höhl · 2026-06-22

The paper analyzes conceptual overlaps between digital humanism and evolutionary design in human-centered technological development. It employs comparative analysis of key terms (freedom, responsibility, conviviality) and case studies including the Turing Test and Chinese Room argument. Results indicate structural similarities between the paradigms, particularly in co-evolutionary software development and Simondon's 'open machine' concept, but reveal tensions in autonomy determination and simulated subjectivity. Market-driven specialization in AI applications is shown to negatively impact open technology development, even in sustainability-focused optimizations like green IT.

digital humanismevolutionary designco-evolutionary developmentopen machinegreen it

Detecting Malicious Agent Skills in the Wild using Attention

arXiv cs.AI · Bacem Etteib, Daniele Lunghi, Tégawendé F. Bissyandé · 2026-06-22

Locate-and-Judge introduces a two-stage detector for identifying malicious LLM agent skills in marketplace distributions. The method first employs a lightweight locator to score skill spans based on instruction-following attention, retaining only the top-K spans. A detailed judge then examines these spans, enabling scalable marketplace-wide audits with reduced computational cost. Compared to direct LLM-based scanning, this approach achieves an order-of-magnitude cost reduction while maintaining high precision, flagging dozens of live malicious skills missed by SkillSpector and Cisco Skill Scanner. The authors release a labeled dataset of confirmed malicious skills.

llm agentsinstruction-following attentionmalicious skillsmarketplace auditspan scoring

UnBias-Plus: Detect, Explain, and Rewrite Bias

arXiv cs.AI · Ahmed Y. Radwan, Ahmed ElKady, Sindhuja Chaduvula, Mohamed Hafez · 2026-06-22

UnBias-Plus introduces an open-source toolkit addressing four key challenges in bias mitigation: segment-level multi-class bias classification, biased span localization, neutral text rewriting, and decision reasoning. The method integrates these components into a unified framework accessible via Python, CLI, REST API, and web interfaces. Results include publicly available models, datasets, and documentation, enabling granular bias analysis across domains like journalism and AI-generated content.

bias detectionspan localizationtext rewritingmulti-class classificationinterpretability

HyperQuant: A Rate-Distortion-Optimal Quantization Pipeline for Large Language and Diffusion Models

arXiv cs.AI · Yuval Domb, Hadar Sackstein, Tomer Solberg · 2026-06-22

HyperQuant introduces a unified post-training quantization pipeline for weights and KV-cache in large language and diffusion transformers, combining four techniques: per-tile Randomized Hadamard Transform, quantization to optimal lattices, near-entropy-optimal Rice coding, and bias-correction for KV-cache. The method outperforms HIGGS at 3-5 bits per scalar (bps) for weights and surpasses TurboQuant and OCTOPUS down to 1.7 bps for KV quantization. It achieves near-lossless quality with ~3.9x compression for linear weights and ~3.79x for KV-cache on an H100 at 4 bps, and successfully quantizes the 19B-parameter LTX-2 DiT video model without artifacts. Integration with Tensor-Core MMA paths shows int8 outperforming fp8 on post-RHT lattice output.

randomized hadamard transformkv-cacherice codingtensor-corepost-training quantization

ReasoningLens: Hierarchical Visualization and Diagnostic Auditing for Large Reasoning Models

arXiv cs.AI · Jun Zhang, Jiasheng Zheng, Boxi Cao, Yaojie Lu · 2026-06-22

ReasoningLens introduces a hierarchical visualization framework for auditing large reasoning models, addressing opacity in lengthy Chain-of-Thought traces. The method structures reasoning traces into interactive hierarchies, employs an agentic auditor for automated error detection with tool-augmented verification, and synthesizes systemic reasoning profiles to identify model-specific blind spots. This open-source system transforms unstructured procedural text into actionable insights for debugging and optimizing reasoning-centric AI.

chain-of-thoughthierarchical visualizationagentic auditorreasoning profilestool-augmented verification

Litmus: Zero-Label, Code-Driven Metric Specification for Evaluating AI Systems

arXiv cs.AI · Prajjwal Gupta, Prasang Gupta, Vishal Bhutani, Apoorva Sharma · 2026-06-22

Litmus introduces a zero-label, code-driven system for specifying evaluation metrics in AI pipelines by eliciting intent from source code and targeted interrogation. The method first identifies measurement requirements and rationale, then constructs a justified metric portfolio through constraint satisfaction. Evaluated on three real-world pipelines (financial account grouping, scientific QA, risk assessment), Litmus achieves superior concern coverage (Spearman ρ=0.72 vs <0.47 baselines in scientific QA), stage span, and portfolio non-redundancy while requiring no labels during design, demonstrating the viability of automatic metric specification over mere implementation.

metric specificationzero-label evaluationai pipelinesconstraint satisfactionconcern coverage

Automated Semantic Fault Localization in SysML v2: A Human-in-the-Loop Framework Using Knowledge-Graph Augmented LLMs

arXiv cs.AI · Haitham Al-Shami, Rohail Malik, Riku Ala-Laurinaho, Jari Vepsäläinen · 2026-06-22

The paper presents a human-in-the-loop framework for automated semantic fault localization in SysML v2 models, addressing undetectable domain-rule violations that evade compiler checks. The method combines fine-tuned Small Language Models (Qwen2.5-Coder-1.5B and DeepSeek-Coder-6.7B) with a domain knowledge graph encoding physical compatibility rules, which also guides synthetic training data generation and grounds repair suggestions. Evaluation on 1,184 test samples shows fine-tuning improves semantic fault repair from <3% to >91% accuracy, while patch-based output reduces token length by 60+%, demonstrating effective AI-assisted model verification.

sysml v2semantic fault localizationknowledge graphsmall language modelmodel-based systems engineering

Distribution-Aware Diffusion-LLM for Robust Ultra-Long-Term Time Series Forecasting

arXiv cs.AI · Falguni Ghosh, Vahid Hashemi, Bernhard Kainz · 2026-06-22

The paper introduces Diffusion-LLM, a framework combining conditional diffusion models with LLMs for robust ultra-long-term time series forecasting. It addresses LLMs' limitations in probabilistic modeling and multimodal alignment by learning conditional distributions in a shared latent space. Evaluated on six benchmarks (ETT, Weather, ECL), the method outperforms existing LLM-based approaches, particularly in ultra-long-term and few-shot scenarios, demonstrating improved robustness through distribution-aware regularization.

diffusion-llmtime series forecastingprobabilistic modelingmultimodal alignmentdistribution-aware regularization

Energy-Based Transformers as Predictors of Reading Difficulty

arXiv cs.AI · Jakub Dotlacil, Ece Takmaz · 2026-06-22

This work introduces energy-based transformers as a unified predictor of reading difficulty in computational psycholinguistics, bridging transformer models with associative memory literature. The authors evaluate energy measures across three reading-time corpora (Natural Stories, UCL eye-tracking, UCL self-paced reading) and a controlled relative clause processing experiment. Results demonstrate that energy measures significantly improve prediction beyond surprisal, capture object/subject asymmetry at a single layer, and subsume effects of both attention entropy and surprisal. This suggests energy may replace multiple complementary measures previously required for modeling reading difficulty.

energy-based transformerscomputational psycholinguisticsattention entropyassociative memoryreading difficulty

Measuring & Mitigating Over-Alignment for LLMs in Multilingual Criminal Law Courts

arXiv cs.AI · Arthur Wuhrmann, Gaetan Stein, Daniel Brunner, Andrei Kucharavy · 2026-06-22

This paper introduces TF-RefusalBench, a multilingual benchmark for measuring over-alignment in LLMs used for criminal law tasks, containing 5,200 prompts across French, German, Italian, and English derived from Swiss Supreme Court rulings. The study demonstrates that over-alignment is multifaceted, influenced by model, prompt, and text language, and evaluates approaches to mitigate refusals in on-premises LLMs. Results show that prompting is effective, but ablating refusal directions (ablieration) eliminates refusals with minimal impact on task performance.

tf-refusalbenchover-alignmentablierationmultilingualcriminal law

Rethinking Molecular Graph Backdoors under Chemistry-aware Admission

arXiv cs.AI · Thinh T. H. Nguyen, Sze Jue Yang, Khoa D. Doan, Chee Seng Chan · 2026-06-22

The study introduces ChemGuard, a chemistry-aware admission protocol for molecular graph neural networks (GNNs) that validates molecular records through parsing, sanitization, and graph-string consistency checks. It demonstrates that many existing graph-based backdoor attacks fail under ChemGuard due to chemical invalidity or representation inconsistency. The authors then propose ChemBack, an admission-aware backdoor attack that constructs chemically feasible motif-anchor attachments and ranks candidates using Tanimoto similarity, achieving high attack success while preserving clean accuracy. Results show ChemGuard suppresses graph-only backdoors, but chemically valid attacks remain a threat.

molecular graph neural networksbackdoor attackschemistry-aware admissiontanimoto similaritychemguard

Adaptive Hard-Soft Physics-Informed Neural Networks for Robust Boundary-Constrained PDE Solving

arXiv cs.AI · Duc Tien Nguyen, Trinh Minh Tuan, Nguyen Duc Manh, Vu Linh Nguyen · 2026-06-22

The study introduces Adaptive Hard-Soft Physics-Informed Neural Networks (HSPINN), a unified framework for robust PDE solving that combines exact boundary enforcement with adaptive soft constraints. Dirichlet and periodic boundaries are enforced exactly via analytical lifting and periodic feature mappings, while PDE residuals and initial conditions are treated as soft constraints with dynamically balanced loss weights using an inverse-share softmax strategy. Applied to Poisson, Burgers, and convection problems, HSPINN demonstrates faster convergence, higher accuracy, and improved stability compared to conventional PINNs.

physics-informed neural networksboundary constraintsadaptive loss weightingpartial differential equationshard-soft constraints

Field-level weak lensing cosmology with $<100$ simulations using multifidelity simulation-based inference

arXiv cs.AI · Alex A. Saoulis, Kiyam Lin, Niall Jeffrey, Maximilian von Wietersheim-Kramsta · 2026-06-22

The study demonstrates that multifidelity simulation-based inference (SBI) enables accurate field-level cosmological inference from weak lensing shear fields using only 60-100 high-fidelity N-body simulations. By pretraining neural inference models on fast log-normal GLASS simulations and fine-tuning on a small set of high-fidelity simulations, the method achieves well-calibrated posteriors while reducing simulation costs by an order of magnitude. Results show the approach successfully extracts cosmological information beyond standard two-point statistics (e.g., power spectrum) in a realistic KiDS-Legacy mock analysis setting.

simulation-based inferenceweak lensingneural compressionn-body simulationscosmological inference

Abstract representational geometry supports inference in large language models

arXiv cs.AI · Yunan Zeng, Yuwang Wang · 2026-06-22

The study demonstrates that large language models (LLMs) develop hippocampal-like abstract representational geometry to support inference, mirroring human cognitive mechanisms. Using a contextual reversal-learning paradigm adapted for text, researchers compared human and LLM behavior and internal representations. Results show hierarchical organization: lower layers encode stimulus identity, while higher layers form context-specific manifolds; geometric regularization of these layers enhances generalizable inference, establishing geometry as a mechanistic principle in LLM reasoning.

representational geometrycontextual reversal-learningmanifoldsgeometric regularizationhierarchical organization

The Watermark Shortcut: How Provenance Marking Sabotages Audio Deepfake Detection

arXiv cs.AI · Nicolas M. Müller, Pascal Debus · 2026-06-22

The study reveals a critical vulnerability in audio deepfake detection systems that use provenance watermarking: detectors exploit watermarks as spurious shortcuts, leading to three failure modes—generalization degradation, strip-to-evade attacks, and mark-to-frame errors. Through controlled white-box experiments, watermark-trained detectors exhibited severe performance drops (e.g., mark-to-frame increased Equal Error Rate from 16% to 75%), while black-box tests confirmed real speech could be misclassified as fake when watermarked. The authors propose a mitigation by retraining detectors with watermarks applied to both synthetic and human speech, decorrelating the shortcut. A paired dataset (WASP) is released for further research.

provenance watermarkingaudio deepfake detectionspurious shortcutequal error ratestrip-to-evade

VideoAgent: All-in-One Framework for Video Understanding and Editing

arXiv cs.AI · Hengji Zhou, Lingxuan Huang, Jian Wang, Bing Zhou · 2026-06-22

VideoAgent introduces an all-in-one agentic framework for video understanding and editing, addressing limitations in diverse operations and long-video coherence. The framework employs automated video shot creation with shot planning agents and cross-modal retrieval, alongside a multi-agent orchestration system integrating over thirty specialized editing agents. Intent parsing and textual-gradient graph optimization enable complex editing pipelines. Evaluations on the VideoEdit benchmark and public datasets show 87-95% orchestration success rates, 60% API cost reduction, and professional-quality outputs rated only 4% below human-created videos.

agentic frameworkcross-modal retrievalintent parsingtextual-gradient graphmulti-agent orchestration

EHR-Complex: Benchmarking Medical Agents for Complex Clinical Reasoning

arXiv cs.AI · Yitong Qiao, Lei Liu, Yue Shen, Jian Wang · 2026-06-22

The authors introduce EHR-Complex, a benchmark for evaluating interactive clinical database reasoning agents on complex electronic health record (EHR) analysis tasks. Built on MIMIC-IV (365K patients, 31 tables), the benchmark comprises 52K tasks requiring SQL/Python execution with 31.93 structural components per query on average. Evaluation shows top models achieve only 62.3% exact-match accuracy, with pass@4 consistency below 50%, revealing prevalent failure modes in SQL logic, medical-code lookup, and semantic understanding.

electronic health recordsclinical reasoningsql generationinteractive executionbenchmark evaluation

Towards a Bathroom-Centered Human-Building Digital Twin Framework for Indoor Safety Analysis

arXiv cs.AI · Yuanzhi Su, Huiying, Hou · 2026-06-22

This study proposes a bathroom-centered human-building digital twin framework to analyze indoor safety for older adults by modeling coupled human-environment interactions. The framework integrates semantic bathroom representations, skeleton-based human motion tracking, spatial-semantic coupling, and interaction-aware analytics to assess risks from wet surfaces, constrained layouts, and posture transitions. A Unity-based prototype demonstrates feasibility for privacy-sensitive aging-in-place applications, advancing beyond isolated hazard identification or activity recognition approaches.

digital twinskeleton-based trackingspatial-semantic couplinginteraction-aware analyticsaging-in-place

GIF: Locally Sound Geometric Information Flow Control for LLMs

arXiv cs.AI · Adam Storek, Nikolaus Holzer, Zhuo Zhang, Suman Jana · 2026-06-22

The paper introduces Geometric Information Flow (GIF), a semantic framework for tracking information flow in large language models (LLMs) to address security and privacy risks. GIF leverages the LLM Jacobian and local output geometry to upper-bound Shannon mutual information between input spans and outputs, enabled by automatic differentiation and low-rank approximation. Evaluations on prompt-injection and privacy-leakage benchmarks show near-perfect recall, outperforming attention-based baselines, while reducing token costs by up to 81x compared to LLM-as-judge approaches. GIF's flows transfer across model sizes and families, enabling black-box deployment.

information flow controlllm securityjacobian analysisshannon mutual informationlow-rank approximation

Exposing the Illusion of Erasure in Knowledge Editing for LLMs

arXiv cs.AI · Advik Raj Basani, Anshuman Chhabra · 2026-06-22

The study exposes fundamental vulnerabilities in Knowledge Editing (KE) methods for LLMs, demonstrating that edited knowledge persists and resurfaces under adversarial elicitation. Through mechanistic analysis of popular KE techniques, the authors show that low-rank updates redistribute rather than erase knowledge, acting as suppression mechanisms that reduce but do not eliminate original fact expression. Results reveal edited knowledge occupies narrow, anisotropic loss landscape regions vulnerable to indirect prompting, proving KE methods are inherently bypassable across architectures.

knowledge editinglow-rank updatesadversarial elicitationloss landscapesuppression mechanisms

Dynamic multi-agent deep reinforcement learning-based pricing and incentivization approach in multimodal transportation networks

arXiv cs.AI · Khadidja Kadem, Mostafa Ameli, Carlos Lima Azevedo, Mahdi Zargayouna · 2026-06-22

The paper proposes a multi-agent deep reinforcement learning framework for dynamic pricing and incentivization in multimodal transportation networks. Two RL agents—a public authority optimizing equity/emissions via transport incentives and a shared mobility provider maximizing revenue through fare adjustments—interact adaptively with demand and congestion patterns. Experiments during a 3-hour morning peak demonstrate 20% cost reduction for commuters, 10% lower emissions, doubled public transport profits, and improved spatial equity when combining both agents' strategies.

multi-agent reinforcement learningdynamic pricingtransportation equityemissions reductiondemand-responsive incentives

P-JEPA: Procedural Video Representation Learning via Joint Embedding Predictive Architecture

arXiv cs.AI · Felix Tristram, Stefano Gasperini, Benjamin Killeen, Marcel Walch · 2026-06-22

P-JEPA introduces a backbone-agnostic method for procedural video representation learning by reducing long-duration videos to a dense, frame-aligned action space and predicting pooled masked latent vectors. The approach handles videos over 30 minutes long, addressing long-range dependencies in procedural tasks where distinct actions may appear visually similar. Evaluated on EgoExo4D, EgoProceL, and Assembly101 with VJEPA2.1, TSM, and I3D backbones, P-JEPA improves linear separability, streaming inference, and temporal action segmentation, achieving state-of-the-art fine-grained action classification on EgoExo4D with 10× fewer parameters than LLM-based methods and real-time operation.

procedural videojoint embeddinglong-range dependenciesaction segmentationreal-time inference

SteerVTE: Seamless Video Text Editing with Style and Glyph Control

arXiv cs.AI · Kai Zeng, Moran Li, Zhengwei Wang, Yingchen Yu · 2026-06-22

SteerVTE introduces a unified framework for precise video text editing by steering a frozen video diffusion model through style and glyph control. The method employs a lightweight text context adapter with style and dual-granularity glyph encoders, enhanced by a glyph-aware spatial-focal loss and a three-stage progressive training curriculum. Evaluations show SteerVTE outperforms existing baselines in text accuracy, style consistency, and temporal coherence, supported by the newly constructed SteerVTE-1M dataset of one million triplets.

video text editingdiffusion transformerglyph controltemporal coherencestyle encoder

HOLMES: Evaluating Higher-Order Logical Reasoning in LLMs

arXiv cs.AI · Yucheng Wu, Jundong Xu, Mingzhen Ju, Yue Yu · 2026-06-22

The paper introduces HOLMES, the first benchmark for higher-order logical reasoning in LLMs, addressing limitations of first-order-centric evaluations. It comprises 1379 instances combining natural-language problems with higher-order logic formalizations, ground-truth answers, and verifiable reasoning traces across law and finance domains. Experiments reveal LLMs' poor performance (average 50.64%, best 59.54%), with particular weaknesses in scope-conditioned and compositional reasoning. The benchmark exposes shortcut reasoning artifacts in conflict-resolution tasks, highlighting higher-order symbolic reasoning as a critical unsolved challenge for reliable AI systems.

higher-order logicsymbolic reasoningbenchmarkconflict-resolutioncompositional reasoning

RS-Gen: A Multi-Stage Agentic Framework for Reasoning and Search-Augmented Image Generation

arXiv cs.AI · Feifei Bian, Zhimin Zheng, Wei Deng, Daiguo Zhou · 2026-06-22

RS-Gen introduces a multi-stage agentic framework for reasoning and search-augmented image generation, addressing limitations in logical reasoning and OOD knowledge handling. The method employs a 'Questioning-and-Solving' closed-loop mechanism to autonomously identify knowledge gaps, plan actions, and execute deep reasoning without additional training. Experiments on WISE Verified and RISEBench show significant improvements, with absolute gains of 0.313 for Qwen-Image and 19.70 for Qwen-Image-Edit-2511, achieving SOTA performance among open-source models.

agentic frameworkreasoning-augmented generationood knowledgeclosed-loop mechanismmulti-stage planning

SPADE: Structure-Prior Adaptive Decision Estimation

arXiv cs.AI · Yifan Wang · 2026-06-22

SPADE introduces a closed-form framework for adaptive decision estimation with structural priors in scientific machine learning, addressing prior misspecification through shrinkage of structure-violating estimates. The method combines an exact specification test, Stein-unbiased James-Stein shrinkage for enforcement strength (with O(σ²/n) oracle guarantee), and a gating mechanism for hard prior commitment. Results demonstrate oracle-tracking performance, 100% correct structure selection accuracy, 2.6% regret reduction, and efficient partial law recovery with controlled false relaxation across linear-subspace, conservation law, and Hamiltonian priors.

structure-prior adaptationshrinkage estimationspecification testingoracle guaranteehamiltonian prior

MuPPET: A Benchmark for Contextual Privacy of LLM Assistants in Multi-Party Conversations

arXiv cs.AI · Elena Sofia Ruzzetti, Cornelius Emde, Sangdoo Yun, Seong Joon Oh · 2026-06-22

We introduce MuPPET (Multi-Party Privacy Exposure Testing), the first benchmark for evaluating contextual privacy risks in multi-party conversations involving LLM assistants. Unlike existing benchmarks focused on single-interlocutor settings, MuPPET addresses the heightened privacy challenges in group environments where sensitive data must be appropriate for all recipients. Experiments reveal that LLMs leak significantly more private information in multi-party contexts compared to one-to-one interactions, with frontier models and smaller open-weights models particularly vulnerable. Current contextual privacy defenses provide only partial mitigation, degrade utility, and fail to resolve the core party-tracking issue.

contextual privacymulti-party conversationsllm assistantsprivacy exposureparty-tracking

Where Is My Physics Wrong? Localized and Identifiable Discovery of Model Discrepancy

arXiv cs.AI · Yifan Wang · 2026-06-22

LISDD introduces a framework for localized, identifiable sparse discovery of model discrepancy, addressing the challenge of pinpointing where and how physical models fail. The method fits known physics on a detected clean regime, flags discrepant regions using a calibrated residual-energy statistic, selects local missing terms via exhaustive holdout over a candidate library, and confirms significance with a sample-split $F$-test. Experiments demonstrate LISDD's effectiveness: it maintains physical-parameter bias at 0.002, improves localization $F_1$ from 0.44 to 0.80, achieves exact detection, and controls multi-region false-discovery rates while recovering all planted mechanisms. This provides a calibrated diagnostic tool for grey-box building-energy models.

localized discrepancysparse discoveryresidual-energy statisticsample-split f-testfalse-discovery rate

The Correct Answer Trap: Pedagogically-Grounded Detection and Feedback for Hidden Misconceptions

arXiv cs.AI · Moiz Imran, Sahan Bulathwela · 2026-06-22

The paper introduces a pedagogically-grounded framework for detecting and addressing hidden misconceptions in automated tutoring systems, where students reach correct answers via flawed reasoning. Analyzing 20,964 student responses from Eedi, the authors find fine-tuned classifiers achieve only 57% detection accuracy, while an open-weight reasoning model improves to 84% but suffers from high false positives (8:1 ratio). They propose a detect-verify-escalate pipeline with graduated assessment rubrics, implemented via teacher dashboards for review queue filtering or autonomous tutors triggering diagnostic follow-up questions.

hidden misconceptionsautomated feedbackreasoning modelgraduated assessmentdiagnostic follow-up

When Does Intrinsic Self-Correction Help? A Task-Sensitive Analysis

arXiv cs.AI · Elroy Stav, Dvir Berlowitz, Maayan Orner, Sarit Kraus · 2026-06-22

This work provides a task-sensitive analysis of intrinsic self-correction (SC) in large language models, demonstrating its effectiveness depends on task structure. The study examines SC mechanisms across three settings: verifying explicit constraints, revisiting complex reasoning, and resolving competing strategies in word games. Experiments on multiple benchmarks and models show SC yields consistent gains when task structure supports these revision modes, suggesting SC is a task-dependent inference-time strategy rather than a universally reliable improvement method.

intrinsic self-correctiontask-sensitive analysislarge language modelsinference-time strategyrevision mechanisms

Memory Contagion: Cross-Temporal Propagation of Evaluator Bias via Agent Memory

arXiv cs.AI · Zewen Liu · 2026-06-22

The paper identifies Memory Contagion, a phenomenon where evaluator bias propagates across temporal interactions via LLM agent memory systems, even with perfect consolidation. Through experiments with two bias types (length preference, authority bias) across four phases, the authors demonstrate that biased input alone suffices for contagion, consolidation affects bias types differently (attenuating length bias while potentially amplifying authority bias), and no safe contamination threshold exists (detection at p=0.2). The work exposes vulnerabilities in agent memory designs and provides formal measurement tools for cross-temporal bias propagation.

memory contagionevaluator biasllm agentscross-temporal propagationmemory consolidation

Capable but Careless: Do Computer-Use Agents Follow Contextual Integrity?

arXiv cs.AI · Anmol Goel, Iryna Gurevych · 2026-06-22

The paper introduces AgentCIBench, an evaluation framework for assessing contextual integrity violations in computer-use agents (CUAs) that operate across personal applications. The benchmark tests three failure modes: visual co-location, task-ambiguity overshare, and recipient misalignment, using deterministic scoring. Evaluation of 15 frontier agents reveals significant privacy risks, with 11 agents leaking information in over 50% of scenarios (average leakage: 67.9%), persisting even in end-to-end task execution. The authors release AgentCIBench to promote safer CUA development through pre-deployment contextual disclosure testing.

computer-use agentscontextual integrityprivacy riskevaluation benchmarkdisclosure testing

DART: Draft-Agreement Routing for Training-Free Adaptive Thinking Budgets in Hybrid Reasoning Models

arXiv cs.AI · Jungseob Lee, Seongtae Hong, Seungjun Lee, Jaehyung Seo · 2026-06-22

The paper introduces DART, a training-free routing framework for hybrid reasoning models that dynamically allocates thinking budgets per query. The method samples two cheap no-think drafts, routes directly when drafts agree, and predicts a thinking budget from draft entropy when they disagree. Results show DART improves accuracy by up to +9.0 points on math reasoning and +22.5 points on code reasoning while reducing thinking tokens by 15-69% and 51-63% respectively, across model scales (0.6B-32B) and families without labeled data or gradient updates.

hybrid reasoningtraining-free routingthinking budgetdraft entropyadaptive computation

Interpretable Probabilistic Medical Image Segmentation via Gaussian Process with Explicit Modelling of Annotation Bias and Variability

arXiv cs.AI · Qi Li, Yuliang Huang, Shaheer U. Saeed, Qianye Yang · 2026-06-22

The authors propose a logit-space probabilistic segmentation framework using stochastic variational Gaussian Process to explicitly model annotator-specific bias and variance in medical image segmentation. The method decomposes predictions into an image-dependent reference logit distribution and annotator-specific perturbations, enabling direct analysis of inter- and intra-rater variability. Evaluated on a multi-annotator dataset, the approach maintains segmentation accuracy while improving uncertainty calibration, with learned parameters quantitatively reflecting annotator behavior and controlled experiments demonstrating systematic influence on predictions.

probabilistic segmentationgaussian processannotation biasuncertainty calibrationmulti-rater variability

Decomposing Financial Market Dynamics via Mechanism Analysis in an Evolutionary Multi-Agent Simulation

arXiv cs.AI · Zhibao Chen · 2026-06-22

This study decomposes financial market dynamics by isolating four mechanisms in an evolutionary multi-agent simulation with 120 heterogeneous behavioral agents. Using a coevolving, endogenous-price simulator, the authors conduct matched 3x20-seed interventions with pluggable mechanisms: selection, price feedback, behavioral bias, and consensus network topology. Results show separable control over emergent properties: selection increases strategy diversity (Δ entropy +0.27 to +1.12 bits), price feedback enhances realism (Δ_5=+0.13,+0.20,+0.20), behavioral bias raises fragility (Δ=+10.5,+11.1,+14.4), while consensus topology exhibits no robust effect. The contribution is a mechanistic decomposition demonstrating distinct knobs for diversity, realism, and fragility.

evolutionary simulationagent-based marketsendogenous-pricestrategy diversitymechanism analysis

LLM-Aided A* Search in Non-Geometric Network Graphs

arXiv cs.AI · Nouf Alabbasi, Esraa Ghourab, Omar Alhussein · 2026-06-22

The paper introduces an LLM-aided A* algorithm for shortest-path searches in non-geometric network graphs, where traditional geometric heuristics are unavailable. The method leverages LLM-generated intermediate waypoints guided by landmark distances, serving as both admissible heuristics and structural features. Experiments on graphs with up to 2,000 nodes show a 50% reduction in expanded nodes with minimal path cost increase. Prompt engineering analysis reveals that incorporating heuristic estimates outperforms advanced prompting techniques, highlighting LLM-classical algorithm synergy for network optimization.

a* searchlandmark distancesnon-geometric graphsllm guidanceheuristic estimates

Managing Procedural Memory in LLM Agents: Control, Adaptation, and Evaluation

arXiv cs.AI · Julia Belikova, Rauf Parchiev, Evgeny Egorov, Grigorii Davydenko · 2026-06-22

The paper introduces AFTER, a benchmark of 382 enterprise tasks across six professional roles and 22 procedural skills, to evaluate procedural memory in LLM agents. It tests skill transfer across tasks, roles, and model backbones through controlled settings. Results show procedural memory improves industrial workflows by 3.7-6.7 points per refinement round, with cross-model accuracy reaching 73.1%. Skills exhibit varying generalization patterns, with some broadly applicable and others role-specific, providing actionable insights for deploying procedural memory systems.

procedural memoryllm agentsskill transferenterprise taskscross-model generalization

AI-Empowered UAV-Assisted Backscatter Localization and ISAC for Zero-Energy IoT: A Comprehensive Survey

arXiv cs.AI · Ruhul Amin Khalil · 2026-06-22

The paper presents a comprehensive survey on AI-empowered UAV-assisted backscatter localization and integrated sensing and communication (ISAC) for zero-energy IoT. It employs a PRISMA-informed methodology to develop a unified taxonomy covering network architectures, UAV roles, backscatter modes, RF sources, localization and sensing functions, AI techniques, and performance metrics. The survey includes comparative tables, quantitative trend analysis, coverage evaluation, and tutorial-style numerical illustrations. Key findings highlight the potential of UAVs to mitigate limitations of backscatter communication, such as weak reflections and double-path loss, while enabling energy-neutral operation. Open challenges include realistic channel modeling, scalable AI, security, privacy, and integration with emerging technologies like RIS, MEC, digital twins, and 6G.

backscatter communicationunmanned aerial vehiclesintegrated sensing and communicationzero-energy iotai-driven optimization

PRIDE: Privileged Information-enhanced Distillation for Empathetic Dialogue Generation

arXiv cs.AI · Jiaqiang Wu, Zhouan Zhu, Shangfei Wang · 2026-06-22

The paper proposes PRIDE, a privileged information-enhanced knowledge distillation method for empathetic dialogue generation that transfers nuanced empathetic reasoning from large to small models without inference-time overhead. The method employs (1) empathy-reasoning prompts for step-by-step teacher decomposition, (2) multi-source attention for privileged information integration, and (3) dual-alignment loss combining reversed KL divergence and maximum mean discrepancy. Experiments on multi-modal and text-only datasets show PRIDE achieves competitive performance, sometimes matching or surpassing teacher models in accuracy and semantic relevance.

knowledge distillationprivileged informationempathy-reasoning promptmulti-source attentiondual-alignment loss

A Matter of Time: Towards a General Theory of Agency

arXiv cs.AI · Amahury J. López-Díaz, Carlos Gershenson · 2026-06-22

This paper develops a graded organizational theory of agency by integrating relational biology, physical biosemiotics, and process ontology through temporally parametrized (F, A)-systems. The authors argue that self-referential closure must be temporally contextualized, leading to an asynchronous dependency structure formalized as a history-dependent, revisable Asynchronous Dynamic Bayesian Network. This framework distinguishes autonomy, goal-directedness, agency, and open-endedness, with agency emerging from anticipatory structures modulating organism-environment coupling. The theory reconciles Rosennean anticipation with organizational closure, reinterprets computational enactivism, and outlines a hierarchy from proto-agential systems to fully semantically closed agents, with implications for multicellular organisms, synthetic lifeforms, and neuroscience.

asynchronous dynamic bayesian networkorganizational closurerosennean anticipationself-referential closuretemporally parametrized systems

Self-Evolution for Multi-Turn Tool-Calling Agents via Divergence-Point Preference Learning

arXiv cs.AI · Jiaqiang Tang · 2026-06-22

The paper introduces ToolGraph, a self-evolution framework for multi-turn tool-calling agents that combines schema-derived topology, transition weights from successful rollouts, and history-aware controls to improve tool selection and preference updates. The method constructs 161 preference pairs via divergence-point identification and trains using DPO (Divergence-Point Preference Learning) under the same ToolGraph context used at inference. On 375 tau2-bench tasks, ToolGraph+DPO achieves a 16.8% relative improvement in weighted average reward (0.355 vs. 0.304 baseline), with notable gains in airline and retail domains, while diagnostics reveal step budget exhaustion in telecom trajectories and reward positivity as the most useful checkpoint signal.

multi-turn tool-using agentsdivergence-point preference learningtoolgraphdpotau2-bench

TTFT-Aware Graph Chain-of-Thought:Distance-Indexed Neural A* for Low-Hallucination Multi-Hop Medical Reasoning

arXiv cs.AI · Bechir Dardouri, Kaïs Zhioua, Yassine Msaddak · 2026-06-22

The paper introduces TTFT-Aware Graph Chain-of-Thought, a production-grade GraphRAG system for low-hallucination medical reasoning. The method combines Pruned Landmark Labeling (PLL) for sub-millisecond distance queries and path enumeration with AStarNet, a lightweight heuristic that operates within PLL constraints to prioritize clinically plausible expansions. Evaluated on fertility-focused queries, the hybrid approach improves recall-latency trade-offs over text-only RAG, reduces Time to First Token (TTFT), and lowers clinician-audited hallucinations while maintaining explanation clarity in a 700K-node medical knowledge graph.

graphragpruned landmark labelingastarnetmulti-hop reasoningtime to first token

ReNIO: Reweighting Negative Trajectory Importance for LLM On-Policy Distillation

arXiv cs.AI · Chen Lin, Kedi Chen, Wei Zhang · 2026-06-22

ReNIO introduces a method for reweighting negative trajectory importance in on-policy distillation (OPD) for LLMs, addressing the asymmetry where incorrect student-generated outputs (SGOs) outperform correct ones in training. By leveraging the student-to-teacher probability ratio, ReNIO identifies pivotal tokens leading to wrong reasoning traces and aggregates their information into normalized sample weights, inherently prioritizing likely negative trajectories without requiring full-answer rollouts. This preserves OPD's prefix training advantage over full-rollout reinforcement learning. Empirical results show ReNIO improves OPD and on-policy self-distillation (OPSD) by up to 8.90% for Qwen3-1.7B and 10.00% for R1-Distill-Qwen-7B on mathematical reasoning benchmarks.

on-policy distillationstudent-generated outputsnegative trajectoryprefix trainingreasoning traces

Cognitive Digital Twins: Ethical Risks and Governance for AI Systems That Model the Mind

arXiv cs.AI · Vamshi Krishna Bonagiri, Juan Nicolas Sepulveda-Arias, Abdoul Jalil Djiberou Mahamadou, Monojit Choudhury · 2026-06-22

The paper introduces cognitive digital twins (CDTs) as AI systems that model individual cognition through dynamic computational representations. It proposes a 5A governance framework (authority, autonomy, access and control, accountability, availability) to address CDT-specific risks like misrepresentation and proxy-power asymmetries. The analysis identifies governance gaps and recommends requirements for high-risk CDTs, emphasizing consent, validity, and model retirement. Unlike existing frameworks focused on data processing or autonomous actions, CDT governance must regulate cognitive representation itself.

cognitive digital twins5a governanceproxy-power asymmetriesepistemic authoritymodel retirement

FLFL: Federated Latent Factor Learning for Private Recovery of Spatio-Temporal Signals

arXiv cs.AI · Chengjun Yu, Di Wu, Yi He, Jia Chen · 2026-06-22

The paper proposes Federated Latent Factor Learning (FLFL), a privacy-preserving method for spatio-temporal signal recovery in wireless sensor networks (WSNs). FLFL employs a sensor-level federated learning framework that shares only gradient information instead of raw data, while incorporating spatio-temporal correlations as regularization to enhance accuracy. Evaluated on four real-world WSN datasets, FLFL outperforms eight state-of-the-art federated and non-federated baselines in recovery accuracy while preserving data privacy.

federated learninglatent factor learningspatio-temporal signalsprivacy-preservingwireless sensor networks

AdaReP:Adaptive Re-Planning under Model Mismatch for Neural World-Model Predictive Control

arXiv cs.AI · Yutian Cheng, Xiaojian Ma, Xianhao Wang, Min Yang · 2026-06-22

AdaReP introduces an adaptive replanning framework for neural world-model predictive control (MPC) that reduces computational overhead without modifying the learned world model or planner. The method leverages a perturbation-based dynamic-regret analysis to quantify the trade-off between stale-plan penalties and replanning frequency, adapting the replanning tolerance online based on current deviation from cached rollouts and local sensitivity estimates. Empirical evaluations across image-space planning, latent-space control, and real-world robotic manipulation demonstrate significant computational savings, including over 80% fewer planner queries in a 50-trial physical robot study, while maintaining task performance.

model predictive controldynamic-regretreplanning toleranceneural world modelscomputational overhead

Safety in Self-Evolving LLM Agent Systems: Threats, Amplification, and Case Studies

arXiv cs.AI · Ruixiao Lin, Xinhao Deng, Qingming Li, Jianan Ma · 2026-06-22

This paper systematically analyzes security and privacy threats in self-evolving LLM agent systems through the Module-Lifecycle Attack Surface (MLAS) matrix, decomposing the attack surface into 25 cells across five functional modules and lifecycle stages. The analysis reveals 17 critically threatened cells with no effective mitigation and identifies seven synergistic amplification effects. Case studies demonstrate that evolution-native designs activate 3.5× more attack surface cells, achieve 100% attack persistence, and render static defenses ineffective, blocking only 2.5% of attacks. The findings highlight that self-evolution converts session-bounded attacks into lineage-persistent threats and necessitates evolution-aware security frameworks.

self-evolving llmmodule-lifecycle attack surfaceamplification effectsattack persistenceformal verification

Attention-Spectrum Regularization for Replay-Free Continual Multimodal LLMs

arXiv cs.AI · Chuangxin Zhao, Canran Xiao, Siyuan Ma, Mengyao Lyu · 2026-06-22

The paper introduces Attention-Spectrum Regularization (ASR), a replay-free continual learning framework for multimodal large language models (MLLMs) that preserves skill-conditioned cross-modal attention structures. ASR treats cross-attention maps as 2D signals, compresses their spectral properties into skill-wise prototype distributions, and uses a phase-invariant regularizer to constrain harmful drift during adaptation. Theoretical analysis links spectral drift control to forgetting under spectral sufficiency assumptions. Experiments on VQA v2, VQACL, CLT-VQA, CoIN, and UCIT benchmarks demonstrate ASR's superiority over replay-, regularization-, and adapter-based baselines in final performance and forgetting reduction.

attention-spectrum regularizationmultimodal llmscontinual learningcross-modal attentionspectral drift

MotionHalluc: Diagnosing Kinematic Hallucinations in Fine-Grained Motion Reasoning

arXiv cs.AI · Weile Guo, Shenghong He, Danying Mo, Chengdong Xu · 2026-06-22

MotionHalluc introduces a benchmark for diagnosing kinematic hallucinations in fine-grained motion reasoning across video pairs, comprising 1540 questions over 553 video pairs. It evaluates hallucinations along three dimensions: directional, attributional, and temporal. The authors propose Perceive-Parse-Verify (PPV), a training-free method that extracts and verifies kinematic measurements, converting instructions into executable queries. Evaluations on state-of-the-art large multimodal models reveal high susceptibility to hallucinations, while PPV improves performance by an average of 10.6%, highlighting the importance of explicit quantitative measurements in reducing hallucinations.

kinematic hallucinationsfine-grained motion reasoningperceive-parse-verifymultimodal modelsmeasurement extraction

From Text Metrics to Model Internals: A Study of Whisper ASR Hallucination Detection

arXiv cs.AI · Jan Jasiński, Mateusz Barański, Julitta Bartolewska, Marcin Witkowski · 2026-06-22

The paper introduces a comprehensive study on detecting hallucinations in Whisper large v3 ASR, comparing text-based metrics, LLM-based methods, and internal decoder state probing. Text classifiers achieve high recall but require reference transcripts, while conditioned LLM prompts improve precision but underperform lightweight text methods. Probing decoder representations reveals hallucination traits encoded in intermediate layers, yielding superior reference-free detection. A late-fusion meta-classifier combining text and internal-state features achieves optimal performance (specific metrics not provided).

asr hallucinationwhisper large v3decoder state probinglate-fusion classifierin-context learning

Some Results about the Expressivity of Preference-Incomplete Structured Argumentation Frameworks

arXiv cs.AI · Antonio Yuste-Ginel · 2026-06-22

The paper analyzes the expressive power of ASPIC$^+$ argumentation frameworks with uncertain preference profiles through comparison with abstract formalisms featuring uncertain defeats. Using formal methods, it primarily establishes negative results, some theoretically unexpected, regarding expressivity limits. The work also proposes a conjectured positive threshold for uncertain preference expressivity and provides foundational proofs toward validating this conjecture.

aspic$^+$argumentation frameworksuncertain preferencesexpressive powerformal comparison

Physics-governed executable modelling of triboelectric nanogenerators

arXiv cs.AI · Hongfa Zhao, Baiqiao Wang, Tiancong Zhao, Chun Jin · 2026-06-22

The authors present TENG-CLAW, a physics-governed simulation platform unifying fragmented approaches for triboelectric nanogenerator (TENG) modeling. The framework establishes a charge-defined electrostatic hierarchy connecting analytical theories with finite-geometry numerical formulations, using triboelectric charges, pre-charging charges, and electrode charges as state variables. TENG-CLAW converts user requests into traceable simulation tasks with explicit charge states, boundary conditions, and reusable artifacts across multiple workflows. This provides a rigorous computational basis for TENG mechanism interpretation and reproducible device design.

triboelectric nanogeneratorselectrostatic hierarchyfinite-geometry solverscharge-defined modelingsimulation workflows

Training Open Models for Agentic Phone Use

arXiv cs.AI · Zhengyang Tang, Xin Lai, Pengyuan Lyu, Xinyuan Wang · 2026-06-22

PhoneBuddy introduces a training recipe for open models in agentic phone use, combining real-app and mock-app (PhoneWorld) environments through shared supervised fine-tuning and mixed reinforcement learning. The method leverages PhoneWorld's ability to reconstruct runnable mock apps from real GUI usage structure, addressing scalability and reset challenges. Evaluated on 150 real-phone tasks, task success improved from 36.67% (supervised fine-tuning) to 45.33% (mixed RL), with AndroidWorld performance rising from 60.3% to 83.2%. Results demonstrate mock-app training complements real-app RL, particularly for app/mini-app tasks, while cross-app workflows remain challenging.

agentic phone usesupervised fine-tuningmixed reinforcement learningmock-app environmentgui usage structure

HALAS: A Human-Annotated Dataset of Hallucinations of Modern ASR Systems

arXiv cs.AI · Mateusz Barański, Jan Jasiński, Julitta Bartolewska, Marcin Witkowski · 2026-06-22

The paper introduces HALAS, the first human-annotated dataset of naturally occurring hallucinations in seven state-of-the-art ASR systems, using real earnings call recordings. It provides span-level labels for analyzing hallucination patterns and severity, revealing vocabulary overlap across models and hallucinations even in low-WER transcripts. Benchmark results show character/semantic-level metrics achieve 81% ROC-AUC for detection, while current methods only reach 53.1% F1 score, establishing HALAS as the first rigorous non-artificial benchmark for ASR hallucination detection.

asrhallucinationbenchmarkwerroc-auc

Domain Adaptation Under Wireless Network Constraints: When Does It Become Green?

arXiv cs.AI · Illyyne Saffar, Aurélie Boisbunon, Shruti Bothe · 2026-06-22

This work investigates the energy efficiency of Unsupervised Domain Adaptation (UDA) compared to single-task training in 6G wireless networks, considering both computational costs and labeling effort. The authors propose a method to determine the minimum number of target domains required for UDA to become more energy-efficient than retraining from scratch. Results provide insights into when UDA should be preferred over traditional approaches from an energy and labeling-aware perspective, addressing practical challenges in deploying data-driven models under frequent distribution shifts.

unsupervised domain adaptation6g wireless networksenergy efficiencydistribution shiftslabeling cost

Prime Fourier Embeddings: A Principled Basis for Modular Arithmetic

arXiv cs.AI · Hyunsang Hwang, Suhyun Bae, Donghun Lee · 2026-06-22

The paper introduces Prime Fourier Embeddings (PFE), a structured integer representation using prime-indexed (cos, sin) pairs derived from harmonic analysis of Q, enabling modular arithmetic via prime-channel selection. The method leverages Schur's lemma to enforce block-diagonal linear maps (one per prime) and the Chinese Remainder Theorem for square-free composite moduli. Empirical results demonstrate 500x specialization ratios between relevant/irrelevant channels and perfect in-distribution accuracy on square-free composites.

prime fourier embeddingsmodular arithmeticschur's lemmachinese remainder theoremharmonic analysis

The Model as One Rater Among Several: Measuring Political Positions in Data-Sparse Regions with a Language-Model Panel

arXiv cs.AI · Tarek Gara · 2026-06-22

The paper introduces a method for measuring political positions in data-sparse regions by treating a large language model as one rater in a panel, analogous to expert surveys. It employs a panel approach with written axis definitions, achieving improved inter-rater agreement (mean absolute gap reduced from 2.81 to 2.50; r from 0.81 to 0.89). Results show high reliability (Krippendorff's alpha 0.86) across nine models, with informative disagreements highlighting referent problems. The method is validated in the Middle East and North Africa, with released instruments and data.

political positionslanguage-model panelkrippendorff's alphainter-rater agreementreferent problem

EvoRubrics: Dynamic Rubrics as Rewards via Adversarial Co-Evolution for LLM Reinforcement Learning

arXiv cs.AI · Hongxin Ding, Baixiang Huang, Yue Fang, Weibin Liao · 2026-06-22

The paper introduces EvoRubrics, a co-evolutionary RL framework where a Policy LLM and Rubric Generator improve through adversarial interaction, addressing reward saturation in rubric-based reinforcement learning. The method dynamically updates evaluation criteria at fine granularity without external models or ground truth, creating an automatic curriculum. Experiments demonstrate consistent improvements over static and dynamic rubric baselines across benchmarks, with the Rubric Generalizing as a transferable reward model. Notably, a fully self-supervised variant achieves meaningful gains, showing co-evolution alone provides sufficient learning signals.

reinforcement learningdynamic rubricsco-evolutionreward modelautomatic curriculum

IPO Finance Agent: Evaluation of LLM Financial Analysts beyond Finance Agent v2, with Automated Rubric Generation -- the Case of the SpaceX (SPCX) IPO

arXiv cs.AI · Mostapha Benhenda · 2026-06-22

The IPO Finance Agent extends Finance Agent v2 by addressing IPO due diligence challenges through enhanced task domain and retrieval architecture. It introduces contextual retrieval for long documents like SEC S-1 filings and builds a dataset of 1,000 IPO-diligence questions, releasing 70 SpaceX S-1 questions publicly. An evaluator-optimizer pipeline automates rubric generation, auditing candidate facts for omissions, hallucinations, and redundancy, with iterative LLM feedback. Results show Alibaba Qwen 3.7 Max achieves 79.4% accuracy at $0.30 per query, and Xiaomi MiMo-2.5 Pro reaches 76.8% at $0.05, outperforming Finance Agent v2's benchmarks.

ipo due diligencecontextual retrievalsec s-1 filingsevaluator-optimizer pipelinellm feedback

From numerical proportions to analogical proportions between probabilities

arXiv cs.AI · Henri Prade, Gilles Richard · 2026-06-22

The paper investigates analogical proportions between probabilities, extending prior work on vector-based analogies to probabilistic representations. It examines definitions based on arithmetic and geometric proportions, considering both single probability values and normalized distributions. The study explores whether analogical proportions between profiles induce analogous proportions in their associated probability distributions. Experimental analysis is conducted to validate this hypothesis, leveraging discrete attribute distributions derived from profile frequencies. Results suggest that analogical proportions in profile representations may indeed transfer to their corresponding probabilistic distributions.

analogical proportionsprobability distributionsarithmetic proportiongeometric proportiondiscrete attributes

Physics-Guided Spatiotemporal State Space Modeling for Lookahead Molten Pool Segmentation in Laser Wire-Feed Welding

arXiv cs.AI · Sen Li, Haichao Cui, Changhao Yin, Chendong Shao · 2026-06-22

The paper introduces WeldMamba, a physics-guided spatiotemporal state space network for lookahead molten pool segmentation in laser wire-feed welding. The model integrates historical coaxial grayscale images, welding parameters, and wire-state electrical signals to predict future semantic layouts of keyhole, wire, and molten pool regions. It employs a visual encoder, process-conditioned feature normalization, patch-level temporal state space modeling, and a motion-aware mask decoder, achieving 74.63% mIoU at 500 ms lookahead on a 43-sequence dataset. Key contributions include temporal history modeling, patch-level state space processing, and keyhole motion awareness.

weld-pool segmentationstate space modelinglaser weldingtemporal consistencysemantic prediction

A Stackelberg Framework for Resource-Aware LLM Agents: Learning, Repair, and Conditional Guarantees

arXiv cs.AI · Baoxun Wang · 2026-06-22

The paper proposes a Stackelberg game framework for resource-aware LLM agents, where a controller optimizes quality targets and cost incentives while an executor manages resource allocation across context, prompting, and tool usage. The method involves learning a conditional response model, optimizing a leader policy, and repairing it via real-API calibration and projection onto an empirically selected action set. Theoretical analysis provides conditional guarantees for equilibrium existence and stability, while experiments on 300 turns show a 17.4% reduction in mean token cost without significant quality degradation ($p=0.44$).

stackelberg gameresource allocationllm agentsconditional guaranteesreal-api calibration

ScalingAttention: Discovering Intrinsic Sparse Attention Topology for Video Diffusion Transformers

arXiv cs.AI · Ruiliang Zhou, Xuecheng Wu, Kang He, Guangyun Han · 2026-06-22

ScalingAttention introduces a training-free framework for efficient video diffusion transformers by exploiting an intrinsic sparse attention topology. The method combines WEST, which extracts a weight-encoded block-sparse mask offline, and FAST, which adaptively tunes head-wise sparsity based on fidelity requirements. Experiments on Wan2.1 demonstrate up to 1.90× speedup with maintained generation quality, outperforming existing sparse attention methods.

diffusion transformerssparse attentionblock-sparse priorfidelity-aware tuningvideo generation

Group-Graph Policy Optimization for Long-Horizon Agentic Reinforcement Learning

arXiv cs.AI · Yunan Wang, Minghui Song, Zihan Zhang, Shaohan Huang · 2026-06-22

The paper introduces Group-Graph Policy Optimization (G2PO), a reinforcement learning algorithm addressing reward sparsity and delayed feedback in long-horizon agentic tasks. G2PO constructs a global state-transition graph from linear trajectories, enabling group-aggregation state-value estimation for reduced variance and edge-centric advantage estimation for identifying critical transitions. Evaluations on WebShop, ALFWorld, and AppWorld show G2PO outperforms baselines by up to 22.2% success rate over GRPO.

group-based rlstate-transition graphcredit assignmenttemporal differencelong-horizon tasks

Neural Architecture Search of Sample Reweighting Networks for Complex Distribution Shift

arXiv cs.AI · Keisuke Sugawara, Kento Uchida, Shinichi Shirakawa · 2026-06-22

The paper proposes enhancing Meta-Weight-Net (MW-Net) via neural architecture search (NAS) to handle simultaneous label noise and class imbalance. Using a tree-structured Parzen estimator, the method optimizes MW-Net's hidden layer count, node count, and intermediate layer selection from the classifier for input features. Experiments on modified CIFAR-10/100 with compound distribution shifts demonstrate NAS-improved MW-Net outperforms the baseline architecture.

neural architecture searchsample reweightingdistribution shiftmeta-weight-netlabel noise

Joint Air Traffic Flow and Capacity Management via Answer Set Programming

arXiv cs.AI · Alexander Beiser, Markus Hecher, Nysret Musliu, Stefan Woltran · 2026-06-22

This work introduces a joint Air Traffic Flow and Capacity Management (ATFCM) model using Answer Set Programming (ASP) to optimize both aircraft trajectories (ATFM) and sector configuration (DAC) simultaneously, addressing a gap in state-of-the-art approaches that optimize these separately. The ASP implementation is evaluated against a Mixed Integer Programming (MIP) model and an iterative CASA-based heuristic using an instance generator based on OpenSky Network flight data. Results show that ASP outperforms MIP and remains competitive with heuristics on small instances, with DAC showing the largest performance improvement compared to rerouting and delaying strategies.

answer set programmingair traffic flowsector configurationmixed integer programmingopen sky network

StatABench: Dataset and Framework for Evaluating Statistical Analysis Capabilities of LLMs

arXiv cs.AI · Youxin Zhu, Yixuan Ding, Peng Lai, Longyue Wang · 2026-06-22

The authors introduce StatABench, a comprehensive benchmark for evaluating LLMs' statistical analysis capabilities, comprising Stat-Closed (404 questions across 18 topics) and Stat-Open (30 open-ended modeling tasks). Evaluation employs the LangChain MCP framework with data science agents, assessing Stat-Open via LLM-as-Judge. Results show GPT-5.1 achieves 68.6% on Stat-Closed, while the best open-source model reaches 60.6%; on Stat-Open, the top agent scores 61.86, revealing persistent gaps in tool-grounded reasoning and methodological decision-making.

statistical analysisllm evaluationbenchmark designtool-grounded reasoningllm-as-judge

Understanding Parallel Samplers in Masked Diffusion via Random Walks on Graphs

arXiv cs.AI · Vansh Bansal, Cho Cholyeon, Syamantak Kumar, Sujay Sanghavi · 2026-06-22

The paper introduces random walks on graphs as a verifiable framework to analyze parallel sampling strategies in masked diffusion models (MDMs). By training MDMs on graph walk samples without explicit graph exposure, the method enables quantitative evaluation via validity checks and Markov kernel estimation. Theoretical analysis reveals that parallel unmasking performance depends on graph structure, with a proposed bisection sampler achieving logarithmic steps under perfect training. Experiments demonstrate graph-dependent sampler efficacy, with bisection-style samplers improving speed-quality tradeoffs in OpenWebText MDM language generation.

masked diffusion modelsparallel samplingrandom walksmarkov kernelbisection sampler

When Preferences Fail to Become Incentives: A Utility-Behavior Gap in Large Language Models

arXiv cs.AI · Yujun Zhou, Christopher M. Ackerman · 2026-06-22

This study investigates whether coherent preferences elicited from large language models (LLMs) in choice paradigms translate into behavioral incentives in practical scenarios. The authors first replicate prior findings on consistent preference elicitation, then design writing tasks (essays, grant proposals, etc.) with blind LLM-judged quality metrics. Despite demonstrating that LLMs can modulate output quality via explicit cues, results show no correlation between offered high-utility incentives (based on choice-paradigm preferences) and output quality across all tested models. The work establishes a utility-behavior gap, showing that elicited preferences don't necessarily motivate LLM behavior.

preference elicitationutility-behavior gaplarge language modelsincentive alignmentbehavioral probing

Attacking the Trusted Imagination: Oracle-Level Integrity Attacks on Imagine-then-Act World Models

arXiv cs.AI · Linghan Chen, Kaiyan Ji, Minyu Guo · 2026-06-22

The paper identifies and exploits a vulnerability in imagine-then-act vision-language-action (VLA) policies, where the world-action model's (WAM) latent trajectory imagination becomes an attack surface. Using L-infinity-bounded observation perturbations and projected gradient descent, attackers can corrupt the imagined future (60x stronger than random noise) while evading detection (AUC 1.0 for untargeted attacks). Targeted control remains constrained due to manifold alignment challenges. Experiments on RynnVLA-002, LingBot-VA, and LaDi-WM show that while reactive policies resist corrupted imagination, imagination-driven MPC fails adversarially (ε=0.01, success drops from 0.70 to 0.05).

imagine-then-actworld-action modellatent trajectoryl-infinity attackmodel-predictive control

The Impact of VAE Design on Latent Pose Representations for Diffusion-based Sign Language Production

arXiv cs.AI · Guilhem Fauré, Mostafa Sadeghi, Sam Bigeard, Slim Ouni · 2026-06-22

The study examines how VAE design choices affect latent space properties and downstream performance in diffusion-based sign language production. Using the Phoenix14T dataset, it analyzes architectural and training objective variations in VAEs, measuring their impact on latent space structure and subsequent text-to-sign generation via latent diffusion models. Results indicate that generative performance (evaluated by back-translation BLEU scores) correlates more strongly with latent space characteristics than with VAE reconstruction accuracy alone.

variational autoencoderlatent diffusionsign language productionback-translation bleuphoenix14t

Plans Don't Persist: Why Context Management Is Load Bearing for LLM Agents

arXiv cs.AI · Aman Mehta, Anupam Datta · 2026-06-22

The paper demonstrates that LLM agents fail to internalize plans as persistent state, relying instead on context retention, through replay pairing—a diagnostic measuring hidden-state cosine distance when plans are present versus evicted. Results on Llama-3.1-70B show plan signal decays sharply (4.1x in one step; 12.4x on HotpotQA), while strict stripping mitigates reasoning-trace confounds (+163% signal recovery). Probe analysis reveals model-specific plan encoding (AUROC 0.748 transfer to DeepSeek-R1-Distill-Llama-70B), and naive eviction reduces ALFWorld success by 34.7pp, highlighting context management as load-bearing for agents.

context managementreplay pairinghidden-state cosine distancereasoning-trace confoundplan signal decay

ENVS: Environment-Native Verified Search for Long-Horizon GUI Agents

arXiv cs.AI · Yincheng Zhou, Athena Zhuoming Zhong, Shijie Zhang, Kevin Zhang · 2026-06-22

The paper introduces Environment-Native Verified Search (ENVS), a training-time search-and-filter pipeline for GUI agents that constructs verified supervision before policy optimization. ENVS branches over behaviorally distinct GUI actions in live OSWorld virtual machines, verifies successful trajectories, and trains from globally balanced step-level supervision. Evaluated on the 300-task OSWorld pool and the new OSWorld-Noisy benchmark with recoverable desktop interruptions, ENVS achieves 30.3 pass@8 (original) and 29.0 pass@8 (noisy), outperforming ARPO-style online RL while reducing compute from 184-192 to 138-153 GPU-hours. ENVS also better preserves visual-reasoning abilities on auxiliary benchmarks like OSWorld-G Refusal (16.7 vs. 1.9) and BLINK Functional Correspondence (26.2 vs. 23.1).

gui agentsverified supervisionlong-horizon tasksenvironment-native searchosworld benchmark

Provable Benefits of RLVR over SFT for Reasoning Models: Learning to Backtrack Efficiently

arXiv cs.AI · Stanley Wei, Juno Kim · 2026-06-22

Theoretical analysis demonstrates that reinforcement learning with verifiable rewards (RLVR) outperforms supervised fine-tuning (SFT) for reasoning tasks by enabling efficient backtracking. Modeling chain-of-thought reasoning as graph pathfinding, the study proves SFT fails to learn backtracking when trained solely on optimal paths, while RLVR achieves this using outcome rewards. This results in exponential computational efficiency gains during inference, as RLVR identifies critical decision points in reasoning chains. Additionally, RLVR-generated reasoning traces can distill backtracking capabilities into base models.

reinforcement learningsupervised fine-tuningbacktrackingchain-of-thoughtgraph pathfinding

When Agents Commit Too Soon: Diagnosing Premature Commitment in LLM Agents

arXiv cs.AI · Aman Mehta · 2026-06-22

The paper introduces 'premature commitment' as a failure mode in long-horizon LLM agents, where early convergence on a single interpretation leads to persistent errors. Using representational commitment—measured via hidden-state similarity across runs—the study diagnoses this phenomenon in Llama-3.1-70B, Qwen-2.5-72B, and Phi-3-14B on HotpotQA and StrategyQA (r = -0.35 to -0.83). A runtime monitor achieves AUROC up to 0.97 for detecting inconsistent trajectories, while a prompting intervention reduces behavioral variance by 28% without affecting accuracy. The method distinguishes commitment from correctness but shows limited utility in routing self-consistency compute.

premature commitmenthidden-state similarityllm agentsreactauroc

BEV-Denoise: Learning Intrinsic Noise for Accurate Bird's-Eye-View Semantic Segmentation

arXiv cs.AI · Dooseop Choi, Kyounghwan An, Kyoung-Wook Min · 2026-06-22

BEV-Denoise introduces a framework for accurate Bird's-Eye-View (BEV) semantic segmentation by estimating and removing intrinsic noise from learned BEV features. The method employs a UNet-based noise estimation module inspired by Denoising Diffusion Probabilistic Models (DDPM), subtracting the estimated noise from BEV features before feeding them to BEV map decoders. A Task Decomposition (TD) paradigm facilitates supervision, utilizing a pre-trained BEV map autoencoder to train a view transformation (VT) encoder. Experiments on the nuScenes dataset demonstrate the framework's effectiveness across four existing models covering major VT paradigms.

bev semantic segmentationdenoising diffusion probabilistic modelstask decompositionview transformation encodernuscenes dataset

Hierarchical Reinforcement Learning for Sparse-Reward Search in Commutative Algebra

arXiv cs.AI · Giorgi Butbaia, Paul Orland, Coco Huang, Davide Passaro · 2026-06-22

The paper introduces a hierarchical reinforcement learning (HRL) framework with equivariant graph neural networks to address Kalai's algebraic Hirsch conjecture, a sparse-reward problem in commutative algebra. The method employs constrained options-based HRL with an equivariant policy to learn temporal abstractions for constructing counterexamples on graphs. Evaluations across varying degrees show superior performance over classical RL and greedy search, demonstrating HRL's efficacy in mathematical problem-solving.

hierarchical reinforcement learningsparse-rewardequivariant graph neural networkcommutative algebratemporal abstractions

Intent-Governed Tool Authorization for AI Agents

arXiv cs.AI · Genliang Zhu, Chu Wang · 2026-06-22

The paper proposes Intent-Governed Access Control (IGAC), a server-side authorization layer that incorporates user intent as a policy attribute for AI-agent tool use. IGAC introduces intent certificates, session-scoped policy narrowing, and consistency checks between intent, tools, and payloads, ensuring user intent only reduces—never expands—static integration policy authority. The system builds on OpenPort's existing governance features, including ABAC-style policy checks and audit capabilities. The approach maintains monotonicity in authority reduction while enabling auditable, intent-aware tool authorization.

access controlai agentsauthorization layerintent certificatespolicy narrowing

Intend, Reflect, Refine: An Adaptive Multimodal Reflection Framework for Autonomous Driving

arXiv cs.AI · Zisheng Chen, Yuping Qiu, Jianhua Han, Tao Tang · 2026-06-22

IRR-Drive introduces an adaptive multimodal reflection framework for autonomous driving, enhancing trajectory planning reliability in dynamic environments. The method generates preliminary textual intentions and predicts future semantic bird's-eye view (BEV) representations to model anticipated scene evolution, enabling rigorous self-correction and refinement of initial intent. An adaptive reflection reward balances planning performance and computational efficiency, allowing the model to select reasoning modes based on scene complexity. IRR-Drive achieves state-of-the-art performance on the NAVSIM benchmark in both PDMS and EPDMS, validating the efficacy of its multimodal reflection framework and adaptive strategy.

multimodal reflectionbird's-eye viewadaptive reasoningtrajectory planningscene complexity

ThermoLLM: Thermodynamics-Aware HVAC Control with Spatial-Semantic Knowledge Graph

arXiv cs.AI · Kirtan Bhatt, Xiachong Lin, Matthew Amos, Flora D. Salim · 2026-06-22

The paper introduces ThermoLLM, a thermodynamics-aware LLM framework for HVAC control that incorporates spatial-semantic knowledge graphs. The method grounds control decisions in a physics-informed knowledge graph (derived from Brick-style semantics) combined with interaction history, enabling reasoning about zone coupling and thermal dynamics in a five-zone EnergyPlus simulation. Evaluation shows superior energy-comfort trade-offs (lowest PMV violations) compared to standard and LLM baselines while maintaining energy efficiency.

hvac controlknowledge graphthermodynamics-awarellm frameworkenergyplus simulation

Cross-lingual Retrieval-Augmented Classification for Dysarthria Severity Assessment

arXiv cs.AI · Taeyoung Jeong, Insung Lee, Du-Seong Chang, Myoung-Wan Koo · 2026-06-22

The paper introduces Cross-lingual Retrieval-Augmented Classification (CRAC) for dysarthria severity assessment, addressing data scarcity by leveraging speech from a different language. CRAC employs supervised contrastive learning to create a severity-focused embedding space, builds a vector database from an opposite-language corpus, and uses cross-attention to fuse retrieved top-k references with input during training/inference. Evaluated on Korean post-stroke and Italian ALS datasets under speaker-independent three-class protocols, CRAC achieves 87.3% and 86.7% balanced accuracy, outperforming monolingual baselines by 8.4 and 20.0 percentage points respectively.

dysarthriacross-lingualretrieval-augmentedcontrastive learningseverity assessment

Graph-Enhanced Large Language Models for Spatial Search

arXiv cs.AI · Nicole R. Schneider, Kent O'Sullivan, Hanan Samet · 2026-06-22

The paper proposes graph-enhanced large language models (LLMs) to address limitations in spatial reasoning capabilities. It identifies current deficiencies in LLMs' ability to process spatial data, particularly for domains like urban planning and civil engineering where graph-structured representations are common. The authors outline technical challenges and envision a future integration of LLMs with search engines for complex spatial queries through graph-based reasoning augmentation.

spatial reasoningretrieval augmented generationgraph-enhanced llmsurban planningsearch engine integration

From Fragments to Paths: Task-Level Context Recovery for Large Industrial Codebases

arXiv cs.AI · Jiawei He, Weisong Sun, Mengyu Shi, Jie Jia · 2026-06-22

DeepDiscovery introduces a task-level repository-understanding method for large industrial codebases, addressing the limitation of local fragment retrieval in existing approaches. The proposed two-stage Location--Inference framework localizes high-confidence task anchors and recovers broader task-relevant context under budget constraints, leveraging multi-relational repository structures. Evaluations on 27 medium-scale tasks, organization-internal industrial tasks (27 medium-scale and 40 large-scale), and SWE-bench Verified demonstrate consistent improvements: best file recovery quality among baselines, Full Recall Rate gains of 1.6-9.2pp (large subprojects) and 2.5-7.4pp (medium-scale), and a 78.6% Solve Rate (8.2pp improvement over baseline) in end-to-end testing.

repository-understandingtask-level contextmulti-relational structurelocation-inference frameworkswe-bench

Agent-as-a-Router: Agentic Model Routing for Coding Tasks

arXiv cs.AI · Pengfei Zhou, Zhiwei Tang, Yixing Ma, Jiasheng Tang · 2026-06-22

The paper introduces Agent-as-a-Router, a dynamic routing framework for Large Language Models (LLMs) that addresses the information deficit in static routing approaches by formalizing routing as a C-A-F loop (Context->Action->Feedback->Context). The proposed ACRouter system comprises an Orchestrator, Verifier, and Memory module, accumulating execution-grounded experience during deployment. Evaluated on CodeRouterBench (~10K task instances with scores from 8 LLMs), ACRouter achieves the lowest cumulative regret on in-distribution tasks and generalizes to out-of-distribution agentic-programming tasks, demonstrating a 15.3% relative gain over heuristic routers.

large language modelsmodel routingc-a-f loopcumulative regretagentic programming

Explainable AI in Speaker Recognition -- Attention Map Visualisation and Evaluation

arXiv cs.AI · Yanze Xu, Mark D. Plumbley, Wenwu Wang · 2026-06-22

The work proposes Modified RISE-eval, an improved algorithm for evaluating attention maps in speaker recognition networks, addressing limitations in existing methods. It systematically analyzes GradCAM and LayerCAM visualizations on a speaker recognition model using both the original and modified evaluation frameworks. Results show distinct performance advantages for each CAM-based method under different experimental conditions, with Modified RISE-eval providing more robust assessment metrics for attention map quality in XAI applications.

explainable aiattention mechanismsspeaker recognitionclass activation mapsevaluation metrics

Explanation-Guided Medical Named Entity Recognition with Stability and Boundary Awareness for Atopic Dermatitis

arXiv cs.AI · Xueguang Li, Di Lin, Xue Jiang, Yanxi Li · 2026-06-22

The study introduces an explanation-guided framework for Chinese medical named entity recognition (NER) in atopic dermatitis texts, enhancing reliability through stability and boundary awareness. The method employs perturbation-based analysis for explanation stability evaluation, an adaptive fusion strategy combining local and global explanations, and incorporates these signals via stability, boundary-aware, and consistency constraints during training. Experiments on Chinese AD NER datasets demonstrate improved explanation robustness and consistent performance gains across multiple NER models, with the fusion strategy yielding more stable explanations and better boundary perception than individual methods.

medical named entity recognitionexplanation-guided learningperturbation-based analysisadaptive fusion strategyboundary awareness

CLI-Universe: Towards Verifiable Task Synthesis Engine for Terminal Agents

arXiv cs.AI · Zhanbo Hua, Yifan Yao, Weihao Xie, Yongchi Zhao · 2026-06-22

The paper introduces CLI-Universe, a verifiable task synthesis engine for terminal agents that addresses data scarcity in LLM training. The method constructs tasks through multi-dimensional capability taxonomy sampling, grounds candidates via evidence-guided research, and verifies them via Dockerized environments with a multi-stage executable pipeline (rubric-gated tests, hint-conditional filtering). The resulting dataset, CLI-Universe-6K (6,000 trajectories), enables Qwen3-32B to achieve 33.4% accuracy on Terminal-Bench 2.0, outperforming larger models and demonstrating data efficiency.

task synthesisterminal agentscapability taxonomydockerized environmentsexecutable verification

Priority-Aware Learning-Unlearning Correction for Dynamic Decentralized LoRA Fine-Tuning

arXiv cs.AI · Nuocheng Yang, Yechen He, Sihua Wang, Zihan Chen · 2026-06-22

The paper proposes a priority-aware learning-unlearning correction framework for dynamic decentralized LoRA fine-tuning in edge-deployed LLMs. The method introduces orthogonal LoRA for history-free projection updates and a bottleneck-aware policy selecting among topology refinement, local correction, proximal damping, and synchronization scheduling. Experiments show robust correction for device join/leave events, with distinct residual regimes requiring different correction actions.

decentralized federated learningorthogonal loralearning-unlearningpriority-aware correctiondynamic edge networks

SpotAttention: Plug-In Block-Sparse Routing for Pretrained Long-Context Transformers

arXiv cs.AI · Huzama Ahmad, Se-Young Yun · 2026-06-22

SpotAttention introduces a plug-in block-sparse routing mechanism for pretrained long-context transformers, reducing computational costs while maintaining accuracy. The method employs a lightweight selector attached to a frozen transformer, trained via KL distillation to estimate attention distributions and select top-K keys per query using a dual top-p rule. Evaluated on Qwen3 (4B-32B) and Qwen3.5 (4B-9B), it matches dense accuracy up to 128K tokens, achieving 3.9x faster decode speed than FlashAttention and 1.8x faster than Twilight, with INT4/FP4 quantization reducing cache size by 3.5x without accuracy loss.

sparse attentionkl distillationkv-cachelong-context transformersquantization

VideoLatent: Video-Language Learning via Latent Self-Forcing

arXiv cs.AI · Zi-Yuan Hu, Zicong Tang, Shijia Huang, Yanyang Li · 2026-06-22

VideoLatent introduces a novel multimodal large language model (MLLM) for video understanding and reasoning, addressing inefficiencies in chain-of-thought (CoT) based approaches. The model employs a latent injection module and a latent self-forcing training paradigm, comprising latent alignment and latent diversity objectives, requiring only standard video-question-answer triplets. Evaluated across 14 benchmarks, VideoLatent outperforms existing MLLMs in general video understanding and complex reasoning, achieving 6× and 68× reductions in training and inference overhead, respectively, compared to Video-R1. The method demonstrates strong generalizability across different MLLM backbones and model scales.

multimodal large language modellatent injection modulelatent self-forcingvideo understandingchain-of-thought reasoning

Discovering Crystal Structure Prediction Algorithms with an AI Co-Scientist

arXiv cs.AI · Kiyoung Seong, Nayoung Kim, Sungsoo Ahn · 2026-06-22

We present Human-AI Co-discovery system (HACO), a framework for scientific algorithm discovery via cross-domain search and sparse human steering, applied to crystal structure prediction (CSP). HACO identified MaskGIT, a masked generative model from vision, as a transferable framework for CSP, and instantiated it as Masked Generative Crystal Transformer (MaskGXT) with crystallographic symmetry tokens, space group stratified sampling, and sub-bin coordinate refinement. On the MP-20 polymorph split, MaskGXT achieves 79.06% METRe accuracy, outperforming the strongest baseline (70.87%), and attains the best match rate on MP-20 and MPTS-52 CSP benchmarks. Results demonstrate that transfer-guided interactive AI co-scientists can effectively contribute to scientific algorithm discovery in domains with cheap, fast, and well-aligned validation.

masked generative modelcrystal structure predictioncross-domain searchpolymorph coveragesub-bin coordinate refinement

AI Scientists as Engines of Discovery: A Case for Development within Reformed Institutions

arXiv cs.AI · Raul Jimenez, Boris Bolliet, Francisco Villaescusa-Navarro, Rabih Zbib · 2026-06-22

The article proposes a paradigm shift in scientific discovery through agentic AI systems, emphasizing their potential to evolve from computational tools into autonomous 'AI scientists' capable of hypothesis generation and verification. It introduces extit{Denario}, a multi-agent framework designed to accelerate discovery cycles and explore model spaces beyond human capacity. The authors advocate for institutional reforms to ensure verification, accountability, interpretability, and dual-use safety. They also address implications for authorship, peer review, and the evolving role of human scientists, concluding with governance recommendations for AI as an epistemic actor rather than a mere instrument.

agentic aimulti-agent systemshypothesis generationmodel spacesepistemic actor

The Unseen Hand: Manipulating Model Fairness and SHAP with Targeted Identity Re-Association Attacks

arXiv cs.AI · Sannaan Khan, Muhammad U. S. Khan · 2026-06-22

The paper introduces Targeted Identity Re-Association (TIRA) attacks, a novel method for manipulating model fairness metrics and SHAP explanations without accessing model internals. Two algorithms are proposed: Probabilistic Micro-Shuffling (PMiS) performs localized adjacent swaps, while Probabilistic Rank-Shift Micro-Perturbation (PRSMP) applies randomized rank shifts. Empirical results show TIRA attacks effectively push fairness metrics toward ideal values and reduce SHAP-based attribution of protected features to near-zero levels, outperforming prior data-agnostic attacks in stealth and effectiveness.

algorithmic fairnessshap explanationsadversarial attacksmodel auditingprotected features

RaMem: Contextual Reinstatement for Long-term Agentic Memory

arXiv cs.AI · Wei Yang, Bryce Kan, Shixuan Li, Li Li · 2026-06-22

RaMem introduces a framework for contextual reinstatement in long-term agentic memory, addressing context collapse in LLM agents by ensuring retrieved memory fragments provide valid evidence for current queries. The method operates through four stages: evidence anchoring, recall condition induction, validity-aware retrieval, and context-preserved synthesis, which collectively ground memories in their original episodic conditions and prioritize context-compatible candidates. Experiments demonstrate RaMem's effectiveness, showing average F1 score improvements of over 10% across multiple long-term memory benchmarks compared to strong baselines.

context collapseevidence anchoringvalidity-aware retrievalcontext-preserved synthesislong-term memory

CLIP-guided Diffusion Model for Backdoor Generation in Sensor-based Human Activity Recognition

arXiv cs.AI · Toby Briston, Illya Kosyk, Kuniyih S · 2026-06-22

The paper introduces IMU-DM-CLIP, a backdoor attack method for sensor-based human activity recognition (HAR) models using a CLIP-guided diffusion model to generate synthetic data with triggers. The approach injects backdoors during training by poisoning only 10% of the data and guiding 10% of the diffusion process. Empirical results demonstrate successful attack execution with this minimal injection rate, compromising HAR model integrity while maintaining apparent functionality.

backdoor attackdiffusion modelhuman activity recognitionsensor dataclip guidance

OrthoMotion:Disentangling Camera and Subject Motion via Geometry Semantics Orthogonal Attention

arXiv cs.AI · Zijie Meng · 2026-06-22

OrthoMotion introduces a geometrically grounded attention operator to disentangle camera and subject motion in controllable video generation, proving the 2D camera/object split is fundamentally non-identifiable. The method routes camera motion via norm-preserving rotation of RoPE phase (geometric channel) and subject motion via gated value injection in cross-attention (semantic channel), with a regularizer enforcing orthogonal response subspaces. It reduces cross-talk by 2.4× versus baselines while maintaining fidelity, achieving SOTA on both motion control accuracy and disentanglement, quantified via a novel Cross-Talk Error (CTE) metric.

disentangled representationrotary position embeddingcross-attentionoptical flowvideo generation

Finding the Evidence: Discovering Decision-Supporting Tokens for On-Policy Reasoning Distillation

arXiv cs.AI · Jinwei Xiao, Zhuowen Han, Yueqing Sun, Zhengxi Lu · 2026-06-22

DEAR (Decision-Evidence Aware Reasoning Distillation) improves on-policy distillation by identifying both decision and evidence tokens in reasoning chains. Decisions are detected via student entropy, while evidence tokens are discovered using hidden-state cosine similarity to decision anchors, enhanced by teacher-student divergence to highlight knowledge gaps. This dual mechanism transfers substantive knowledge previously untransferred by standard methods. Evaluated across math and code benchmarks with three student-teacher configurations, DEAR outperforms standard on-policy distillation, achieving gains of up to +2.5pp on competition math and +5.7pp on code generation.

on-policy distillationreasoning chainsstudent entropyhidden-state cosine similarityteacher-student divergence

DBT-Bleed: Dual-Branch Temporal Modeling with Key-Frame Selection for Surgical Bleeding Detection

arXiv cs.AI · Sudhanshu Mishra, Jialang Xu, Jensen Ang, Evangelos B. Mazomenos · 2026-06-22

DBT-Bleed introduces a dual-branch multi-scale temporal modeling framework for surgical bleeding detection, addressing limitations in temporal reasoning and computational efficiency. The method employs layer-wise temporal adapters to disentangle bleeding and normal representations, coupled with HiRED, a hierarchical entropy-driven frame selection strategy for efficient long-video processing. Evaluated on the MultiBypass dataset, it achieves gains of 6.53% F1, 5.62% Recall, and 9% MCC, with cross-procedure generalization showing 6% F1 and 8% MCC improvements in zero-shot settings on the novel EndoPit-IAE neurosurgery dataset.

temporal modelingsurgical bleeding detectionframe selectionzero-shot learningneurosurgery dataset

MINCE: Shrinking LLM Evaluation Datasets via Few-Model Monte Carlo Calibration

arXiv cs.AI · Devleena Das, Rajeev Patwari, Vikram Kumar Bukka, Nithin Kumar Guggilla · 2026-06-22

MINCE introduces Monte Carlo simulation over per-item logs from a small calibration pool to determine the minimum subset size for LLM evaluation that bounds accuracy drift, eliminating the need for prediction layers. The method reduces evaluation costs significantly, achieving reductions of 54% on IFEVAL, 89% on MMLU, and 70% on GSM8K, with maximum drift ≤2.62pp on BF16 models and mean drift of 0.77--3.59pp on NPU models. MINCE delivers median GPU speedups of 2.7--8.1× and NPU speedups of 1.7--2.0×, outperforming tinyBenchmarks in drift reduction while using 57× fewer calibration models.

monte carlo simulationaccuracy driftsubset selectionllm evaluationcalibration pool

Active Inference as the Test-Time Scaling Law for Physical AI Agents

arXiv cs.AI · Omar Hashash, Christo Kurisummoottil Thomas, Walid Saad, Merouane Debbah · 2026-06-22

The paper introduces a test-time scaling law for physical AI agents based on active inference principles, enabling generalization in unforeseen scenarios. The method dynamically updates the agent's policy through Bayesian inference, minimizing free energy bounds to resolve prediction errors. Simulation results on autonomous driving show a 36% improvement in inference efficiency over Q-learning and Bayesian reinforcement learning, with robust generalization to non-stationary environments.

active inferencescaling lawbayesian inferencefree energy minimizationgeneralization

Bagpiper-TTS: Natural Language Guided Universal Speech Synthesis

arXiv cs.AI · Jinchuan Tian, Haoran Wang, Siddhant Arora, Takashi Maekaku · 2026-06-22

Bagpiper-TTS introduces a universal speech synthesis system that processes natural language prompts to generate rich captions encompassing transcription and metadata, enabling diverse applications beyond classical TTS. The method first reasons over user intent to derive comprehensive textual blueprints, which then guide speech synthesis. Evaluations on Seed-TTS-Eval show a 1.7% WER, with performance matching dedicated models in LLM-as-a-judge and human subjective assessments across multi-talker, intent-to-speech, role-play, and singing voice synthesis tasks.

text-to-speechnatural language processingspeech synthesismetadata reasoningmulti-talker synthesis

AI-Assisted Help-Seeking Trajectories in Programming Education from an SRL-Informed Perspective

arXiv cs.AI · Boxuan Ma, Huiyong Li, Gen Li, Li Chen · 2026-06-22

This study investigates AI-assisted help-seeking trajectories in university-level programming education through a self-regulated learning (SRL) framework. Analyzing 1,290 task-specific prompts linked to 17,190 code submissions from 71 introductory Python students, the research examines how help-seeking interactions unfold across turns and attempts, and their relation to task scores and submission counts. Findings reveal that students predominantly use AI for reactive troubleshooting rather than planned problem-solving. While trajectory patterns did not significantly affect task scores, they substantially influenced the number of code submissions required, highlighting the importance of help-seeking dynamics in programming education.

self-regulated learninghelp-seeking trajectoriescode submissionsreactive troubleshootingprogramming education

Measuring Behavior Portability in Large Language Models

arXiv cs.AI · Tianjia Dong, Nadav Kunievsky, James A. Evans · 2026-06-22

The study introduces a formal framework to measure behavioral portability in large language models (LLMs), assessing how well behavioral mappings learned in one decision environment generalize to payoff-equivalent alternatives. The protocol fits an interpretable behavioral model on pooled source environment data and evaluates out-of-sample predictive performance in a target environment, benchmarking against an oracle trained directly on target data. Portability is quantified via a loss-agnostic measure providing worst-case bounds on prediction-action mapping performance. Experiments across seven canonical economic decision problems reveal substantial and systematic portability losses, indicating LLM behavioral characterizations do not reliably transfer to structurally equivalent environments.

behavioral portabilitypayoff-equivalentinterpretable behavioral modelloss-agnostic measureprediction-action mapping

A Formula-Driven Survey and Research Agenda for On-Policy Distillation

arXiv cs.AI · Bowen Zhang · 2026-06-22

This survey introduces a formula-driven taxonomy for on-policy distillation (OPD), framing it as a feedback-to-update problem rather than a single loss family. The taxonomy organizes methods into direct distributional losses and policy-gradient-style log-ratio updates, emphasizing factors like state compatibility, temporal credit, and vocabulary-level probability routing. It distinguishes temporal credit from vocabulary routing, proposing GAE-OPD for log-ratio returns and Counterfactual Routed OPD (CR-OPD) for probability mass redirection. The study identifies bias boundaries for estimators, maps actionable diagnostics, and provides a reporting checklist for OPD variables.

on-policy distillationtemporal creditvocabulary routinggae-opdcounterfactual routed opd

The Origins of Stochasticity: Comprehensive Investigations on Uncertainty Quantification for Large Language Models

arXiv cs.AI · Xiang-Jun Ou, Shuang Liang, Xin-Yu Hu, Rong-Hao Huang · 2026-06-22

This paper introduces a granular uncertainty taxonomy for Large Language Models (LLMs), categorizing uncertainty into input-level, parameter-level, token-level, and decoding-process sources, and evaluates 21 Uncertainty Quantification (UQ) methods across Qwen3, Llama 3.2, and DeepSeek-V3 models on TriviaQA, GSM8K, and HumanEval benchmarks. The study employs Bayesian, ensemble, consensus-based, and single-pass UQ approaches, finding that consensus-based methods (Deg and EigV) outperform others, UQ effectiveness varies by task and generation settings, and larger models exhibit lower uncertainty estimates, suggesting a scaling law for LLM uncertainty.

uncertainty quantificationlarge language modelsconsensus-based methodsscaling lawgranular taxonomy

Scaling Audio Models Efficiently: A Joint Study of Compute Constraints and Optimization Behavior

arXiv cs.AI · Vyom Agarwal, Mokshda Gangrade, Siddharth Pal, Jerry Wu · 2026-06-22

The paper presents a unified framework for analyzing compute-optimal scaling in speech models, examining tradeoffs between model size ($x_N$), input length ($x_T$), and representation resolution ($x_V$) under fixed budgets. Through systematic experiments on LibriSpeech (ASR) and CREMA-D (SER), the study reveals non-linear scaling: (1) diminishing returns from model scaling (8.22% vs 2.35% WER reduction), (2) optimal 4-second audio duration for SER, and (3) encoder token resolution reduction as an effective inference cost saver (2572→5228 GFLOPS for <3% WER increase). LoRA-based adaptation enables efficient finetuning.

compute-optimal scalingspeech processingrepresentation resolutionlora-based adaptationnon-linear scaling

Learning Filters with Certainty

arXiv cs.AI · Yuval Banoun, Daniel Sadoc Menasche, Ori Rottenstreich · 2026-06-22

The paper proposes leveraging certainty estimates from Counting Bloom Filters (CBFs) to enhance machine learning pipelines, addressing uncertainty in hash-based data structures. By maintaining counters instead of binary bits, CBFs provide probabilistic membership indications that quantify collision-induced uncertainty. The authors demonstrate how these certainty signals can be integrated with ML models, though specific architectures or benchmarks are not detailed. The approach extends traditional Bloom filter applications in caching and anomaly detection by incorporating probabilistic confidence measures.

counting bloom filtershash collisionsmembership indicationmachine learning pipelinescertainty estimation

Explainable AI for Mental Health Prediction in Drug-Affected Populations with Dragonfly Algorithm and GAN Oversampling

arXiv cs.AI · Ahnaf Atef Choudhury, Shahriar Siddique Ayon, Md. Ebrahim Hossain, Abdullah Al Mamun · 2026-06-22

This study proposes an explainable AI framework for mental health prediction in drug-affected populations, combining PCA-Information Gain feature selection, GAN-based oversampling, and Dragonfly Algorithm-optimized XGBoost. The method addresses class imbalance and enhances interpretability via SHAP analysis, achieving 94.17% accuracy and 93.80% F1-score. Key predictive factors include sleep quality, physical health, and emotional regulation, with minimal demographic influence.

gan oversamplingdragonfly algorithmshap analysisxgboostpca-information gain

GeoRouteNet: Geometry-Enhanced Non-Autoregressive Neural Solver for the Traveling Salesman Problem

arXiv cs.AI · Xiang Li · 2026-06-22

GeoRouteNet introduces a geometry-enhanced non-autoregressive neural solver for Euclidean TSP, addressing limitations in geometric inductive bias and training stability. The model combines centered node features, radial distance basis functions, distance-aware graph attention, LayerNorm-SwiGLU blocks, and cross-layer residual mixing, while training employs multi-candidate self-comparison RL with adaptive baselines and annealed entropy. On TSP50 (10k instances), it achieves a 0.32% optimality gap (Beam-1000); on TSPLIB EUC_2D, the gap drops from 17.12% to 3.60% versus prior NAR methods, with superior throughput to Concorde/LKH3. Ablations show geometric enhancements and MCS-RL are complementary for cross-distribution generalization.

non-autoregressiveeuclidean tspgraph attentionreinforcement learninginductive bias

Noise is Signal: Density-Based Outliers as Leading Indicators of Occupational Emergence in Labor Market Text

arXiv cs.AI · Shreyash Rawat · 2026-06-22

The paper proposes that density-based outliers in occupational clustering, typically discarded as noise (10-15% of job postings), signal emerging occupations in rapidly evolving domains, formalized as the Emergence-Density Inversion (EDI) hypothesis. Testing EDI on 84,988 job postings across eight quarters (Q4 2022-Q3 2024), the authors find that high-EOS outlier groups transition to stable clusters in 1.4 ± 0.6 quarters versus 4.1 ± 1.2 for low-EOS groups (p < 0.001). Emerging occupations like Prompt Engineer and AI Safety Researcher, absent from O*NET, achieve coherence scores > 0.75 with 77% precision, forming stable clusters by Q1 2025.

occupational clusteringdensity-based outliersemergence-density inversionjob postingscoherence score

Evolutionary Optimization Reveals Structural Constraints on Reservoir Architecture for Spatiotemporal Chaos

arXiv cs.AI · Nima Dehghani · 2026-06-22

Evolutionary optimization reveals structural constraints on reservoir architecture for spatiotemporal chaos prediction, demonstrating that architectural refinement stabilizes task-suitable dynamics. Using the Kuramoto--Sivashinsky equation as a testbed, reservoirs were evolved over five hyperparameters: size, connectivity degree, spectral radius, input scaling, and readout regularization. Results show reduced prediction error, extended forecast horizons, and organization along a size--efficiency frontier. Evolved reservoirs maintained a stochastic-block-model-like spectral envelope while refining low-eigenvalue modes and optimizing modularity and connection cost. Pareto analysis indicates joint optimization of accuracy and efficiency, highlighting interpretable structural constraints on recurrent substrates in evolutionary reservoir computing.

evolutionary optimizationspatiotemporal chaoskuramoto--sivashinsky equationreservoir computingstochastic-block-model

AI Fiction in the Wild

arXiv cs.AI · Neel Gupta, Maria Antoniak, Melanie Walsh · 2026-06-22

This paper investigates how large language models (LLMs) transform fiction production and consumption through participatory narrative generation. Analyzing 500,000 anonymized ChatGPT-user conversations, the study finds that 34% involve fiction generation, primarily by power users engaging in repetitive narrative variations (termed 'infinite story demanders'). Dominant genres include fanfiction and erotica, with users favoring generic forms, repetition, and niche combinations. The authors theorize the emergence of 'solipsistic reader-writers' who interact solely with AI, and highlight LLMs' role in enabling interactive, pleasurable storytelling. Findings are contextualized within broader trends in self-publishing and personalized media.

large language modelsfiction generationsolipsistic reader-writerfanfictionchatgpt-user interactions

GroundEval: A Deterministic Replacement for LLM-as-Judge in Stateful Agent Evaluation

arXiv cs.AI · Jeffrey Flynt · 2026-06-22

GroundEval introduces a deterministic framework for evaluating stateful agents by analyzing their evidence retrieval and reasoning trajectories, addressing limitations of LLM-as-judge approaches. The method employs domain-specific configurations to generate questions, tracks agent actions (search, fetch, cite), and scores both final answers and trajectories across three failure modes: Silence, Perspective, and Counterfactual. Case studies reveal significant discrepancies between LLM-judged scores (0.85+) and GroundEval scores (0.000) when agents rely on unretrieved evidence, demonstrating systematic blind spots in traditional evaluation methods.

stateful agent evaluationevidence retrievaldeterministic scoringtrajectory analysisllm-as-judge limitations

Closed-loop Auto Research for Molecular Property Prediction: Discovering and Certifying Generalizable Improvements

arXiv cs.AI · Jingjie Ning, Xiaochuan Li, Ji Zeng, Chenyan Xiong · 2026-06-22

Closed-loop Auto Research introduces a framework extending automated machine learning to dynamically modify research workflows, leveraging language-model agents to edit representations, model code, and acquire external evidence. The method isolates three axes—features, models, and external evidence—under a file-level ablation lock to attribute improvements across 36 molecular property prediction endpoints in TDC, Polaris, and MoleculeNet benchmarks. Results show held-out test gains of 0.013, 0.011, and 0.042, with transferable axes varying by benchmark. Curated external data significantly improves CYP2C9-substrate and half-life predictions by 0.17 and 0.08, respectively, while a contamination filter ensures generalization. The pipeline outperforms an 84M-parameter pretrained 3D model on shared training splits.

closed-loop auto researchmolecular property predictionfile-level ablationcontamination filterlanguage-model agents

When Confidence Takes the Wrong Path: Diagnosing Retrieval-State Lock-In in RAG

arXiv cs.AI · Sahib Julka · 2026-06-22

The paper introduces retrieval-state lock-in, a failure mode in retrieval-augmented generation (RAG) systems where repeated samples condition on the same defective retrieval state, leading to spurious confidence estimates. The authors propose a diagnostic framework that disentangles answer surface, retrieved evidence, and retrieval state, evaluated in an ontology-guided KG-RAG system across six QA benchmarks. Results show 42% of KG-RAG errors and 59% of dense-retrieval errors exhibit zero answer dispersion, while a tripartite decision rule achieves 91.9% precision (100% in clinical domain) at 7.7% coverage.

retrieval-augmented generationconfidence estimationknowledge-graph ragerror analysisdecision rule

Text Dictates, Music Decorates: Energy-based Attention for Editable Dance Motion Generation

arXiv cs.AI · Seong Jong Yoo, Siyuan Peng, Felix Gu, Stratis Aloimonos · 2026-06-22

The paper introduces STREAM, a modality-decoupled diffusion transformer for editable dance motion generation that resolves spatial-temporal conflicts between text and music conditioning. The method employs Adaptive Layer Normalization (AdaLN) for text-driven kinematic structure and a Bimodal Energy-Based Attention Module (BEAM) to route features to musical beats without semantic overwriting. Evaluated on the Motorica++ dataset with a novel Exchange Evaluation Protocol, STREAM achieves state-of-the-art music-motion alignment while preserving choreographic semantics, enabling precise artistic control.

diffusion transformeradaptive layer normalizationbimodal energy-based attentionchoreographic semanticseditable motion generation

Interpretable Uncertainty Routing Separating Emotion Ambiguity from Distribution Shift in Facial Expression Recognition

arXiv cs.AI · Keito Inoshita, Takato Ueno · 2026-06-21

The study introduces Uncertainty-Aware Routing (UAR) for facial expression recognition (FER), separating aleatoric (annotator disagreement) and epistemic (distribution shift) uncertainties via Deep Ensemble fine-tuned DINOv2 models. Aleatoric uncertainty correlates with human disagreement (Spearman 0.66, 95% CI: 0.64-0.68), while epistemic detects image corruptions (AUROC 0.699). UAR retains 1.8× more ambiguous in-distribution faces than single-uncertainty routing at matched rejection rates, enabling interpretable action selection. A label-distribution baseline matches disagreement recovery but fails to separate uncertainty types.

facial expression recognitionaleatoric uncertaintyepistemic uncertaintydeep ensembledinov2

Subspace-Constrained Federated Learning with Low-Rank Adaptation

arXiv cs.AI · Neranjan Senarath, Rohit Muralitharan, Sadia Asif · 2026-06-21

The paper introduces Subspace-Reg, a subspace-regularized federated learning method for low-rank adaptation (LoRA) that addresses geometric misalignment in heterogeneous client data. The approach constrains local client updates to remain close to a shared global reference subspace, improving aggregation and convergence. Empirical evaluation on RoBERTa-large and SmolLM-360M models in a non-IID 10-client federated setting demonstrates that Subspace-Reg achieves superior mean best accuracy (0.454 ± 0.023), mean final accuracy (0.429 ± 0.011), and lowest final loss (1.363) on RoBERTa-large, outperforming FedAvg, SVD redistribution, and FedSVD baselines. Subspace-Reg also achieves near-perfect basis overlap (~0.9999) across models and seeds, validating the geometric alignment hypothesis.

federated learninglow-rank adaptationgeometric misalignmentnon-iidsubspace regularization

Leakage-Aware Benchmarking of LLM Forecasting: Real-Time Nowcasts as the Decision-Time Input for Macro Factor Ranking

arXiv cs.AI · Mao Guan, Qian Chen · 2026-06-21

The paper introduces a leakage-controlled benchmarking framework for evaluating retrieval-augmented LLMs in real-time macroeconomic forecasting. Using a 7B-parameter open-source LLM pipeline, the system processes decision-time inputs (lagged FRED variables, macro-event summaries, and Cleveland Fed CPI nowcasts) to rank U.S. equity style factors. The method employs a macro-analog retrieval module, a critic LLM for rule compression, and an actor LLM for scoring, achieving a median monthly Spearman rank IC of +0.154. Results show comparable performance to kNN baselines, with LLMs demonstrating marginal benefits in extreme rankings for long-short portfolios.

retrieval-augmented llmsmacroeconomic forecastingspearman rank icdecision-time inputslong-short portfolios

Beyond Penalizing Mistakes: Stabilizing Efficiency Training in Large Reasoning Models via Adaptive Correct-Only Rewards

arXiv cs.AI · Jungseob Lee, Seungyoon Lee, Seongtae Hong, Minhyuk Kim · 2026-06-21

The paper introduces ACOER (Adaptive Correct-Only Efficiency Reward), a method to stabilize efficiency training in large reasoning models by addressing reward collapse in Group Relative Policy Optimization (GRPO). ACOER isolates brevity bonuses to correct completions and employs dynamic budget normalization to prevent stochastic over-compression. Evaluated on mathematical reasoning benchmarks, ACOER improves accuracy while reducing token generation by over 60%, demonstrating robust efficiency-aware optimization.

group relative policy optimizationreward collapseadaptive correct-only efficiency rewarddynamic budget normalizationefficiency-aware optimization

Beyond Simpson's Paradox: A Cascade of Confounders in AI Agent Pull-Request Co-Authorship

arXiv cs.AI · Haoran Yu, Xiaochong Jiang, Lifei Liu, Su Wang · 2026-06-21

The study reveals a Simpson's Paradox in AI coding agent pull-request (PR) co-authorship: while aggregated data shows lower merge rates for human-co-authored PRs (53.8% vs. 79.8%), stratifying 33,596 PRs from AIDev by agent identity reverses this trend, with Copilot and Devin showing significant positive gaps (+41.2 and +33.5 pp). The paradox stems from compositional bias, as Codex dominates the dataset (64.9%) with high merge rates but rare co-authorship. Further controls for repository selection and PR structure (commit count, multi-commit PRs) eliminate all significant co-authorship effects, demonstrating these associations are confounded by selection and structural artefacts rather than causal benefits.

simpson's paradoxpull-requestco-authorshipconfoundersstratification

Libretto: Giving LLM Agents a Sense of Musical Structure

arXiv cs.AI · Yichen Xu · 2026-06-21

Libretto introduces an agent-facing framework for symbolic music generation and revision, addressing the challenge of inspecting and editing generative music outputs. The system employs an LLM-native grammar with explicit musical structure elements (onset slots, voices, bar-level organization) and evaluates pieces using corpus-calibrated statistical metrics across rhythm, harmony, melody, texture, form, and variation. Results demonstrate capabilities in gap filling, reference-guided generation, gradual morphing, and educational music generation, transforming symbolic music into a measurable and editable object for language-model agents.

symbolic music generationllm-native grammarcorpus-calibrated metricsstructural axesiterative self-revision

Safety-Aware Evaluation of LLM-Generated Driver Intervention Messages through Multi-Task Risk Fusion

arXiv cs.AI · Keito Inoshita · 2026-06-21

The paper introduces Driver Safety-Aware Intervention Score (DSAIS), a domain-specific metric for evaluating LLM-generated driver intervention messages across five dimensions: risk-urgency alignment, cognitive load, and driver acceptability. The hybrid architecture combines rule-based computation with LLM Judge evaluation, integrating multi-task recognition outputs via risk fusion, state history management, and dynamic prompt construction. Experiments on the AIDE dataset show DSAIS achieves ICC 0.798-0.840, with multi-task integration improving contextual relevance by 9.1% over rule-based baselines, and compact local LLMs (7B-9B parameters) outperforming API-based models.

driver interventionmulti-task fusionllm judgecontextual relevancerisk-urgency alignment

SCRUB-FL: Sanitizing and Cleansing Representations via Unlearning of Backdoors

arXiv cs.AI · Osama Wehbi, Sarhad Arisdakessian, Omar Abdel Wahab, Azzam Mourad · 2026-06-21

SCRUB-FL introduces a two-phase method for post-training backdoor removal in Federated Learning, addressing persistent vulnerabilities in converged global models. The approach combines client-side detection of suspicious samples via spectral analysis and activation clustering with server-side synthesis of trigger-approximating samples using aggregated WGAN-GP parameters. Machine unlearning is then applied to erase trigger-target associations by redistributing predictions uniformly. Evaluations on CIFAR-10 and GTSRB demonstrate SCRUB-FL reduces backdoor attack success rates to 3.88% while maintaining over 91% normal task accuracy, outperforming existing defenses without requiring prior trigger knowledge or large clean datasets.

federated learningbackdoor attacksmachine unlearningwgan-gpspectral analysis

VISTA Architect: A graph database-oriented health AI system demonstrated in multidisciplinary tumor boards

arXiv cs.AI · Tuomo Kiiskinen, Jason Fries, Philip Adamson, David Wu · 2026-06-21

VISTA Architect introduces a graph database-oriented AI system for integrating LLMs with longitudinal EHRs, addressing limitations of direct prompting and RAG. The architecture employs a two-layer approach: a provenance-preserving MEDS Graph and a Timeline Object Architecture (TOA) that uses graph-guided LLM extraction to synthesize deduplicated clinical timelines. Evaluated on 1,180 thoracic oncology patients, it achieved 96.4% accuracy (95% CI 96.1-96.7%) on tumor board-salient variables, outperforming BM25 RAG baselines. An agentic interface reduced preparation time to 2.2 minutes for a 30-patient cohort without accuracy loss. The modular design supports customization for other specialties.

knowledge graphelectronic health recordsretrieval-augmented generationlongitudinal dataagentic interface

The Geometry of Refusal: Linear Instability in Safety-Aligned LLMs

arXiv cs.AI · Shivam Ratnakar, Kartikeya Vats · 2026-06-21

The paper demonstrates that safety alignment in LLMs operates via a linear 'refusal direction' vulnerable to manipulation, introducing Contrastive Logit Steering (CLS) as a zero-optimization framework to probe this fragility. CLS contrasts hidden states from safe/unsafe prompts to isolate refusal vectors, achieving 95% attack success rates on Llama-3.1 and exposing architectural differences in safety implementation (e.g., 'Late Decision' vs. 'Early Divergence'). Results show CLS outperforms activation-level steering by 50.4% on Llama 2 and 11.8% on Qwen 7B, while bidirectional control enables defensive hardening via vector inversion.

contrastive logit steeringrefusal directionsafety alignmentlinear instabilityjailbreak vulnerability

Only Ask What You Don't Know: Grounded Delta Planning for Efficient Multi-step RAG

arXiv cs.AI · Wei-Chieh Chou, Xuanjun Chen, Jian-Ren Lin, Claire Lin · 2026-06-21

The paper introduces Grounded Delta Planning RAG (GDP-RAG), a plan-based framework for efficient multi-hop question answering that targets only unresolved information gaps. The method employs (1) preliminary retrieval to ground planning, (2) gap-conditioned planning prompts to request missing information, and (3) skeletal trajectories pairing subqueries with evidence-carrying Thoughts. Evaluated on HotpotQA, 2WikiMultiHopQA, and MuSiQue, GDP-RAG achieves 60.63% accuracy (highest among baselines) with a cost-of-pass of 0.51, reducing costs by 22-68% compared to PAR-RAG and KnowTrace while maintaining superior accuracy.

retrieval-augmented generationmulti-hop qadelta planningskeletal trajectorycost-of-pass

RigorBench: Benchmarking Engineering Process Discipline in Autonomous AI Coding Agents

arXiv cs.AI · Meher Bhaskar Madiraju, Meher Sai Preetam Madiraju · 2026-06-21

The paper introduces RigorBench, the first benchmark for evaluating process discipline in autonomous AI coding agents, moving beyond outcome correctness to assess engineering rigor. The benchmark measures five dimensions (Planning Fidelity, Verification Coverage, Recovery Efficiency, Abstention Quality, Atomic Transition Integrity) via 30 tasks across five categories, computing a composite RigorScore. Results show structured process discipline improves process quality by 41% and outcome correctness by 17%, demonstrating that coding methodology significantly impacts reliability. The benchmark suite and tools are released as open-source.

autonomous coding agentsengineering process disciplinerigorbenchverification coverageatomic transition integrity

Skin-Deep: A Geometric Diagnostic for Alignment Fragility in Large Language Model Representations

arXiv cs.AI · Dongyub Jude Lee, Jungseob Lee, Seungyoon Lee, Seongtae Hong · 2026-06-21

The paper introduces Skin-Deep, a geometric diagnostic for detecting alignment fragility in large language models (LLMs) prior to deployment or adversarial intervention. The method computes a Geometric Fragility Score (GFS) from hidden-state activations, compressing layer-wise safety geometry into a scalar. Evaluated on 21 instruction-tuned models (3B–32B parameters) across six alignment recipes, Skin-Deep identifies a recurring low-rank safety subspace causally linked to harmful-request refusal. GFS successfully predicts which initially safe models retain refusal behavior after small-scale LoRA fine-tuning, serving as a pre-deployment diagnostic tool.

alignment fragilitygeometric diagnostichidden-state activationslow-rank subspacelora fine-tuning

Data Evolution by Wittgenstein's Rule Following

arXiv cs.AI · Aydin Ghojogh, Benyamin Ghojogh · 2026-06-21

The paper introduces Wittgenstein's Rule Following (WRF), a philomatics framework for evolving datasets by extrapolating implicit rules from historical sequences. WRF represents datasets via structural descriptors capturing geometric, distributional, clustering, and label-based properties, then predicts rule-following and family-resemblance targets through descriptor trajectory extrapolation and averaging. Candidate datasets are generated via balanced or bounded mixture recombination, scored against targets, and refined using differentiable optimization in descriptor space. WRF accommodates varying sample sizes and feature dimensions without assuming direct transformations between datasets. Experiments on synthetic and image data demonstrate WRF's ability to generate meaningful dataset continuations in unsupervised and supervised settings.

philomaticsstructural descriptorsrule-followingfamily-resemblancedifferentiable optimization

AgentLens: Interpretable Safety Steering via Mechanistic Subspaces for Multi-Turn Coding Agent

arXiv cs.AI · Weidi Luo, Qiming Zhang, Yihao Quan, Mingyu Jin · 2026-06-21

The paper introduces AgentLens, a white-box defense framework for multi-turn coding agents that detects and mitigates safety risks through mechanistic subspace intervention. By analyzing step-level hidden representations and intervening in a 10-dimensional subspace, AgentLens addresses evolving risk dynamics in agent execution. Evaluated on the Mechanistic Agent Safety (MAS) benchmark with LLaMA-3.1-8B, Qwen-2.5-7B, and Gemma-2-9B, the method demonstrates strong safety detection, lookahead risk anticipation, and significant reduction in harmful actions.

mechanistic interpretabilitymulti-turn coding agentssafety steeringhidden representationssubspace intervention

Confidently Wrong: Severity-Aware Calibration of Prompt-Injection Detectors under Attack Shift

arXiv cs.AI · Md Anas Biswas · 2026-06-21

The paper exposes a critical blind spot in prompt-injection detectors: when attacks shift from benchmark distributions, detectors exhibit severe miscalibration by confidently missing adversarial inputs (severity S=0.99-1.00) despite varying false-negative rates (0.01-0.97). The authors evaluate three detectors (ProtectAI-v2, Prompt-Guard-2 variants) under five distribution shifts, revealing unanimous failure on indirect behavior-hijack injections. Analysis traces this to content-keying biases, confirmed via instruction-tuned model evaluation and black-box rewriting attacks. Standard calibration metrics (e.g., 0.06 pooled error) fail to capture attack-specific miscalibration (0.91 error). Live model tests show missed injections yield functional exploits.

prompt-injectiondistribution shiftseverity metriccontent-keyingmiscalibration

Foundation Models for Epileptogenic Zone Identification in Drug-Resistant Epilepsy

arXiv cs.AI · Thi Kieu Khanh Ho, Thomas Lai, Petr Klimes, Jan Cimbalnik · 2026-06-21

EpiiSLM, a dual foundation model system, advances epileptogenic zone (EZ) identification in drug-resistant epilepsy by combining a signal foundation model trained on 104,990 minutes of stereo-electroencephalography (sEEG) recordings with a language foundation model integrating multimodal clinical data. The system anchors EZ biomarker extraction on non-epileptic signals and leverages all recordings regardless of surgical outcome. Evaluated under leave-one-patient-out, EpiiSLM achieved a 0.978 contact-level positive predictive value (PPV), surpassing the seizure onset zone-as-EZ baseline by 15.1%, and 100% region-level accuracy. On an external dataset, it maintained a 0.857 contact-level PPV. EpiiSLM requires only one night of interictal sleep data, potentially reducing invasive sEEG monitoring duration and improving surgical outcomes.

epileptogenic zonestereo-electroencephalographyfoundation modelpositive predictive valueinterictal sleep

Confident but Conflicted: Internal Uncertainty and Cognitive Dissonance Resolution in LLMs

arXiv cs.AI · Weihong Qi, Kristina Lerman · 2026-06-21

This work introduces Trust Elasticity (TE), an econometrics-inspired metric quantifying how readily large language models (LLMs) resolve cognitive dissonance when presented with conflicting evidence. The authors systematically vary persuasion attempts along source authority and evidence quality dimensions across 12 health-science claims, observing three dissonance resolution patterns: persuasion, backfire, and immunity. Across four LLMs, TE varies significantly, with clearly false claims eliciting near-zero TE universally. Analysis of two open-weight models reveals TE variation correlates with internal uncertainty indicators: Confidence Miscalibration in Qwen and Internal Uncertainty Change in Llama, linking behavioral variation to measurable internal properties.

trust elasticitycognitive dissonanceinternal uncertaintyconfidence miscalibrationpersuasion attempts

Orthogonal Representation Editing: Decoupling Semantic Entanglement in Batch Knowledge Editing of LLMs

arXiv cs.AI · Wenhao Yu, Zhicong Lu, Bo Lv, Fangyin Ma · 2026-06-21

Orthogonal Representation Editing (ORE) addresses semantic entanglement in batch knowledge editing for Large Language Models (LLMs) by constructing a general semantic subspace and enforcing orthogonal constraints on edit vectors. The method introduces a gated non-linear representation head to adaptively learn editing locations and precisely control knowledge injection. Extensive experiments demonstrate that ORE outperforms existing methods, achieving superior performance in cross-lingual knowledge editing scenarios. The code is publicly available.

orthogonal representation editingsemantic entanglementbatch knowledge editinggated non-linear representation headcross-lingual knowledge editing

Federated Learning for Global Carbon Emission Forecasting: A Hybrid Time-Series Approach with Statistical and Neural Models

arXiv cs.AI · Attia Qammar, Qazi Haseeb Yousaf, Ali Azam, Ammar Ahmed · 2026-06-21

A federated hybrid forecasting framework is proposed for global carbon emission prediction, integrating ARIMA-based trend modeling, GARCH-based volatility modeling, LSTM-Attention temporal representation learning, and XGBoost prediction within a privacy-preserving federated learning environment. The framework enables collaborative learning across distributed clients without raw data exchange, addressing privacy and data distribution challenges. Evaluation across 14 clients demonstrates strong performance, with average metrics of R²=0.73, RMSE=1.21, and MAPE=6.5%, indicating accurate and scalable carbon-emission forecasting compliant with regulatory constraints.

federated learningarimagarchlstm-attentionxgboost

SkillAudit: From Fixed-Suite Benchmarking to Skill-Centered Assessment

arXiv cs.AI · Dexu Yu, Youhua Li, Zhaoyang Guan, Xianhao Lin · 2026-06-21

The paper introduces SkillAudit, a framework for evaluating agent skills beyond fixed benchmarking suites. It generates multi-dimensional assessments (utility, efficiency/cost, safety) by constructing capability-aligned tasks from skill packages, executing them in sandboxed environments, and applying LLM-based verification. The method employs baseline comparison for utility/efficiency and combines static semantic analysis with dynamic runtime verification for safety. Evaluation of real-world skills across 23 categories revealed over 7% exhibit risky behaviors.

skillauditagent skillsllm-based judgingruntime verificationsemantic analysis

PaperClaw: Harnessing Agents for Autonomous Research and Human-in-the-Loop Refinement

arXiv cs.AI · Weiwei Ye, Hangchen Liu, Dongyuan Li, Renhe Jiang · 2026-06-21

PAPERCLAW introduces a multi-agent system for autonomous research, from literature curation to paper writing, featuring stoppable hypothesis testing and human-in-the-loop refinement. The system employs LLM-driven agents to curate domains, brainstorm ideas, and iteratively test hypotheses, maintaining a full-lifecycle memory for context preservation. Evaluation with an LLM judge shows PAPERCLAW generates strong papers autonomously and with human refinement, while ensuring grounded outputs through validated references and reproducible results.

multi-agent systemautonomous researchhuman-in-the-loophypothesis testingllm-driven agents

On the Position Bias of On-Policy Distillation

arXiv cs.AI · Yan Xie, Sijie Zhu, Tiansheng Wen, Bo Chen · 2026-06-21

Importance-Weighted On-Policy Distillation (IW-OPD) addresses position bias in token-level supervision by dynamically weighting tokens based on accumulated distribution discrepancies between student and teacher models. The method derives from constrained optimization, upweighting earlier tokens and downweighting later ones with larger deviations. Empirical results demonstrate that IW-OPD converges faster and achieves better final performance than standard On-Policy Distillation (OPD), improving performance by up to 6.9 points on AIME-2025 in both same-size and cross-scale settings.

on-policy distillationtoken-level supervisionconstrained optimizationimportance weightingposition bias

Text2DSL: LLM-Based Code Generation for Domain-Specific Languages

arXiv cs.AI · Alexander V. Kozachok, Alexander M. Nazimov, Shamil G. Magomedov · 2026-06-21

The paper formalizes Text2DSL, a novel task for generating domain-specific language (DSL) code from natural language descriptions, distinct from Text-to-SQL and general-purpose code generation. It introduces PolkitBench, a dataset of 4,204 verified natural-language-to-Polkit-rule pairs validated via an AST-based pipeline. Experiments on GigaChat-10B-A1.8B and Nemotron-3-Nano-30B-A3B demonstrate that structured context (BNF grammar, API specification, permitted identifier vocabulary) significantly improves syntactic validity (98.6-99.4%), structural validity (+9.7 to +35.5 pp), and CodeBLEU scores (+60% to +95%) across models.

text2dsldomain-specific languagepolkitbenchcodebleubnf grammar

From CVE to CWE: Syscall-Based HIDS Generalisation

arXiv cs.AI · Alexander V. Kozachok, Stanislav G. Vyugov, Shamil G. Magomedov · 2026-06-21

The study evaluates whether syscall-based host intrusion detection systems (HIDS) trained on CVEs sharing a Common Weakness Enumeration (CWE) class can generalize to unseen CVEs in the same class. Using LID-DS-2021 data across three CWE families (CWE-307, CWE-89, CWE-434), the authors extract 66-dimensional Peng-Guo features and train Isolation Forest and SGD One-Class SVM detectors with normal-only thresholds calibrated to fixed false positive rates. Results show CWE-307 achieves F1=0.6976 at FPR=0.05, while CWE-89 and CWE-434 perform poorly (F1≤0.21), revealing asymmetric cross-CVE transfer dominated by source profile breadth rather than CWE labels.

host intrusion detectionsystem-call tracescommon weakness enumerationanomaly detectionfeature extraction

Context-Aware Distillation and Ablation for Text2DSL

arXiv cs.AI · Alexander V. Kozachok, Alexander M. Nazimov, Shamil G. Magomedov · 2026-06-21

This work enhances Text2DSL by introducing context-aware distillation and conducting factorial ablation of structured context. The method replaces prompt-only synthetic generation with a teacher LLM (DeepSeek-V4-Flash) operating under structured context (BNF grammar, API specification, closed identifier vocabulary), verified via AST validation and runtime acceptance. This scales the PolkitBench corpus to 10,073 pairs with 100.0% AST validity and 99.7% runtime pass rate. Ablation on GigaChat-10B-A1.8B reveals structured context as load-bearing, with full context (C7) performing best. Vocabulary contributes most to semantic quality (+0.198), while API and BNF dominate structural validity (+24.7 pp and +22.3 pp, respectively).

context-aware distillationstructured contextast validationfactorial ablationsemantic quality

The Power of Light: Improving Synthetic-to-Real Domain Adaptation through Physically-Based Indirect Illumination

arXiv cs.AI · Hooman Tavakoli Ghinani, Tatjana Legler, Martin Ruskowski · 2026-06-21

This paper introduces SmartSDG, an automated pipeline leveraging Physically-Based Shading (PBS) in NVIDIA Isaac Sim, and ILLUM_INTRUCK, a novel industrial benchmark dataset, to study synthetic-to-real domain adaptation. Through 18 YOLOv12 experiments, the authors demonstrate that complex indirect lighting and domain-relevant background variability enhance visual cue richness, reduce false positives, and accelerate model convergence compared to direct-light synthetic data. The findings yield actionable guidelines for virtual scene design to improve object detection robustness in industrial automation.

synthetic-to-realphysically-based shadingdomain adaptationyolov12indirect illumination

Concept-Constrained Prompt Learning for Few-Shot CLIP Adaptation

arXiv cs.AI · Na Sang, Ding Ma, Rui Sang, Yuxuan Liu · 2026-06-21

The paper proposes Concept-Constrained Prompt Learning (CCPL), a regularization framework for few-shot CLIP adaptation that mitigates overfitting to base classes. CCPL anchors learnable class prompts to frozen concept prototypes via text-space cosine consistency, employs concept dropout for regularization, and optionally fuses class-prompt and concept-prototype logits during inference. Evaluated on DTD, EuroSAT, and OxfordPets with automatically-generated splits, CCPL improves base-to-new harmonic mean by +0.6 and +2.9 on DTD and EuroSAT respectively versus CoOp, while maintaining comparable performance on OxfordPets (-0.1). Ablations show text-space concept regularization consistently benefits performance, though optimal inference fusion strength varies by dataset.

few-shot learningprompt learningclip adaptationconcept regularizationtext-space alignment

Look Light, Think Heavy: What Multimodal Chain-of-Thought Reasoning Can and Cannot Do

arXiv cs.AI · Zhuoran Jin, Kejian Zhu, Hongbang Yuan, Yupu Hao · 2026-06-21

This paper systematically evaluates multimodal Chain-of-Thought (CoT) reasoning across 12 tasks, revealing three key findings: (1) CoT benefits reasoning tasks (mathematical, scientific, multi-image) but harms perception tasks (visual grounding, object counting); (2) open-source multimodal reasoning models show marginal improvements due to overemphasis on mathematical reasoning; (3) models exhibit a 'Look Light, Think Heavy' pattern where visual reflection diminishes during reasoning. The study analyzes 22 models (14 non-reasoning, 8 reasoning) to demonstrate CoT's limitations in maintaining visual introspection despite strong verbal reflection.

multimodal chain-of-thoughtvisual reasoningperception tasksmathematical reasoningvisual introspection

MacAgentBench: Benchmarking AI Agents on Real-World macOS Desktop

arXiv cs.AI · Yikun Fu, Bowen Fu, Zhenyu Wu, Shuang Cheng · 2026-06-21

MacAgentBench introduces a comprehensive benchmark for evaluating computer use agents (CUAs) on macOS, addressing limitations of existing benchmarks by incorporating framework augmentation and multi-application tasks. The benchmark comprises 676 tasks across 25 applications, with 60% requiring GUI and CLI interaction, and employs deterministic rule-based evaluation with fine-grained multi-checkpoint scoring. Experiments with three frameworks and 16 models reveal that Claude Opus 4.6 on OpenClaw achieves 73.7% Pass@1, with performance primarily driven by skill libraries rather than framework design, while fine-grained metrics expose significant variations in sub-goal completion among models with similar Pass@1 scores.

computer use agentsmacos benchmarkmulti-application tasksfine-grained evaluationskill library

Training-Free Semantic Correction for Autoregressive Visual Models

arXiv cs.AI · Junhao Chen, Chanyu Zhu, Zheqi Lv, Keting Yin · 2026-06-21

The paper introduces Gazer, a training-free framework for semantic correction in autoregressive visual models (AVMs) that integrates multimodal large language model feedback during generation. Gazer employs a two-stage process: Reflective Diagnosis identifies semantic errors in intermediate states, while Semantic Correction rewinds and adjusts the generation trajectory to align with the target prompt. Evaluations on compositional image and video benchmarks show improved semantic alignment and accuracy across multiple AVMs without requiring additional training.

autoregressive visual modelssemantic correctionmultimodal feedbackin-generation diagnosistraining-free

Generative Robust Optimisation

arXiv cs.AI · Yuhui Yin, Vassilis M. Charitopoulos · 2026-06-21

Generative Robust Optimisation (GRO) introduces a framework where deep generative models define uncertainty sets via neural network decoders, capturing complex dependencies in real-world data. The approach employs a five-point evaluation framework (reconstruction fidelity, distribution matching, latent regularity, robust relevance, computational tractability) to assess neural network-based uncertainty sets systematically. A Wasserstein Adversarial Autoencoder with Gaussian mixture model-guided training and constraint-consistency regularization is instantiated, enabling exact worst-case verification through mixed-integer programming embedding. Experiments on production planning and multi-period facility location problems across six uncertainty distributions and six generative architectures validate GRO's ability to produce expressive, well-calibrated, and tractable uncertainty sets.

generative robust optimisationwasserstein adversarial autoencodermixed-integer programminguncertainty setsgaussian mixture model

Governance Decay: How Context Compaction Silently Erases Safety Constraints in Long-Horizon LLM Agents

arXiv cs.AI · Shiyang Chen · 2026-06-21

The paper identifies Governance Decay as a novel safety failure mode in long-horizon LLM agents, where context compaction silently removes in-context governance constraints, leading to prohibited tool actions. Using ConstraintRot, a benchmark with deterministic tool-call grading, the authors measure compaction-induced violations across seven model families (1,323 episodes), showing violations rise from 0% to 30% (peaking at 59%) when constraints are omitted. They demonstrate a Compaction-Eviction Attack where adversarial content biases summarizers to drop policies, and propose Constraint Pinning, a training-free mitigation that reduces violations to 0%.

governance decaycontext compactionllm agentsconstraintrotconstraint pinning

Fed-CausalDiff: Decoupled Synchronization for Federated Do-Simulation and Policy Evaluation

arXiv cs.AI · Pengfei Li, Mohammad Khalil · 2026-06-21

The paper introduces Fed-CausalDiff, a federated causal diffusion framework for do-simulation and policy evaluation. It addresses limitations of observational federated learning by decomposing latent state evolution into global causal and local confounding score functions, enabling decoupled synchronization (DSS). This approach allows clients to aggregate shared causal mechanisms while retaining site-specific confounders locally. Evaluations on four datasets show improved average treatment effect (ATE) and policy-value estimation accuracy, with optimized communication cost and inference fidelity trade-offs.

federated learningcausal diffusiondo-simulationdecoupled synchronizationpolicy evaluation

Imagine to Ensure Safety in Hierarchical Reinforcement Learning

arXiv cs.AI · Gregory Gorbov, Artem Latyshev, Aleksandr I. Panov · 2026-06-21

The paper proposes a hierarchical reinforcement learning method for safe exploration in long-horizon tasks, addressing compounding estimation errors and restricted exploration in existing approaches. The method combines a learnable world model with dual policies: a high-level policy generating safe subgoals and a low-level policy using imagined rollouts to minimize unsafe behaviors. Evaluated on navigation and manipulation tasks with high-dimensional action spaces, it outperforms Safe RL baselines in success rate (quantitative improvement unspecified) and consistently meets safety constraints across seeds, where prior methods fail.

hierarchical reinforcement learningsafe explorationworld modelsubgoal generationlong-horizon tasks

Enabling Cloud-Level Accuracy in Edge AI through IoT Data Preprocessing

arXiv cs.AI · Aygün Varol, Katarzyna Kołodziej, Łukasz Sobczak, Michał Romaszewski · 2026-06-21

The paper demonstrates that structured prompt preprocessing narrows the accuracy gap between local and cloud LLMs for IoT environmental monitoring. A framework transforms raw sensor data (air-quality, thermal-comfort) into three textual representations (raw values, threshold-aware descriptions, summary flags), evaluated on Raspberry Pi/BME680 and European city datasets. Testing five local and five cloud LLMs across prompt variants and inference modes shows enriched prompts boost local-model accuracy from 50.9% to 81.7% (indoor) and 63.7% to 89.3% (outdoor) in No-CoT mode, with 0.22s mean latency versus slower CoT alternatives.

edge aiprompt engineeringiot analyticsnumerical reasoningchain-of-thought

Grounded Scaling: Why Agentic AI Needs Deterministic Environments

arXiv cs.AI · Liang Ding, Xintong Wang · 2026-06-21

The paper introduces grounded scaling as a framework for agentic AI, arguing that deterministic environments are crucial for reliable long-chain task execution. It formalizes three key results: a Determinism-Efficiency Bound showing exponential decay in success rates (δ^k for k-step chains), a Verifier-Goodharting Floor on reward optimization limits, and convergence conditions for environment-side skill evolution. The authors propose a Supply Certainty Index (SCI) with five measurable properties, a five-level Determinism Maturity Model (DMM), and a falsifiable research program. The work engages competing positions on sim-to-real transfer, alignment, and AI-as-normal-technology.

agentic aideterminism-efficiency boundsupply certainty indexverifier-goodharting floordeterminism maturity model

Deep Learning-Based Sign Language Recognition from Videos and Cross-Lingual Translation to Indian Vernaculars

arXiv cs.AI · Chandranath Adak, Ramesh Nandipalli · 2026-06-21

The authors present a two-stage pipeline for Indian sign language recognition and cross-lingual translation, combining video classification with multilingual text generation. Stage one fine-tunes VideoMAE on 16-frame 224x224 clips from AI4Bharat's 13-class dataset (197 clips), achieving 78% validation accuracy despite confusable adjective pairs. Stage two translates English predictions into Hindi, Telugu, and Bengali using NLLB-200. The system demonstrates limitations in isolated-word recognition and single-signer sensitivity, with released code supporting future expansion to continuous signing and larger vocabularies.

videomaenllb-200sign language recognitioncross-lingual translationai4bharat

An LLM-Orchestrated Agent for Directional-Coupler Design with Self-Consistent Eigenmode and FDTD Validation

arXiv cs.AI · Saumya Biswas, Amrit De, Md Tauhidul Islam · 2026-06-21

The paper introduces an LLM-orchestrated agent for designing silicon-on-insulator $2\times2$ directional couplers, leveraging deterministic solvers for physics computations. The LLM proposes candidate gap values and assesses convergence, while a frequency-domain eigenmode solver estimates the coupling coefficient $\kappa$ and an FDTD solver validates it. Using a 2D effective-index model, the design achieves self-consistency with a residual phase offset $\phi$, corrected via a closed-loop length adjustment. The agent produces a 50/50 splitter with an FDTD-measured cross fraction of 0.498 (target 0.500), demonstrating a residual of 0.0017.

directional couplerlarge language modelfinite-difference time-domaineffective-index modelsilicon-on-insulator

SCOPE: Evolving Symbolic World for Planning in Open-Ended Environments

arXiv cs.AI · Yundaichuan Zhan, Minghe Gao, Zhongqi Yue, Wendong Bu · 2026-06-21

SCOPE introduces a self-adaptive symbolic planning framework for open-ended environments, addressing incomplete symbolic representations from perception. The framework combines a Symbolic Execution Simulator (SESim) for plan refinement and symbolic world evolution through validation and execution feedback, and a Self-Adaptive Symbolic Memory (SASMem) for distilling feedback into evolving symbolic knowledge. Experiments demonstrate SCOPE's improvements in symbolic world completeness, plan success rates under perturbations, and cross-task adaptability in diverse embodied scenarios.

symbolic planningvision-language modelsopen-ended environmentssymbolic executionself-adaptive memory

Human and AI collaboration for pulmonary nodule segmentation

arXiv cs.AI · Hongqiao Dong, Wenhao Chi, Ruobing Liang, Xiaokui Yang · 2026-06-21

The study introduces Hi-Seg, a human-in-the-loop segmentation framework for pulmonary nodules built on the Segment Anything Model (SAM), enabling iterative refinement of prompts through trial-and-error learning and semantic reasoning. Evaluated on chest CT scans from 1,179 patients across 12 centers, Hi-Seg achieved a mean Dice score of 85%, outperforming five deep learning models by 10-22% and 13 SAM variants by 1-29%. The framework reduced annotation time for medical professionals and enabled non-medical annotators to achieve performance comparable to junior medical students, suggesting potential for scalable crowdsourced annotation and clinical workflow integration.

human-in-the-looppulmonary nodule segmentationsegment anything modeldice scoreiterative refinement

VADAOrchestra: Neurosymbolic Orchestration of Adaptive Reasoning Workflows

arXiv cs.AI · Teodoro Baldazzi, Luigi Bellomarini, Andrea Coletta, Michela Iezzi · 2026-06-21

VADAOrchestra introduces a neurosymbolic framework for adaptive workflow orchestration, combining LLM-based planning with symbolic reasoning via Datalog+/-. The system dynamically constructs and modifies workflows in response to evolving context, using LLMs for high-level orchestration while offloading logical inference to a scalable symbolic engine. This hybrid approach ensures verifiable traces and auditability while maintaining flexibility. Evaluations on financial use cases demonstrate improvements in faithfulness, scalability, and explainability over standard agentic architectures.

neurosymbolicorchestrationdatalog+/-llmworkflow

All Green, Still Broken: Real-Flow Verification Lessons from an LLM-Integrated, Multi-Market Web Application

arXiv cs.AI · Muhammad Bilal, Ali Hassaan Mughal · 2026-06-21

The paper introduces a four-seam framework for identifying testing gaps in LLM-integrated, multi-market web applications, based on empirical analysis of 252 bug-fix commits in a production rental-search assistant. The authors studied a system with 1,553 test cases that passed continuously yet allowed user-facing defects to reach production. They classified fixes by the boundary they escaped through, finding 44% of defects occurred in four untestable seams: live browser runtime, non-default market, end-to-end flow, and whole-system level. The study highlights the limitations of component-level unit testing and proposes practices for identifying critical seams.

llm-integrated systemsmulti-market testingfour-seam frameworkcomponent-level unit testsend-to-end flow

Not All Claims Are Equally Risky: FACTOR for Adaptive Verification in Factual Long-Form Generation

arXiv cs.AI · Areeba Hassan, Arooj Kausar, Syeda Kisaa Fatima, Gibrail Islam · 2026-06-21

FACTOR (FACTuality-Oriented Risk-aware Verification) introduces adaptive verification for factual long-form generation by LLMs, dynamically adjusting verification criteria based on claim-level uncertainty. The method integrates uncertainty estimation, adaptive language inference verification, and candidate re-ranking to optimize verification effort allocation. Evaluated on FactScore, FACTOR improves factuality while reducing verification costs, with ablation studies identifying key performance drivers. Results demonstrate its effectiveness and model-agnostic applicability.

factual verificationuncertainty estimationlong-form generationadaptive inferencelanguage models

PRIME: Evaluating Prompt Resolution Under Incompatible Instructions in LLMs

arXiv cs.AI · Tehreem Javed, Shumaim Fatimah, Masooma Bakhtiari, Gibrail Islam · 2026-06-21

We introduce PRIME (Prompt Resolution under Incompatible Meta-Instructions Evaluation), a framework to analyze LLM behavior under conflicting instructions, addressing limitations of isolated instruction-following benchmarks. PRIME generates calibrated conflicts across response length, output format, and reasoning, classifying responses using a deterministic behavioral taxonomy. We evaluate five instruction-tuned open-weight LLMs in balanced and naturally distributed settings, finding that conflict type significantly impacts behavior more than model scale, with distinct failure modes across conflict categories. These results highlight the importance of conflict awareness and demonstrate that LLM instruction-following cannot be assessed through isolated constraints alone.

prompt resolutionmeta-instructionsbehavioral taxonomyconflict awarenessinstruction-tuned

CASPER in the Machine: Insights into Character Variety in LLM-Generated Stories

arXiv cs.AI · Anneliese Brei, Abhisheik Sharma, Nicholas Sanaie, Lu Wang · 2026-06-21

This study investigates character variety in LLM-generated versus human-written stories using narratological dimensions like stylization and wholeness. The authors analyze eight character dimensions across both story types, employing automated categorization techniques. Key findings reveal both similarities and differences in character portrayal, addressing whether LLMs produce diverse characters comparable to human authors.

llm-generated storiescharacter varietynarratologystylizationwholeness

AutoDex: An Automated Real-World System for Dexterous Grasping Data Collection

arXiv cs.LG · Mingi Choi, Gunhee Kim, Jisoo Kim, Taeksoo Kim · 2026-06-22

AutoDex introduces an automated system for scalable real-world dexterous grasping data collection, addressing limitations of teleoperation and simulation-based methods. The system integrates dense 20-camera perception for object localization under occlusion, collision-monitored robot execution, physical outcome labeling, and active object resetting to expose diverse grasp candidates. It collects 3,593 grasp trials across Allegro and Inspire hands on 100 objects, achieving a 4.8x throughput improvement over teleoperation (10.3h vs. 49.4h for 500 trials). Grasps validated by AutoDex succeed 76% of the time, compared to 34% for simulation-only validation, demonstrating its efficacy in generating physically reliable grasping datasets.

dexterous graspingteleoperationcollision-monitoredmulti-view perceptionphysical validation

On the Limits of Prompt-Conditioned Language Models as General-Purpose Learners

arXiv cs.LG · David Mguni, Julian Ma, Jun Wang · 2026-06-22

The paper establishes fundamental limitations of prompt-conditioned LLMs as general-purpose learners through game-theoretic analysis. Modeling user-system interaction as a bilevel cheap-talk game, it demonstrates: (1) an expressivity floor where task complexity exceeds language's capacity as a communication channel, creating indistinguishable tasks; (2) an objective-misalignment floor where alignment constraints prevent ideal output distributions. PAC-Bayes bounds quantify irreducible errors persisting despite data or model scaling. Results prove LLMs cannot universally solve all tasks via prompting alone due to information-constrained communication and alignment-constrained objectives, suggesting multimodal interfaces may mitigate these limitations.

prompt-conditionedbilevel gamepac-bayesexpressivity floormisalignment floor

MAS-PromptBench: When Does Prompt Optimization Improve Multi-Agent LLM Systems?

arXiv cs.LG · Juyang Bai, Laixi Shi · 2026-06-22

The paper systematically evaluates prompt optimization in multi-agent LLM systems (MAS), addressing when and how it improves performance across diverse configurations. Using two state-of-the-art single-agent prompt optimizers, the study varies task, workflow, communication protocol, and team size to analyze gains. Results show significant potential for performance improvement but highlight challenges from the exponentially growing search space in MAS setups.

multi-agent systemsprompt optimizationllm-based agentssystem promptssearch space

Action-BED: Task-Driven Bayesian Experimental Design with Singly Intractable Objectives

arXiv cs.LG · Tom Rossa, Angus Phillips, Tom Rainforth · 2026-06-22

The paper introduces ACTION-BED, a task-driven Bayesian experimental design framework that reformulates traditional doubly intractable objectives into singly intractable expected future loss (EFL) objectives. The method jointly optimizes design and action policies using stochastic gradients, requiring only sampling from the joint model and evaluation of downstream loss functions. This approach eliminates explicit posterior estimation, improves optimization efficiency, and enables easy customization for diverse downstream tasks.

bayesian experimental designsingly intractable objectivesexpected future lossstochastic gradientstask-driven optimization

Dynamic estimation of slowly varying sequences

arXiv cs.LG · Prashant Gokhale, Mikhail Khodak, Sandeep Silwal · 2026-06-22

We introduce a general-purpose framework for sequentially approximating functions of slowly-varying sequences, extending implicit trace estimation to diverse linear and nonlinear operations across vector spaces. Our method develops a novel algorithm that dynamically scales the estimation budget with the sequence variation rate α_t, achieving sharper path-length-style bounds of O(∑α_i) compared to prior O(m·max α_i) guarantees. The framework enables sequential estimation for matrix powers, spectral densities, Monte Carlo integration, and PDE boundary value problems, while demonstrating on-the-fly estimation of α_i in certain cases with minimal overhead. This advances dynamic trace estimation by making sequential approximation adaptive and general-purpose.

sequential approximationimplicit trace estimationpath-length-style boundsdynamic scalingslowly-varying sequences

Muown Implicitly Performs Angular Step-size Decay

arXiv cs.LG · Florian Hübler, Kai Lion, Antonio Orvieto, Niao He · 2026-06-22

AngularMuown, a novel optimizer derived from Muown, explicitly decouples angular step size from radial magnitude updates by optimizing directly over normalized directions. The method leverages Riemannian steps for directional updates and introduces a schedulable angular multiplier, addressing the implicit angular step-size decay observed in Muown. Empirical results demonstrate that AngularMuown outperforms Muown, achieving state-of-the-art performance in the nanoGPT speedrunning competition and scaling effectively to larger models, including Qwen2-0.5B and 1.1B parameter mixture-of-experts architectures. Implementation details are publicly available.

angular step sizeriemannian optimizationnormalized directionsmatrix-aware optimizersmixture-of-experts

Diffusion Models Adapt to Low-Dimensional Structure Under Flexible Coefficient Choices

arXiv cs.LG · Changxiao Cai, Yuchen Jiao, Gen Li · 2026-06-22

The paper demonstrates that diffusion models robustly adapt to low-dimensional data structures across a broad class of update coefficients, independent of ambient dimension. The authors prove that $\widetilde{O}(k/\varepsilon)$ iterations suffice to generate an $\varepsilon$-accurate sample in total variation distance, generalizing prior convergence theory beyond narrowly prescribed coefficient choices. This framework encompasses several practical diffusion samplers, providing theoretical justification for their empirical effectiveness on structured high-dimensional data.

diffusion modelslow-dimensional structuretotal variation distanceconvergence theorysampling complexity

Hedgementation = Hedgerow Segmentation: A Remote Sensing Benchmark

arXiv cs.LG · Nathan Senyard, Salem Hamdani, Astrid Zhang, Derek Wang · 2026-06-22

Hedgementation introduces a novel benchmark for evaluating machine learning models in hedgerow mapping from remote sensing data at country scale with 10m² spatial resolution. The benchmark combines harmonized remote sensing data and ground truth labels from a French hedgerow inventory, assessing model generalization across spatial distances and climatic zones. Three baseline models are evaluated for their ability to track fine-scale agricultural features using both supervised and self-supervised learning approaches. The benchmark and baseline results are reproducible via an open-source code repository.

hedgerow mappingremote sensingself-supervised learningspatial resolutiongeneralization

MORL-A2C: Multi-Objective Reinforcement Learning Reranker for Optimizing Healthiness in MOPI-HFRS

arXiv cs.LG · Aarya Vasantlal, Joshua Zolla · 2026-06-22

MORL-A2C extends MOPI-HFRS by introducing sequential decision-making for health-aware food recommendation, addressing limitations of static tradeoff solutions. The method employs an Advantage Actor-Critic algorithm with scalarized rewards, warm-started via behavior cloning from frozen GNN embeddings, to perform K-step reranking. Results show a trade-off: while Recall@20 and NDCG@20 decrease by 2.03pp and 2.88pp respectively, H-Score@20 improves by 23.52pp, demonstrating effective health-preference optimization.

multi-objective reinforcement learningadvantage actor-critichealth-aware recommendationgnn embeddingsbehavior cloning

Neural Networks as Linear Regression: An Introduction for Statisticians

arXiv cs.LG · Abigail Loe, Susan Murray, Zhenke Wu · 2026-06-22

The article bridges the gap between classical statistics and neural networks by demonstrating how neural networks can approximate linear regression models. It targets statisticians with a frequentist background, aiming to lower the entry barrier to neural network methodologies. The authors describe foundational network architectures and common customizations, providing a conceptual framework for further exploration. This approach facilitates understanding of neural networks through familiar statistical concepts, emphasizing their utility as prediction tools.

neural networkslinear regressionfrequentistpredictionstatistics

Quantifying the Agreement Between Data-Influence and Data-Similarity to Understand LLM Behavior

arXiv cs.LG · Christopher J. Anders, Henrique Da Silva Gameiro, Nico Daheim, Mohammad Emtiyaz Khan · 2026-06-22

This work quantifies the agreement between data-similarity and data-influence measures for tracing LLM outputs to training data, addressing a gap in comparative analysis. By ranking training documents using both measures and computing rank overlap, the study reveals significant agreement but an asymmetry: top data-similarity documents exhibit more consistent ranks in data-influence than vice versa. Experiments on OLMo2-1B, Qwen3-1.7B, LlaMa3.2-1B, Gemma3-1B, and GPT2 validate this finding. The asymmetry is leveraged to optimize cost-accuracy trade-offs by refining data-similarity results with data-influence.

data-similaritydata-influencerank overlapllm behaviorcost-accuracy trade-off

It's Much Easier for Neural Networks to learn Game of Life Dynamics with the Right Activation Function: Polynomial Kolmogorov-Arnold Networks

arXiv cs.LG · Tashin Ahmed, Q. Tyrell Davis · 2026-06-22

The paper demonstrates that neural networks with tailored activation functions can learn Conway's Game of Life dynamics more efficiently than standard ReLU networks. By shifting focus from search-based initialization to inductive bias alignment, the authors show that 2nd-degree polynomial activations enable minimal networks to consistently learn the rules, even without weight optimization. Results challenge the default practice of scaling network size and advocate cellular automata as testbeds for physics-informed and interpretable ML. Performance improvements are quantified across alternative activation functions.

cellular automatainductive biaspolynomial activationgame of lifeinterpretable machine learning

Patient-Aware Contrastive Learning Preserves Per-Patient Structure in RR-Interval Representations

arXiv cs.LG · Yasantha Niroshana, Weijith Wimalasiri, Chathuranga Hettiarachchi · 2026-06-22

The paper introduces a patient-aware contrastive learning objective for RR-interval sequence analysis that preserves per-patient sinus rhythm baselines while improving atrial fibrillation detection. By forming positive pairs exclusively from same-patient, same-class segments, the method maintains individual physiological variation critical for cross-patient generalization. Evaluated on the IRIDIA-AF dataset, the approach achieves 0.989 AUROC with 2.6× lower variance than supervised contrastive baselines, demonstrating superior per-patient structural cohesion (0.850 vs 0.800 for SupCon) while avoiding the over-separation of binary cross-entropy that harms unseen-patient performance.

contrastive learningrr-intervalpatient-awareparoxysmal atrial fibrillationgeneralization

SVD-Surgeon: Optimal Singular-Value Surgery for Large Language Model Compression

arXiv cs.LG · Mahmoud Safari, Frank Hutter · 2026-06-22

SVD-Surgeon introduces a training-free method for optimal singular-value surgery in LLM compression, extending the Optimal Brain Surgeon framework to singular-value decomposition. By treating singular values as parameters, it computes closed-form updates to compensate for truncation effects to second-order loss approximation, while also providing a saliency metric for pruning decisions. The method layers atop existing SVD compressors like SVD-LLM, improving perplexity-compression trade-offs on OPT models and LLaMA 2-7B without retraining.

singular value decompositionmodel compressionoptimal brain surgeonlarge language modelsperplexity

Approximating velocity fields with planted attractors via Neural-ODEs for classification purposes

arXiv cs.LG · Feliciano Giuseppe Pacifico, Duccio Fanelli, Lorenzo Buffoni, Lorenzo Chicchi · 2026-06-22

The work demonstrates Neural ODEs with planted equilibrium points for classification tasks, where attractors correspond to target classes. The method leverages the universal approximation capability of Neural ODEs to shape velocity fields, creating basins of attraction that guide inputs (initial conditions) to their correct class destinations. This approach effectively transforms classification into a dynamical system with stable equilibria representing class decisions.

neural odesequilibrium pointsvelocity fieldsbasins of attractionuniversal approximation

SuperCond-GNN: Scalable Graph Neural Network Surrogate for Superconducting Circuit Simulations

arXiv cs.LG · Nandana Menon, Giorgio Vallone · 2026-06-22

SuperCond-GNN introduces a graph neural network surrogate for predicting voltage distributions in high-temperature superconducting (HTS) magnet circuits, achieving 4.3% mean MAPE accuracy. The method represents HTS circuits as graphs, employing message-passing GNNs to learn electrical responses from topology, material properties, and operating current. Evaluated on tape stacks (≤10 tapes), it demonstrates scalability via physics-informed regularization (Kirchhoff's current law) and generalizability through zero-shot inference and few-shot fine-tuning.

graph neural networkssuperconducting circuitssurrogate modelingphysics-informed learningmessage passing

Simulation-Free Estimation of Traffic Flows from Sparse Count Data

arXiv cs.LG · Davide Guastella, Gianluca Bontempi · 2026-06-22

The authors propose a simulation-free method for estimating time-varying traffic flows from sparse aggregated vehicle counts. The approach partitions the study area into spatial regions, constructs feasible region-to-region routes, and solves a weighted least-squares optimization problem to allocate vehicles across routes, guided by a weighted contribution matrix encoding sensor coverage. Edge-level trajectories are derived by scoring candidate routes against temporal and volumetric profiles of regional sensor counts. Evaluated on the Brussels road network using real and synthetic data, the method reproduces daily traffic profiles and outperforms baselines at reduced computational cost.

weighted least-squaresspatial regionssensor coverageedge-level trajectoriestraffic flow estimation

Concordia: JIT-Compiled Persistent-Kernel Checkpointing for Fault-Tolerant LLM Inference

arXiv cs.LG · Yuhang Gan, Yiwei Yang, Yuyi Li, Xiangyu Gao · 2026-06-22

Concordia introduces a GPU-resident execution context for fault-tolerant LLM inference, addressing the loss of valuable state (e.g., KV caches, request schedulers) during GPU or communicator failures. The runtime employs a persistent kernel as its substrate, interposing on GPU module loading and supporting PTX- and SASS-level instrumentation to insert checkpoint and pause hooks below framework and library boundaries. Concordia JIT-compiles specialized delta-checkpoint handlers for each LLM state region, hot-swapping them into the persistent kernel's operator table. The kernel manages a lock-free ring buffer of tasks, enabling dirty-page detection, delta staging, and committed record logging in CXL memory or host DRAM without CPU intervention.

gpu-resident execution contextpersistent kerneldelta-checkpoint handlerptx-level instrumentationlock-free ring buffer

Collapsed Effective Operators for Higher-order Structures

arXiv cs.LG · Maximilian Krahn, Lennart Bastian, Vikas Garg, Björn Schuller · 2026-06-22

The authors propose Collapsed Effective Operators, a method for condensing higher-order topological structures into vertex-level operators via Schur complementation of a graded Laplacian. This yields a dense operator encoding long-range interactions mediated by topology while preserving positive semi-definiteness with a spectral upper bound relative to the rank-0 Hodge Laplacian. Empirical results demonstrate improvements in spectral clustering, signal smoothing, and neural network positional encoding through topological feature inclusion.

schur complementationhodge laplacianhigher-order structuresspectral clusteringpositional encoding

FairBED: A Bayesian Experimental Design Approach to Gathering Fairer Data

arXiv cs.LG · Marcel Hedman, Emily Alger, Brieuc Lehmann, Chris Holmes · 2026-06-22

FairBED introduces a Bayesian experimental design framework for fairness-aware data acquisition, proposing novel metrics to quantify dataset fairness based on sensitive attribute uninformativness. The method formulates objectives that maximize information gain about target variables while minimizing information about sensitive attributes, establishing theoretical connections to demographic parity. Empirical results demonstrate improved fairness-accuracy trade-offs compared to random data collection and conventional BED, with models trained on FairBED-acquired data showing superior performance.

bayesian experimental designfairness metricsdemographic paritydata acquisitionsensitive attributes

Development and Design of FLKit: A Structured Onboarding Toolkit for Federated Learning in Health and Life Sciences

arXiv cs.LG · Ashkan Pirmani, Ilse Vermeulen, Goran Vinterhalter, Lotte Geys · 2026-06-22

FLKit introduces a structured onboarding toolkit for federated learning (FL) in health and life sciences, addressing multidisciplinary barriers to adoption. The toolkit provides role-specific entry points (11 total) across four FL lifecycle stages (Governance, Infrastructure, Wrangling, Analysis), supplemented by a cross-disciplinary glossary, FAIR-aligned project templates, and a curated tool directory. Developed via a consortium-driven approach with practitioner interviews, it includes 39 pages of documentation and seven FL Stories showcasing real-world applications in domains like multiple sclerosis prediction and genomics. The open-source resource (launched December 2024) is maintained by the biomedical data science community at UHasselt.

federated learninglifecycle stagesfair-alignedmultidisciplinarygovernance

TROPT: An Open Framework for Unifying and Advancing Discrete Text Optimization

arXiv cs.LG · Matan Ben-Tov, Mahmood Sharif · 2026-06-22

The authors introduce TROPT, an open-source framework for unifying discrete text optimization across diverse applications like LLM jailbreaking and model auditing. TROPT standardizes optimizer execution via a modular interface supporting 15+ optimizers (white-box to black-box), 15+ losses, and 30+ prebuilt recipes, enabling component swapping and cross-domain portability. Empirical studies demonstrate its utility in comparative optimizer analysis for jailbreaking and novel applications like corpus-poisoning of embedding models, reducing adoption barriers for text-trigger optimization research.

discrete optimizationtext-triggerjailbreakingmodular frameworkmodel auditing

Sublinearly Structured Deep Neural Networks Achieve Feature Learning Consistency for Compositional Functions

arXiv cs.LG · Sehwan Kim, Yan Sun, Faming Liang · 2026-06-22

The paper establishes feature-learning consistency guarantees for sublinearly structured deep neural networks (DNNs), where input/output dimensions and hidden neurons grow sublinearly with sample size, when learning hierarchically compositional functions. Theoretical analysis shows this consistency persists even in over-parameterized regimes (parameters > samples), while empirical results demonstrate competitive performance against wide DNNs. Structural audits reveal popular CNNs (AlexNet, VGGNet, ResNet, GoogLeNet) exhibit sublinear scaling on image classification tasks, aligning with the theoretical framework for hierarchical data.

feature-learning consistencysublinear scalingcompositional functionsover-parameterized regimeuniversal approximation

Time Series Classification through Diffeomorphic Time Warping (DiffTW)

arXiv cs.LG · Vicky Geneva Haney, Kamel Lahouel, Victor Rielly, Bruno M. Jedynak · 2026-06-22

The paper introduces Diffeomorphic Time Warping (DiffTW), a novel framework for time series classification that generalizes Dynamic Time Warping (DTW) by learning diffeomorphic transformations between real-valued functions. The method models temporal alignments as solutions to ordinary differential equations derived from a transport equation, using reproducing kernel Hilbert spaces and optimal control for velocity field representation. Evaluated with a 1-nearest neighbor classifier, DiffTW achieves superior performance over DTW on 60 out of 86 benchmark datasets.

time series classificationdiffeomorphic transformationdynamic time warpingordinary differential equationsreproducing kernel hilbert space

Do Location Encoders Capture Spatial Effects? A GeoShapley Benchmark Across Scales

arXiv cs.LG · Daniel Kiv, Shaowen Wang · 2026-06-22

The study evaluates how well location encoders capture interpretable spatial effects using GeoShapley, a game-theoretic explainer that treats location features as a single joint player. Eleven TorchSpatial encoders are benchmarked against a synthetic process with known coefficients, across grid, county, and global scales, with variations in training and coordinate inclusion. Results show high recovery of primary coefficients across encoders, while secondary coefficient recovery is scale-dependent, particularly at the global scale; raw coordinates remain competitive throughout.

location encodersgeoshapleyspatial effectstorchspatialgame-theoretic explainer

Selective Time Series Forecasting via Metalearning

arXiv cs.LG · Ricardo Inácio, Vitor Cerqueira, Marília Barandas, Carlos Soares · 2026-06-22

The paper introduces a selective forecasting framework that improves time series prediction accuracy by abstaining from high-risk forecasts via metalearning. The method models empirical error percentiles using structural features from recent lags, decoupling rejection decisions from domain-specific forecasts. Evaluations demonstrate consistent accuracy improvements in both in-domain and transfer learning settings when rejecting challenging samples.

selective forecastingmetalearningtime seriesrejection mechanismtransfer learning

SkyJEPA: Learning Long-Horizon World Models for Zero-Shot Sim-to-Real Control of Quadrotors

arXiv cs.LG · Pratyaksh Rao, Wancong Zhang, Randall Balestriero, Yann LeCun · 2026-06-22

The authors present SkyJEPA, a Joint Embedding Predictive Architecture (JEPA) for zero-shot sim-to-real control of quadrotors, addressing long-horizon forecasting challenges in neural dynamics models. The method combines a latent dynamics model with a physics-inspired prober for interpretable state prediction and integrates it with sampling-based optimal control for real-time embedded deployment. Experiments demonstrate accurate long-horizon prediction, robust zero-shot transfer, and generalization across diverse conditions, enabled by an automated dataset generation pipeline.

joint embedding predictive architecturezero-shot sim-to-reallatent dynamics modelsampling-based optimal controlquadrotor control

Interpretable Kolmogorov-Arnold Network with Feature-Isolated Temporal Attention Mechanism for Electricity Load Forecasting

arXiv cs.LG · Jinhao Li, Hao Wang · 2026-06-22

The paper introduces LoadKAN, an interpretable hybrid framework for electricity load forecasting that combines a feature-isolated temporal attention mechanism with a Kolmogorov-Arnold Network (KAN). The attention mechanism independently extracts temporal dynamics from input features (e.g., historical load, mobility data), while the KAN module provides interpretable predictions through learnable activation functions. Evaluated on three U.S. electricity market datasets, LoadKAN matches state-of-the-art black-box models while enabling granular analysis of non-linear feature relationships, particularly revealing market-specific mobility-load dependencies through quantitative sensitivity analysis.

kolmogorov-arnold networktemporal attentionfeature isolationelectricity load forecastinginterpretable machine learning

Leveraging Similarities in Multi-Armed Bandits

arXiv cs.LG · Khaled Eldowa, Thibaud Rahier, Augustin Cablant, Panayotis Mertikopoulos · 2026-06-22

The paper investigates multi-armed bandit problems with similarity-structured action sets, where actions are organized in a tree encoding their hierarchical relationships. It first establishes an impossibility result for leveraging similarities under standard one-point bandit feedback, then proposes unified algorithms for richer feedback models (semi-bandit to multi-point). These algorithms adapt to action similarities, replacing the action count $K$ with an effective count $K_{\mathrm{eff}}$ in regret bounds. Key results include best-of-both-worlds guarantees and $\sqrt{T}$ regret for Lipschitz bandits under two-point feedback when $d \leq 2$.

multi-armed banditssimilarity-structured actionstree-compatible lossessemi-bandit feedbackregret bounds

Quantum Convolutional Neural Networks for Groundwater Heat Plume Prediction: A Surrogate Modeling Approach

arXiv cs.LG · Danyal Maheshwari, Julia Pelzer, Miriam Schulte · 2026-06-22

The study proposes a Quantum Convolutional Neural Network (QCNN) as a surrogate model for predicting groundwater temperature variations induced by geothermal heat pumps in Munich. The QCNN architecture combines quantum convolutional and pooling layers with a fully connected readout, using parameterized quantum circuits and Hamiltonian-inspired feature encoding. Evaluated across statevector simulators, noisy simulators, and IBM's 127-qubit Kyiv processor with error mitigation, the QCNN achieves competitive performance (measured by MSE) compared to classical neural networks, demonstrating potential for quantum-enhanced environmental modeling as hardware improves.

quantum convolutional neural networksurrogate modelinggroundwater heat plumeerror mitigationparameterized quantum circuits

Differential Spectral Damping Gap Adaptive Regularization for Ill-Conditioned Kernel Methods

arXiv cs.LG · Praveg Vashishtha · 2026-06-22

The paper introduces Differential Spectral Damping (DSD), an adaptive regularization method for ill-conditioned kernel methods like Least-Squares Twin Support Vector Machines (LSTSVM). DSD selectively penalizes eigenvectors based on spectral gap size, preserving reliable directions (large gaps) while suppressing noisy ones (small gaps), as justified by Davis-Kahan perturbation theory. Experiments show DSD improves LSTSVM classification accuracy by +4.8pp on the GINA dataset (d=970) compared to optimized Tikhonov baselines, with statistically significant effects (Cohen's d=4.49, p<0.001).

kernel methodsadaptive regularizationspectral dampingdavis-kahan theoremlstsvm

Physics-Informed Modeling for Wood Thermal Analysis and Prediction

arXiv cs.LG · Jingren Xie, Alex John Buckthal, Ryan Anthony O'Connor, Isak Worre Foged · 2026-06-22

The paper introduces physics-informed deep learning frameworks for predicting pixel-level thermal responses of spatially heterogeneous wood materials. Two approaches are proposed: Physics-Informed CNNs (PICNNs) with PDE-based soft constraints in the loss function, and Physics-Integrated CNNs (PInteCNNs) that hard-code an analytical solver into the network architecture. Evaluated on multimodal datasets of Poplar and Grandis wood samples, both methods outperform purely data-driven approaches in accuracy and interpretability while handling material heterogeneity.

physics-informed neural networksthermal property predictionmaterial heterogeneitypartial differential equationsmultimodal learning

FlexServe: A Fast and Secure LLM Serving System for Mobile Devices with Flexible Resource Isolation

arXiv cs.LG · Yinpeng Wu, Yitong Chen, Lixiang Wang, Jinyu Gu · 2026-06-22

FlexServe introduces a fast and secure LLM inference system for mobile devices, addressing the overheads of ARM TrustZone-based isolation. The system decouples access and management permissions for secure resources, enabling normal-world OS management without access. It implements Recallable Resource Isolation to create Recallable Secure Memory (Flex-Mem) and Recallable Secure NPU (Flex-NPU), accessible only by the secure world. The FlexServe Framework facilitates cooperative secure memory management between the secure and normal worlds. Prototype evaluations demonstrate significant speedups, with FlexServe achieving 10.05X and 2.44X faster Time-To-First-Token (TTFT) compared to baseline and optimized TrustZone designs, respectively.

arm trustzonerecallable resource isolationflex-memflex-nputime-to-first-token

Convergence of Gradient Descent for General Neural Network Architectures Beyond the NTK Regime

arXiv cs.LG · Yuqing Wang · 2026-06-22

The paper establishes a convergence framework for gradient descent (GD) in general neural network architectures beyond the neural tangent kernel (NTK) regime, including pre-normalized transformers. By analyzing network blocks and employing polynomial generalized smoothness with a local relaxed dissipative condition, the authors prove GD converges to a stationary point's neighborhood for almost all initializations under regular learning rates. Key findings include learning rate scaling with depth and bottleneck dimensions rather than maximum width, and structural implications for residual connections and function composition.

gradient descentneural tangent kernelpolynomial generalized smoothnessxavier initializationresidual connections

SOAP-Bubbles: Structured Weight Uncertainty for Neural Networks

arXiv cs.LG · Adrian Robert Minut, Nico Daheim, Marco Miani, Mohammad Emtiyaz Khan · 2026-06-22

The paper introduces SOAP-Bubbles, a method for structured weight uncertainty in neural networks, and Eigenspace-VON (EVON), a new optimizer. The approach adapts the SOAP optimizer by running IVON (a diagonal-covariance variational method) in the eigenspace of SOAP's preconditioner, then transforming the diagonal estimate into non-diagonal covariance. For logistic regression, EVON recovers exact Gaussian covariance; in language model pretraining, it outperforms existing diagonal-covariance methods. The method maintains computational efficiency similar to SOAP and requires minimal pipeline modifications.

structured weight uncertaintysoap optimizereigenspace-vondiagonal-covariance variationalnon-diagonal covariance

Changing Modalities: Adapting Remote Sensing Models to New Satellites and Sensors

arXiv cs.LG · Tim G. Zhou, Anthony Fuller, Geoff Pleiss, Evan Shelhamer · 2026-06-22

The paper introduces DeluluNet, a modular architecture for adapting remote sensing models to changing sensor modalities without full retraining. The method addresses three scenarios—Modality Transfer (substitution), Addition (superset), and Peeking (subset)—via end-to-end training with modality hallucination, predicting missing modalities from available ones using unlabeled multimodal data and a unimodal teacher. This approach enables deployment under varying sensor configurations while minimizing labeling and computational costs.

modality hallucinationremote sensingmultimodal learningmodular architectureunlabeled data

Ultra-Peripheral Collisions as a Nuclear-Structure Interferometer with Interpretable Multitask Deep Learning

arXiv cs.LG · Jing-Zong Zhang, Wang-Mei Zha, Lingxiao Wang, Guo-Liang Ma · 2026-06-22

The authors introduce an interpretable multitask deep-learning framework to quantify nuclear structure from ultra-peripheral collisions (UPCs), addressing the inverse problem of mapping transverse momentum distributions to multiple nuclear-structure indicators. The method leverages coherent vector-meson photoproduction in UPCs, specifically analyzing $J/ψ$ photoproduction in $^{96}_{40} ext{Zr} + ^{96}_{40} ext{Zr}$ collisions, to separate diffraction-dominated and interference-dominated kinematic regions. Results demonstrate that the framework identifies key kinematic regions driving inferences and provides analysis-ready observables for future high-luminosity data, enabling precise nuclear-structure tomography.

ultra-peripheral collisionsmultitask deep-learningnuclear-structure tomographyvector-meson photoproductionkinematic regions

Superhuman AI for Generals.io Using Self-Play Reinforcement Learning

arXiv cs.LG · Matej Straka, Viliam Lisý, Martin Schmid · 2026-06-22

The paper introduces a superhuman AI agent for Generals.io, achieving #1 rank on the public leaderboard (5,000+ players) with a 199-70 win record against top humans. Key innovations include a JAX-native simulator (10,000x speedup over prior work) enabling efficient self-play training, and a vision transformer policy trained via policy-gradient loops with top-advantage filtering. The agent was trained for four days on 4x NVIDIA H200 GPUs, demonstrating superior performance in both long-horizon planning and short-term tactics under imperfect information.

generals.ioself-playjaxvision transformerpolicy-gradient

The Anatomy of the CTC Oracle Gap: Acoustic Exhaustion and Linguistic Recovery

arXiv cs.LG · Ivan Novosad · 2026-06-22

The study identifies the information bottleneck in CTC-based ASR systems where acoustic confidence scores fail to improve linguistic plausibility beyond greedy decoding. Analyzing eleven scoring strategies on LibriSpeech dev-other (G=16), it shows CTC's Spearman ρ degrades by 53% (from -0.574 to -0.270) as blank-path proliferation exhausts discriminative capacity. Introducing external linguistic information via MBR-CER decoding with RoBERTa pseudo-log-likelihood (τ=10, G=128) achieves 5.42% WER (Δ=-0.535 pp vs. greedy), demonstrating linguistic recovery. The method generalizes across architectures (Zipformer), domains (LibriSpeech, TED-LIUM 3, VoxPopuli), and noise levels (MUSAN), with gains in 11/13 conditions. MWER training shows limited efficacy due to minimal reward signals (0.007 pp oracle gap).

ctcmbr-cerspearman ρrobertamwer

GRIMIP: A General Framework for Instance-Specific Configuration of MIP Solvers Using LLMs

arXiv cs.LG · Yidong Luo, Xuemin Chen, Chenguang Wang, Fangzhou Zhu · 2026-06-22

GRIMIP introduces a hybrid framework for instance-specific configuration of Mixed-integer programming (MIP) solvers by integrating Large Language Models (LLMs) with Bayesian Optimization (BO). The method leverages LLMs as probabilistic surrogates within the BO loop, enhancing semantic reasoning and sample efficiency. Evaluated on seven benchmarks including MIPLIB, GRIMIP reduces Primal-Dual Integral by over 40% on hard instances, outperforming SMAC and other LLM-assisted BO methods. This approach combines expert-level reasoning from LLMs with efficient search from BO, achieving state-of-the-art performance.

mixed-integer programmingbayesian optimizationlarge language modelsprobabilistic surrogateprimal-dual integral

Non-asymptotic estimates of the minimal risk in statistical learning

arXiv cs.LG · Liming Wu, Sen Yang · 2026-06-22

The paper establishes non-asymptotic concentration inequalities for error probabilities in the Empirical Risk Principle (ERP), providing high-confidence lower and upper bounds for minimal risk. By relaxing the usual boundedness condition to Gaussian/exponential integrability, the analysis leverages Talagrand's concentration inequalities, transport-entropy methods, and empirical process theory. Key results show the lower bound's confidence is independent of parameter count and input dimension, while the upper bound requires sample size n ≫ box dimension of the parameter set Θ in Orlicz metric d_{ψ_1}.

empirical risk principleconcentration inequalitiesnon-asymptotic boundsorlicz metrictransport-entropy

Transfer learning-based method for automated ewaste recycling in smart cities

arXiv cs.LG · Nermeen Abou Baker, Paul Szabo-Müller, Uwe Handmann · 2026-06-22

The study proposes a transfer learning-based method for automated e-waste recycling in smart cities, focusing on smartphone classification. Using AlexNet as a pretrained model, the authors fine-tune its output layers and evaluate performance on a dataset of 12 classes from 6 smartphone brands. Hyperparameter tuning, including learning rate and optimizer selection, along with data augmentation, achieves 98% accuracy with Stochastic Gradient Descent with Momentum at a learning rate of 3e-4. The approach demonstrates the efficacy of transfer learning in reducing e-waste sorting errors and advancing circular economy initiatives.

transfer learningalexnete-waste recyclingstochastic gradient descentdata augmentation

Attention mechanism for scalable mesh-based neural surrogates of free-surface fluids

arXiv cs.LG · Federico Lanteri, Massimiliano Cremonesi · 2026-06-22

The authors propose a self-attention-based neural surrogate for Particle Finite Element Method (PFEM) simulations of free-surface flows, addressing computational challenges in Lagrangian methods. The architecture employs attention mechanisms to model node interactions on evolving meshes, with variants including standard self-attention and a linear version for scalability. Evaluated on 2D/3D benchmarks with non-Newtonian fluids, the method accurately predicts transient dynamics and final configurations while enabling stress field reconstruction through finite element operators.

attention mechanismparticle finite element methodfree-surface flowsneural surrogatenon-newtonian fluids

Unlocking In-Context Learning in Audio-Language Models from Decentralized Medical Audio

arXiv cs.LG · Ran Piao, Tsai-Ning Wang, Martijn den Dekker, Linda Moonen · 2026-06-22

We introduce Federated Self-Contextualization (FSC), a multimodal language model framework for in-context clinical audio diagnosis across federated hospital clients. FSC enables episodic in-context learning by constructing pseudo-label episodes via unsupervised clustering of audio representations, bypassing scarce diagnostic labels, and reasoning from support-query pairs. The method employs a three-stage pipeline: aligning audio embeddings with the language model via caption-based pretraining, adapting for episodic inference through federated optimization, and performing multimodal reasoning at test time. FSC achieves 71.6% accuracy in 2-way 2-shot evaluation on respiratory and cardiac conditions, outperforming audio-language baselines by over 9%.

federated learningin-context learningmultimodal reasoningaudio embeddingspseudo-label episodes

Spectral Gating via Damped Oscillations for Adaptive Implicit Neural Representations

arXiv cs.LG · Alex Costanzino, Pierluigi Zama Ramirez, Giuseppe Lisanti, Luigi Di Stefano · 2026-06-22

The authors propose a novel activation function for Implicit Neural Representations (INRs) based on damped harmonic oscillators to address the spectral bias vs. noise memorization trade-off. Each neuron's activation is modeled as the steady-state response of a sinusoidally-forced damped oscillator, with jointly optimized parameters enabling adaptive spectral gating during training. The method exhibits a coarse-to-fine learning curriculum, initially suppressing high frequencies and progressively expanding bandwidth as needed. Experiments demonstrate state-of-the-art or competitive performance across INR tasks without requiring task-specific hyperparameter tuning.

implicit neural representationsspectral biasdamped harmonic oscillatoradaptive activationcoarse-to-fine learning

Deep learning-based detection of cessation of breathing in pre-term infants

arXiv cs.LG · Dineo Serame, Lionel Tarassenko, Mauricio Villarroel · 2026-06-22

The study demonstrates that deep learning models can reliably detect apnoea-related Cessation Of BrEathing (COBE) events in pre-term infants using routinely monitored physiological signals. The authors evaluated shallow CNNs, ResNets, and ConvNeXt architectures on 430 hours of NICU recordings (346 COBE and 608 non-COBE events) from 24 infants, comparing impedance pneumography (IP), electrocardiography (ECG), and photoplethysmography (PPG) modalities. IP-based models achieved 86.8-88.0% balanced accuracy, outperforming ECG (62.6-69.7%) and PPG (65.1-66.4%). Multimodal fusion with ConvNeXt (IP+PPG) yielded the best performance (88.7% accuracy, F1=0.75), highlighting signal modality's importance over architectural complexity in data-constrained settings.

convolutional neural networkimpedance pneumographymultimodal fusionneonatal intensive careapnoea detection

Efficient Network Inference via Hardware-Aware Architecture Search, Model Pruning & Quantization

arXiv cs.LG · Lucas Heublein, Mark Deutel, Axel Plinge, Felix Ott · 2026-06-22

The paper presents a hardware-aware approach to optimize deep neural networks for real-time GNSS interference monitoring, balancing model expressiveness with deployability on resource-constrained devices. The method combines iterative structured pruning, post-training static quantization, and zero-shot neural architecture search (NAS) to enhance efficiency. Evaluated on a GNSS interference dataset, the optimized models demonstrate reduced size, computational complexity, and memory usage while maintaining performance, validated across platforms like iMXRT1062 MCU and Raspberry Pi variants.

gnss interferenceneural architecture searchmodel pruningquantizationembedded deployment

Leveraging AutoML for Sustainable Deep Learning: A Multi-Objective HPO Approach on Deep Shift Neural Networks

arXiv cs.LG · Leona Hennig, Marius Lindauer · 2026-06-22

This work introduces a multi-objective hyperparameter optimization (HPO) approach leveraging AutoML to enhance Deep Shift Neural Networks (DSNNs) for sustainable deep learning. The method combines multi-fidelity HPO with multi-objective optimization to balance accuracy and energy consumption in image classification tasks. Results demonstrate a 20% performance improvement and up to 60% reduction in emissions compared to default DSNNs. Experiments reveal counterintuitive findings, such as optimal energy savings achieved by quantizing smaller network portions with low precision. These insights are validated across multiple backbone architectures, offering automated strategies for energy-efficient model design.

deep shift neural networksmulti-objective optimizationhyperparameter optimizationautomlquantization

Bridge the Gaps: Heterogeneous Attributed Graph Clustering via Quaternion Representation Learning

arXiv cs.LG · Xinxi Chen, Junyang Chen, Yiqun Zhang, Chuangming Qiu · 2026-06-22

The paper introduces AGREE, an end-to-end framework for heterogeneous attributed graph clustering that addresses representation degradation and attribute heterogeneity. AGREE employs quaternion-based graph convolution to enhance attribute interaction and mitigate over-dominating effects, while using shallow graph architectures to alleviate over-smoothing. The framework integrates multi-level alignment and similarity-based graph construction to unify attributed graphs and any-type attributed data. Joint optimization of graph reconstruction and clustering is performed without requiring a predefined number of clusters. Experiments on diverse benchmarks demonstrate AGREE's superior accuracy, robustness, and adaptability in clustering tasks.

attributed graph clusteringquaternion representationover-smoothingover-dominatinggraph convolution

Incremental Learning in Mirror Flows

arXiv cs.LG · Raphaël Berthier, Loucas Pillaud-Vivien · 2026-06-22

The paper characterizes incremental learning in mirror flows by analyzing their rescaled trajectories near domain boundaries. Using convex quadratic loss and general convex lower semicontinuous mirror potentials, the authors demonstrate convergence to a limiting flow with an indicator function potential. The primal variable minimizes loss over a time-dependent hypothesis set derived from the subdifferential of the domain's support function. This provides a theoretical framework for understanding incremental learning dynamics in mirror descent algorithms.

mirror flowsconvex optimizationincremental learningsubdifferentialsupport function

Stage-dependent integer-binary encoding in factorization-machine black-box optimization

arXiv cs.LG · Ryo Ogawa, Mayumi Nakano, Yuya Seki, Shu Tanaka · 2026-06-22

The paper introduces a stage-dependent integer-binary encoding framework for factorization-machine-based black-box optimization (FMQA), where different encodings are used for surrogate learning versus Ising-machine solution search. The method derives conversion formulas between one-hot and domain-wall QUBO matrices that preserve the surrogate objective over feasible integer states. Evaluated on the Rastrigin function (N=2,5; q=61,301), results show one-hot encoding in the learning stage consistently yields lower residual errors, while switching to domain-wall encoding during search provides additional improvement under finer discretization (N=5, q=301).

black-box optimizationfactorization machineising machineone-hot encodingdomain-wall encoding

EML Trees Are Universal Approximators

arXiv cs.LG · Joe Germany, Elie Abdo, Joseph Bakarji · 2026-06-22

The work establishes EML (Exp-Minus-Log) trees as universal approximators for functions in Sobolev spaces $W^{k, \infty}$. By constructing tree-structured compositions of EML functions—continuous analogues of NAND gates—the authors demonstrate their capacity to mimic polynomial representations, leveraging classical neural network approximation theory. A parameterized learning algorithm for EML trees is proposed and validated on practical optimization tasks. The results provide theoretical grounding for EML trees as a function approximation framework.

universal approximationeml treessobolev spacesnand gatesfunction composition

Position: Correct Answer, Wrong Mechanism -- When AI Scientists Defend General Claims Their Own Data Contradicts

arXiv cs.LG · Steven Young Eulig · 2026-06-22

This position paper demonstrates that outcome-only evaluation of AI scientist systems is insufficient, proposing separate measurement of task outcome, mechanism fidelity, and epistemic honesty. The authors analyze 28 episodes of a coding agent rediscovering a particle identification observable in Geant4 simulations, including cross-model probes. Results show 4/20 primary-model and 3/8 cross-model episodes yield correct answers via incorrect reasoning (CAWM), with one agent defending physics-inconsistent claims. The study reveals coding agents as reliable tools but unreliable co-authors for open-ended claims, proposing a lightweight regime-shift check and companion recomputation that detect all CAWM cases.

ai scientist systemsmechanism fidelityepistemic honestygeant4 simulationregime-shift check

Substitution-Based Analysis of Structural Novelty for Generative Models of Materials

arXiv cs.LG · Masahiro Negishi, Aron Walsh · 2026-06-22

The study evaluates structural novelty in AI-generated inorganic crystals by developing a workflow to classify generated structures as training duplicates, substitution-derived, or novel. Analyzing representative generative models reveals 81-92% of chemically valid metastable outputs are either duplicates or substitution variants, particularly in high-symmetry systems. Structural fingerprint analysis shows low-symmetry novel structures emerge from interpolation in data-rich regions, while high-symmetry duplicates suggest memorization in sparse regions, exposing a bias toward known prototypes in current models.

generative modelsinorganic crystalsstructural noveltyelemental substitutionmetastable materials

Neural Parameter Calibration for Finite-State Mean Field Games

arXiv cs.LG · Anna C. M. Thöni, Grégoire Lambrecht, Gökçe Dayanıklı, Yonathan Efroni · 2026-06-22

The authors propose a neural network-based framework for calibrating parameters in finite-state mean field games (MFGs) from observed population dynamics, addressing the challenge of specifying hidden preferences and interactions. Their method formulates parameter calibration as an inverse problem, using implicit differentiation to backpropagate through the game's equilibrium, enabling estimation of flexible trajectory-wise parameter paths without requiring individual agent actions or rewards. Experimental validation on synthetic and real-world urban mobility datasets demonstrates the framework's effectiveness across systems of varying complexity.

mean field gamesparameter calibrationimplicit differentiationinverse problempopulation dynamics

Weighted Score-Oriented Losses for Temporally Localized Event Prediction

arXiv cs.LG · Edoardo Legnaro, Sabrina Guastavino, Francesco Marchetti · 2026-06-22

The paper introduces weighted score-oriented loss (wSOL), a temporally localized loss function for event prediction that addresses the score-loss mismatch in operational systems. Starting from score-oriented losses based on expected confusion matrices, wSOL incorporates temporal weights to discount near-event false positives and reduce false-negative penalties when preceded by admissible alarms. The objective is differentiable and optimizable via backpropagation, compatible with metrics like balanced accuracy, F1, and critical success index. Evaluations on three time-series event prediction benchmarks demonstrate wSOL's effectiveness in improving performance when temporal localization is critical and not encoded by pointwise labels.

score-oriented losstemporal localizationconfusion matricesbackpropagationevent prediction

The Fractal Neural Operator: Overcoming Spectral Bias in Chaotic Attractors via Prime-Harmonic Weierstrass Encodings

arXiv cs.LG · Kanishk Awadhiya · 2026-06-22

The Fractal Neural Operator (FNO) addresses spectral bias in neural networks by employing a non-resonant prime number basis for chaotic dynamical systems. The proposed Harmonic Weierstrass Encoder overcomes spectral gaps in traditional geometric encodings by injecting infinite spectral resolution into the latent space. On the Lorenz-63 system, FNO achieves a prediction horizon of 347 Lyapunov times, outperforming Reservoir Computing methods by 2.3x, demonstrating that chaotic systems can be modeled with appropriate fractal embeddings.

fractal neural operatorspectral biaschaotic dynamical systemsharmonic weierstrass encoderlyapunov times

Temporal-Spectral Alignment with Frequency Adaptation for Source-Free Time-Series Adaptation

arXiv cs.LG · Shichang Meng, Linquan Wu, Xuan Ai, Linqi Song · 2026-06-22

The paper proposes Temporal-Spectral Alignment with Frequency Adaptation (SAFA), a novel method for source-free domain adaptation (SFDA) in time-series data that addresses both temporal drift and spectral shifts. SAFA models the source domain at multiple scales by jointly capturing temporal dependencies and spectral characteristics, then introduces a trainable frequency adaptation module to align target signals' phase and amplitude with the source distribution in the frequency domain. Experiments on multiple benchmarks demonstrate SAFA's efficacy and robustness in handling feature shift and temporal drift.

source-free domain adaptationtime-series adaptationspectral shiftfrequency adaptationtemporal drift

LOLLA: Deep Reinforcement Learning for Closed-Loop Link Adaptation Towards a GPU-Accelerated AI-RAN

arXiv cs.LG · Rui Wang, Linchao Zhang, Qiang Liu, Kun Yang · 2026-06-22

LOLLA introduces a deep reinforcement learning framework for outer-loop link adaptation in 5G NR, replacing conventional OLLA with a learned SINR offset conditioned on PHY/MAC telemetry. Using Proximal Policy Optimization (PPO) under a BLER constraint, it achieves tunable reliability targets (1%-15%) without manual calibration. Implemented as a GPU-accelerated dApp, LOLLA reduces control latency to <500μs and demonstrates 15%-92% throughput gains over OLLA across Doppler frequencies up to 400 Hz. The policy generalizes to unseen channel models and scales to eight UEs under shared-resource scheduling.

5g nrollasinrppobler

Minimax Quantile Lower Bounds for Interactive Statistical Decision Making with Privacy

arXiv cs.LG · Raghav Bongole, Amirreza Zamani, Tobias J. Oechtering, Mikael Skoglund · 2026-06-22

The paper develops a δ-explicit minimax-quantile theory for interactive statistical decision making (ISDM), addressing rare but consequential failures overlooked by expectation-based criteria. It establishes structural relations between minimax quantiles, lower minimax quantiles, and minimax risk, including quantile-to-expectation conversions. The authors derive high-probability interactive Fano's and Le Cam's methods as converse tools for ISDM, and extend the framework to mutual-information privacy via decision class restrictions. For Gaussian privatization, they isolate privacy-induced variance inflation using a two-point template, applying it to Gaussian mean estimation and bandit problems. Results include explicit lower bounds scaling as log(1/δ)/n for estimation and √(Tlog(1/δ)) or √(KTlog(1/δ)) for bandits, with privacy effects quantified via Gaussian variance inflation.

minimax-quantileinteractive statistical decision makingmutual-information privacygaussian privatizationvariance inflation

FlowTrain: Flow-Based Decoupled Training for Industrial-Grade Vision-Language Models

arXiv cs.LG · Zhida Jiang, Zhaolong Xing, Yang Pei, Xiaolong Chen · 2026-06-22

FlowTrain introduces a flow-based decoupled training framework for vision-language models (VLMs) that reformulates training as a producer-consumer dataflow coordinated via a unified memory pool, enabling independent progress of encoder and backbone modules. The method employs a heterogeneous parallel allocator for module-specific parallelism strategies and a dynamic packing scheduler for balanced microbatches based on LLM-side computation costs. Experiments demonstrate 50% MFU and 1.7x throughput improvement, significantly reducing the efficiency gap to LLM-only training.

vision-language modelsdecoupled trainingheterogeneous parallelismdynamic packing schedulermemory pool

PeLAP-A: Adaptive Latent Pruning for Lightweight Latent Diffusion Models

arXiv cs.LG · Kissa Zahra, Zaib Un Nisa · 2026-06-22

PeLAP-A introduces adaptive latent pruning for latent diffusion models, augmenting the standard pipeline with a learnable channel-wise importance predictor implemented as a two-layer MLP. The framework jointly trains the predictor with diffusion, reconstruction, and sparsity losses on CIFAR-10, revealing a sparsity collapse phenomenon where aggressive regularization (λ=0.01) drives latent channels near-zero while improving diffusion loss (0.0236 vs. 0.0240) and VAE reconstruction MSE (22.59 vs. 24.67). This demonstrates UNet robustness to latent channel suppression and provides insights into latent diffusion sparsity dynamics.

latent diffusionadaptive pruningsparsity collapsechannel-wise importancedenoising unet

Who Owns the AI Recommendation? A Multi-Industry Empirical Map of Brand Category Ownership Across Large Language Models

arXiv cs.LG · Dmitrij Żatuchin · 2026-06-22

This study introduces three metrics—Category Ownership Index (COI), Competitive Vacuum Index (CVI), and Displacement Score (DS)—to empirically map brand category ownership in AI-generated recommendations across large language models. Analyzing 3,750 responses from GPT-5.2, Google Gemini 3 Flash, and Perplexity sonar-pro across 50 brands and five industries, the authors found moderate recommendation concentration (mean Gini coefficient: 0.28), rare competitive vacuums (8.0% of queries), and industry-dependent displacement (mean DS: 2.4:1). Cross-model agreement on top-recommended brands was 41.6%, challenging winner-takes-all narratives. The metrics provide a reproducible framework for competitive intelligence analysis.

category ownership indexcompetitive vacuum indexdisplacement scoregini coefficientlarge language models

Counterfactual learning of new adaptive instructional policies using logged data

arXiv cs.LG · Samuel Girard, Sein Minn, Amel Bouzeghoub, Jill-Jênn Vie · 2026-06-22

The paper introduces an offline contextual bandit framework for optimizing instructional policies in Intelligent Tutoring Systems (ITS) using logged interaction data, eliminating the need for costly online experimentation or student simulators. The method maps student-item interactions onto a continuous latent proficiency-difficulty scale via a Rasch model, framing tutoring as a continuous stochastic bandit problem. A novel reward function balances task challenge with student success to optimize 'flow'. The approach includes round-specific behavior policy estimation for off-policy evaluation and ITS adaptivity diagnostics. Evaluated on four large-scale real-world datasets, the framework achieves consistent policy improvements over logged behavior policies, enabling rapid policy learning and visualization within seconds.

contextual banditrasch modeloff-policy evaluationintelligent tutoring systemslatent proficiency-difficulty scale

A Novel Approach to Temporal QoS Estimation via Extended Kalman Filter-Incorporated Latent Feature Analysis

arXiv cs.LG · Ye Yuan, Song Wang, Hongxun Zhou, Ling Wang · 2026-06-22

The paper proposes an Extended Kalman Filter-Enhanced Latent Feature Analysis (EKL) model for temporal QoS prediction, combining model-driven and data-driven approaches. The method employs an Extended Kalman Filter for temporal latent features, alternating least squares for time-invariant features, and a density-oriented parallel strategy for efficiency. Theoretical convergence is proven, and experiments on real-world datasets show superior accuracy and computational efficiency compared to state-of-the-art models.

temporal qos predictionextended kalman filterlatent feature analysisalternating least squaresservice-oriented systems

From Point Estimates to Distributions: GMM Pooling for MIL in Preterm Birth Prediction

arXiv cs.LG · Hussain Alasmawi, Numan Saeed, Soha Said, Mohammad Yaqub · 2026-06-22

The paper proposes Gaussian Mixture Model (GMM) pooling for multiple instance learning (MIL), capturing intra-patient variability by modeling feature distributions instead of collapsing bags to point estimates. Applied to preterm birth (PTB) prediction from transvaginal ultrasound (TVUS) images and lymph node metastasis classification, GMM pooling improves PR-AUC from 0.44 to 0.56 on PTB prediction and achieves state-of-the-art performance (0.91 F1-score, 0.89 ROC-AUC, 0.18 MAE) on the metastasis benchmark. The method addresses variable-sized image bags in clinical settings while preserving distributional information.

gaussian mixture modelmultiple instance learningpreterm birth predictiontransvaginal ultrasoundfeature distribution

Do Sparse Autoencoders Learn Meaningful Concept Hierarchies?

arXiv cs.LG · Nils Grandien, David Steinmann, Felix Friedrich, Kristian Kersting · 2026-06-22

The study establishes formal requirements for evaluating hierarchical concept discovery in sparse autoencoders (SAEs), drawing from semantic networks and taxonomy research. It introduces a concrete evaluation protocol and applies it to SAEs trained on visual data. Results indicate that while SAE feature spaces support hierarchical organization, both hard and soft feature absorption systematically degrade hierarchy quality, revealing a fundamental challenge for future methods.

sparse autoencodersconcept discoveryfeature hierarchyfeature absorptionunsupervised learning

Generalized nonparametric regression in reproducing kernel Hilbert spaces: Consistency and rates of convergence

arXiv cs.LG · Ioannis Kalogridis · 2026-06-22

The paper establishes a theoretical framework for regularized M-estimation in reproducing kernel Hilbert spaces (RKHS), proving existence, measurability, and convergence rates for both convex and non-convex losses. The method combines functional analysis and empirical process theory to enable asymptotic linearization without requiring closed-form solutions or global Lipschitz conditions. Results demonstrate sharp convergence rates with explicit bias-variance decomposition, showing variance independence from misspecification and bias dependence on source conditions. For tensor product Sobolev spaces, the analysis reveals dimensionality-curve mitigation via connections to mixed smoothness spaces. Theoretical claims are validated through C++ implementations and numerical experiments.

reproducing kernel hilbert spacesm-estimationbias-variance decompositiontensor product sobolev spacesdominant mixed smoothness

Subject-Level Unknown-Identity Identification from Leap Motion Controller 2 Hand Landmarks

arXiv cs.LG · Bahar Moharrer, Susanna Cifani, Marco Raoul Marini, Luigi Cinque · 2026-06-22

The study introduces a subject-level unknown-identity identification framework using Leap Motion Controller 2 (LMC2) hand landmark data on the Multi View Leap2 Hand Pose (ML2HP) dataset. The method enriches geometric representations with fingertip-to-palm distances and palm-normalized inter-finger angular descriptors, evaluated under a Leave-One-Subject-Out (LOSO) protocol. A tree ensemble baseline outperforms neural alternatives, including centroid matching with cosine similarity and MLP+OpenMax, highlighting the challenge of robust score separation between known and unknown probes. Results demonstrate the feasibility of compact, interpretable landmark-based descriptors for contactless hand-based unknown-subject rejection in small-cohort datasets.

leap motion controller 2leave-one-subject-outunknown-identity identificationhand landmarksopen-set recognition

Scalable Physics-Inspired Transformers for Spin Glasses

arXiv cs.LG · Lu Zhong, Wenli Duan, Jing Liu, Pan Zhang · 2026-06-22

The authors introduce a physics-inspired transformer architecture for efficient Boltzmann sampling in frustrated spin-glass systems, addressing scalability and computational cost limitations of existing methods. The model employs interpretable sparse attention mechanisms and spin-specific positional embeddings, combined with FlashAttention-accelerated parallel ancestral sampling. Results demonstrate 100x speedup over vanilla variational autoregressive networks, enabling single-GPU simulations of Sherrington-Kirkpatrick and Edwards-Anderson models at unprecedented scales while resolving full probability distributions, free energies, and overlap statistics across temperature regimes where prior ML methods failed.

spin glassesboltzmann samplingsparse attentionpositional embeddingsancestral sampling

TaLK: Text-attributed Graph Dataset Distillation via Coupling Language Model with Graph-Aware Kernel

arXiv cs.LG · Yeongho Kim, Yeonje Choi, Kijung Shin · 2026-06-22

The paper introduces TaLK, a dataset distillation method for text-attributed graphs (TAGs) that couples a language model with a graph-aware neural tangent kernel, avoiding costly joint LM-GNN training. The approach efficiently synthesizes data by capturing both textual semantics and graph structure without full-dataset retraining. Experiments on multiple TAG benchmarks demonstrate TaLK's effectiveness, achieving up to 97% of full-dataset performance with only 1% synthetic data.

text-attributed graphsdataset distillationneural tangent kernellanguage modelgraph neural network

Topological Out-of-Domain Generalization in Dynamical Systems Reconstruction

arXiv cs.LG · Georg Trede, Charlotte Ricarda Doll, Elias Weber, Daniel Durstewitz · 2026-06-22

The paper addresses out-of-domain generalization in dynamical systems reconstruction (DSR) by identifying three structural mismatches between existing models and physical systems. It proposes feature splitting and derives a closed-form bound on extrapolation range, enabling zero-shot prediction into unseen dynamical regimes, including across tipping points. Empirical results demonstrate improved accuracy without requiring fine-tuning or retraining in new domains.

dynamical systems reconstructionout-of-domain generalizationfeature splittingzero-shot predictiontipping points

PG-MAP: Joint MAP Optimization for Inference-Time Alignment of Diffusion and Flow-Matching Models

arXiv cs.LG · Ruolan Sun, Pawel Polak · 2026-06-22

PG-MAP introduces a training-free framework for joint inference-time alignment of diffusion and flow-matching models via trajectory-level Gibbs-MAP optimization over conditioning $c$ and latent state $z_t$. The method employs forward-consistency coupling and optional reward guidance, enabling coordinated updates across modalities while maintaining transport-specific compatibility. Evaluations on Stable Diffusion variants (SD1.5, SDXL, SD3.5-medium) show consistent improvements in PickScore (91.9%), Aesthetic metrics, and human preference (75.7% win rate), with analysis revealing prompt-dependent optimization dynamics.

inference-time alignmentgibbs-map optimizationdiffusion modelsflow-matchingforward-consistency coupling

Domain-incremental audio classification using domain-specific experts and prototype classifier

arXiv cs.LG · Jongyeon Park, Do-Hyeon Lim, Sang-won Park, Hong Kook Kim · 2026-06-22

The work presents a domain-incremental learning (DIL) system for audio classification, addressing the challenge of inaccessible past/future domain data. The method employs frozen-feature replay: training compact domain-specific experts at each stage, concatenating their penultimate features, and training a lightweight prototype classifier. Knowledge retention uses DeepInversion-based generative replay and cross-stage regression imputation. Four DIL-compliant systems are evaluated, including a cross-stack ensemble achieving 78.15% micro and 77.03% macro accuracy on the DCASE 2026 development set, outperforming individual backbones.

domain-incremental learningfrozen-feature replayprototype classifierdeepinversioncross-stage regression

DT-GOL: Dual-Track Geometric Online Learning in Nonstationary Environment with Label Delay

arXiv cs.LG · Yulin Wang, Yi He, Dianlong You, Di Wu · 2026-06-22

The paper proposes DT-GOL, a dual-track online learning framework addressing label latency in nonstationary environments. The method models label delay as a semi-supervised task, using real-time topological feature evolution as a geometric surrogate for concept drift, and introduces dynamic evidence calibration to produce uncertainty-aware soft labels. A decoupled architecture separates stable (ground-truth-updated) and transient (geometry-adapted) learners. Experiments on synthetic and real datasets show DT-GOL outperforms state-of-the-art baselines, particularly under concept drift.

online learningnonstationary environmentlabel delayconcept driftsemi-supervised learning

Neural Operator Processes for Probabilistic Operator Learning under Partial Observations

arXiv cs.LG · Jose Miguel Lara-Rangel, Serge Guillas · 2026-06-22

Neural Operator Processes (NOPs) introduce a framework for probabilistic operator learning under partial observations, unifying neural-process conditioning with neural-operator decoding. NOPs employ sparse joint input-output observations and support deterministic and probabilistic prediction via a shared encoder-decoder architecture, incorporating convolutional pooled summaries and query-aligned attention conditioning strategies. Experiments across function regression and PDE benchmarks demonstrate that sparse conditional operator learning matches dense-grid performance in certain regimes, with local context-query geometry preservation being crucial in non-periodic settings. Uncertainty-aware operator learning succeeds when latent conditioning complements local geometric pathways, bridging operator learning and probabilistic meta-learning in function space.

neural operator processesprobabilistic operator learningpartial observationsencoder-decoder architecturequery-aligned attention

CITADEL: CSI-Based Jamming Detection and Open-Set Classification for IIoT Networks

arXiv cs.LG · Aymen Bouferroum, Ildi Alla, Valeria Loscri, Abderrahim Benslimane · 2026-06-22

CITADEL introduces a lightweight two-stage hierarchical pipeline for jamming detection and classification in IIoT networks using Channel State Information (CSI) measurements. The method jointly achieves closed-set classification of known attacks, open-set detection of zero-day attacks, and adversarial resistance, leveraging CSI signatures without requiring raw I/Q data. Evaluated on 6 known and 15 zero-day attack types, it achieves 100% known-attack detection, 97.1% zero-day detection at 0.4% false positive rate, and resists adversarial evasion (<5% success). The pipeline runs in 14.2 ms at 95.9 mJ on edge GPUs, outperforming eight baselines in detection, generalization, and robustness.

channel state informationjamming detectionopen-set classificationadversarial robustnessiiot security

FORGE: Fused On-Register Gradient Elimination for Memory-Efficient LLM Training

arXiv cs.LG · Dikshant Kukreja, Kritarth Prasad, Avinash Anand, Zhengkui Wang · 2026-06-22

FORGE introduces fused on-register gradient elimination to reduce memory overhead in LLM training by folding optimizer steps into backward passes, processing gradient tiles in registers without materializing full tensors. The method maintains exactness in full precision and bounded deviation in bf16/8-bit regimes, unbiased by stochastic rounding. It halves optimizer step memory and accelerates training by 1.5x at small batch sizes, enabling 8B model training with 4x larger micro-batches on tensor-parallel Megatron-LM.

gradient eliminationmemory-efficientregister fusionstochastic roundingtensor-parallel

EEG Benchmarking Needs a Task Specification Layer: NeuroDoc for Rulebook-Guided, Executable Benchmark Construction

arXiv cs.LG · Chengxuan Qin, Zhige Chen, Shu Peng, Rui Yang · 2026-06-22

The paper introduces NeuroDoc, a task specification layer for EEG benchmarking that standardizes heterogeneous datasets through a structured language and shared rulebook. The method represents benchmarks as task documents synchronized with executable kernels, with fields, evidence requirements, and machine-checkable constraints defined by the rulebook. Results include a community-reviewed corpus of 53 entries with 245 task definitions, validated across four EEG foundation model backbones, demonstrating reusable and auditable benchmarking infrastructure.

eeg benchmarkingtask specificationneurodocrulebook-guidedexecutable kernels

PromptDyG: Test-Time Prompt Adaptation on Dynamic Graphs

arXiv cs.LG · Guoguo Ai, Chaoxi Niu, Hui Yan, Joey Tianyi Zhou · 2026-06-22

PromptDyG introduces a novel framework for unsupervised test-time prompt adaptation in dynamic graph learning, addressing limitations of offline methods that fail to capture evolving complexities and distribution shifts in discrete-time dynamic graphs (DTDGs). The method leverages expressive dynamic graph prompts learned on a frozen backbone via feature-wise, label-free entropy minimization to continuously model evolving patterns. Theoretical analysis shows this approach guarantees a larger similarity margin between positive and negative pairs, enhancing prediction accuracy. Empirical evaluations on six benchmark datasets demonstrate consistent and significant improvements over state-of-the-art baselines.

dynamic graphstest-time adaptationentropy minimizationonline learningprompt adaptation

Learning Graphs through Continuous Information Entropy Fields

arXiv cs.LG · Hui Cong, Bo Sun, Ziheng Jiao, Yisheng An · 2026-06-22

The paper introduces Field-informed Graph Network (FGN), a framework where graphs emerge from latent continuous information entropy fields rather than treating edges as primitives. FGN learns a scalar field from node features to modulate message passing, optimizing an information-theoretic objective that balances structural fidelity with field smoothness. This creates a self-reinforcing loop where the field guides information diffusion and updated node representations refine the field. Experiments on node and graph classification benchmarks show FGN achieves superior performance, robustness, and structurally coherent field representations.

graph neural networksinformation entropymessage passingscalar fieldnode classification

Physiology-Aware CNN and Zero-Shot Multimodal LLMs for ECG Image Classification: A Comparative Study

arXiv cs.LG · Khalil Ahammad, Derek Abbott, Mohsen Dorraki · 2026-06-22

The study compares zero-shot multimodal LLMs (GPT-5.2, GPT-4.1, Gemini-2.5 Pro) with physiology-aware CNN models for binary classification of 12-lead ECG images. A novel LeadGroupECG CNN architecture aggregates features from anatomical lead groups, evaluated alongside ResNet18, DenseNet121, and VGG16 on internal and PTB-XL datasets. CNN models achieved stable performance (internal ROC-AUC: 0.92-0.94; external: 0.85-0.86), while LLMs performed near chance (ROC-AUC ~0.5), despite improved PR-AUC with grid-based calibration. Domain-specific architectures remain critical for reliable ECG interpretation.

multimodal llmsecg classificationzero-shot learningphysiology-aware cnnroc-auc

Tensor Train Decomposition-based 3D Implicit Full Waveform Inversion with Multi-scale Structural Similarity

arXiv cs.LG · Liangsheng He, Chao Song, Tiansheng Chen, Tao Liu · 2026-06-22

The authors propose TT-3DIFWI, a tensor train decomposition-based 3D implicit full waveform inversion framework with multi-scale structural similarity (M-SSIM) objective. The method represents 3D velocity models via TT decomposition as low-rank core tensors predicted by axis-specific implicit neural networks, reducing memory usage while preserving reconstruction accuracy. M-SSIM leverages multi-scale structural differences and ultra-low frequencies to mitigate cycle skipping. Experiments on synthetic and land datasets show accurate velocity reconstruction even with poor initialization or missing low-frequency data.

tensor train decompositionimplicit neural representationfull waveform inversionmulti-scale structural similaritycycle skipping

When AUC 0.998 Is Not Enough: A Candidate Evaluation Protocol for Hidden-State Probes of Indirect Prompt Injection in Multimodal Computer-Use Agents

arXiv cs.LG · Yanhang Li, Zhichao Fan, Zexin Zhuang · 2026-06-22

The study proposes a candidate evaluation protocol for hidden-state probes in multimodal computer-use agents, challenging the sufficiency of high AUC scores alone for detecting indirect prompt injection (IPI). Using Qwen2.5-VL-7B on Mind2Web with teacher-forced replay, the authors introduce two post-hoc diagnostics: a paired-construction scalar baseline for text-side injections and same-step nuisance-matched visual controls for overlay surfaces. These diagnostics reveal that high clean-vs-attack AUC does not conclusively indicate malicious-content detection, suggesting partly-semantic interpretations instead. The protocol includes reporting heuristics to clarify what high AUC scores can and cannot confirm, emphasizing that labels indicate injection-surface presence rather than attack success.

hidden-state probingindirect prompt injectionmultimodal agentsauc scoresteacher-forced replay

Chains That See, Answers That Don't: A Multi-Aspect Evaluation Recipe for Forced Chain-of-Thought on Video-MME

arXiv cs.LG · Zhichao Fan, Yanhang Li, Zexin Zhuang · 2026-06-22

The study evaluates forced chain-of-thought (CoT) in vision-language models through a three-probe recipe: paired accuracy comparisons, counterfactual video-swap diagnostics, and visual-degradation tests. Applied to Qwen2.5-VL on Video-MME subsets, results show CoT chains are video-conditioned (video-swaps disrupt chains and flip answers), yet forced CoT does not improve multiple-choice accuracy and may harm performance in smaller models. The method includes strict/permissive regex scoring and multiplicity correction, with raw responses and scripts provided for reproducibility.

forced chain-of-thoughtvision-language modelsvideo question answeringcounterfactual diagnosticvisual-degradation

IndicGuard: A Multilingual Safety Guard Model and Dataset for Indic Languages

arXiv cs.LG · Parth Bramhecha, Smit Deshmukh, Sairaj Bodhale, Adwait Borate · 2026-06-22

The authors introduce IndicGuard, a multilingual safety guard model and dataset for ten major Indic languages, addressing the gap in culturally nuanced content moderation for non-English contexts. They construct a high-volume dataset capturing regional harms and fine-tune a 4B-parameter Gemma-3-4B-IT model for real-time moderation. Evaluations show IndicGuard outperforms CultureGuard in robustness and cross-lingual transfer, even generalizing to low-resource languages excluded from training.

multilingual safetycontent moderationindic languagescross-lingual transfergemma-3-4b-it

RLM-Cascade: Response-Level Speculative Decoding for Cost-Efficient LLM API Serving

arXiv cs.LG · Haifeng Wu, Srinivasan Manoharan, Fangbo Tu, Junhua Zhao · 2026-06-22

RLM-Cascade introduces a response-level speculative decoding system for cost-efficient LLM API serving, eliminating the need for model architecture access or shared vocabularies. The method employs a fast draft model to generate candidate responses, verified or enhanced by a capable model, with a lightweight complexity router determining execution paths. Evaluated on the Claude Code agentic coding workload, RLM-Cascade achieves an 88.8% draft-use rate, reducing API costs by 45.8% and improving median response time by 1.83X compared to Native Opus. Quality benchmarks show a 100% pass rate on a 20-task Code/Math/Instruct benchmark, surpassing Native Opus's 95%. The system is deployed in production and open-sourced with a live metrics dashboard.

speculative decodingcomplexity routerdraft modelapi cost reductionagentic coding

Learning-Augmented Algorithms for Online Vertex Cover

arXiv cs.LG · Tianhang Lu, Runtian Ren, Shengcai Liu · 2026-06-22

The paper introduces learning-augmented algorithms for online vertex cover in bipartite and general graphs, focusing on robustness-consistency tradeoffs. For bipartite graphs, a randomized algorithm achieves $\frac{1}{1-e^{-λ}}$-robustness and $\frac{λ}{1-e^{-λ}}$-consistency. For general graphs, a deterministic algorithm attains $(1+\frac{1}{λ})$-robustness and $(1+λ)$-consistency, with proven optimality in both settings. Experimental validation on synthetic and real-world datasets confirms the theoretical results.

learning-augmented algorithmsonline vertex coverrobustness-consistency tradeoffsbipartite graphsdeterministic algorithm

BranchShine: Compact Raw-Audio-to-IPA Transcription with a RoPE E-Branchformer Encoder

arXiv cs.LG · Nikhil Navas, Sergio Chevtchenko, Talisson Damiao, Saeed Afshar · 2026-06-22

BranchShine introduces a compact 33M-parameter raw-audio-to-IPA transcription model using a convolutional front end and 19-block RoPE E-Branchformer encoder, achieving competitive performance with larger baselines. The system employs CTC recognition and is evaluated on a multilingual test set of 16,660 utterances across 41 languages, yielding a 9.19% whitespace-insensitive IPA character error rate, outperforming the 575M-parameter PhoneticXEUS baseline (9.78%). Analysis reveals complementary strengths with Whisper-Medium, with BranchShine being more conservative on incorrect readings while maintaining competitive accuracy.

ipa transcriptione-branchformerropectcmultilingual speech

Retrieval-Augmented Multimodal Learning for Enzyme-Substrate Interaction Prediction Under Low-Homology Shift

arXiv cs.LG · Chen Liu, Bingxin Zhou, Xinyuan Wang, Ming Li · 2026-06-22

The paper introduces RAMMESI, a retrieval-augmented multimodal framework for enzyme-substrate interaction (ESI) prediction under low-homology shift. The method learns pairwise enzyme-substrate representations via directional cross-modal interaction modeling and adaptive fusion, augmented at inference by retrieving neighboring enzymes and aggregating their predictions. It employs an imbalance-aware weighted-BCE objective to address sparse supervision. Evaluations on two ESI benchmarks with sequence-identity-aware splits show RAMMESI outperforms baselines, particularly in low-identity regimes, while its retrieval module generalizes as a plug-and-play robustness enhancer for other ESI models.

enzyme-substrate interactionretrieval-augmented learningmultimodal fusionlow-homology shiftimbalance-aware objective

Towards Robust Personalized Federated Learning: Vulnerability Assessment and Defense Co-Design

arXiv cs.LG · Mingyuan Fan, Cen Chen · 2026-06-22

The paper identifies heightened vulnerability of personalized federated learning (PFL) methods to transfer-based adversarial attacks compared to centralized learning, where malicious clients exploit local model knowledge to compromise peers. Through theoretical analysis and empirical evaluation on multiple benchmarks, the authors demonstrate significant accuracy drops across PFL methods. They propose a defense framework combining stochastic input noise, input-scaled trace regularization, and parameter sensitivity maximization to enhance robustness.

personalized federated learningadversarial attackstransfer-based attacksmodel robustnessedge computing

Target-Aware Linear Regression Under Distribution Shift

arXiv cs.LG · Zhewen Hou, Tian Zheng · 2026-06-22

The paper introduces target-aware linear regression under distribution shift, where target marginals of covariates and response are known. It proposes the hybrid-loss estimator as a benchmark and develops two tractable alternatives: a constrained moment-matching estimator and a two-stage OLS-calibration estimator. Theoretical analysis compares their asymptotic mean squared errors, identifying conditions for equivalence or divergence. Monte Carlo experiments validate the tradeoffs, showing the two-stage estimator approximates the hybrid benchmark in high signal-to-noise regimes with minimal computational overhead.

distribution shiftlinear regressiontarget-aware estimationasymptotic msemoment matching

Statistical Matching via Schrödinger Bridge beyond Conditional Independence

arXiv cs.LG · Eunho Koo, Tongseok Lim, Jinwon Sohn · 2026-06-22

The paper proposes a dependency-aware Schrödinger bridge method for statistical matching that relaxes the restrictive conditional independence assumption (CIA). By incorporating a transportation-based compatibility cost, the approach captures latent Y–Z dependencies while coupling partially overlapping datasets (sharing covariates X but separately observing Y and Z). Theoretical analysis provides recovery guarantees and improvement conditions over CIA baselines. Experiments on synthetic data, CelebA, and Adult datasets demonstrate consistent predictive improvements, particularly in scenarios with strong Y–Z dependence like data recoding.

statistical matchingschrödinger bridgeconditional independencetransportation costbidirectional imputation

Factored Gossip DiLoCo: Reducing Blocking Communication in DiLoCo

arXiv cs.LG · Chamin Hewa Koneputugodage, Thalaiyasingam Ajanthan, Sameera Ramasinghe, Hadi Mohaghegh Dolatabadi · 2026-06-22

The paper introduces Factored Gossip DiLoCo, a method to reduce blocking communication in distributed low-communication (DiLoCo) training by replacing exact synchronization with approximate gossip-based mixing. This approach factorizes synchronization into non-blocking and blocking steps, enabling tunable trade-offs between compute utilization and optimization stability. Evaluated on billion-parameter language models in low-bandwidth settings, the method significantly improves compute utilization compared to DiLoCo while maintaining comparable training progress and enhancing robustness to failures.

distributed traininggossip protocoldiocosynchronizationbandwidth optimization

One-Step Flow Matching for Generative Modeling of Path-Dependent Physical Fields

arXiv cs.LG · Yijing Zhou, Jasmin Jelovica · 2026-06-22

The paper introduces a transformer-based flow matching (FM) model for efficient generation of path-dependent stress fields in physical simulations. The method employs a variational autoencoder (VAE) latent space and formulates field generation as a video synthesis task, incorporating token-level loading embeddings and auxiliary networks. A non-Gaussian source distribution reduces transport path crossings during training, enabling one-step generation without distillation. Results show the model achieves 6-7x speedup over finite element analysis on CPUs and ~100x on GPUs while maintaining accuracy in high-resolution field generation with limited training data.

flow matchingpath-dependent stressvariational autoencodertransformer backbonefinite element analysis

Error Highways: Scaling Predictive Coding to Very Deep Networks

arXiv cs.LG · Amirhossein Mohammadi, Alexander G. Ororbia · 2026-06-22

The paper introduces highway error propagation (HEP), a method to scale predictive coding networks (PCNs) to deep architectures by addressing error signal decay. HEP augments the PC free energy function with linear feedback matrices that directly couple hidden states to output error, bypassing exponential attenuation with depth while preserving local learning rules. Experiments on MNIST and Fashion-MNIST demonstrate that HEP enables effective training of MLPs up to 128 layers with depth-robust accuracy.

predictive coding networkshighway error propagationfree energy functionlocal learningerror signal decay

GRADE: Graph Representation of LLM Agent Dependency and Execution

arXiv cs.LG · Yue Zhao · 2026-06-22

GRADE introduces a graph representation for LLM agent execution traces, capturing both execution order (via execution edges) and step dependencies (via graded dependency edges). The method reconstructs implicit dependencies through observation, declaration, or inference, enabling analysis beyond traditional trace logs. Evaluated across six corpora spanning tool use, coding, and web tasks, dependency edges outperform run size in failure prediction (maintaining above-chance accuracy in leave-one-corpus-out transfer) while execution edges localize faults in multi-agent runs. The work also analyzes limitations of generic graph neural networks for dependency layer interpretation.

llm agentsdependency graphsexecution tracesfailure predictiongraph neural networks

GARIP: A Running-Average Moving Reference for Last-Iterate Self-Play in Two-Player Zero-Sum Games

arXiv cs.LG · Can Savcı · 2026-06-21

The paper introduces GARIP, a running-average moving reference method for last-iterate self-play in two-player zero-sum games, addressing cycling behavior in naive gradient ascent. GARIP anchors to a running average, minimizing peak lag compared to snapshot-based references like R-NaD, and proves local last-iterate convergence at constant anchor strength. Experiments on matrix games, the Coin Game, and Connect Four/Othello demonstrate GARIP's robustness, matching R-NaD's performance while offering better hyperparameter defaults. The method uniquely achieves collapse rates statistically indistinguishable from baselines, with 0/40 vs 10/40 seed failures in matched-mean-lag settings.

self-playzero-sum gamesgradient ascentrunning-averagelast-iterate convergence

Clipping the Price of Adaptivity at the Tail

arXiv cs.LG · Itai Kreisler, Yair Carmon, Oliver Hinder · 2026-06-21

We introduce a method that circumvents the 'price of adaptivity' barrier in adaptive stochastic convex optimization (SCO) by leveraging additional structural assumptions. Specifically, we assume the objective decomposes into a model and a loss function, allowing intervention by clipping the model's output when it deviates significantly from a fixed reference model. This approach achieves optimal bounds for known-parameter SCO up to logarithmic factors in uncertainty regarding the initial distance to optimality and the Lipschitz constant, enabling efficient adaptation to large parameter uncertainties.

stochastic convex optimizationadaptivitylipschitz constantmodel clippingreference model

From Complaint Narratives to Monetary Relief: A Hybrid Machine Learning Framework for CFPB Consumer Complaints

arXiv cs.LG · Zhuoer Wang, Sizhen Zhu, Xiongyu Chen · 2026-06-21

The paper introduces a hybrid machine learning framework for predicting monetary relief outcomes from Consumer Financial Protection Bureau complaints, formulated as an imbalanced binary classification task. The framework integrates complaint narrative text, LDA-based topic representations, interpretable text-engineered features, and structured categorical attributes, employing an XGBoost classifier trained with a temporal train-test split. Evaluation shows significant performance improvements over a TF-IDF baseline, with AUC-ROC increasing from 0.69 to 0.78 and enhanced PR-AUC under class imbalance. Feature importance analysis highlights the predictive contributions of textual signals, latent topics, and company identity, revealing systematic variation in complaint resolution patterns across financial institutions.

xgboostldaauc-roctf-idfimbalanced classification

LSTM Variants for Chaotic Dynamical Systems: An Empirical Study on the Lorenz Attractor

arXiv cs.LG · Ruslan Gokhman · 2026-06-21

The study evaluates seven recurrent and convolutional architectures for forecasting the chaotic Lorenz attractor in the AI-DEEDS 2026 Chaotic Systems Challenge. Models include vanilla LSTM, LSTM with additive attention, Bidirectional LSTM (BiLSTM), BiLSTM with Huber loss, Temporal Convolutional Network (TCN), and hybrid CNN-LSTM variants. Results show BiLSTM with Huber loss performs best (leaderboard score 58.81), while additive attention degrades performance by >10 points and CNN front-ends slightly harm scores. RMSE analysis reveals BiLSTMs generalize better on challenging pairs (6-7), whereas LSTM+Attention collapses (RMSE up to 8.94). Findings highlight bidirectional context and robust loss benefits in chaotic regimes.

lstmbidirectionalhuber losstemporal convolutional networkchaotic systems

A Markov Chain Approach to Preference Alignment

arXiv cs.LG · Takuya Koriyama, Tengyuan Liang · 2026-06-21

Markov Chain from Human Feedback (MCHF) is introduced as a method for aligning generative models using pairwise human preferences, contrasting with Reinforcement Learning from Human Feedback (RLHF) and Nash Learning from Human Feedback (NLHF). MCHF constructs a Markov kernel based on pairwise utility $U(x,y)$ and a reference distribution $μ_{\mathsf{ref}}$, defining a transition mechanism over model outputs. The method converges geometrically fast to a stationary distribution, with convergence rate determined by the seminorm $\|U\|_\oplus$, which captures the non-transitive structure of $U$. MCHF and NLHF are shown to agree up to first order around an RLHF solution when $\|U\|_\oplus$ is small, unifying reward-based, game-theoretic, and Markovian alignment approaches.

markov chainpairwise preferencesgenerative modelsconvergence ratenon-transitive structure

MaRS: Robust Out-of-Distribution Detection via Mahalanobis Residual Scoring

arXiv cs.LG · Francesco Di Salvo, Sebastian Doerrich, Christian Ledig · 2026-06-21

The paper introduces MaRS (Mahalanobis Residual Scoring), a post-hoc out-of-distribution (OOD) detector that improves upon reconstruction-based methods in latent feature space. MaRS employs a lightweight autoencoder to learn an in-distribution manifold and scores deviations using Mahalanobis distance on reconstruction residuals, preserving anisotropic residual structure often collapsed by standard $L_2$ norms. Evaluated across three medical imaging modalities, multiple distribution shifts, and various model scales, MaRS outperforms confidence-, distance-, and reconstruction-based baselines while remaining computationally efficient.

out-of-distribution detectionmahalanobis distanceautoencoderlatent feature spacedistribution shift

RAVEN: Agentic RAG for Automated Vulnerability Repair

arXiv cs.LG · Varun Gadey, Zijie Liu, Alexandra Dmitrienko · 2026-06-21

RAVEN introduces a scalable, agentic retrieval-augmented generation (RAG) framework for automated vulnerability repair, addressing limitations of existing approaches restricted to memory-related vulnerabilities and single-language evaluations. The framework integrates a multi-faceted retrieval pipeline, leveraging open-source LLMs in a locally deployable setting with minimal GPU requirements, and introduces a Curator Agent to retrieve cross-file dependencies for complex vulnerabilities. Evaluated on 160 real-world CVE vulnerabilities across diverse types, two programming languages, and unseen CWE categories, RAVEN achieves an 83.13% repair success rate, outperforming state-of-the-art frameworks while maintaining negligible repair costs.

retrieval-augmented generationvulnerability repaircross-file dependenciesopen-source llmscve vulnerabilities

Statistical Inference for Misspecified Contextual Bandits

arXiv cs.LG · Yongyi Guo, Ziping Xu · 2026-06-21

The paper addresses statistical inference challenges in misspecified contextual bandits, where standard algorithms like LinUCB may exhibit non-Gaussian behavior under model misspecification. The authors propose an inverse-probability-weighted Z-estimation framework for marginal moment targets, including projection parameters and off-policy values, and introduce a stability condition called scaled inverse-propensity convergence. This condition ensures consistency, asymptotic normality, and valid sandwich variance estimation. Theoretical guarantees are provided for policy classes like multi-armed bandits and smooth contextual allocations. Empirical validation via simulations and a HeartSteps V1 application demonstrates reliable coverage and competitive performance.

contextual banditsmisspecified modelsinverse-probability-weightingz-estimationasymptotic normality

Learning Entropy Signature for Image Representation and Classification

arXiv cs.LG · Jan Glaser, Ivo Bukovsky, Noriyasu Homma, Marcel Jirina · 2026-06-21

The paper introduces Learning Entropy Signatures (LES), a novel image descriptor derived from Spatial Learning Entropy Maps (SLEMs) by selecting the K largest LE locations. SLEMs are generated through incremental learning of a pretrained MLP network that processes local pixel neighborhoods sequentially, capturing both local structure and acquired knowledge from prior locations. Experiments demonstrate LES's effectiveness in image classification, showing that a small K preserves discriminative information, linking neural weight learning to information relevance for compact representation.

learning entropy signaturesspatial learning entropy mapsimage descriptorincremental learningneural weight behavior

Scalable Maximum Entropy Reinforcement Learning for Diffusion Policies via Adjoint Matching

arXiv cs.LG · Serge Thilges, Onur Celik, Denis Blessing, Emiliyan Gospodinov · 2026-06-21

The authors introduce a scalable algorithm for training diffusion policies in reinforcement learning (RL) via adjoint matching, eliminating the need for ground-truth data or costly backpropagation. The method leverages stochastic optimal control to enable simulation-free training and avoids explicit likelihood estimation, with extensions improving robustness. Empirical results show competitive performance with reduced computational overhead, enhancing the viability of diffusion policies for online RL.

diffusion policiesadjoint matchingstochastic optimal controlreinforcement learningsimulation-free training

Sub-Billion, Super-Frontier: Small Language Models Rival Zero-Shot Frontier LLMs on General and Literary Relation Extraction

arXiv cs.LG · Despina Christou, Grigorios Tsoumakas · 2026-06-21

The study demonstrates that small language models (SLMs) with targeted task adaptation can match or exceed zero-shot performance of frontier LLMs on relation extraction (RE) tasks while being computationally efficient. Evaluating five SLMs (360M-3B parameters) across three domain-composition regimes and two prompt-conditioned tuning styles, the best sub-billion model (Qwen2.5-0.5B) achieved 0.83 micro-F1 on general-domain RE, outperforming GPT-5.4 (0.69) and Claude Sonnet 4.6 (0.66). On literary RE, tuned SLMs reached 0.92 F1 versus 0.83 for GPT-5.4, showing task-adapted SLMs enable accurate, private, and hardware-efficient RE without domain-adaptive pretraining gains.

relation extractionsmall language modelszero-shot learningtask adaptationmicro-f1

Scalable Bayesian Additive Models for Stellar Flare Detection via Amortized Gaussian Process Inference and Hidden Markov Models

arXiv cs.LG · Rodrigo Herrera, Vianey Leos-Barajas, Gwendolyn Eadie, Elizaveta Semenova · 2026-06-21

The paper introduces a generative surrogate framework to enable scalable Bayesian additive modeling for stellar flare detection, overcoming the computational bottleneck of exact Gaussian Process (GP) inference. The method employs a Variational Autoencoder (VAE) to compress Celerite priors into a low-dimensional isotropic manifold, replacing costly covariance operations with neural network forward passes. Results show the VAE approximation maintains structural fidelity to exact Celerite kernels while enabling efficient joint modeling with hidden Markov models (HMMs), achieving significant speedups in empirical evaluations on astrophysical time series.

gaussian processesvariational autoencoderceleritehidden markov modelstellar flare detection

Training-free Task Classification for Multi-Task Model Merging

arXiv cs.LG · Jungyong Son, Jinwook Jung, Sungyong Baik · 2026-06-21

We introduce SiM, a training-free task classification method for multi-task model merging that eliminates the need for router training or task-ID access at inference. SiM employs singular value decomposition (SVD)-based low-rank manifold approximations to score tasks by projecting test input features onto pre-computed task manifolds, using only a small per-task support set (e.g., 32 examples). This approach integrates with subspace-/mask-based merging, avoiding full expert parameter storage. Experiments across computer vision and NLP benchmarks show SiM significantly improves merged-model performance, narrowing the gap to individual task experts under task-unknown inference.

task classificationmodel mergingsingular value decompositionlow-rank manifoldtraining-free

Stationary Robust Mean-Field Games under Model Mismatches

arXiv cs.LG · Yue Wang · 2026-06-21

The paper introduces a stationary robust mean-field game framework to address model mismatches in multi-agent reinforcement learning (MARL) under distributional uncertainty. By incorporating worst-case transition models into population-coupled dynamics, the authors establish a robust dynamic programming principle with a contractive Bellman operator and prove the existence of a stationary robust mean-field equilibrium. They develop a convergent algorithm and show that the mean-field solution induces approximate equilibrium behavior in finite-population games, with explicit non-asymptotic error bounds under contractive robust dynamics. Numerical experiments validate the framework's robustness across multiple uncertainty models.

mean-field gamesdistributional robustnessmulti-agent reinforcement learningdynamic programmingmodel uncertainty

Deep material network for homogenization of piezoelectric composites

arXiv cs.LG · Ting-Ju Wei, Yen-Ming Lu, Chuin-Shan Chen · 2026-06-21

The study introduces a piezoelectric deep material network (PDMN) for efficient homogenization of two-phase piezoelectric composites, addressing computational bottlenecks in conventional direct numerical simulation (DNS). The physics-informed surrogate embeds electromechanical homogenization relations into its architecture, enabling offline training on linear electroelastic data and online prediction for nonlinear electroelasticity and history-dependent responses via a Newton--Raphson solver. Validation on polyvinylidene fluoride (PVDF) and lithium niobate (LiNbO$_3$) composites demonstrates three orders of magnitude speedup over DNS while maintaining high accuracy, facilitating multiscale analysis and design.

piezoelectric compositesdeep material networkcomputational homogenizationnonlinear electroelasticityphysics-informed surrogate

Mitigating Measurement-Induced Training Instability in Hybrid Quantum Neural Networks for Protein Classification

arXiv cs.LG · Milton Mondal, Sushovan Chanda, Mohamad Mahdi Alawieh, Brijesh Sukhadiya · 2026-06-21

The paper introduces Quantum Measurement Temperature (QMT), a learnable scaling parameter to mitigate measurement-induced training instability in hybrid Quantum Neural Networks (QNNs). QMT rescales quantum measurement outputs (logits bounded in [-1,1]) before loss computation, addressing gradient suppression caused by weak sensitivity to logit differences in cross-entropy loss. This architecture-agnostic method enhances gradient magnitude, stabilizes training, and improves classification accuracy without modifying quantum circuits. Experiments on fluorescence microscopy and Fashion MNIST variants demonstrate QMT's effectiveness in logit separation and training stability across random initializations.

quantum neural networkslogit contractionmeasurement-induced instabilitygradient suppressionquantum measurement temperature

Robust Diffusion Models via Divergence-Induced Weighted Denoising

arXiv cs.LG · Lei Li, Yuexiao Dong · 2026-06-21

This work introduces a robust training surrogate for diffusion models by replacing the standard MSE denoising loss with a nonlinear transformation induced by an f-divergence. The method leverages a local divergence construction under the Gaussian reverse-kernel structure of DDPM, where each per-step likelihood ratio follows a lognormal distribution parameterized by a scalar mismatch. The resulting training objective unifies diffusion training as divergence-induced weighted denoising, with bounded-influence divergences (e.g., Hellinger) suppressing large error samples. Empirical evaluation on CIFAR-10 under 30% contamination shows that NED reduces FID from 93.0 (KL) to 77.5, outperforming standard robust losses like Huber and clipped MSE.

f-divergencedenoising lossgaussian reverse-kernelbounded-influence divergencesrobust training

Detecting and Understanding Vulnerabilities in Fully Homomorphic Encryption Frameworks

arXiv cs.LG · Yiteng Peng, Dongwei Xiao, Zhibo Liu, Zhenlan JI · 2026-06-21

HERTA introduces the first automated testing tool for fully homomorphic encryption (FHE) frameworks, addressing vulnerabilities arising from incorrect implementation logic. The tool employs metamorphic testing with novel metamorphic relations derived from FHE semantics, enabling automated correctness testing without manual ground truth. Evaluated on three leading industry frameworks, HERTA identified 21 previously unknown bugs, several of which were confirmed and fixed by developers. The hazard analysis highlights the critical security impact of these bugs on the integrity and availability of FHE-based services.

fully homomorphic encryptionmetamorphic testingimplementation bugssecurity vulnerabilitiesautomated correctness testing

The Scissors Effect: When Resize-Based Input Diversity Helps or Hurts Transfer Attacks

arXiv cs.LG · Yuhang Jiang, Xiaojing Chen · 2026-06-21

The Scissors Effect demonstrates that Input Diversity (DI), involving random resizing and padding, exhibits regime-dependent efficacy in transfer-based adversarial attacks. While DI enhances transferability for standard surrogates, it reduces success for robustly trained ones, with a 10.3% average drop on ImageNet across various architectures and attacks. Gradient geometry analysis attributes 67% of the harm to resizing, showing improved alignment for standard surrogates but degradation for robust ones. A Local Gradient Consistency (LGC) probe distinguishes surrogate types, and a bias-variance crossover theorem isolates DI's impact. The CG-DI rule disables DI when LGC is high, preserving benefits for standard surrogates while mitigating losses for robust ones.

input diversitytransferabilitygradient geometrylocal gradient consistencybias-variance crossover

Interleaved Speech Language Models Latently Work In Text

arXiv cs.LG · Talia Sternberg, Gallil Maimon, Yossi Adi · 2026-06-21

The study reveals that interleaved speech-text language models implicitly transcribe speech to text in intermediate layers despite lacking explicit speech recognition training. Through logit lens analysis across model families and sizes, researchers found text tokens of spoken words become decodable in 77% of cases during an intermediate transcription phase before prediction returns to the speech domain. The work examines how interleaved data and text LM initialization enable this latent behavior and its correlation with spoken knowledge capabilities, providing mechanistic insights for speech LM optimization.

speech language modelslogit lensinterleaved traininglatent transcriptionmodality interaction

Federated learning with heavy-tailed gradient noise and communication noise: a variance-reduction based algorithm

arXiv cs.LG · Shengchao Zhao, Yongchao Liu · 2026-06-21

The paper proposes VRA-FedSGD, a variance-reduction based algorithm for federated learning under heavy-tailed gradient and communication noise. The method combines momentum variance reduction with nonlinear mapping to handle gradient noise and employs variance-reduced aggregation for communication noise. Theoretical analysis shows convergence rates of O(K^-(p-1)/(2p-1)) in mean for nonconvex objectives and O~(K^-(1-1/(p-ε))) almost surely for strongly convex cases, where p is the tail index. Experiments on logistic regression with real-world data validate the approach.

federated learningvariance reductionheavy-tailed noisenonconvex optimizationmomentum

Adaptive Recurrent Message Passing for Test Time Computing on Graphs

arXiv cs.LG · Junshu Sun, Wanxing Chang, Qingming Huang, Shuhui Wang · 2026-06-21

The paper introduces AdaR, an Adaptive Recurrent graph model that enables flexible test-time computing on diverse downstream tasks without parameter updates. The method derives step dependence as a necessary and sufficient condition for adaptive convergence, incorporating normalized step information and representation-target relations into recurrent updates while ensuring convergence via gradient-based supervision. Empirical results show AdaR outperforms baselines in both inductive and transductive settings.

adaptive recurrent modelsgraph learningtest-time computingstep dependencegradient-based supervision

Music Playlist Captioning at Scale with Large Language Models

arXiv cs.LG · Mathieu Delcluze, Léa Briand, Benjamin Chapus, Deniz Mekik · 2026-06-21

The paper presents Deezer's deployed system for automatic playlist captioning using large language models (LLMs), addressing the challenge of scalable natural language descriptions for personalized recommendations. The method leverages LLMs to generate controlled captions from diverse data sources, now powering the Daily Mix feature. Results show significant user engagement improvements, demonstrating how semantic framing affects perception of unchanged recommendations.

playlist captioninglarge language modelspersonalized recommendationsuser engagementsemantic framing

A Differentiable Atari VCS:A Complex, Fully Known Ground Truth for Explainable AI

arXiv cs.LG · Andreas Maier, Siming Bayer, Patrick Krauss · 2026-06-21

The paper introduces differentiable Atari 2600 Video Computer System (VCS) emulators (jutari in Julia, jaxtari in JAX) as complex yet fully inspectable ground-truth systems for explainable AI (XAI) research. Both emulators achieve bit-for-bit equivalence with the xitari emulator across all 64 Arcade Learning Environment games, treating ROM as weights and RAM as soft tape while maintaining differentiability. Theoretical analysis proves equivalence between soft and hard execution modes, with GPU-accelerated batched rollouts reaching millions of steps/second. The 137-hour development involved autonomous coding agents, enabling gradient-based XAI studies on a non-trivial but verifiable system.

differentiable emulatorexplainable aiarcade learning environmentsurrogate gradientsbatched rollouts

Escaping the Variance Trap: Jacobian-Free Dynamics for Root-Finding Bilevel Optimization

arXiv cs.LG · Zhiyu Li, Xi Xuan, Davide Carbone · 2026-06-21

The paper introduces Root-Finding Bilevel Optimization (RF-BO) as a solution to the Variance Trap, a pathology in stochastic root-finding tasks like entropy tuning and GAN equilibration. The proposed Jacobian-free method employs Two-Time-Scale Stochastic Approximation (TTSA) to update along root error directly, avoiding variance amplification from implicit Jacobians. Theoretical non-asymptotic convergence guarantees are provided under Markovian noise. Experiments show 2.6% top-1 accuracy gain in SimCLR, 17× faster ODE control convergence, improved RL entropy stability, and 11.1% generative modeling quality improvement over squared-residual baselines.

bilevel optimizationstochastic approximationvariance trapimplicit jacobiansroot-finding

Words as Difference Makers: How Large Language Models Determine Causal Structure in Text

arXiv cs.LG · Wolfgang Pietsch · 2026-06-21

The paper proposes that large language models (LLMs) infer causal structure through variational induction, a difference-making logic contrasting with Pearl's interventionist and Neyman-Rubin frameworks. It analyzes how LLM training on diverse textual contexts enables identification of difference-makers (causal factors) via token embeddings and self-attention mechanisms. The study demonstrates architectural parallels between LLMs' inductive approach and experimental methods, where systematic variation isolates causal relations in word sequences.

variational inductiondifference-makerstoken embeddingsself-attentioncausal structure

Enhancing LLMs for Graph Tasks via Graph-aware LoRA Generation

arXiv cs.LG · Junshu Sun, Wanxing Chang, Qingming Huang, Shuhui Wang · 2026-06-21

The paper introduces GaRA, a Graph-aware LoRA generation model that enhances large language models (LLMs) for graph tasks via weight-level information injection. GaRA generates task-specific low-rank weight updates conditioned on original graph structures while constraining update norms, preserving whole-graph information and avoiding optimization bias. Evaluations show GaRA outperforms baselines in zero-shot graph learning, demonstrating improved transferability compared to traditional graph neural networks (GNNs) and existing LM adaptation methods.

graph-aware lorazero-shot learningweight-level adaptationlow-rank updatesgraph neural networks

QeHDC: Hyperdimensional Computing based on Quantum-enhanced binding and SuperClass Construction

arXiv cs.LG · Yangjie Xu, Hui Huang, Li Ning, Radu State · 2026-06-21

The paper introduces QeHDC, a quantum-enhanced hyperdimensional computing framework featuring one-pass training via sinusoidal/quantum encoding and a novel reference-state-based quantum binding operation. The method employs density-matrix-based superclass generation through eigenvalue decomposition for robust feature extraction. Evaluations on benchmark datasets show superior performance, noise robustness, and computational feasibility compared to classical HDC and existing quantum-enhanced approaches, demonstrating practical potential for quantum classification tasks.

hyperdimensional computingquantum encodingdensity-matrixeigenvalue decompositionquantum binding

Asymptotic Signal Subspace Recovery in Softmax Attention Models

arXiv cs.LG · Lan V. Truong · 2026-06-21

The study provides a theoretical foundation for softmax-attention mechanisms by analyzing their signal extraction capabilities in high-dimensional noisy environments. Using a stylized softmax-attention model, the authors derive a population objective and characterize the learning dynamics via a limiting ordinary differential equation. Through stochastic approximation and dynamical systems theory, they prove that the learned query vector converges almost surely to the one-dimensional signal subspace spanned by the latent informative direction, recovering the signal up to sign ambiguity. This establishes attention mechanisms as effective signal extraction procedures under high-dimensional scaling assumptions.

softmax-attentionsignal subspacestochastic approximationdynamical systemshigh-dimensional scaling

Reinforcement learning to improve large language model-based automated code compliance systems

arXiv cs.LG · Jack Wei Lun Shi, Minghao Dang, Wawan Solihin, Leong Hien Poh · 2026-06-21

The paper introduces P4IR, a two-stage framework combining supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO) to enhance large language model (LLM)-based automated code compliance systems. SFT instills domain knowledge, while GRPO optimizes intermediate representations as high-level code skeletons. The framework reduces tree edit distance by 23.8% and token-level Levenshtein distance by 38.6% compared to SFT baselines. In zero-shot settings, P4IR outperforms Claude Opus, Sonnet 4.5, GPT-5.2, Qwen-3-Max, and GLM-4.7 in code structure and semantics, while GRPO significantly reduces false positives.

supervised fine-tuninggroup relative policy optimizationcode compliancetree edit distancelevenshtein distance

Multi-cancer detection using a computationally efficient CNN with transfer learning

arXiv cs.LG · Vasileios E. Papageorgiou, Georgios Petmezas, Dimitrios-Panagiotis Papageorgiou, Leandros Stefanopoulos · 2026-06-21

The study proposes a lightweight CNN architecture with transfer learning for multi-cancer detection from biomedical images, optimized for resource-constrained environments. The method employs pretraining on one cancer type followed by fine-tuning (20 epochs, ~0.014s/image/epoch on GTX 960) on brain MRI, lung CT, and kidney CT datasets. Results show 90.85±2.22%, 98.64±2.43%, and 99.92±0.08% test accuracy for brain, lung, and kidney cancer respectively via 5-fold CV, outperforming Xception/VGG/MobileNetV2/DenseNet121 baselines while maintaining computational efficiency.

convolutional neural networktransfer learningmulti-cancer detectioncomputational efficiencybiomedical imaging

Bypassing Minimization Bias: A Shift-Invariant Variance Estimator for Off-Equilibrium Local Learning Coefficients

arXiv cs.LG · Yingjia Cai · 2026-06-21

The paper introduces the Shift-Invariant Variance Estimator (SIVE), a method to estimate Local Learning Coefficients (LLC) without requiring knowledge of the loss minimum during off-equilibrium training phases. SIVE leverages variance-based estimation to eliminate dependency on an unknown additive baseline, combining shift-invariance with a correction derived from the Law of Total Variance to separate geometric fluctuations from mini-batch noise. Experiments on toy models and deep neural networks demonstrate SIVE's robustness in recovering geometric signals and tracking structural phase transitions, outperforming mean-energy LLC estimators in transient regimes.

local learning coefficientshift-invariant estimatorminimization biassingular learning theoryloss landscape

Structured Hyperedge Adaptation for Parameter-Efficient Fine-Tuning of Vision Transformers

arXiv cs.LG · Edwin Kwadwo Tenagyei, Lei Wang, Ugochukwu Ejike Akpudo, Jun Zhou · 2026-06-21

The paper introduces HyperAdapter, a hypergraph-based adapter for parameter-efficient fine-tuning (PEFT) of Vision Transformers (ViTs). Unlike token-wise adapters, HyperAdapter performs adaptation in hyperedge space, leveraging structured relationships among tokens via soft token routing and prototype-based assignments. This approach aggregates token features into hyperedge representations, applies bottleneck adaptation, and diffuses updates back through the hypergraph structure. Experiments show HyperAdapter outperforms existing PEFT methods under comparable parameter budgets, particularly in tasks requiring structured reasoning, demonstrating the importance of adaptation space design in ViT transfer learning.

hyperadapterparameter-efficient fine-tuningvision transformershyperedge spacestructured adaptation

Multigrid Training for Molecular Generation using Graph Neural Networks

arXiv cs.LG · Zixuan Ling, Paula Mercurio, Di Liu · 2026-06-21

The paper introduces a multigrid training strategy for molecular generation that accelerates learning by transferring parameters across resolutions. For graph representations, parameters learned from coarse graphs are progressively transferred to finer graphs via biased random walk upsampling. For 3D grids, a coarse-resolution CVAE is pretrained and its convolutional parameters are transferred to initialize a fine-resolution model. Experiments on receptor-conditioned 3D ligand generation demonstrate faster convergence and improved generalization compared to training from scratch.

multigrid traininggraph neural networksmolecular generationvariational autoencoderparameter transfer

Kiwano: A Cutting-Edge Open-Source Toolkit for Speaker Verification

arXiv cs.LG · Mickael Rouvier, Pierre Michel Bousquet · 2026-06-21

The authors introduce Kiwano, an open-source PyTorch-based toolkit for speaker verification research, offering standardized training pipelines, pretrained models, and evaluation protocols. The framework supports multiple architectures, emphasizes reproducibility through unified benchmarks, and includes tools for experiment tracking and rapid prototyping. Distributed under Apache 2.0 with comprehensive documentation, Kiwano aims to lower entry barriers and standardize practices in speaker verification research and development.

speaker verificationpytorchreproducibilitybenchmarkingopen-source

Reference-Free Assessment of Physical Consistency in World Model-based Video Generation

arXiv cs.LG · Yun Oh, Sukmin Yun · 2026-06-21

The authors propose reference-free measures for assessing physical consistency in world model-based video generation, addressing fidelity gaps that hinder robotic simulation accuracy. The method combines relative and absolute approaches, leveraging DROID-SLAM and SEA-RAFT to quantify inconsistencies, inspired by WorldScore. Relative consistency filtering improves task success rates by over 8%, reducing the simulation-to-reality gap, while absolute assessment enables spatio-temporal localization of physical artifacts. This approach eliminates the need for costly human voting or ground-truth references, offering a scalable alternative to existing methods like Elo and FVD.

physical consistencyvideo generationdroid-slamsea-raftworldscore

📰 Industry Media (7)

The $400 million machine powering the future of chipmaking

MIT Tech Review — AI · Clive Thompson · 2026-06-23

ASML's latest extreme-ultraviolet (EUV) lithography machine achieves 8nm resolution (≈40 silicon atoms) using high-numerical-aperture optics and vacuum-based tin plasma light sources, enabling continued Moore's Law scaling for AI chips. The $400M system, developed over 16 years and $10B R&D, dominates 90% of global lithography tools. Geopolitical tensions arise as US export controls block Chinese access, prompting domestic R&D efforts. EUV's wavelength (13.5nm) and precision mirror alignment enable transistor density critical for Nvidia GPUs and large language model training hardware.

extreme-ultraviolet lithographynumerical aperturemoore's lawtin plasma light sourcewafer patterning

Datalab Releases lift: A 9B Open-Weights Vision Model That Extracts Structured JSON From PDFs Using Schemas

MarkTechPost · Asif Razzaq · 2026-06-23

Datalab introduces lift, a 9B-parameter open-weights vision model for structured JSON extraction from PDFs and images using schema-constrained decoding. The model accepts JSON Schema inputs and outputs valid JSON structures, enforcing token-level constraints during generation. lift achieves 90.2% field accuracy on a 225-document benchmark, with a median latency of 9.5 seconds per document. It supports multi-page documents in a single pass and includes trained abstention to return null for missing fields. Benchmarks show lift outperforms other self-hostable models like NuExtract3 and Qwen3.5-9B in field accuracy but lags in full-document accuracy.

schema-constrained decodingjson extractionvision modelabstentionbenchmark

How to Use NVIDIA Canary-1B-v2 for ASR, Translation, and Automatic SRT Subtitle Export in Python

MarkTechPost · Sana Hassan · 2026-06-23

The tutorial demonstrates an end-to-end workflow for multilingual automatic speech recognition (ASR) and translation using NVIDIA Canary-1B-v2. It details environment setup with NeMo toolkit, audio preprocessing to 16kHz mono, and inference on GPU-enabled hardware. Results include English ASR (word-level WER not reported), translations to 4 languages, SRT subtitle generation with timestamps, long-form transcription (6× sample length), batch processing (2 clips), and a real-time factor benchmark (speed not quantified). The pipeline supports 25 languages and integrates with standard audio libraries.

automatic speech recognitionmultilingual translationtimestamp alignmentsrt subtitlesnemo toolkit

Prime Intellect Releases prime-rl 0.6.0 to Train Trillion-Parameter MoE Models on Agentic RL Workloads

MarkTechPost · Asif Razzaq · 2026-06-23

Prime Intellect released prime-rl 0.6.0, a framework for asynchronous reinforcement learning targeting trillion-parameter Mixture-of-Experts (MoE) models on agentic workloads. The system employs 3-D parallelism (FSDP, EP, CP), FP8 training, and inference optimizations like Wide Expert Parallelism and KV cache offloading. In a case study, GLM-5 was trained on software-engineering tasks at 131k sequence length with sub-5-minute step times using 28 H200 nodes, achieving stable training via router replay and precision matching.

mixture-of-expertsasynchronous rlfp8 inferencekv-cacheexpert parallelism

GLM-5.2 OpenAI-Compatible API: A Hands-On Guide to Reasoning Effort, Function Calling, and Long-Context Retrieval

MarkTechPost · Sana Hassan · 2026-06-23

The article provides a technical guide for using GLM-5.2 via an OpenAI-compatible API, demonstrating its capabilities in reasoning-effort control, function calling, and long-context retrieval. The method involves setting up a reusable chat wrapper with support for streaming, tool calling, and token tracking, followed by practical tests including multi-step tool-using agents and structured JSON output. Results show GLM-5.2's ability to handle complex tasks like arithmetic calculations and population comparisons with adjustable reasoning effort levels (off/high/max).

glm-5.2openai-compatible apireasoning-effort controlfunction callinglong-context retrieval

Omio scales travel product development using OpenAI models

AI News · Ryan Daws · 2026-06-23

Omio demonstrates enterprise-scale integration of OpenAI models, achieving 80% reduction in technical effort for product development. The multimodal travel platform implemented OpenAI Codex across its software development lifecycle (research, coding, testing) and deployed GPT-based conversational interfaces for real-time travel booking. Results show 5x faster project completion (quarterly projects reduced to one month) and validated consumer demand through rapid prototyping. The architecture grounds model outputs in live transportation data to ensure accuracy, while maintaining human oversight for all critical decisions.

openai codexconversational commercemultimodal routingin-context learninggenerative interfaces

Top spy agencies say AI cyber threats will impact you within months. Here’s why

AI News · Dashveenjit Kaur · 2026-06-23

The Five Eyes intelligence alliance warns that upcoming AI models (e.g., GPT-5.5-Cyber, Mythos) will significantly lower technical barriers for cybercrime within months, enabling automated vulnerability scanning and hyper-personalized phishing at scale. Their joint advisory highlights how AI-driven attacks reduce patch deployment windows and target consumer data, with APAC regions experiencing 165% ransomware spikes. Defensive measures include AI-powered network monitoring and user-level protections like multi-factor authentication, though 94% of executives cite AI as their top threat amid cybersecurity talent shortages.

five eyesgpt-5.5-cybervulnerability scanningmulti-factor authenticationransomware


Generated automatically at 2026-06-23 21:32 UTC. Summaries and keywords are produced by an LLM and may contain inaccuracies — always consult the original article.