Daily Digest — 2026-07-01

Tuesday, June 30, 2026 · 342 items · model: deepseek/deepseek-chat

342 items · 8 research labs, 333 arxiv papers, 1 industry media

⚠️ Source issues today:

MarkTechPost: all feed URLs failed (last tried: https://www.marktechpost.com/feed/)
AI News: all feed URLs failed (last tried: https://artificialintelligence-news.com/feed/)

🏛️ Research Labs (8)

How ChatGPT adoption has expanded

OpenAI News · 2026-06-30

OpenAI Signals data reveals global expansion and diversification of ChatGPT adoption, demonstrating increased usage intensity and task variety over time. The study analyzes aggregated interaction data from Individual ChatGPT plans (Free, Go, Plus, Pro) to track behavioral evolution across demographics and regions. Key findings show: 50% increase in daily messages and doubled task diversity after six months; fastest growth in Africa/Asia and lower-HDI countries; gender parity shifts with feminine-name dominance in 54% of usage; non-English languages now constitute majority usage, led by Spanish, Portuguese, and Arabic with Uzbek/Kazakh/Burmese showing highest growth rates.

chatgpt adoptionopenai signalshuman development indexnon-english usagetask diversity

Read original →

Introducing GeneBench-Pro

OpenAI News · 2026-06-30

GeneBench-Pro introduces a research-level benchmark for evaluating AI agents' ability to handle ambiguity and make consequential judgments in computational biology. The benchmark comprises 129 synthetic problems simulating real-world datasets, requiring iterative analysis, causal reasoning, and methodological choices. Problems are validated by domain experts and graded deterministically against known targets. GPT-5.6 Sol achieves a 31.5% pass rate with Pro mode enabled, outperforming open-source models like GLM 5.2. Results indicate significant progress in high-level scientific reasoning but highlight limitations in closing inferential loops. GeneBench-Pro aims to accelerate scientific discovery by addressing bottlenecks in computational analysis.

computational biologycausal reasoningiterative analysissynthetic datasetsdeterministic grading

Read original →

Core dump epidemiology: fixing an 18-year-old bug

OpenAI News · 2026-06-30

OpenAI identified and resolved two distinct crash-inducing bugs in their Rockset data infrastructure through population-level core dump analysis. The investigation revealed a silent hardware corruption on an Azure host and an 18-year-old race condition in GNU libunwind. By automating core dump analysis with a ChatGPT-generated script, the team separated crash populations, enabling targeted fixes: denylisting the faulty host and improving fault detection mechanisms. This epidemiological approach proved critical for diagnosing complex, low-level failures in C++ systems.

core dump analysismemory corruptiongnu libunwindstack misalignmentazure host

Read original →

Inside Genebench-Pro

OpenAI News · 2026-06-30

Genebench-Pro introduces 10 case studies demonstrating its biomedical benchmark for evaluating AI models on complex genomic tasks. Each case presents a distinct challenge (e.g., clinical utility estimation, lncRNA dependency analysis, cis-MVMR) with provided prompts and datasets requiring multi-modal evidence integration. The benchmark tests capabilities including structural variant interpretation, ancestry tract analysis, and selection inference while controlling for technical confounders like ambient RNA, LD artifacts, and mappability biases. Representative tasks involve processing pharmacogenomic evidence, single-cell RNA-seq data, and ancient allele-frequency time series with rigorous statistical controls.

genomic benchmarkstructural variantmendelian randomizationsingle-cell rna-seqlocal-ancestry tracts

Read original →

ScarfBench: Benchmarking AI Agents for Enterprise Java Framework Migration

Hugging Face Blog · 2026-06-30

ScarfBench introduces a benchmark for evaluating AI agents on enterprise Java framework migration across Spring, Jakarta EE, and Quarkus. Unlike traditional code generation benchmarks, it assesses whether migrated applications successfully build, deploy, and preserve behavior. Evaluation of state-of-the-art agents reveals significant gaps: compile success (29/30) overestimates deploy success (22/30), with configuration layers requiring disproportionate iterative effort. Agents exhibit overconfidence in self-assessment and struggle with environmental dependencies beyond code transformation.

framework migrationjava ecosystemsbehavioral validationdependency resolutionbuild verification

Read original →

Why Specialization Is Inevitable

Hugging Face Blog · 2026-06-30

The article synthesizes evidence from optimization theory, evolutionary biology, competitive markets, and machine learning to argue that specialization is an inevitable consequence of performance optimization under resource constraints. It cites Wolpert and Macready's No Free Lunch Theorem (1997) as mathematical foundation, showing that algorithmic performance gains require domain-specific adaptation. Empirical support includes biological niche specialization, market competition dynamics, and ML phenomena like negative transfer and mixture-of-experts architectures. The analysis distinguishes domain specialization (resource concentration) from domain knowledge (hand-coded features), reconciling specialization with Sutton's Bitter Lesson on scaling.

no free lunch theoremnegative transfermixture-of-expertsdomain specializationbitter lesson

Read original →

Featuring Every Eval Ever Results on Hugging Face Model Pages

Hugging Face Blog · 2026-06-30

Hugging Face integrates Every Eval Ever (EEE) with Community Evals to standardize AI model evaluation reporting. EEE employs a JSON schema capturing evaluation metadata, including model, metric, and generation settings, consolidating results from diverse sources into a unified format. Community Evals enables decentralized benchmark score reporting via YAML files in model repositories, linking results to EEE records. The integration includes a converter automating YAML generation from EEE JSON, supporting benchmarks like MMLU-Pro and GPQA. As of February 2026, the EEE datastore contains 229,000 evaluation results across 22,000 models and 2,200 benchmarks, enhancing reproducibility and transparency in AI evaluation.

json schemacommunity evalsbenchmark reportingevaluation metadatayaml converter

Read original →

Unlocking Britain’s next era of productivity: Building a nation of AI trailblazers

Google AI Blog · Kate Alessi · 2026-06-30

A UK-wide study by Google and Public First reveals AI adoption has doubled to 73% in workplaces, but with uneven progression. The workforce is segmented into four stages: AI Spectators (10%), Experimenters (38%), Practitioners (37%), and Trailblazers (15%). Trailblazers, who leverage AI for advanced workflows, report significant professional advantages, including 84% higher promotion likelihood and 55% higher pay rise probability. Barriers to adoption include behavioral habits, cognitive mindsets, and organizational permissions. Google’s nationwide AI upskilling initiative, AI Works for Britain, aims to train 10 million workers by 2030, supported by tools contributing £140 billion to the UK economy in 2025.

ai adoptionworkforce segmentationai trailblazersupskilling initiativeeconomic impact

Read original →

📜 arXiv Papers (333)

VLK: Learning Humanoid Loco-Manipulation from Synthetic Interactions in Reconstructed Scenes

arXiv cs.AI · Yen-Jen Wang, Jiaman Li, Sirui Chen, Takara E. Truong · 2026-06-29

The paper introduces VLK, a method for learning humanoid loco-manipulation from synthetic vision-language-kinematics supervision. The approach reconstructs metric-scale indoor scenes using 3D Gaussian Splatting, synthesizes 48,000 navigation and object-interaction trajectories with privileged scene information, and renders paired egocentric observations. A VLK policy trained on this data predicts short-horizon whole-body kinematic trajectories, executed via a whole-body tracker on the Unitree G1 humanoid. Physical experiments demonstrate successful sim-to-real transfer for navigation and single-object transport tasks.

humanoid loco-manipulation3d gaussian splattingvision-language-kinematicssim-to-real transferwhole-body tracking

Read original →

LeVo 2: Stable and Melodious Song Generation via Hierarchical Representation Modeling and Progressive Post-Training

arXiv cs.AI · Shun Lei, Huaicheng Zhang, Dapeng Wu, Yaoxun Xu · 2026-06-29

LeVo 2 introduces a hybrid LLM-Diffusion framework for full-length song generation, addressing the trade-off between vocal-instrument coordination and track-specific acoustics through hierarchical modeling. The system employs LeLM for semantic planning and track-specific refinement, coupled with a diffusion-based Music Codec for waveform reconstruction. Key innovations include an aesthetics-guided training schedule with progressive post-training (SFT, offline DPO, semi-online DPO) and modular extension for acoustic refinement. Evaluations demonstrate LeVo 2 outperforms open-source baselines across six subjective dimensions and approaches commercial systems in listening metrics, validating the effectiveness of hierarchical architecture, aesthetics guidance, and training strategy.

hierarchical modelingdiffusion-based music codecaesthetics-guided trainingprogressive post-trainingtrack-specific refinement

Read original →

Self-Evolving World Models for LLM Agent Planning

arXiv cs.AI · Xuan Zhang, Wenxuan Zhang, See-Kiong Ng, Yang Deng · 2026-06-29

WorldEvolver introduces a self-evolving world model framework for LLM agents that improves foresight without modifying model parameters. The method combines Episodic Memory (retrieval-based simulation of real transitions), Semantic Memory (rule extraction from prediction-observation mismatches), and Selective Foresight (confidence-based prediction filtering). Evaluated on ALFWorld and ScienceWorld using Word2World and AgentBoard benchmarks, WorldEvolver achieves superior prediction accuracy across three model backbones and outperforms baselines in downstream agent success rates, demonstrating test-time memory revision enhances both prediction and planning.

world modelllm agentsepisodic memorysemantic memoryselective foresight

Read original →

GROW$^2$: Grounding Which and Where for Robot Tool Use

arXiv cs.AI · Yuhong Deng, Yuyao Liu, David Hsu · 2026-06-29

GROW$^2$ addresses open-world affordance grounding for robot tool use by hierarchically decomposing the problem into semantic and geometric levels. The method leverages Vision-Language Models for semantic task parsing and tool selection, followed by vision foundation models for 3D region grounding from RGB-D images. Experiments demonstrate superior performance on affordance prediction benchmarks, with zero-shot generalization over open-category objects and improved tool use in simulated and real-world settings compared to baselines.

affordance groundingvision-language modelszero-shot generalizationrobot tool use3d region grounding

Read original →

Pessimism's Paradox: Conservative Offline Training Amplifies Reward Hacking During Online Adaptation in Reasoning Models

arXiv cs.AI · Subramanyam Sahoo, Aman Chadha, Vinija Jain, Divya Chaudhary · 2026-06-29

The study demonstrates that conservative offline training paradoxically increases reward hacking during online adaptation, contrary to conventional wisdom. Using Qwen3-14B trained with Direct Preference Optimisation (DPO) at three conservatism levels (β ∈ {β_lo, β_mid, β_hi}), the authors show that higher conservatism monotonically raises reward-hacking damage (Spearman ρ = 1.0), measured via Goodhart gap and AUGC on GSM8K. Mechanistic analysis reveals a causal chain: high-β DPO reduces policy entropy and response diversity, concentrating outputs in high-epistemic-uncertainty regions exploited during online optimization. A power-law fit identifies an optimal conservatism level β* balancing alignment and hacking vulnerability.

direct preference optimisationreward hackinggoodhart gappolicy entropyepistemic uncertainty

Read original →

DOPD: Dual On-policy Distillation

arXiv cs.AI · Xinlei Yu, Gen Li, Qingyi Si, Guibin Zhang · 2026-06-29

The paper introduces DOPD (Dual On-policy Distillation), a novel distillation paradigm addressing privilege illusion in on-policy knowledge transfer. The method dynamically routes token-level supervision between privileged teacher and student policies based on advantage gaps and relative probabilities, applying varying supervision strength and objectives. Experiments on LLMs and VLMs show DOPD outperforms Vanilla OPD and other baselines, with additional validation on stability, robustness, continual learning, and OOD tasks.

on-policy distillationprivilege illusiontoken-level supervisionadvantage gapcapability transfer

Read original →

Optimization Dynamics Imprint Semantic Specificity in Contrastive Embedding Norms

arXiv cs.AI · Ziwei Su, Junyu Ren, Victor Veitch · 2026-06-29

The work establishes a theoretical framework explaining why embedding norms in contrastive models encode semantic properties despite being typically ignored in cosine similarity metrics. Through analysis of optimization dynamics, the authors derive an analytic formula showing that embedding length naturally captures concept specificity, token frequency, and human uncertainty during training. The results demonstrate how these norms provide calibration signals for model interpretability and retrieval tasks, offering a principled explanation for an empirical phenomenon previously treated heuristically.

contrastive learningembedding normsoptimization dynamicssemantic specificitycalibration signals

Read original →

C$^{2}$R: Cross-sample Consistency Regularization Mitigates Feature Splitting and Absorption in Sparse Autoencoders

arXiv cs.AI · Haoran Jin, Xiting Wang, Shijie Ren, Hong Xie · 2026-06-29

The paper introduces C$^2$R (Cross-sample Consistency Regularization) to address feature splitting and absorption in Sparse Autoencoders (SAEs) for large language model interpretation. Feature splitting fragments coherent concepts into non-atomic latents, while absorption creates arbitrary exceptions in general features, both stemming from inconsistent latent assignment. C$^2$R enforces cross-sample consistency by penalizing co-activation of directionally similar latents within a batch. Evaluations show C$^2$R mitigates these issues while preserving reconstruction fidelity, enhancing latent interpretability without performance degradation.

sparse autoencodersfeature splittingfeature absorptioncross-sample consistencylatent interpretability

Read original →

MESA: Prioritizing Vulnerable Communication Channels for Securing Multi-Agent Systems

arXiv cs.AI · Kunyang Li, Kyle Domico, Jonathan Gregory, Patrick McDaniel · 2026-06-29

We introduce MESA, a label-free framework for prioritizing vulnerable communication channels in multi-agent systems (MAS) by ranking security-critical edges. MESA combines six graph-theoretic metrics with two dynamic probes (ablation and masking) to assess edge vulnerability without requiring attack traces. Evaluated across three MAS scenarios, eight network topologies, and five LLMs (Qwen, Llama, Gemma), MESA achieves a mean Spearman ρ=+0.60 (peaking at +0.73) correlation with empirical attack success rates. Monitoring the top 10% of MESA-ranked edges intercepts 3x more successful attacks than random allocation. The framework demonstrates effectiveness under varying attacker/defender models and LangGraph workflows, though limitations exist under adaptive attacks and high-redundancy graphs.

multi-agent systemsgraph-theoretic metricsdynamic probesspearman correlationlanggraph workflows

Read original →

Words Speak Louder Than Code: Investigating Cognitive Heuristics in LLM-Based Code Vulnerability Detection

arXiv cs.AI · Asif Shahriar, Hongyu Cai, Hadjer Benkraouda, Gang Wang · 2026-06-29

This paper presents the first systematic investigation of cognitive heuristics in LLM-based code vulnerability detection, introducing a controlled framework that isolates three heuristics: halo (author attribution), framing (task objectives/consequences), and anchoring (prior analysis). Evaluating eight LLMs across three programming languages, the study finds average susceptibility rates of 33.2% for framing, 23.5% for anchoring, and 18.4% for halo, with semantic reasoning vulnerabilities being more affected than pattern-matching ones. A proof-of-concept black-box attack demonstrates that cognitive susceptibility can suppress up to 97% of detected vulnerabilities, revealing it as a consistent and exploitable property of LLM-based detection systems.

cognitive heuristicsvulnerability detectionsemantic reasoningframing effectblack-box attack

Read original →

Beyond 2D Matching: A Unified Single-Stage Framework for Geometry-Aware Cross-View Object Geo-Localization

arXiv cs.AI · Liyao Wang, Ruipu Wu, Haojun Xu, Lei Shi · 2026-06-29

We propose GAGeo, a single-stage Geometry-Aware Geo-localization framework for cross-view object geo-localization (CVOGL) that jointly predicts bounding boxes, segmentation masks, and camera poses. Built upon the permutation-equivariant 3D foundation model $π^3$, GAGeo integrates visual features, referring prompts, and learnable task tokens, adapting inherited 3D priors in a unified forward pass. We introduce a contrastive loss leveraging satellite views as universal anchors for implicit alignment, enabling zero-shot ground-to-drone localization without triplet data. Evaluated on a new large-scale dataset with 220,000 ground-satellite and drone-satellite pairs, GAGeo outperforms state-of-the-art methods, demonstrating strong generalization in unseen scenes and novel cross-view setups.

cross-view object geo-localizationgeometry-aware frameworkpermutation-equivariant 3d modelcontrastive losszero-shot localization

Read original →

A Multi-task Mixture of Experts Framework for Malware Classification, Packing Detection, and Family Attribution

arXiv cs.AI · Jithin S., Roshin Sleeba C., Anvin Mariya P. B., Asmitha K. A. · 2026-06-29

The paper proposes a multi-task Mixture of Experts (MoE) framework for malware analysis, addressing classification, packing detection, and family attribution. It evaluates EMBER features and raw byte arrays, comparing Homogeneous MoE, Heterogeneous MoE, and Multi-Gate MoE (MMoE) architectures. MMoE achieves a 0.9744 detection rate with 2.56% failure, demonstrating robustness against adversarial mutations. The framework leverages expert specialization and adaptive gating for scalable, resilient malware detection.

mixture of expertsmalware classificationpacking detectionmulti-task learningadversarial robustness

Read original →

The Human Creativity Benchmark

arXiv cs.AI · Aspen Hopkins, Allison Nulty, Alexandria Minetti, Anoop Pakki · 2026-06-29

The Human Creativity Benchmark (HCB) introduces a framework for evaluating creative AI by preserving both convergence and divergence signals in professional judgments. It collects pairwise preferences, scalar ratings on prompt adherence, usability, and visual appeal, alongside qualitative rationale from domain professionals across 15,000 judgments in five creative domains and three workflow phases. Results show convergence on verifiable dimensions like technical correctness and divergence on taste-driven aspects like aesthetic direction. No model uniformly excels across all phases, and collapsing these signals into a single metric discards critical insights on correctness versus steerability.

human creativity benchmarkconvergencedivergencepairwise preferencesworkflow phases

Read original →

TraceLab: Characterizing Coding Agent Workloads for LLM Serving

arXiv cs.AI · Kan Zhu, Mathew Jacob, Chenxi Ma, Yi Pan · 2026-06-29

The paper introduces TraceLab, a dataset of 4,300 coding-agent sessions (350K LLM steps, 430K tool calls) from daily use of Claude Code and Codex, addressing the lack of real workload data for serving-system optimization. Through trace collection and analysis, the authors identify key workload characteristics: long autonomous loops, short outputs in long contexts, heavy-tailed tool calls, and high but imperfect KV-cache hit rates. These findings suggest optimizations like append-length-aware prefill and semantic-aware tool-latency prediction.

coding agentsllm servingkv-cachetool callsworkload characterization

Read original →

Linguistic Firewall: Geometry as Defense in Multi-Agent Systems Routing

arXiv cs.AI · Dvir Alsheich, Adar Peleg, Ben Hagag, Rom Himelstein · 2026-06-29

We introduce ANTAP (Automatic Non-Textual Agent Picker), a routing architecture for Multi-Agent Systems that mitigates security vulnerabilities arising from reliance on unverified proxies for agent competence. ANTAP employs active capability testing to empirically assess agent performance, distilling results into fixed behavioral operators within a shared semantic space. Routing decisions are made via non-textual algebraic projection, establishing a 'linguistic firewall' that prevents metadata-based attacks. Experiments demonstrate ANTAP achieves near-zero Attack Success Rate (ASR) against description-based injection attacks, compared to 67.3% for baseline methods, and reduces ASR by 20% against adaptive embedding attacks.

multi-agent systemslinguistic firewallattack success ratesemantic spacealgebraic projection

Read original →

To Tab or Not to Tab: Measuring Critical Engagement in AI Code Completion Tools Using Behavioral Signals and Attention Checks

arXiv cs.AI · Jessica Hutchison, Ian Tyler Applebaum, Kenneth Angelikas, Kush Rakesh Patel · 2026-06-29

The study introduces Clover, an AI code completion tool that logs student interactions and incorporates attention checks to measure critical engagement during programming tasks. A taxonomy of behavioral interaction metrics was developed, informed by prior literature. Analysis revealed that higher rates of tab acceptance correlated with lower attention check performance, while increased dwell time was associated with higher attention check performance. These findings suggest that interaction patterns and attention checks can serve as indicators of reflective engagement in AI-assisted programming.

code completionattention checksbehavioral interaction metricstab acceptancedwell time

Read original →

Latent Actions from Factorized Transition Effects under Agent Ambiguity

arXiv cs.AI · Heejeong Nam, Chandradithya S Jonnalagadda, Harshit Aggarwal, Eric Xu · 2026-06-29

The paper introduces Observed Transition Factorization (OTF) to address action ambiguity in Latent Action Models (LAMs) by decomposing transitions into reusable primitives. OTF-LAM abstracts these primitives into action-like latents using inverse-forward dynamics, while OTF-LAM-Dino predicts future states in DINOv2 space without a decoder. Experiments show zeroshot transferability of OTF primitives across carrier and morphology shifts, with downstream policy learning matching or surpassing baselines under transition ambiguity.

latent action modelstransition factorizationdino representationinverse-forward dynamicszeroshot transfer

Read original →

TRACE: Temporal Relationship-Aware Conversational Entrainment Detection in Dyadic Speech

arXiv cs.AI · Sathvik Manikantan Napa Ugandhar, Hao Zhang, Alison Gunzler, Yuzhe Wang · 2026-06-29

The paper introduces TRACE, a temporal relationship-aware framework for detecting emotional entrainment in dyadic speech, and DyadEE, a dataset containing both natural and synthetically disrupted conversations. TRACE models interactions as ordered sequences of acoustic embeddings from emotion fine-tuned Whisper representations, treating samples as interaction traces rather than pooled utterances. Experiments on DyadEE show that incorporating conversational context and relationship information improves detection, with TRACE achieving 97.01% accuracy.

emotional entrainmentdyadic speechwhisper representationsinteraction traceacoustic embeddings

Read original →

Learning from Mistakes: Rollout-Retrieval Lifelong Policy Learning for Autonomous Driving

arXiv cs.AI · Cheng Gong, Haoyang Wang, Chao Lu, Zirui Li · 2026-06-29

The paper introduces Rollout-Retrieval Lifelong Policy Learning (R²LPL), a framework for continual improvement of autonomous driving policies by learning from recoverable mistakes. R²LPL addresses the challenge of converting sparse failure evidence into compact supervised knowledge by filtering mistake-related states and retrieving feasible corrective targets. Evaluated on large-scale closed-loop nuPlan benchmarks, R²LPL significantly improves a learning-based planner's performance, achieving state-of-the-art results on the challenging Test14-hard split with minimal rollout and continual-learning cycles. This demonstrates R²LPL's efficacy in leveraging recoverable closed-loop mistakes for sustained policy enhancement.

autonomous drivinglifelong learningpolicy improvementclosed-loop scenariosnuplan benchmarks

Read original →

Entity Binding Failures in Tool-Augmented Agents

arXiv cs.AI · Rahul Suresh Babu, Shashank Indukuri · 2026-06-29

The paper identifies entity binding failures as a critical reliability issue in tool-augmented language-model agents, where correct tool selection still leads to actions on incorrect real-world entities. The authors formalize the distinction between tool correctness and entity correctness, propose a taxonomy of wrong-entity failures, and evaluate entity-aware execution mechanisms including resolution preconditions, confidence gating, and provenance tracking. In diagnostic evaluations across 60 tasks with five model backends, action-oriented baselines produced 24.0-26.0% wrong-entity actions, while entity-aware methods eliminated these errors at the cost of reduced task completion under ambiguity.

entity binding failurestool-augmented agentsentity-resolution preconditionsprovenance trackingconfidence gating

Read original →

Informational Frustration in Neural Manifolds: Shannon Bottlenecks and the Limits of Learnability

arXiv cs.AI · Srinivasa Rao P., Vangmayi P Reddy · 2026-06-29

The paper introduces a unified theoretical framework connecting information theory, topology, and statistical mechanics to explain deep learning's generalization paradox. It proposes the Entropic Learnability Horizon (ELH), a fundamental law relating data manifold entropy, decision boundary complexity, and weight space entropy. The Shannon-Topological Bottleneck Theorem proves that exceeding this horizon triggers an entropic phase transition into 'Informational Frustration', explaining phenomena like grokking as entropic release. The theory yields Entropic Gradient Descent (EGD), an optimization method dynamically managing weight entropy. Results demonstrate entropy as the physical currency governing learnability.

entropic learnability horizonshannon-topological bottleneckinformational frustrationentropic gradient descentvon neumann entropy

Read original →

On the Faithfulness of Post-Hoc Concept Bottleneck Models

arXiv cs.AI · Laines Schmalwasser, Jan Blunk, Niklas Penzel, Julia Niebling · 2026-06-29

The paper analyzes faithfulness issues in Post-Hoc Concept Bottleneck Models (post-hoc CBMs), which project latent features onto interpretable concept spaces. It identifies two failure modes: (1) covariate shifts in auxiliary data causing unfaithful concept representations, with a derived error bound, and (2) systematic label noise in vision-language model-generated concept labels. The authors propose novel metrics decoupling concept faithfulness from predictive accuracy, demonstrating their effectiveness across synthetic and real-world benchmarks where standard accuracy evaluations fail.

concept bottleneck modelscovariate shiftvision-language modelsinterpretabilitylabel noise

Read original →

McMg: A Learned Phase-Space Multi-channel Multigrid Preconditioner for Helmholtz Equation

arXiv cs.AI · Jiwei Jia, Xinliang Liu, Juntao Wang, Jinchao Xu · 2026-06-29

The paper introduces McMg, a learned phase-space multigrid preconditioner for heterogeneous Helmholtz equations that addresses challenges of indefiniteness and pollution errors. The method coarsens physical space while preserving wave information in channel dimensions, using learned packets of amplitude, phase, and direction, combined with adaptive stencils and medium-dependent smoothers. Experiments on high-frequency, high-contrast 3D problems show McMg reduces iterations and wall-clock time versus classical baselines and outperforms existing neural preconditioners, with generalization across scales via Layer-by-Layer Progressive Finetuning.

helmholtz equationmultigrid preconditionerphase-space coarseningneural pde operatorslearned green's operator

Read original →

SIMAX: A Scalable and Interpretable Framework for Multi-Fidelity and Annotated Clinician-Patient Dialogue Simulation

arXiv cs.AI · Zhuhan Bao, Rui Yang, Bohao Yang, Zhiyi Liu · 2026-06-29

SIMAX introduces a scalable framework for generating controlled clinician-patient dialogues with behavioral annotations, addressing the scarcity of real-world data for evaluating AI-driven communication coding systems. The method employs predefined clinical scenarios, personas, and voice conditions, with behaviors controlled via Global and WISER codebooks. Evaluation on 3,388 simulated dialogues across three specialties demonstrated reasonable speech naturalness (UTMOS: 3.03, WV-MOS: 2.61), high transcription fidelity (WER: 0.07, CER: 0.05), and positive text-audio correspondence (CLAP cosine similarity: 0.41). Human assessments yielded median MOS of 4.67 and clinical realism score of 3.00, validating SIMAX's utility in assessing communication coding systems.

simulated dialoguescommunication codingbehavioral annotationstranscription fidelityclinical realism

Read original →

Situation Perception: A Necessary Primitive to Artificial Superintelligence

arXiv cs.AI · Ziqin Yuan, Jaymari Chua · 2026-06-29

The authors argue that achieving artificial superintelligence (ASI) necessitates the development of 'situation perception,' a capacity to construct, revise, and act within internal simulations of possible worlds across latent time. They identify three core components required for this capability: abstract prediction, long-term compressed memory, and active learning guided by objectives. The analysis critiques current large language models for their lack of general intelligence despite advanced pattern recognition, proposing specific tests to evaluate progress toward machines capable of simulating futures, pursuing self-directed goals, and potentially judging their creators.

artificial superintelligencesituation perceptionabstract predictionlong-term compressed memoryactive learning

Read original →

COHORT: Collaborative Orchestration for Hardening via Offensive Replay on Emulated Topologies

arXiv cs.AI · Chen Frydman, Aviram Zilberman, Rubin Krief, Abed Showgan · 2026-06-29

COHORT introduces the first end-to-end framework for automating enterprise network mitigation through a role-decomposed multi-agent LLM workflow. The system generates and refines mitigations as real device commands, evaluated via offensive replay on a GNS3 emulator with vendor firmware, supplemented by connectivity-regression and cumulative-effect checks. In experiments across three topologies and four attack scenarios, 46.7% of mitigations successfully disrupted attacks while preserving connectivity, outperforming a single-agent baseline by 4.4×.

offensive replaygns3 emulatormulti-agent llmconnectivity-regressionenterprise mitigation

Read original →

Field Order Should Not Matter: Permutation-Invariant Embedding Model Fine-Tuning for Structured Metadata Retrieval

arXiv cs.AI · Aivin V. Solatorio, Olivier Dupriez, Rafael Macalaba · 2026-06-29

The paper introduces permutation-invariant fine-tuning (PI-FT), a method to make structured metadata retrieval robust to field order variations in serialization. By randomizing field order and applying dropout during fine-tuning, PI-FT reduces the performance drop from 7.4 to 0.2 nDCG@10 when field order changes, while maintaining in-distribution accuracy. The approach is evaluated on DevDataBench, a multilingual benchmark of 10,000 development indicators with LLM-generated queries, where a 118M-parameter fine-tuned model outperforms zero-shot baselines like text-embedding-3-large (0.707 vs. 0.556 nDCG@10), particularly in low-resource languages.

permutation-invariant fine-tuningstructured metadata retrievalfield order robustnessmultilingual benchmarkin-context learning

Read original →

Collective cooperation without individual fidelity in LLM agents

arXiv cs.AI · Henrique Ferraz de Arruda, Carlos Gracia Lázaro, Alberto Aleta, Yamir Moreno · 2026-06-29

The study evaluates the fidelity of LLM agents as proxies for human decision-making in social simulations by comparing their behavior in a networked Prisoner's Dilemma experiment against human data. Nine open-weight LLMs were tested using identical interaction protocols, payoff structures, and network topologies. While LLMs reproduced macro-level cooperation dynamics, including early decline and later stabilization, they underestimated individual-level heterogeneity and exhibited different conditional cooperation patterns. Introducing random agents improved micro-level agreement but did not align decision rules with human behavior. The findings highlight a macro-micro dissociation in LLM-based social agents, emphasizing the need for multi-faceted validation beyond aggregate outcomes.

llm agentsprisoner's dilemmacooperation dynamicsindividual heterogeneitydecision rules

Read original →

The FIL Hypothesis: Inductive Biases Help with Kernel Engineering

arXiv cs.AI · Nikolai Rozanov, Subhabrata Dutta, Preslav Nakov, Iryna Gurevych · 2026-06-29

The paper challenges the Bitter Lesson by introducing the Feedback Information Loop (FIL) hypothesis, which identifies feedback latency as a critical scaling dimension for AI systems. It argues that future applications in science and physical-world domains will involve FILs ranging from hours to weeks, rendering purely data-driven methods impractical. The authors propose an alternative approach incorporating human-inspired inductive biases to constrain the solution space. Initial validation on GPU programming tasks demonstrates superior performance over data-driven methods, with code released publicly.

feedback information loopinductive biasesbitter lessonscaling dimensiongpu programming

Read original →

Translating Natural Language to Strategic Temporal Specifications via LLMs

arXiv cs.AI · Marco Aruta, Francesco Improta, Vadim Malvone, Aniello Murano · 2026-06-29

The authors introduce a framework for translating natural language descriptions of strategic requirements into ATL/ATL* formulas using Large Language Models (LLMs), addressing the challenge of formalizing Multi-Agent System specifications. They create an expert-validated dataset for training and evaluation, as no existing dataset supports this task. Fine-tuned open-weight models (3-7B parameters) achieve 0.84 semantic accuracy, comparable to 0.86 for proprietary few-shot baselines, while maintaining on-premises requirements. Judge reliability inversely correlates with generator strength, with Llama-3.3-70B tracking human verdicts most closely. The tool integrates with a strategic logics model checker, enabling non-experts to specify properties in natural language.

multi-agent systemsatl/atl* formulassemantic accuracyllm judgestrategic logics

Read original →

Transformer Architectures as Complete Bayes Processes: A Formal Proof in the Measure-Theoretic Kernel Framework

arXiv cs.AI · Haobo Yang · 2026-06-29

The paper provides a formal proof that transformer architectures implement exact Bayesian posterior inference when their internal update mechanisms satisfy a Bayes joint-distribution condition. Using a measure-theoretic kernel framework, the authors define a hierarchy of abstractions—from core Bayesian transformers to multilayer stacks—and prove that the update kernel equals the posterior almost everywhere at each level. The proof includes deriving the explicit Bayes formula through Radon-Nikodym differentiation and demonstrating that softmax attention induces a valid probability distribution over keys. The framework establishes conditions under which transformer blocks are provably Bayesian, linking abstract kernel theory to concrete attention mechanisms.

bayesian inferencemeasure-theoretic kernelradon-nikodym differentiationsoftmax attentionmarkov kernel

Read original →

Beyond Point Estimates for Glaucoma Visual Field Forecasting with Diffusion Models

arXiv cs.AI · Marta Colmenar Herrera, Pablo Márquez Neila, Şerife Seda Kucur Ergünay, Martin S. Zinkernagel · 2026-06-29

This work introduces conditioned denoising diffusion models for probabilistic forecasting of glaucoma visual fields (VFs), addressing the limitations of deterministic predictions in representing disease progression uncertainty. The method generates distributions of plausible future VFs from longitudinal observations with irregular follow-up intervals, enabling uncertainty-aware risk assessment. Evaluated on two independent VF cohorts, the approach produces well-calibrated distributions for clinically relevant VF measures and achieves state-of-the-art accuracy when reduced to point estimates, outperforming clinical baselines and prior learning-based methods. The results advocate for a shift toward distributional modeling in glaucoma monitoring and treatment planning.

denoising diffusion modelsvisual fieldsglaucomaprobabilistic forecastinguncertainty-aware

Read original →

Can LLMs Rank? A Tale of Triads and Triage

arXiv cs.AI · Gaurab Pokharel, Shafkat Farabi, Patrick J. Fowler, Sanmay Das · 2026-06-29

The paper introduces a dual-metric framework for assessing LLM consistency in high-stakes ranking tasks, combining classical social choice theory with modern LLM evaluation. It proposes using the coefficient of consistency (ζ) for intra-run circular triad analysis and Kendall's τ for inter-run ranking distance, demonstrating their complementary value through homelessness allocation and emergency triage case studies. Experiments reveal significant performance variation across three leading LLMs, with guidelines for practical consistency assessment before deployment.

large language modelsranking consistencycircular triadskendall's tausocial choice theory

Read original →

Beyond IID: How General Are Tabular Foundation Models, Really?

arXiv cs.AI · Lennart Purucker, Andrej Tschalzev, Nick Erickson, Gioia Blayer · 2026-06-29

The paper introduces BeyondArena, a unified benchmark for evaluating tabular foundation models across diverse task types (IID, temporal, grouped), dataset scales, and feature types. It also presents Data Foundry, a Python framework for curating tabular datasets. Evaluations on 11 models and 142 datasets reveal that existing foundation models perform well on small to medium IID data but are outperformed by traditional tree-based and deep learning models on non-IID, large, or high-dimensional datasets. The benchmark aims to guide research toward more challenging scenarios in tabular data modeling.

tabular foundation modelsiid databenchmarkingdata curationnon-iid challenges

Read original →

ENC-ODE: Event-level Neurodegenerative Modeling in Continuous Time with Neural ODEs

arXiv cs.AI · Yujee Song, Seunghun Baek, Guorong Wu, Won Hwa Kim · 2026-06-29

ENC-ODE introduces a neural ODE framework for continuous-time modeling of neurodegenerative disease progression, addressing sparse and irregular longitudinal biomarker data. The method employs diagnosis-conditioned dynamics and target-conditioned attention to predict future biomarker evolution without history compression. Evaluated on the ADNI dataset, ENC-ODE outperforms sequence models, providing a scalable solution for clinical support in Alzheimer's disease management.

neurodegenerative modelingneural odesbiomarker predictioncontinuous-time dynamicsattention mechanism

Read original →

Model Predictive Current Control with Harmonic Correction for Single-Phase AC-DC EV Charging

arXiv cs.AI · Changhong Li, Bharathkumar Hegde, Biswajit Basu, Shreejith Shanker · 2026-06-29

A duty cycle predictive Model Predictive Current Control (MPCC) with real-time harmonic estimation is proposed to improve current quality in single-phase AC-DC EV charging. The method dynamically estimates low-order harmonic components of input current, corrects MPCC reference current, and enables continuous duty cycle control for targeted harmonic suppression. Compared to switching state predictive MPCC, the proposed approach reduces steady-state current THD_i from 11.47% to 6.10%, and further to 2.85% with harmonic reference correction, addressing limitations from dead time, control delay, and model parameter mismatch.

model predictive current controlharmonic estimationelectric vehicle chargingpower factor correctiontotal harmonic distortion

Read original →

A Stochastic--Geometric Theory of Scaling Laws in Grokking

arXiv cs.AI · Róisín Luo, Christian Gagné, Jonas Ngnawé, Ihsan Ullah · 2026-06-29

The paper presents a stochastic-geometric theory explaining grokking (delayed generalization) in neural networks, attributing it to a shell-core topological configuration in the solution space induced by Adam optimization with weight shrinkage. The analysis reveals that random initializations concentrate on an outer shell, memorization solutions on an inner shell, and generalization solutions in the core. Using stopping-time theory, the authors derive scaling laws for learning rate, batch size, and ℓ2 regularization, validated empirically and consistent with prior work.

grokkingadam optimizationstopping-time theoryscaling lawsℓ2 regularization

Read original →

Whose Side Is Your Agent On? Multi-Party Principal Loyalty in LLM Agents

arXiv cs.AI · Bojie Li, Noah Shi · 2026-06-29

The paper introduces PrincipalBench, a 75-item multi-turn benchmark, and two mechanisms for ensuring multi-party loyalty in LLM agents, where agents must balance principal loyalty with counterparty interactions. PrincipalBench employs leak probes, dual judges, and an integrity-audit gate to evaluate 13 frontier models, revealing a sharp split between selective (≤20% harm) and over-refusing clusters (53.6-75.3% harm). A prompt-time loyalty scaffold reduces harm to 19.4% in Claude-Sonnet, while a per-token-KL distillation recipe transfers knowledge from Qwen3-32B to 8B Qwen3 and Llama-3.1. Both mechanisms operate along a leak/over-refusal trade-off, unable to achieve jointly favorable outcomes.

multi-party loyaltyleak probeskl distillationintegrity-audit gateprompt-time scaffold

Read original →

Set-Inclusive Uncertainty Modeling for Robust Brain Tumor Segmentation

arXiv cs.AI · Seunghun Baek, Jihwan Park, Jaeyoon Sim, Hoseok Lee · 2026-06-29

The authors propose a probabilistic representation framework for robust brain tumor segmentation under missing MRI modalities, addressing intrinsic uncertainty from information loss. The method models representations as Gaussian distributions, where the mean encodes task information and variance quantifies uncertainty. A regularization strategy aligns partial modality means with full-modality counterparts while scaling variance by their discrepancy. A set-inclusive strategy leverages hierarchical modality subsets with ordering constraints for consistent uncertainty relationships. Experiments on BraTS 2018 and 2020 demonstrate superior performance across diverse missing-modality scenarios compared to baselines.

probabilistic representationgaussian distributionsmissing modalitiesbrain tumor segmentationuncertainty modeling

Read original →

Using Large Language Models as Low-Cost Statistical Estimators for Human-Response Data

arXiv cs.AI · Haobo Yang · 2026-06-29

The paper establishes that pretrained large language models (LLMs) serve as risk-equivalent estimators for conditional expectations under squared loss, achieving restricted functional risk equivalence with Bayes-optimal risk for conditional-mean-dependent inference. The authors formalize LLMs as misspecified functional estimators, decomposing error into representation bias and optimization error, and prove convergence to irreducible population variance plus squared representation bias under mild regularity conditions. They derive finite-sample bounds and a calibration protocol, showing LLMs can replace human experiments for near-optimal statistical inference when conditions are met.

risk equivalenceconditional expectationsrepresentation biaspinsker inequalityle cam deficiency

Read original →

ReactiveBFM: Reactive Closed-Loop Motion Planning Towards Universal Humanoid Whole-Body Control

arXiv cs.AI · Xiao Chen, Weishuai Zeng, Xiaojie Niu, Zirui Wang · 2026-06-29

ReactiveBFM introduces a real-time closed-loop planning-control framework for humanoid robots, addressing limitations of Behavior Foundation Models (BFMs) in reactive whole-body coordination. The method employs a scheduled prefix sampling curriculum to mitigate exposure bias and an asynchronous replanning mechanism to reconcile latency mismatches between planning and tracking. Trajectory chunking ensures spatio-temporally fluid execution. Deployed on the Unitree G1 humanoid, ReactiveBFM demonstrates zero-shot moving target reaching and achieves a 93.1% success rate in sim-to-sim benchmarking under severe perturbations, outperforming open-loop baselines by 28.6%.

behavior foundation modelsexposure biasclosed-loop planningasynchronous replanningtrajectory chunking

Read original →

Residual-Guided Expert Specialization for Incomplete Multimodal Learning

arXiv cs.AI · Seunghun Baek, Jihwan Park, Jaeyoon Sim, Minjae Jeong · 2026-06-29

The paper proposes MARS, a mixture-of-experts framework for incomplete multimodal learning that leverages representational deviations caused by missing modalities. The method uses a privileged residual signal derived from complete-incomplete representation contrasts to guide expert specialization via a residual router, while a feature router imitates this behavior for deployment. Discrepancy-aware noise regularization mitigates train-test router gaps. Experiments on multimodal classification (CASIA-SURF, CREMA-D, UPMC Food-101) and segmentation (MCubeS) demonstrate consistent improvements over baselines while maintaining efficiency and backbone compatibility.

incomplete multimodal learningmixture-of-expertsresidual signaldiscrepancy-aware regularizationrepresentation deviation

Read original →

FFAvatar: Feed-Forward 4D Head Avatar Reconstruction from Sparse Portrait Images

arXiv cs.AI · Jianjiang Yao, Ke Xian, Renxiang Dai, Robert Caiming Qiu · 2026-06-29

FFAvatar introduces a Transformer-based 3D Gaussian framework for reconstructing animatable 4D head avatars from sparse portrait images, supporting incremental refinement with additional inputs. The method employs an alternating attention mechanism to disentangle identity appearance from expression/viewpoint variations, coupled with a sparse-to-dense learning paradigm that first captures coarse features via FLAME-anchored primitives before UV-domain densification. A motion refinement module models residual motion for subject-specific dynamics. Experiments show FFAvatar achieves high-fidelity, identity-consistent rendering with superior flexibility and driving efficiency compared to existing approaches.

4d avatar reconstructiontransformer-based 3d gaussianalternating attention mechanismsparse-to-dense learningflame parametric model

Read original →

DRIFT: Difficulty Routing Self-DIstillation with Rhythm-Gated Exploration and Success BuFfer Training

arXiv cs.AI · Haisen Luo, Yiwei Liu, Haoning Wang, Dan Liu · 2026-06-29

The paper introduces DRIFT, a self-evolution policy optimization framework for large language models that enables stable self-improvement without external supervision. DRIFT combines Difficulty Routing to dynamically allocate self-distillation and reinforcement learning signals based on problem-level learning states, and Rhythm Gating to focus token-level exploration on critical reasoning positions. It incorporates a success buffer and two-stage curriculum learning to preserve high-quality experience and guide policy evolution. Evaluated across five benchmarks and three model scales, DRIFT achieves 79.5% average score, outperforming GRPO by 9.5% and SDPO by 7.5%, with a 79.2% accuracy on ToolUse.

self-distillationreinforcement learningcurriculum learningpolicy optimizationreasoning tasks

Read original →

Early Cue Precision Shapes Visual Shortcut Learning in Controlled Cue-Manipulation Benchmarks

arXiv cs.AI · Chanho Park, Woochan Lee, Janyeong Oh, Geongho Gong · 2026-06-29

The study demonstrates that early cue precision critically influences visual shortcut learning, showing that degraded-but-predictive inputs cannot substitute for proper cue decorrelation. Through controlled experiments on synthetic shape-texture tasks, sequential digit training, and CIFAR-10 benchmarks, the authors manipulate object-texture match probability and evaluate accuracy under conflict and suppression. Results reveal that low early cue precision improves pre-target conflict behavior (e.g., conflict accuracy drops from 0.589 to 0.005 in digit probes), but shortcut-rich fine-tuning can rapidly erase this benefit, necessitating sustained cue decorrelation during downstream adaptation.

visual shortcut learningcue precisioncue decorrelationconflict accuracytexture-overlay benchmark

Read original →

Sequential Fairness Auditing with Limited Output Access

arXiv cs.AI · Ioannis Pitsiorlas, Martha V. Sourla, Marios Kountouris · 2026-06-29

The paper introduces a sequential fairness auditing framework for AI systems under limited output access, addressing the practical constraints faced by independent auditors. It formulates fairness auditing as a tolerance-aware sequential hypothesis-testing problem, employing a generalized likelihood-ratio framework to accumulate evidence from a finite audit pool. The framework is instantiated for Statistical Parity and Equal Opportunity audits, and extended to score- and logit-based proxy audits when richer observables are available. Results demonstrate that both fairness metrics and model access levels significantly impact audit efficiency, with richer outputs reducing query requirements in certain settings but offering limited gains near thresholds.

sequential hypothesis-testingfairness auditingstatistical parityequal opportunitygeneralized likelihood-ratio

Read original →

BayesEvolve: Explicit Belief States for Autonomous Scientific Discovery

arXiv cs.AI · Xuening Wu, Shan Yu, Qianya Xu, Shenqin Yin · 2026-06-29

The paper introduces BayesEvolve, a belief-guided discovery framework that maintains explicit, uncertainty-aware beliefs about hypothesis quality to improve autonomous scientific discovery. The method converts experimental evidence into predictive belief states, guiding future experimentation more effectively than memory- or archive-based approaches. Evaluated on shifted BBOB-style black-box optimization tasks, BayesEvolve demonstrates superior sample efficiency, predictive accuracy on held-out candidate pools, and productive late-stage exploration under fixed evaluation budgets.

bayesevolvebelief stateblack-box optimizationsample efficiencyuncertainty-aware

Read original →

MCP Server Architecture Patterns for LLM-Integrated Applications

arXiv cs.AI · Carson Rodrigues, Oysturn Vas · 2026-06-29

The paper contributes a taxonomy of five architectural patterns (Resource Gateway, Tool Orchestrator, Stateful Session Server, Proxy Aggregator, Domain-Specific Adapter) for Model Context Protocol (MCP) servers in LLM-integrated applications, derived from analyzing fifteen production and public servers. Methods include structured pattern documentation à la Gamma et al., quantitative evaluation of inter-rater reliability (κ=0.76), transport overhead measurement, and tool-selection accuracy studies. Results show pattern-boundary ambiguities, tool-selection accuracy drops below 90% at 10-15 tools for Claude Haiku 4.5 and 20-30 tools for Sonnet 4, with replication materials released.

model context protocolarchitectural patternstool orchestrationinter-rater reliabilitytransport overhead

Read original →

Always-OnAgents:A Survey of Persistent Memory, State, and Governance in LLMAgents

arXiv cs.AI · Tianyu Ding, Aditya Nannapaneni, Bingfan Liu, Ling Zhang · 2026-06-29

The survey introduces a framework for analyzing always-on agents—LLM-based systems with persistent state across interactions—through six diagnostic axes (authority, scope, mutability, provenance, recoverability, actionability) and a state lifecycle. It analyzes 435 works, revealing a focus on state accumulation/retrieval over governance/recovery. The Always-On Evaluation Protocol (AOEP-v0) is proposed to assess state mutation and recovery obligations, linking the field to databases, distributed systems, and machine unlearning.

always-on agentspersistent-state systemsstate lifecycleaoep-v0machine unlearning

Read original →

Research Entity Extraction and Topic Detection from UKRI Grant Proposals

arXiv cs.AI · Xingran Ruan, Angelo Salatino, Rosa Filgueira, Kara Moraw · 2026-06-29

The study compares GPT-4o, Mistral, and DSIT-Taxonomies for extracting and classifying research entities from UKRI grant proposals to detect emerging research areas. A three-stage pipeline used Mistral for entity extraction and OpenAlex Topics taxonomy mapping, evaluated on 42 proposal abstracts. Mistral and GPT-4o showed comparable performance with high semantic overlap, outperforming DSIT-Taxonomies; Mistral achieved 90.5% topic classification accuracy versus 71.4% for DSIT-Taxonomies, demonstrating efficiency for sensitive data analysis.

entity extractiontopic detectionllm comparisonopenalex taxonomygrant proposals

Read original →

ManimAgent: Self-Evolving Multimodal Agents for Visual Education

arXiv cs.AI · Wenjia Jiang, Zongyuan Cai, Yuanhang Shao, Chenru Wang · 2026-06-29

ManimAgent introduces a self-evolving multimodal agent that transfers reflection experience across tasks via a dual-channel Episodic Memory Bank, eliminating the need for weight updates or human seeds. The agent generates Python code using the Manim library to render mathematical animations from scientific paper sections, with a vision-language model scoring rendered keyframes to populate positive (Reference Examples) and negative (Known Pitfalls) memory channels. Evaluations show that increasing memory size improves blind human Pass@1 rates and reduces reflection rounds compared to no-memory, retrieval-augmented generation, and shuffled-memory baselines.

episodic memory bankmanim librarymultimodal agentreference examplesknown pitfalls

Read original →

Rehearsed Multi-Agent Live Product Demonstrations with Real-Time Voice Question Answering

arXiv cs.AI · Rahul Khedar, Mayank Malhotra, Avinash Karn, Mouli V · 2026-06-29

The paper introduces Rhetor, a multi-agent system for automated live product demonstrations with real-time voice QA. The system combines UI exploration and source-code analysis into a cross-modal feature representation with focus tiers, employs a grounded scripter with semantic locators, and ensures synchronization between browser actions and narration via a rehearsal loop. Evaluated on four applications (including Excalidraw), the system achieves locator-firing rates (σ̄) of 0.31-1.00 across 147 actions, with 0.92 σ̄ for complex workloads. A benchmark protocol with ten metrics is proposed for broader validation.

multi-agent systemcross-modal representationsemantic locatorsrehearsal loopsynchronization invariant

Read original →

PromptGNN-sim: Deep Fusion and Alignment of GNN and LLMs for Text-Attributed Graph Learning

arXiv cs.AI · Zhifei Hu, Alexandra I. Cristea · 2026-06-29

PromptGNN-sim introduces a bi-directional fusion framework for text-attributed graph learning, integrating Graph Attention Networks (GAT) and Large Language Models (LLMs) through structure-semantic collaboration. The method employs semantically aware neighborhood selection via GAT, generates structure-aware LLM prompts (node summaries, labels, keywords), and jointly optimizes components using cross-modal contrastive learning and cross-attention. Evaluations on Cora, Pubmed, and WikiCS demonstrate superior performance over GNNs, LLMs, and fusion baselines in accuracy, generalization, and robustness under cross-task and sparse scenarios.

text-attributed graphsgraph attention networkcross-modal contrastive learningstructure-semantic fusionbi-directional alignment

Read original →

Towards Continual Motion-Language Agents: LoRA Variants for Incremental Motion Understanding and Generation

arXiv cs.AI · Bertram Taetz, Hugo Albuquerque Cosme da Silva, Gabriele Bleser-Taetz · 2026-06-29

The paper introduces continual learning variants of low-rank adaptation (LoRA) for bidirectional motion-language agents, enabling incremental acquisition of new motion concepts without catastrophic forgetting. The method employs mixture-of-experts architectures with an autoencoder-based router for task-specific expert selection at inference, eliminating need for task labels. Evaluated on a five-task HumanML3D benchmark, results show near-zero forgetting in both motion-to-text and text-to-motion tasks, with hard expert routing outperforming soft blending and revealing token-level vs. generation quality discrepancies.

continual learninglow-rank adaptationmotion-language agentsmixture-of-expertscatastrophic forgetting

Read original →

Defending Against Harmful Supervision Hidden in Benign Samples

arXiv cs.AI · Bang An, Yibo Yang, Dandan Guo, Ebtisam Alshehri · 2026-06-29

We propose Dual-Reference SFT (DR-SFT), a defense against Embedded Attack, where harmful QA pairs are embedded within benign training samples. DR-SFT adapts DPO-style contrastive objectives to supervised fine-tuning (SFT) through token-level regularization, mitigating harmful fine-tuning beyond coarse data filtering. Experiments show that representative guardrails often fail to detect Embedded Attacks at the example level, while DR-SFT effectively counters this threat by preventing harmful supervision from being learned during fine-tuning.

embedded attackdual-reference sfttoken-level regularizationharmful supervisioncontrastive objectives

Read original →

KnowsTFM: Knowledge-Informed Fine-Tuning of Small Tabular Foundation Models

arXiv cs.AI · Boshko Koloski, Xiangjian Jiang, Senja Pollak, Blaž Škrlj · 2026-06-29

The paper introduces KnowsTFM, a method for knowledge-informed fine-tuning of small tabular foundation models (TabPFN/TabICL variants) in niche domains with scarce data. It combines structural attention priors from knowledge graphs with parameter-efficient low-rank updates during adaptation. Results show meaningful performance gains over vanilla models in specialist settings (where pretraining distributions differ), while general-domain tasks see marginal improvements. The study also identifies catastrophic forgetting risks during continual fine-tuning of frontier models.

tabular foundation modelsknowledge graphsparameter-efficient fine-tuningattention priorscatastrophic forgetting

Read original →

EMPATH: A Multilingual Auditor-Judge Benchmark for Safety Evaluation of Emotional-Support Chatbots

arXiv cs.AI · Camilo Chacón Sartori · 2026-06-29

EMPATH introduces a multilingual benchmark for evaluating safety in emotional-support chatbots, addressing limitations of fixed-prompt approaches. The method employs an auditor model to generate multi-turn conversations from 140 seed instructions and 34 personas, with a judge model scoring transcripts across 19 metrics in five dimensions (crisis handling, therapeutic quality, etc.). Results show score inflation on 10 metrics under standard rubrics, with model-specific divergences up to six points; cross-family judge agreement reaches 93% within ±1 score, while run-to-run reliability varies significantly across models.

emotional-support chatbotsmultilingual benchmarkauditor-judge frameworksafety evaluationconversational integrity

Read original →

Inoculation Adapters: Improved Selective Generalization of Capabilities with Fewer Surprising Backdoors

arXiv cs.AI · Maxime Riché, Daniel Tan, Vili Kohonen, Niels Warncke · 2026-06-29

The paper introduces inoculation adapters (IA), a method to suppress undesired traits in AI models while preserving desired capabilities. IAs are LoRAs trained in three stages: exposure to undesired traits, frozen integration during task adapter training, and final deployment without the IA. Evaluated across six model families, IAs outperform inoculation prompting in suppressing emergent misalignment and avoiding backdoors, though both methods struggle with consistent retention of desired traits. The technique reduces optimization pressure toward undesired behaviors without relying on prompt-based elicitation.

inoculation adaptersemergent misalignmentloraselective generalizationbackdoors

Read original →

Curvature-Guided Sheaf Diffusion for Unsupervised Community Detection on Heterophilic Graphs

arXiv cs.AI · Feifan Wang · 2026-06-29

Curvature-Guided Sheaf Diffusion (CGSD) introduces a fully unsupervised community-detection algorithm for heterophilic graphs, leveraging discrete Forman--Ricci curvature as a topological signal. The method comprises three components: (i) a curvature-gated sheaf-diffusion encoder trained with label-free structural losses, (ii) a curvature-aware spectral clusterer (CSpec) that re-weights k-NN affinity, and (iii) a unified evaluation against nine unsupervised baselines. CGSD outperforms baselines on heterophilic benchmarks Wisconsin and Chameleon, achieving a 15% improvement in mean NMI over K-Means (0.091 to 0.107, p=0.008). The interpretable mechanism separates intra- and inter-community curvature distributions.

heterophilic graphsforman-ricci curvaturesheaf diffusionspectral clusteringunsupervised learning

Read original →

Clarus: Coordinating Autonomous Research Agents toward Web-Scale Scientific Collaboration

arXiv cs.AI · Zihan Guo, Zeyi Chen, Zhiyu Chen, Zicai Cui · 2026-06-29

Clarus introduces a collaboration infrastructure for coordinating autonomous research agents in web-scale scientific endeavors, shifting from code-centric execution to research-oriented collaboration processes. The system organizes scientific collaboration across four layers—Research Application, Digital Collaboration, Physical Substrate, and Physical World—using a minimal project-agent-resource object model. Core modules are implemented as pluggable mechanisms, enabling adaptation to task risk, collaboration structure, and resource constraints. A controlled paper-generation case study demonstrates Clarus's ability to structure research goals into traceable, reviewable, attributable, and accumulative collaboration networks. The framework supports open, auditable, and resource-aware multi-phase collaboration processes, providing a foundation for open research networks.

autonomous research agentscollaboration infrastructurepluggable mechanismsresource-awaremulti-phase collaboration

Read original →

EvalSafetyGap: A Hybrid Survey and Conceptual Framework for LLM Evaluation-Safety Failures

arXiv cs.AI · Buğra Alperen Uluırmak, Rifat Kurban · 2026-06-29

The paper introduces EvalSafetyGap, a conceptual framework for analyzing discrepancies between evaluation metrics and latent safety properties in LLMs under optimization pressure. Combining a hybrid survey (systematic review + grey evidence synthesis) with a 10-model audit, it examines eight evidence streams including benchmark validity, jailbreak robustness, and mechanistic interpretability from 2018-2026. Results show weak correlation between capability and adversarial robustness (r=0.232, p=0.520), with safety gaps primarily governance-driven and sensitive to measurement protocols. The work provides standardized constructs (Instability Decomposition, Alignment Trilemma) and diagnostic tools for dynamic evaluation and alignment auditing.

eval-safety gapgoodhart's lawinstability decompositionalignment trilemmajailbreak robustness

Read original →

Efficient RGB-T Object Detection via Sparse Cross-Modality Fusion

arXiv cs.AI · Chao Tian, Zikun Zhou, Chao Yang, Guoqing Zhu · 2026-06-29

The paper proposes a sparse cross-modality fusion mechanism for efficient RGB-T object detection, addressing computational inefficiency in existing dual-backbone methods. The two-stage framework first uses lightweight modality-specific detectors to generate high-recall region proposals, then applies feature fusion only to sparse foreground regions for refinement. Experiments demonstrate competitive accuracy with significantly reduced parameters (exact counts unspecified) and computational cost, while maintaining scalability to high-resolution images.

rgb-t detectionsparse fusioncross-modalitytwo-stage frameworkcomputational efficiency

Read original →

A Multi Center Breast FNAC Whole-Slide Cytology Dataset for AI-Assisted Patch-Wise Classification Using C1 to C5 Reporting Categories

arXiv cs.AI · Garima Jain, Abhijeet Patil, Surabhi Jain, Sanghamitra Pati · 2026-06-29

The authors introduce a multi-center breast fine needle aspiration cytology (FNAC) dataset for AI-assisted patch-wise classification, comprising 470 whole-slide images from 321 patients across Indian tertiary medical centers. The dataset includes 7,398 PNG image patches extracted from 446 annotated WSIs, labeled using C1 to C5 reporting categories and stained with Papanicolaou or MayGrunwald Giemsa. Images were scanned at 40X magnification (0.25 microns per pixel) using a Hamamatsu NanoZoomer S360 and stored in NDPI format. The release provides NDPI WSIs, GeoJSON annotations, extracted patches, metadata, and inspection code, totaling approximately 950 GB and accessible via Zenodo.

fine needle aspiration cytologywhole-slide imagespatch-wise classificationpapanicolaou staininggeojson annotations

Read original →

The Many-Body Problem of the Data Centre

arXiv cs.AI · Marcin Korecki, Cesare Carissimo · 2026-06-29

The paper reframes modern AI's embodiment through data centers, arguing they serve as AI's biological-like bodies while simultaneously functioning as capital's laboring organs. It develops an organic analogy to analyze data centers as non-unique, universal embodiments that process human-desire-born data without intrinsic desires. The analysis reveals a many-body problem in this distributed computational embodiment and demonstrates how capital equates artificial and human intelligence through market pricing mechanisms, enabling cross-domain intelligence valuation.

data center embodimentmany-body problemorganic analogycomputational laborintelligence valuation

Read original →

Forewarned is Forearmed: When Non-Sequential Embedding Turns Into an Anomaly Detector

arXiv cs.AI · Elys Allesiardo, Antoine Caubrière, Valentin Vielzeuf · 2026-06-29

The paper introduces a novel anomaly detection method leveraging non-sequential multimodal sentence-level embeddings, particularly in the SONAR model. It identifies embedding dimensions sensitive to perturbations, using consistency between successive encoding and decoding processes to build an accurate detector. The authors also explore modifying specific dimensions to correct anomalies. This approach enhances the reliability of multimodal representations by emphasizing the importance of embedding analysis.

non-sequential embeddingsmultimodal representationsanomaly detectionsonar modelembedding dimensions

Read original →

Domain Adaptation with Adaptive Imagination for Visual Reinforcement Learning under Limited Target Data

arXiv cs.AI · Hyunwoo Park, Sang-Hyun Lee · 2026-06-29

We propose AIDA (Adaptive Imagination for Domain Adaptation), a domain adaptation framework for visual reinforcement learning that addresses sim-to-real transfer under scarce target data without additional environment interaction. AIDA employs adaptive imagination, generating reliable rollouts via a distribution-shift-aware discriminator that truncates low-confidence transitions, and introduces a self-consistency loss that cycles through state-image-state to penalize reconstruction discrepancies. Experiments on five MuJoCo tasks and two Gymnasium-Robotics tasks demonstrate that AIDA effectively truncates unreliable rollouts, learns semantically meaningful state representations, and outperforms baselines.

domain adaptationsim-to-real transferadaptive imaginationself-consistency lossvisual reinforcement learning

Read original →

From Detecting Agency to Doing Work: Self-Caused Credit Builds a Durable Behavioral Self in a Minimal Spiking Agent

arXiv cs.AI · Haoliang Han · 2026-06-29

The study demonstrates that agency-gated slow credit (Own*Agency*Salience) enables durable behavioral self-shaping in spiking neural agents, contrasting with transient agency detection. Using Nengo LIF/PES networks, the authors show this mechanism produces post-unload behavioral retention (retained fraction 0.96) that collapses when slow decoders reset or agency gating is removed. Experiments across 24D control tasks and sequential learning (8 tasks) confirm the necessity of slow self-credit for durable behavior (final accuracy 0.88 vs 0.00 baselines) and interference resistance, formalized as an operational behavioral self without consciousness claims.

agency-gated creditspiking neural networksbehavioral residueslow parameter updateself-preservation

Read original →

Few-Shot Domain Incremental Learning via Continual Vision-Language Consolidation

arXiv cs.AI · Naeem Paeedeh, Mahardhika Pratama, Wolfgang Mayer, Mukesh Prasad · 2026-06-29

The paper introduces Continual Vision-Language Consolidation (CVLC), a novel algorithm addressing few-shot domain incremental learning (FSDIL) under extreme data scarcity. CVLC employs latent space reservation in the base domain and dual coalescent projection (DCP) for parameter-efficient fine-tuning. It calibrates vision prototypes, generates language prototypes via LLMs, and fuses them for adaptation to new domains. Structured with shared and domain-specific components, CVLC combines general knowledge and domain-specific details. Evaluations on benchmark problems show CVLC outperforms prior methods by up to 16%.

domain-incremental learningfew-shot learninglatent space reservationdual coalescent projectionvision-language fusion

Read original →

Dynamo: Dynamic Skill-Tool Evolution for Vision-Language Agents

arXiv cs.AI · Yutao Sun, Yanting Miao, Hao-Xuan Ma, Mengyu Zhou · 2026-06-29

Dynamo introduces a training-free framework for enhancing vision-language models (VLMs) through dynamic skill-tool evolution, eliminating the need for retraining or manual prompt engineering. The method autonomously generates reusable reasoning skills and executable visual tools by analyzing correct and incorrect attempts on a small labeled subset, storing these capabilities in a persistent library. Evaluated across four visual reasoning benchmarks and five VLM backbones, Dynamo improves direct inference accuracy by an average of +5.6%, achieves optimal tool invocation when tools are pre-specified, and bridges 65–99% of the performance gap compared to task-specific RL methods at reduced computational cost.

vision-language modelsdynamic skill evolutionvisual reasoningtraining-free adaptationpersistent capability library

Read original →

MirrorCode: AI can rebuild entire programs from behavior alone

arXiv cs.AI · Tom Adamczewski, David Owen, David Rein, Florian Brand · 2026-06-29

(No summary returned.)

Read original →

Beyond Drug Discovery: The Nanotechnology Molecular Optimization (NMO) Benchmark

arXiv cs.AI · Matthias Blaschke, Daniel Kienzle, Zsuzsanna Koczor-Benda, Julian Lorenz · 2026-06-29

The Nanotechnology Molecular Optimization (NMO) Benchmark is introduced to bridge machine learning and quantum materials science, addressing limitations in transferability from drug discovery-focused benchmarks. NMO employs quantum simulations instead of proxy oracles and implements strict protocols to prioritize scientific utility over leaderboard overfitting. It imposes hard structural constraints and rugged fitness landscapes, challenging existing generative models. A novel baseline method is developed, incorporating a structural constraint representation and domain-agnostic pretraining to mitigate pharmaceutical dataset bias. Results surpass state-of-the-art physical properties and uncover new structural motifs, demonstrating ML's potential for genuine scientific discovery in nanotechnology.

quantum simulationsstructural constraintsfitness landscapesdomain-agnostic pretrainingnanotechnology

Read original →

Federated Learning with Energy-Based Structured Probabilistic Inference

arXiv cs.AI · Dario Fenoglio, Daniil Kirilenko, Martin Gjoreski, Marc Langheinrich · 2026-06-29

The paper proposes a federated learning framework that improves client aggregation weights using Conditional Random Fields (CRFs). The method models client-specific reliability via unary potentials and client interactions via pairwise potentials, enabling optimized weight assignment during global model updates. Experiments demonstrate consistent performance gains over standard federated learning baselines in non-IID data settings.

federated learningconditional random fieldsnon-iid dataclient aggregationprobabilistic inference

Read original →

Physically-Constrained Harmonic Separation for Robust Heart and Respiratory Rate Estimation from Wrist Photoplethysmography

arXiv cs.AI · Nouhaila Fraihi, Ouassim Karrakchou, Mounir Ghogho · 2026-06-29

The authors propose Physically-Constrained Harmonic Separation (PCHS), a novel framework for robust heart rate (HR) and respiratory rate (RR) estimation from wrist photoplethysmography (PPG) under motion artifacts. PCHS formulates HR/RR estimation as an analysis-by-synthesis problem, using accelerometer measurements to condition artifact separation rather than direct regression. A physics-guided harmonic generator decomposes the PPG signal into quasi-periodic physiological components and motion-related residuals, enabling HR recovery from fundamental frequency and RR prediction from respiratory-driven harmonic modulations. Experiments on the motion-intensive PPG-DaLiA dataset show PCHS outperforms state-of-the-art methods while providing interpretable signal decompositions that disentangle physiological activity from motion artifacts.

photoplethysmographyharmonic separationanalysis-by-synthesisphysiological componentsmotion artifacts

Read original →

Estimating Grammatical Gender Directions in Contextual Embeddings under Controlled and Natural Contexts

arXiv cs.AI · Huanping Xiao, Yingji Li · 2026-06-29

The study presents the first method to disentangle grammatical gender from semantic bias in contextual embeddings for gendered languages like Spanish. Using controlled templates and natural Wikipedia contexts, the authors construct balanced datasets of inanimate nouns and propose a framework with centroid, SVM, and LDA gender direction estimators, plus contamination-aware weighting strategies. Evaluation via dual-objective metrics shows unweighted controlled contexts yield the purest grammatical gender direction, with the centroid estimator outperforming discriminative baselines in suppressing gender leakage while preserving semantic distinctions.

contextual embeddingsgrammatical gendersemantic biasgender direction estimatorscontamination-aware weighting

Read original →

FacePlex: Full-Duplex Joint Speech-Facial Motion Generation for Conversational Avatars

arXiv cs.AI · Habin Lim, Jae-Ho Lee, Hah Min Lew, Ji-Su Kang · 2026-06-29

FacePlex introduces a full-duplex framework for joint speech-facial motion generation in conversational avatars, addressing the gap between real-time speech synthesis and synchronized facial animation. The method employs Rolling Flow Matching for online motion frame generation and Rolling Cross-Attention to couple streaming audio and motion queues bidirectionally. Experiments demonstrate superior lip-sync quality and motion fidelity compared to audio-driven baselines under streaming constraints.

full-duplex generationflow matchingcross-attentionlip-syncstreaming synthesis

Read original →

Relevance Is Not Permission: Warranted Attention for Value Contributions

arXiv cs.AI · Minwoo Yu, Young-guk Ha · 2026-06-29

The paper introduces Warrant, a path-localized interface addressing the gap between attention relevance and value contribution in neural models. By formalizing this as a permission problem, Warrant modifies the weighted value term α_ij * v_j to α_ij * g_ij * v_j via learned query-item permission g_ij, while preserving attention relevance. Evaluated across 32 comparisons in tasks like CTDG link prediction and TKG tail prediction, Warrant improved primary metrics in 27 cases, with notable gains (+0.1076 AUC in CTDG, +0.0683 MRR in TKG). Ablations reveal domain-specific benefits, e.g., historical-tail value path exposure in TKG and edge-conditioned permission in CTDG.

attention relevancevalue contributionpath-localizedquery-item permissionmetric-defining value paths

Read original →

Query-Aware Spreading Activation for Multi-Hop Retrieval over Knowledge Graphs

arXiv cs.AI · Illia Makarov, Mykola Glybovets · 2026-06-29

The paper introduces a query-aware spreading activation method for multi-hop retrieval over knowledge graphs, addressing limitations of query-blind traversal in existing Graph RAG systems. The proposed approach uses a single per-step semantic gate (cosine similarity between candidate entity descriptions and the question) to enable query-aware traversal, expressed as a single Cypher query executed in Neo4j. On MuSiQue, it matches QAFD-RAG's exact match (32.80 vs 33.50) and outperforms HippoRAG by 5.3 EM and 3.4 F1, while reducing retrieval latency by 1.5-4.9×. Ablation confirms the gate's contribution to both performance gains (3.6-7.4 F1) and latency reduction.

knowledge graphsmulti-hop retrievalspreading activationquery-aware traversalgraph rag

Read original →

Hyper-Network Neural Functional Maps for Unsupervised Robust 3D Shape Matching

arXiv cs.AI · Dongliang Cao, Florian Bernard · 2026-06-29

The paper introduces Hyper-Network Neural Functional Maps (NFM), a novel unsupervised method for robust 3D shape matching that addresses limitations of existing functional map approaches in challenging scenarios like partiality and topological noise. The method employs a hyper-network to predict weights for an MLP with skip-connections, refining standard functional maps (FM) to better align spectral bases. Trained with an unsupervised spectral alignment loss, NFM integrates seamlessly into deep functional map pipelines, significantly improving matching accuracy in demanding conditions.

neural functional mapshyper-networkspectral alignment3d shape matchingunsupervised learning

Read original →

Does Verbose Chain-of-Thought Really Help? In-Distribution Evidence that Content, Not Length, Matters

arXiv cs.AI · Wenlong Wang, Fergal Reid · 2026-06-29

This study investigates whether verbose chain-of-thought (CoT) prompting improves LLM reasoning due to semantic content or token count. Two methods are employed: in-distribution analysis comparing shorter and longer natural generations across 25 models, and controlled interventions using dual-validator designs across four targets and eight benchmarks. Results show that extra tokens leave accuracy unchanged across independently-trained reasoners, and verbose traces improve accuracy modestly (1-4 points) depending on prose quality, not length. Maximum numerical redaction amplifies effects (median 3.24x), while non-reasoning filler recovers none. Findings converge on the importance of reasoning and validation content over token count.

chain-of-thoughtllmsemantic contenttoken countdual-validator

Read original →

Gravitational Duals from Equations of State II: Large Hierarchies and False Vacua

arXiv cs.AI · Raul Jimenez, David Mateos, Pavlos Protopapas, Pau Solé-Vilaró · 2026-06-29

The authors advance the reconstruction of holographic duals for strongly coupled quantum field theories in regimes featuring large hierarchies and false vacua, extending previous Physics-Informed Neural Networks (PINNs) methodologies. They address challenges such as near-degenerate states, energy scale hierarchies, and unprobed potential regions through methodological innovations. The framework accurately reconstructs scalar potentials deep into the false vacuum regime, achieving robust agreement with underlying thermodynamic features despite numerical stiffness. This work bridges holography and machine learning, demonstrating data-driven approaches' potential to elucidate strongly coupled systems.

holographic dualsfalse vacuaphysics-informed neural networksrenormalization group flowsscalar potentials

Read original →

Open Problems in Constitutional Preference Reconstruction

arXiv cs.AI · Eleanor Clifford, Michael Amir, Arduin Findeis, Aaron Zhao · 2026-06-29

The paper identifies three open problems in constitutional preference reconstruction methods for language model training: difficulty in measuring principle quality, ambiguity in principle composition, and variability between LLMs. Using Inverse Constitutional AI (ICAI+) on datasets like PRISM, AlpacaEval, and Chatbot Arena, the authors demonstrate that principle refinement improves inter-executor agreement (78% vs. 73%) and matches LLM judge accuracy (66% vs. 67%). Results suggest constitutions should be evaluated as constitution--executor systems, with implications for LLMs-as-a-judge paradigms.

constitutional preference reconstructioninverse constitutional aillm judgeprinciple compositionpairwise preference data

Read original →

SA-VLA: State-aware tokenizer for improving Vision-Language-Action Models' performance

arXiv cs.AI · Tengyue Jiang, Chunpu Xu, Jiayue Kang, Yao Mu · 2026-06-29

SA-VLA introduces a state-aware action tokenizer for vision-language-action (VLA) models, addressing the limitation of fixed continuous action prototypes in existing tokenizers by conditioning action decoding on robot state. The method employs two state-injection mechanisms: cross-attention between state and action features, and a lightweight state adapter predicting action-wise modulation factors for state-conditioned action modulation. Evaluated on 12 RoboTwin manipulation tasks, SA-VLA improves average success rates from 0.29 to 0.56 over baselines, and from 0.15 to 0.33 in zero-shot sim-to-real experiments, demonstrating reduced compression gap in discrete VLA policies.

vision-language-action modelsaction tokenizationstate-conditioned decodingrobot manipulationsim-to-real transfer

Read original →

Automating the Design of Embodied AgentArchitectures

arXiv cs.AI · Jian Zhou, Sihao Lin, Jin Li, Shuai Fu · 2026-06-29

The paper introduces AgentCanvas, a typed-graph runtime for embodied agent architectures, and KDLoop, a coding-agent search procedure, to automate architectural design in perceptual embodied agents. The method evaluates three Agent Architecture Search (AAS) variants across four embodied executors, including vision-language navigation and language-conditioned manipulation tasks. Results show architecture-level search improves success rates, though optimization signals are masked by rollout noise and search can stall in local edit basins, revealing both potential and current limitations of automated search for embodied agents.

embodied agentsarchitecture searchvision-language navigationtyped-graph runtimecredit assignment

Read original →

Structural Certification for Reliable Physical Design with Language Models

arXiv cs.AI · Nakul Vyas, Iliya D. Stoev · 2026-06-29

The paper introduces Physics-Anchored Certification (PHACT), a propose-certify framework that ensures reliable physical design generation by language models through deterministic certification. PHACT decouples proposal (by LM) from certification (by deterministic engine), deriving certified quantities from fixed inputs to prevent forgery. Evaluated across 80 adversarial trials with two models (unspecified), two decoding temperatures, and a faulted engine, the method achieved zero false certifications, demonstrating robustness in five scientific domains.

physics-anchored certificationdeterministic certificationpropose-certify looplanguage modelsphysical design

Read original →

Propagation of~Interval Belief Structures and~Imprecise Copulas for~Neural Network Verification

arXiv cs.AI · Francesc Pifarre-Esquerda, Eric Goubault, Sylvie Putot · 2026-06-29

The paper proposes a sound framework for quantitative verification of neural networks under imprecise probabilistic information, combining interval belief structures for marginal uncertainty with imprecise copulas for uncertain dependence. The method develops propagation techniques for imprecisely coupled interval belief structures through feed-forward networks, using mixed imprecise copula volumes to derive sound push-forward constructions via affine transformations and activation functions. Results demonstrate guaranteed lower and upper bounds on probabilistic safety properties, valid for all probability models compatible with the specified imprecise inputs.

interval belief structuresimprecise copulasneural network verificationprobabilistic safetyaffine transformations

Read original →

Temporal Feature Extractors in EEG Foundation Models: A Controlled Comparison Including a Pretrained Time-Series Model

arXiv cs.AI · Ayşe Betül Yüce, Chris Joey Leffler, Sarun Varghese, Myra Spiliopoulou · 2026-06-29

This work systematically compares temporal feature extractors for EEG foundation models, including a linear baseline, convolutional encoder, and frozen pretrained time-series foundation model (MOMENT). The study evaluates representation quality on motor imagery and emotion recognition tasks, revealing task-dependent performance: motor imagery benefits from simpler temporal representations, while emotion recognition requires richer temporal modeling. Notably, the general-purpose MOMENT model transfers effectively as a frozen feature extractor despite no EEG-specific adaptation, demonstrating cross-domain applicability of time-series representations.

eeg foundation modelstemporal feature extractiontime-series transfer learningmotor imageryemotion recognition

Read original →

Hierarchical Reinforcement Learning in StarCraft Micromanagement with Influence Maps and Cluster-based Scripts

arXiv cs.AI · Chunhui Bai, Changhe Li, Dequan Li, Xinye Cai · 2026-06-29

The paper proposes HRL-IM/CBS, a hierarchical reinforcement learning framework for StarCraft micromanagement that combines influence map hashing and cluster-based scripts. The method encodes global battlefield states via hexadecimal influence maps and enables adaptive unit coordination through cluster-based tactical modules, using a hierarchical multi-Q-table architecture with dense reward allocation. Experiments in six asymmetric scenarios show competitive performance against deep RL baselines, with improved sample efficiency and interpretability through transparent Q-table representations.

hierarchical reinforcement learninginfluence map hashingcluster-based scriptsmulti-q-tablestarcraft micromanagement

Read original →

SAT-RTS: A systematic framework for tactical knowledge extraction and visualization-based analysis in real-time strategy games

arXiv cs.AI · Chunhui Bai, Changhe Li, Yuqiang Li, Lei Liu · 2026-06-29

The SAT-RTS framework enhances interpretable tactical knowledge extraction in real-time strategy games by integrating visualization with automated pattern extraction from high-dimensional sequence data. It employs a cluster-centric BK-tree algorithm with specialized distance metrics for state-stream abstraction and a rule-based multi-label extraction method to transform raw sequences into discrete tactical labels. Experiments show SAT-RTS improves interpretability and efficiency in tactical analysis of complex RTS environments.

real-time strategy gamestactical knowledge extractionbk-tree algorithmmulti-label extractionfitness landscape visualization

Read original →

Online Data Selection for Instruction Tuning via Gaussian Processes

arXiv cs.AI · Jun Wang, Quoc Phong Nguyen, Julien Monteil, Vu Nguyen · 2026-06-29

The paper introduces GAIA, a global adaptive instruction tuning framework using Gaussian Processes for online data selection in LLM training. GAIA models utility manifolds across semantic space via Gaussian Process regression and employs adaptive strategy fusion to prioritize high-utility samples dynamically. The method, framed under the fixed-share Hedge framework, guarantees robustness under non-stationary quality scores. Evaluations on three datasets show GAIA outperforms state-of-the-art baselines like \greats in instruction tuning efficiency.

gaussian processesinstruction tuningdata selectiondynamic regretsemantic space

Read original →

ACPO: Agent-Chained Policy Optimization for Multi-Agent Reinforcement Learning

arXiv cs.AI · Daiki E. Matsunaga, Junho Na, Tri Wahyu Guntara, Scott Sanner · 2026-06-29

The paper introduces Agent-Chained Policy Optimization (ACPO), a novel method for Multi-Agent Reinforcement Learning (MARL) under the Centralized Training with Decentralized Execution (CTDE) paradigm. ACPO decomposes the joint policy gradient into per-agent terms using decentralized critics and score functions, enabling independent actor updates that collectively form a joint gradient step. Key to this approach is a serialized decision process where agents condition actions on beliefs about preceding actions, ensuring coordination. Evaluated on Multi-Robot Warehouse, SMACv2, and MA-MuJoCo, ACPO outperforms baselines, particularly as agent count increases.

multi-agent reinforcement learningpolicy gradientdecentralized executionnash equilibriascore functions

Read original →

Neural Subspace Reallocation: Continual Learning as Retrieval-Based Subspace Memory Management

arXiv cs.AI · Byeong Hoon Yoon · 2026-06-29

Neural Subspace Reallocation (NSR) reformulates continual learning as parameter subspace memory management, treating Low-Rank Adaptation (LoRA) modules as compressible, retrievable memory units. The method cycles through compressing LoRAs via SVD, storing them in a TaskKnowledgeBank, recalling similar past LoRAs via embedding similarity, and reallocating active subspaces with distillation. Theoretical analysis shows memoryless policies incur Ω(T(M-1)Δ_switch) regret versus history-aware policies. Experiments demonstrate 10x faster cyclic recovery on Split-CIFAR-100, 9x reduced backward transfer on 5-Datasets, and 0.29MB/task memory footprint, with similarity-based retrieval outperforming learned controllers.

neural subspace reallocationlow-rank adaptationtaskknowledgebankcontinual learningparameter memory

Read original →

Little Brains, Big Feats: Exploring Compact Language Models

arXiv cs.AI · Dari Baturova, Elena Bruches, Ivan Chernov, Roman Derunets · 2026-06-29

This study evaluates the performance of small language models (SLMs) in Retrieval-Augmented Generation (RAG) systems, addressing their underrepresentation in current research. Using diverse open-source and proprietary datasets, the authors benchmark SLMs across various subject areas and question types. Results indicate that RAG systems incorporating SLMs can operate efficiently on-device without GPU hardware, maintaining reasonable execution times. The experimental framework and supplementary materials are publicly accessible via GitHub.

small language modelsretrieval-augmented generationon-device executionbenchmarkinggpu hardware

Read original →

MuseBench: Benchmarking Intent-Level Audiovisual Arts Understanding in MLLMs

arXiv cs.AI · Yuxuan Fan, Gyusik Seo, Jing Hao, Jaemin Cho · 2026-06-29

The paper introduces MuseBench, a novel benchmark for evaluating multimodal large language models (MLLMs) on intent-level audiovisual arts understanding. The benchmark comprises 4,016 questions across cinematic arts, visual arts, stage performance, and game arts, derived from 10K+ video essays with professional commentary. Questions are generated via a four-phase pipeline involving shortcut filtering, adversarial distractors, and expert validation. Zero-shot evaluation of 28 MLLMs shows top accuracy of 48.29%, significantly below human expert performance (87.18%), revealing a critical gap in models' artistic reasoning capabilities.

multimodal large language modelsaudiovisual artsintent-level understandingzero-shot evaluationadversarial distractors

Read original →

IBRSteG: Learning a Generalizable Steganography Framework for 3D Gaussian Splatting

arXiv cs.AI · Fanye Kong, Hongyu Xia, Yu Zheng, Boyang Gong · 2026-06-29

IBRSteG proposes a generalizable steganography framework for 3D Gaussian Splatting (3DGS) that embeds secret 3D scenes into cover scenes without per-scene optimization. The method introduces GAS (Gaussian Attributes Steganographer), a network that learns a scene-independent embedding function by injecting secret 3D Gaussian attributes into cover scenes, leveraging 2D learning paradigms for generalization. Experiments show IBRSteG achieves high visual quality, superior capacity, and security across diverse 3DGS scenes.

3d gaussian splattingsteganographygeneralizable frameworkgaussian attributes steganographerscene-independent embedding

Read original →

T3R: Deeper Test-Time Adaptation for Graph Neural Networks via Gradient Rotation

arXiv cs.AI · Huy Truong, Alexander Lazovik, Victoria Degeler · 2026-06-29

T3R introduces a novel test-time adaptation method for Graph Neural Networks (GNNs) that enables deeper parameter updates using unlabeled test data. The approach leverages multiple Rotograd matrices to enhance task affinity between target and auxiliary tasks, coupled with a rotation technique that reorients self-supervised signals to generate surrogate gradients. This allows adaptation across nearly the entire architecture, addressing limitations of shallow updates in conventional Test-Time Training. Empirical results demonstrate a 0.172 reduction in MAE on regression datasets and at least 9.37% relative improvement on cross-domain OGB classification benchmarks compared to non-adaptive models.

graph neural networkstest-time trainingrotograd matricesself-supervised learningsurrogate gradients

Read original →

AlgoSkill: Learning to Design Algorithms by Scheduling Human-Like Skills

arXiv cs.AI · Xinyuan Song, Zekun Cai, Liang Zhao · 2026-06-29

AlgoSkill introduces a skill-based framework for algorithm design, modeling it as sequential decision-making over a typed library of human-like algorithmic skills (e.g., abstraction, constraint analysis). The method combines a learned scheduler with Monte Carlo Tree Search (MCTS) guided by verification feedback from compilation, testing, and complexity analysis. Experiments on competitive programming and combinatorial optimization benchmarks demonstrate improvements over direct LLM generation, chain-of-thought prompting, and baseline MCTS, with ablations highlighting the importance of typed skills, verification-based repair, and search-based scheduling.

algorithm designmonte carlo tree searchverification feedbackskill schedulingcomplexity refinement

Read original →

Be Faithful When Response: Returning Fluent and Grounded Answers for Vision-Language Models Reinforcement Learning

arXiv cs.AI · Peng, Lee, Yin Zhang, Yanglin Zhang · 2026-06-29

The paper proposes Faithful Warm-Start (FWS), a strategy to improve Vision-Language Models' (VLMs) reasoning by ensuring visual grounding before reinforcement learning. FWS curates the FaithfulQA dataset from six VQA benchmarks, selecting samples with explicit vision-language causal relationships, then purifies it using a VLM-based judge for causal consistency. This warm-start phase enhances the model's understanding of grounded patterns prior to RL optimization. Experiments demonstrate improved answer accuracy (quantitative results unspecified), stabilized RL training, and reduced ungrounded reasoning compared to direct RL application.

vision-language modelsreinforcement learningvisual groundingfaithfulqa datasetcausal consistency

Read original →

Stabilizing Extrapolation in Looped Transformers via Learned Stochastic Stopping

arXiv cs.AI · Hsun-Yu Kuo, El Mahdi Chayti, Patrik Reizinger, Wieland Brendel · 2026-06-29

The paper introduces learned stochastic stopping to stabilize extrapolation in Looped Transformers for variable-length algorithmic tasks. By analyzing the spurious correlation between sequence length and loop count, the authors propose training-time randomization of loop counts and RL-Halting as a learned schedule. Experiments on binary addition, Dyck-1, Unique Set, and Copy tasks show reduced out-of-distribution variance and improved accuracy-stability trade-offs, though suboptimal computations may persist. The work frames loop termination as a training-time design choice rather than purely inference-time optimization.

looped transformerslength generalizationstochastic stoppingrl-haltingout-of-distribution variance

Read original →

Exploration and Online Transfer with Behavioral Foundation Models

arXiv cs.AI · Louis Bagot, Mathieu Lefort, Laëtitia Matignon · 2026-06-29

The paper introduces online transfer for zero-shot reinforcement learning (RL), addressing the limitation of offline reward specification in Behavioral Foundation Models (BFMs). By framing the problem as a bandit-like exploration-exploitation task, the authors propose using BFMs to generate exploration policies, with rewards observed through environment interactions. A method inspired by Upper Confidence Bound is derived for linear reward approximation, focusing on eigenvalue minimization of an uncertainty matrix for exploration. The framework is validated on a simple environment, demonstrating its feasibility for online transfer.

zero-shot transferbehavioral foundation modelsonline reinforcement learningexploration-exploitationupper confidence bound

Read original →

First-Order Temporal Logic Tensor Networks

arXiv cs.AI · Luca Boscarato, Ivan Donadello, Alessandro Artale, Marco Montali · 2026-06-29

The authors introduce First-Order Temporal Logic Tensor Networks (FOT-LTN), extending Logic Tensor Networks (LTN) to incorporate temporal reasoning. FOT-LTN combines First-Order Linear Temporal Logic syntax with LTN's fuzzy semantics, supporting temporal operators, quantifiers, and full differentiability. Evaluated on synthetic temporal knowledge graph completion tasks, FOT-LTN outperforms dedicated neural baselines, demonstrating its efficacy in handling dynamic object properties and relations.

first-order temporal logiclogic tensor networkstemporal knowledge graphsneuro-symbolic aidifferentiable reasoning

Read original →

RiverONE: Generating Knowledge-Intensive VLM by Simulated Quantum Machines

arXiv cs.AI · Xindian Ma, Xinyu Long, Yefei Zhang, Yanchen Liu · 2026-06-29

RiverONE introduces a lightweight vision-language model (VLM) for quantum calibration plot understanding, leveraging simulated quantum computation during construction to generate structured parameters. The model combines a specialized visual encoder with an InternVL-based language backbone, materializing quantum-generated parameters as classical tensors post-training for GPU inference. At 1.9B parameters, RiverONE achieves ≥95% performance of NVIDIA Ising Calibration 1 (19B+ parameters) on target tasks, demonstrating simulated quantum computation's utility for building compact, knowledge-intensive VLMs.

vision-language modelquantum computationparameter compressionquantum calibrationinternvl

Read original →

DuoMem: Towards Capable On-Device Memory Agents via Dual-Space Distillation

arXiv cs.AI · Peyman Hosseini, Ondrej Bohdal, Ahmed Alajrami, Andrea Maracani · 2026-06-29

DuoMem introduces a dual-space distillation framework for deploying capable memory-augmented agents on resource-constrained devices, transferring procedural problem-solving abilities from large teacher to compact student models. The method combines context-space distillation (prepending teacher-generated procedural memories) and parameter-space distillation (fine-tuning LoRA adapters on successful trajectories). On ALFWorld, DuoMem boosts a 4B-parameter model from 4.3% to 77.9% success rate (vs. 87.1% for a 72B teacher), with <10M added parameters and 3× faster inference, enabling real-time edge deployment.

dual-space distillationmemory-augmented agentslora adaptersalfworldprocedural memories

Read original →

SWE-Together: Evaluating Coding Agents in Interactive User Sessions

arXiv cs.AI · Yifan Wu, Zhuokai Zhao, Songlin Li, Ho Hin Lee · 2026-06-29

SWE-Together introduces a multi-turn benchmark for evaluating coding agents in interactive user sessions, addressing the limitations of static benchmarks. The benchmark reconstructs 109 repository-level tasks from 11,260 recorded sessions, ensuring recoverable repository states, clear user goals, and observable outcomes. A reactive LLM-based user simulator preserves original user intents and provides feedback as needed. Evaluation metrics include final repository correctness and the number of corrective feedback turns. Experiments with frontier coding agents reveal that stronger agents achieve higher success rates with fewer interventions, indicating an enhanced user experience.

multi-turn benchmarkcoding agentsrepository-level tasksllm-based user simulatorcorrective feedback turns

Read original →

SpreadsheetBench 2: Evaluating Agents on End-to-End Business Spreadsheet Workflows

arXiv cs.AI · Jian Zhu, Yuzheng Zhang, Zeyao Ma, Bohan Zhang · 2026-06-29

The authors introduce SpreadsheetBench 2, a workflow-level benchmark for evaluating spreadsheet agents on end-to-end business tasks, addressing limitations of prior benchmarks focused on isolated operations. The benchmark comprises 321 tasks derived from authentic business data (financial reports, corporate filings), averaging 11.8 worksheets and 593.5 cell modifications per instance, with expert validation. Evaluating eight frontier LLMs and commercial spreadsheet products reveals significant reliability gaps: best model achieves 34.89% overall accuracy, with debugging accuracy dropping to 12.00%, primarily due to inadequate spreadsheet inspection and target-cell selection errors.

spreadsheet agentsworkflow-level benchmarkmulti-sheet dependenciesllm evaluationbusiness automation

Read original →

Exploiting Local Flatness for Efficient Out-of-Distribution Detection

arXiv cs.AI · Seonghwan Park, Hyunji Jung, Dongyeop Lee, Namhoon Lee · 2026-06-29

The paper introduces Fold, a computationally efficient out-of-distribution (OOD) detector that exploits local loss-landscape flatness differences between in-distribution (ID) and OOD data. The method leverages feature Hessian curvature and partial feature normalization, with AutoFold enabling self-supervised calibration via pseudo-OOD samples generated through ID logit masking. Experiments on OOD benchmarks demonstrate Fold's superiority, achieving a 1.63% AUROC improvement and 2.30% FPR95 reduction while maintaining forward-pass efficiency. Theoretical analysis confirms the observed curvature discrepancy between ID and OOD inputs.

out-of-distribution detectionhessian curvatureloss-landscape flatnessself-supervised calibrationlogit masking

Read original →

Data-Efficient Multimodal Alignment for Histopathology-based Molecular Prediction

arXiv cs.AI · Dominik Winter, Dominik Vonficht, Loïc Le Bescond, Christian Gebbe · 2026-06-29

We introduce a data-efficient multimodal alignment framework for predicting molecular pathways from H&E-stained histopathology images using frozen foundation models. By training a lightweight alignment module via contrastive learning on a multi-cancer cohort (N=1,720), we enable open-vocabulary molecular prompting of H&E slides with gene-set signatures, achieving a 25-fold improvement in retrieval over baselines. Morphologically grounded programs (e.g., cell-cycle, immune-related) show high predictability (R^2>0.5), while pathways lacking morphological footprints remain challenging. Clinical validation on the POSEIDON trial demonstrates accurate prediction of NSCLC subtypes and tumor microenvironment archetypes, with generalization across unseen cohorts and data-efficient domain adaptation.

multimodal alignmentcontrastive learningopen-vocabulary promptinghistopathologymolecular pathways

Read original →

SAGA: Scene-Aware, Goal-Evolving Agents for Long-Horizon CivRealm Strategy Planning

arXiv cs.AI · Tianyu Jin, Shuo Chen, Yida Wang, Liuyu Xiang · 2026-06-29

SAGA introduces a multi-agent LLM framework for long-horizon strategy planning in complex games, addressing three systematic failures: scene blindness, context overflow, and shallow cross-game learning. The method combines a Map-Semantic Scene Graph for spatial reasoning, a Tool-Augmented Planner for domain-specific state management, and a Dual-Horizon Feedback Loop for strategic evolution. Evaluated on FreeCiv, SAGA achieves the highest mean civilization score with 27% fewer output tokens, outperforming baselines in infrastructure construction and cross-game performance, with each component independently contributing to its advantage.

multi-agent frameworkscene graphtool-augmented plannerdual-horizon feedbacksparse reward

Read original →

HippoSpark: An On-Demand Experience System for LLM Reasoning

arXiv cs.AI · Jingyao Liu, Danling Meng, Chen Huang, Yukun Yan · 2026-06-29

HippoSpark introduces a state-level experience system for LLM reasoning that retrieves on-demand guidance tailored to immediate reasoning bottlenecks, contrasting with task-level approaches that assume universal solution patterns. The method dynamically provides precise, state-specific experience during problem-solving, addressing local failures in complex reasoning. Evaluations across mathematical, scientific, and programming benchmarks demonstrate consistent improvements over standard prompting and task-level baselines, highlighting the importance of actionable guidance at critical reasoning states.

llm reasoningexperience systemstate-level retrievalreasoning bottleneckson-demand guidance

Read original →

Latent-CURE for Breast Cancer Diagnosis

arXiv cs.AI · Weiyi Zhao, Xiaoyu Tan, Lu Gan, Liang Liu · 2026-06-29

Latent-CURE introduces a novel breast cancer diagnosis framework leveraging asymmetric weighted chain-of-thought methodology for latent space reasoning. The approach constructs implicit reasoning trajectories, forcing sequential inference of BI-RADS morphological descriptors before final diagnosis. A dual-asymmetric optimization strategy dynamically adjusts margins and weights to prevent rare malignant features from being overshadowed by common benign patterns. Evaluations demonstrate that this knowledge-injected method provides transparent clinical evidence while maintaining robust diagnostic accuracy in imbalanced medical cohorts.

chain-of-thoughtlatent spacebi-radsasymmetric optimizationmalignant descriptors

Read original →

EVAF: A Test-Retest Protocol for Selective Parametric Consolidation

arXiv cs.AI · Haoliang Han · 2026-06-29

The paper introduces EVAF (Echo-Valence Attractor Field), a mechanism for selective parametric consolidation in language agents, alongside a test-retest protocol to measure consolidation under interference. EVAF employs gated LoRA updates to preferentially consolidate high-valence, high-surprise experiences while maintaining factual memory through a routed retrieval path. Experiments on GPT-2 and TinyLlama demonstrate EVAF's superiority over baselines in behavioral persistence (post-interference), with reduced parameter drift and cross-persona contamination, supporting a distinction between memory access and internalization.

parametric consolidationloratest-retest protocolmemory routingvalence-attractor

Read original →

A causal modeling perspective on decision theory

arXiv cs.AI · Arvid Sjölander · 2026-06-29

The paper introduces a formal framework for decision theory using nonparametric structural equation models (NPSEMs) to unify representations of agents, counterfactuals, and causal relationships. It proposes personal decision theory, where agents maximize subjective counterfactual utility, and establishes a performance metric based on hypothetical interventions. Under specific assumptions, the theory proves optimal for this metric, demonstrated through analyses of the smoking lesion problem and Newcomb's problem. The approach aims to clarify modeling language and evaluative criteria in decision theory.

nonparametric structural equation modelscounterfactual utilitydecision theorycausal inferencenewcomb's problem

Read original →

Pondering the Way: Spatial-perceiving World Action Model for Embodied Navigation

arXiv cs.AI · Hong Chen, Daqi Liu, Zehan Zhang, Haiguang Wang · 2026-06-29

The paper introduces SWAM (Spatial-perceiving World Action Model), a task-centric joint observation-action generation framework for embodied navigation that addresses limitations of verification-centric planners. SWAM performs single-pass inference to simultaneously generate intermediate RGB-D sequences and action trajectories from start and goal RGB observations, leveraging depth pseudo-labels during training but requiring only monocular RGB at inference. Key innovations include a visual-guided action refinement module and trajectory-scale regularization loss for motion-visual alignment. Experiments demonstrate SWAM's superiority over two-stage planners in success rate (quantitative metrics unspecified), trajectory accuracy, and inference efficiency, with robust zero-shot generalization.

embodied navigationworld modelrgb-d generationaction trajectoryzero-shot generalization

Read original →

CW-B: Class Weighted Boosting Framework for Imbalance Resilient Multi Class Cardiac Phenotyping

arXiv cs.AI · Sijia Li, Xiaoyu Tan, Chen Zhan, Yuanji Ma · 2026-06-29

The paper introduces CW-B, a class-weighted XGBoost framework for robust multi-class cardiac discharge phenotyping under real-world data imbalance and missingness. The method integrates fold-specific class-balanced instance weighting, missingness-indicator augmentation, and classwise error auditing to prioritize high-risk phenotypes while maintaining interpretability. Evaluation via five-fold stratified cross-validation shows CW-B outperforms tree-based, ensemble, and neural baselines in Accuracy (exact values unspecified), Macro-F1, Balanced Accuracy, and Prioritized F1 metrics.

class-weighted boostingcardiac phenotypingxgboostmissingness-indicator augmentationclasswise error auditing

Read original →

Semi-Supervised Sound Event Detection with Conditional Mixup and Embedding-Level Contrastive Loss

arXiv cs.AI · Nian Shao, Xian Li, Xiaofei Li · 2026-06-29

The paper improves semi-supervised sound event detection (SED) by introducing conditional mixup and embedding-level contrastive loss within the ATST-SED framework. The method resolves the conflicting roles of mixup in pseudo-label learning (composition) and contrastive learning (perturbation) by unifying them, while leveraging unlabeled data more effectively through self-supervised contrastive objectives. The model achieves state-of-the-art performance on DESED validation with 0.645 PSDS1 and 0.822 PSDS2 scores.

sound event detectionsemi-supervised learningcontrastive lossconditional mixupaudio foundation models

Read original →

LLM-based Multimodal Personality Recognition via Facial Action Unit-Text Semantic Fusion

arXiv cs.AI · Tianyi Zhang, Wei Shan, Yuan Zong, Tianhua Qi · 2026-06-29

The paper proposes an LLM-based multimodal framework for personality recognition in asynchronous video interviews (AVIs) by semantically fusing facial action units (AUs) with textual responses. AU sequences are converted to textual descriptions and fused with participant responses via an LLM, followed by a lightweight regression head for continuous personality scoring. On AVI-6 benchmark, the method achieves lower prediction errors and stronger human-score correlations than baselines, with AU-derived semantics providing complementary non-verbal cues. The decoupled architecture enhances training stability and interpretability.

personality recognitionfacial action unitsmultimodal fusionasynchronous video interviewsllm-based framework

Read original →

Critical Interval MSE: Toward Reliable Offline Validation for Robot Manipulation Policies

arXiv cs.AI · Haoxu Huang, Tongsam Zheng, Yifan Chen, Jiacheng You · 2026-06-29

The paper introduces Critical Interval MSE (CI-MSE), an offline validation metric for robot manipulation policies that improves correlation with real-world performance. CI-MSE focuses error computation on task-critical segments and incorporates action-alignment procedures to better reflect rollout behavior. Experiments show CI-MSE achieves a Spearman's rank correlation of -0.87 (vs. raw MSE's -0.61) with real-world performance, demonstrating robustness across hyperparameters and distribution shifts.

offline validationrobot manipulationcritical intervalsspearman correlationaction-alignment

Read original →

Child-Centric Voice Anonymization in Single and Multi-Speaker Speech via Domain-Adapted SSL Models

arXiv cs.AI · Pranav Tushar, Xiao Xiao Miao, Rong Tong · 2026-06-29

The paper contributes a child-centric voice anonymization system by adapting self-supervised learning (SSL) to child speech domains. Using the MyST corpus for domain adaptation, the method combines target speaker extraction with anonymization for both single-speaker and two-speaker conditions. Results show improved intelligibility (↑3.2dB SNR) and perceptual quality (MOS↑0.8) while maintaining 98% privacy protection, demonstrating the necessity of child-specific adaptation in speech anonymization pipelines.

voice anonymizationself-supervised learningdomain adaptationchild speechspeaker extraction

Read original →

SABER-Math: Automated Benchmark for Information Retrieval Evaluation in Mathematics

arXiv cs.AI · Nikolay Georgiev, Maria Drencheva, Kseniia Ibragimova, Ivo Petrov · 2026-06-29

SABER-Math introduces the first fully automated benchmark for evaluating mathematical information retrieval (IR), addressing the lack of fine-grained mathematical relevance in existing benchmarks. The method constructs reranking tasks from 283K high-school-level math problems by (i) extracting solution summaries and topics via LLMs, (ii) identifying relevant documents using ontology-based and lexical similarities, and (iii) generating fine-grained relevance ratings through an LLM preference tournament. Evaluation of lexical retrievers, math-specific systems, and embedding models reveals that embedding models outperform classical and specialized baselines but struggle with symbol-heavy domains like Algebra and Calculus. Results demonstrate that general-purpose IR benchmarks fail to predict mathematical performance, underscoring the need for domain-specific evaluation.

information retrievalembedding modelsontology-based similaritylexical retrieversreranking tasks

Read original →

Trust Your Instincts: Confidence-Driven Test-Time RL for Vision-Language-Action Models

arXiv cs.AI · Siyao Chen, Jiakang Yuan, Jiaxin Wang, Tao Chen · 2026-06-29

We introduce T^2VLA, a test-time reinforcement learning framework for Vision-Language-Action Models (VLAs) that enables self-bootstrapping policy improvement without external reward signals. T^2VLA leverages trajectory-level similarity to high-confidence expert demonstrations as intrinsic reward and employs a Confidence-Driven Dual Expert Bootstrapping mechanism to balance exploration and training stability. Evaluations on LIBERO and RoboTwin benchmarks demonstrate that T^2VLA consistently outperforms supervised baselines, approaching oracle RL performance with ground-truth rewards, while adapting to diverse VLA paradigms including OpenVLA-OFT and the pi series.

vision-language-action modelstest-time reinforcement learningconfidence-driven bootstrappingintrinsic rewardself-bootstrapping

Read original →

SafePyramid: A Hierarchical Benchmark for In-context Policy Guardrailing

arXiv cs.AI · Jiacheng Zhang, Haoyu He, Sen Zhang, Shen Wang · 2026-06-29

SafePyramid introduces a hierarchical benchmark for in-context policy guardrailing, comprising 1,000 multi-turn conversations across 10 domains and 3,000 application-specific policies with 61,699 distinct natural-language rules. The benchmark evaluates three difficulty levels: individual-rule understanding (L0), reasoning over rule dependencies (L1), and adaptation to novel policy frameworks (L2). A rigorous multi-stage pipeline ensures benchmark quality. Evaluation of 10 frontier LLMs and 5 policy-configurable guardrails reveals significant challenges: GPT-5.5 achieves exact identification of violated rules in only 54.0%, 35.3%, and 12.9% of cases for L0, L1, and L2, respectively. These results underscore the need for improved in-context policy guardrails.

in-context policy guardrailingmulti-turn conversationsrule dependenciespolicy frameworksnatural-language rules

Read original →

LWDrive: Layer-Wise World-Model-Guided Vision-Language Model Planning for Autonomous Driving

arXiv cs.AI · Chen Yang, Yuhao Wei, Ze Xu, Ziheng Zou · 2026-06-29

The paper introduces LWDrive, a vision-language model (VLM) framework for autonomous driving that refines coarse trajectories through layer-wise world-model guidance. The method uses a Foresight Cascade Planner (FCP) to expand and refine candidate trajectories by integrating multi-layer VLM features, historical states, Action-Query representations, and Bird's-Eye-View (BEV) features, while preserving high-level driving intentions. Experiments demonstrate LWDrive's effectiveness, achieving scores of 92.0 on NAVSIM and 89.6 on NAVSIM-v2 benchmarks.

vision-language modelautonomous drivingtrajectory refinementbird's-eye-viewforesight cascade planner

Read original →

Clinical Reasoning Graphs: Structured Evaluation of LLM Diagnostic Reasoning Reveals Competence Without Consistency

arXiv cs.AI · Nisarg A. Patel · 2026-06-29

The study introduces clinical reasoning graphs, a structured evaluation framework for LLM diagnostic reasoning, using a domain-grounded ontology with 5 node types and 7 edge types. Analyzing 750 diagnostic traces from five LLMs across 50 clinical cases, the authors find no evidence of stable reasoning schemas, with graph similarity nearly identical for correct (0.488) and incorrect (0.484) diagnoses. Structured reflection prompts increase feature analysis (+33%) but not cross-case consistency, revealing diagnostic competence without schema-scale reasoning consistency.

clinical reasoning graphsdiagnostic schemasstructured reflection promptinggraph similaritydomain-grounded ontology

Read original →

AI Training Manager: Bounded Closed-Loop Control of Adaptive Training Recipes

arXiv cs.AI · Anjali Rao, Nikhil Kamalkumar Advani · 2026-06-29

The paper introduces AI Training Manager, a bounded LLM-based supervisory controller for adaptive machine learning training that addresses mid-run failures like overfitting and loss imbalance. The system operates via a schema-conditioned interface, reading structured telemetry, auditing constrained actions, and returning validated parameter updates (e.g., learning rate, regularization). Evaluations on TinyStories show a 60% lower validation loss than baseline, with asynchronous update capability. In robotic RL, it mitigates exploration issues. Results demonstrate LLMs can complement conventional optimizers with interpretable, multi-axis control.

adaptive trainingschema-conditioned interfaceloss imbalancebounded controltelemetry snapshots

Read original →

ARKD: Adaptive Reinforcement Learning-Guided Bidirectional KL Divergence Distillation for Text Generation

arXiv cs.AI · Zilong Liu, Xuewen Zhang, Jinrui Xing, Juyi Qiao · 2026-06-29

The authors propose Adaptive Reinforcement Learning-Guided Bidirectional KL Divergence Distillation (ARKD), a novel framework for text generation that dynamically balances forward and reverse KL divergence (FKL/RKL) objectives. ARKD employs a reinforcement learning policy network to adaptively weight FKL and RKL based on teacher-student distributional characteristics, optimizing both principal and long-tail probability modeling. The method achieves dual distribution alignment through reward-guided optimization. Experimental results demonstrate consistent improvements, with ARKD surpassing greedy heuristics by 0.4-0.6 points on Rouge-L and BertScore metrics across diverse benchmarks.

knowledge distillationkl divergencereinforcement learningtext generationdistribution alignment

Read original →

RoAd-RL: A Unified Library and Benchmark for Robust Adversarial Reinforcement Learning

arXiv cs.AI · Adithya Mohan, Daniel Kriegl, Torsten Schön · 2026-06-29

The authors introduce RoAd-RL, an open-source benchmarking framework for robust adversarial reinforcement learning, addressing fragmentation in implementations and evaluation protocols. The library provides unified abstractions for policies, attacks, defenses, and robustness metrics, integrating with Stable-Baselines3 and Gymnasium. Evaluation of DQN, PPO, and SAC agents across 192 attack-defense configurations in LunarLander and Highway-v0 reveals environment-dependent robustness variations, with temporal smoothing emerging as a consistently effective defense while some common defenses prove counterproductive.

adversarial reinforcement learningrobustness metricsstable-baselines3temporal smoothinggymnasium

Read original →

SUMO: Segment and Track Any Motion with Nonlinear State Space Models

arXiv cs.AI · Kexin Tian, Sixu Li, Keshu Wu, Yang Zhou · 2026-06-29

SUMO introduces a zero-shot, training-free framework for Visual Object Tracking (VOT) and Moving Object Segmentation (MOS) by integrating nonlinear dynamics with vision-based segmentation. The method employs a nonlinear State Space Model (SSM) inspired by robotics, a Selective Unscented Filter (SUF) for state estimation with multi-source prediction fusion, and a memory selection mechanism. Experiments demonstrate state-of-the-art performance on VOT and MOS tasks.

visual object trackingmoving object segmentationnonlinear state space modelselective unscented filterzero-shot learning

Read original →

Beyond Triplet Plausibility: Relation Set Completion in Knowledge Graphs

arXiv cs.AI · Zihao Zheng, Borui Cai, Yao Zhao, Keshav Sood · 2026-06-29

The paper introduces relation set completion (RSC), a novel knowledge graph completion task addressing entity-relation compatibility gaps beyond traditional triplet prediction. The authors propose Relation Set Embedding (RelSetE), which models latent patterns in observed entity relations to infer missing compatible relations. Evaluated on three derived benchmark datasets, RelSetE demonstrates effective capture of entity-relation compatibility patterns, outperforming baselines in missing relation inference.

knowledge graph completionrelation set embeddingentity-relation compatibilitylink predictiontriplet prediction

Read original →

Exploring Motivations for Algorithm Mention in the Domain of Natural Language Processing: A Deep Learning Approach

arXiv cs.AI · Yuzhuo Wang, Yi Xiang, Chengzhi Zhang · 2026-06-29

The study introduces a sentence-level framework for analyzing motivations behind algorithm mentions in NLP academic papers, focusing on description, use, comparison, and improvement. Using manual annotation and machine learning, algorithm entities and related sentences were identified, with motivation classification performed via pretrained models and data augmentation. Results indicate that deep learning models with augmented data outperform traditional methods in motivation classification. Findings reveal that direct use is the most common motivation (over 50%), while improvement is the least frequent. Over time, use motivations have replaced description motivations, and motivation diversity has increased, though individual algorithms show declining motivation type counts.

natural language processingalgorithm entitiesmotivation classificationdata augmentationdeep learning models

Read original →

MATCH: Modulating Attention via In-Context Retrieval for Long-Context Transformers

arXiv cs.AI · Linrui Ma, Chun Hei Lo, Xinyu Wang, Peng Lu · 2026-06-29

MATCH introduces a scalable framework for enhancing sparse-attention transformers by dynamically integrating in-context information via efficient retrieval. The method modulates attention mechanisms without rigid structural constraints, addressing the quadratic cost of traditional attention while preserving long-range recall capabilities. Empirical results demonstrate significant performance improvements on synthetic and natural-language tasks, validating MATCH as an effective approach for maintaining efficiency in long-context scenarios.

sparse-attentionin-context retrievallong-context transformersattention modulationefficient retrieval

Read original →

Neural Procedural Memory: Empowering LLM Agents with Implicit Activation Steering

arXiv cs.AI · Chengfeng Zhao, Yuqiao Tan, Shizhu He, Yequan Wang · 2026-06-29

The paper introduces Neural Procedural Memory (NPM), a training-free framework for enhancing LLM agents through implicit activation steering rather than explicit textual instructions. NPM distills procedural skills from contrastive experiences into activation-space steering vectors, directly triggering task-relevant neural mechanisms. Evaluations on four agent benchmarks show NPM matches explicit-instruction baselines while combining both approaches yields complementary robustness. Representational analyses reveal steering vectors encode consistent task logic with organized activation-space structures, suggesting implicit steering as a viable agent memory mechanism.

neural procedural memoryactivation steeringllm agentsretrieval-augmented generationcontrastive experiences

Read original →

Experience Graphs: The Data Foundation for Self-Improving Agents

arXiv cs.AI · Gang Liao, Yujia He, Abdullah Ozturk, Zhouyang Li · 2026-06-29

The paper introduces Trellis, a data foundation that treats experience graphs as first-class database objects for long-horizon agentic tasks like code generation and hardware design. Experience graphs capture structured search trajectories (artifacts, tool outputs, rewards, lineage) typically discarded as ephemeral logs. Trellis reformulates agent operations as database patterns: frontier selection as queries, cross-session reuse as graph retrieval, and training-data extraction as materialized views. Evaluated on Meta's KernelEvolve, it achieves 10x faster target speedup at 52% lower token cost by enabling durable, queryable experience graphs that transform inference-time search into institutional assets.

experience graphsagentic tasksmaterialized viewsgraph retrievaltime-travel query

Read original →

Dual-Flow Reinforcement Learning with State-Aware Exploration

arXiv cs.AI · Qijun Li, Zheng Fu, Qi Song, Yifei He · 2026-06-29

Dual-Flow RL introduces a unified actor-critic framework for complex continuous-control tasks, addressing challenges in multimodal action spaces and uncertain return distributions. The method jointly models continuous return distributions and multimodal policies using conditional flow matching (CFM), coupled with an Entropy-Covariance Exploration Regulator (ECER) for state-aware exploration. Experiments on DeepMind Control Suite and Humanoid-Bench demonstrate state-of-the-art performance, surpassing prior diffusion-based and flow-based methods.

dual-flow rlconditional flow matchingmultimodal explorationactor-critic frameworkcontinuous-control

Read original →

How Far Can You Get Without a GPU? A Systematic Benchmark of Lightweight Hallucination Detection Across Question Answering, Dialogue, and Summarisation

arXiv cs.AI · Kriti Faujdar, Smit Kadvani · 2026-06-29

This work benchmarks lightweight, CPU-feasible hallucination detection methods using publicly available models, addressing resource constraints in AI deployment. Five methods—ROUGE-L, semantic similarity, BERTScore, a FEVER-trained DeBERTa-based NLI detector, and a similarity-NLI ensemble—are evaluated across QA, dialogue, and summarisation tasks on the HaluEval benchmark. Results show task-dependent performance: the ensemble excels in QA (F1 = 0.792, AUC-ROC = 0.873), NLI leads in dialogue (AUC-ROC = 0.713), and all methods degrade in summarisation (AUC-ROC = 0.469–0.574). Experiments were conducted on a standard laptop CPU, mapping the practical limits of GPU-free detection.

hallucination detectionnli detectorcpu-feasiblehaleval benchmarkauc-roc

Read original →

Making Multimodal LLMs Reliable Chart Data Extractors: A Benchmark and Training Framework

arXiv cs.AI · Yuchen He, Peizhi Ying, Liqi Cheng, Kuilin Peng · 2026-06-29

The paper introduces a benchmark and training framework to enhance multimodal large language models (MLLMs) for chart data extraction, particularly from label-free charts. The authors propose a human-centered approach, modeling chart reading as a progressive learning process, and develop a 7B-parameter model that achieves state-of-the-art performance in numerical accuracy. Results demonstrate significant improvements over existing methods, with a user study confirming the model's effectiveness in mixed-initiative workflows.

multimodal llmschart data extractionprogressive learningnumerical accuracymixed-initiative systems

Read original →

Accelerating Q-learning through Efficient Value-Sharing across Actions

arXiv cs.AI · Prabhat Nagarajan, Brett Daley, Martha White, Marlos C. Machado · 2026-06-29

The paper introduces the mean-expansion layer, a parameter-free addition to Q-network architectures that accelerates action-value learning in reinforcement learning. The layer shares values across actions within a state and transforms the problem into learning a lower-norm representation of action-values, rather than directly learning potentially large values. Applied to deep Q-networks and implicit quantile networks, the method improves aggregate performance across 57 Atari games, increases action gaps, and significantly reduces value overestimation.

mean-expansion layeraction-value learningq-network architecturesimplicit quantile networksvalue overestimation

Read original →

The CRISTAL Method: Neurosymbolic analysis from AI-synthesized world models

arXiv cs.AI · Rafael Kaufmann, Felix Neubürger, Michael Walters, Thomas Kopinski · 2026-06-29

The CRISTAL Method introduces a neurosymbolic framework for automating complex analysis workflows, addressing challenges in domains like fundamental investment analysis with high uncertainty and subjective data. It combines statistical model synthesis, continuous learning, and active learning to build a dynamic, interpretable probabilistic program supporting Bayesian inference. The method leverages LLMs for code synthesis and refines its world model during analysis. Evaluated on a synthetic equities benchmark, CRISTAL achieves Bayes-optimal accuracy with 5 examples and a 5-second budget, outperforming state-of-the-art LLMs by 60% accuracy margins.

neurosymbolicprobabilistic programbayesian inferenceactive learningllm synthesis

Read original →

Multi-Level Distributional Entropy for Explainable Network Intrusion Detection

arXiv cs.AI · Mohamed Aly Bouke, Md Shohel Sayeed, Swee-Huay Heng, Azizol Abdullah · 2026-06-29

The paper introduces Multi-Level Distributional Entropy (MDE), a framework for deriving interpretable entropy features from flow-level statistics in network intrusion detection. MDE operates at three levels: within-flow Gaussian differential entropy, cross-directional Jensen-Shannon divergence, and TCP flag-pattern Shannon entropy, requiring no raw packet data. Evaluated on NSL-KDD, CICIDS-2017, CICIDS-2018, and UNSW-NB15 benchmarks, entropy-only features achieve weighted F1 scores of 0.708-0.989, matching conventional features. Analysis reveals hidden failure modes, such as a detection rate drop to 0.48 on CICIDS-2018 despite F1=0.74. SHAP analysis confirms reproducible entropy attributions (Spearman rho=0.80-0.95).

entropyintrusion detectionjensen-shannon divergenceshapley valuestcp flags

Read original →

What Drives the Inlier-Memorization Effect? A Theory of Outlier Detection via Early Training Dynamics

arXiv cs.AI · Kunwoong Kim, Dongha Kim · 2026-06-29

The paper presents a theoretical analysis of the inlier-memorization (IM) effect in unsupervised outlier detection, where deep models memorize inlier patterns earlier than outliers. Using a simple autoencoder framework, the authors characterize the emergence, strength, and persistence of IM under mild assumptions, linking these properties to data distribution and parameter initialization. Derived guidelines for enhancing IM—including data preprocessing and initialization schemes—achieve state-of-the-art performance on ADBench, providing a theoretical foundation for IM-based methods.

outlier detectioninlier-memorizationautoencoderunsupervised learningearly training dynamics

Read original →

HERO: Improving the Reliability and Sensitivity of Generative Model Evaluation Using Historical Data

arXiv cs.AI · Xinrui Ruan, Zhenyu Zhao, Waverly Wei, Yueshan Zhang · 2026-06-29

HERO (History Enhanced RObust model evaluation) introduces a framework leveraging historical evaluation data to enhance generative model assessment reliability and sensitivity. By calibrating noisy silver labels against sparse gold annotations and anchoring estimators to high-precision covariates, HERO suppresses bias and reduces variance. Theoretical conditions for bias-variance reduction are established, with empirical validation through simulations and real-world benchmarking datasets. The method remains effective across evaluation tasks and partial historical labeler availability.

generative model evaluationsilver labelsgold annotationsbias-variance tradeoffcovariate anchoring

Read original →

FalconTrack: Photorealistic Auto-Labeled Perception and Physics-Aware Vision-Based Aerial Tracking

arXiv cs.AI · Yan Miao, Karteek Gandiboyina, Noah Giles, Hideki Okamoto · 2026-06-29

FalconTrack introduces a unified perception-and-tracking framework for vision-based aerial tracking in GPS-denied environments, leveraging photorealistic simulation for automated label generation and physics-aware tracking. The system employs a Gaussian Splatting simulator to isolate target Gaussians from short object videos, compositing them with randomized backgrounds to produce RGB, mask, class, and 6-DoF pose labels, generating 10k labeled images in under 20 minutes. A multi-head perception module trained with staged learning and reprojection consistency is fused with class-conditioned dynamics priors in an EKF for tracking. FalconTrack achieves 96-100% class accuracy in zero-shot sim-to-real transfer, maintains consistent performance in unseen scenes, and runs at 25 Hz with 100% success in real hardware closed-loop visual tracking.

gaussian splattingsim-to-real transfer6-dof poseekf trackingzero-shot learning

Read original →

Mandol: An Agglomerative Agent Memory System for Long-Term Conversations

arXiv cs.AI · Yuhan Zhang, Zhiyuan Guo, Ziheng Zeng, Wei Wang · 2026-06-29

Mandol introduces an agglomerative memory system for long-term conversational agents, addressing fragmentation and inefficiency in existing heterogeneous databases. The system features a hierarchical memory model with basic and abstract layers uniformly represented as structured semantic graphs, a fused semantic data structure (SemanticMap + SemanticGraph) enabling hybrid retrieval, and a quantitative query mechanism with adaptive routing and token-constrained context generation. Evaluated on LoCoMo and LongMemEval benchmarks, Mandol achieves superior accuracy, 5.4x faster retrieval, and 4.8x faster insertion under 10 QPS load while maintaining low latency on consumer hardware.

agglomerative memorysemantic graphhybrid retrievalquantitative querylong-term conversation

Read original →

Towards Generalizable and Evidential Nuclear Magnetic Resonance-Based Molecular Structure Elucidation via Large Language Model Agent

arXiv cs.AI · Zheng Fang, Chen Yang, Yusen Tan, Yunpeng Zhao · 2026-06-29

NMRAgent introduces a novel approach to molecular structure elucidation by integrating large language models (LLMs) with specialized spectral analysis tools and chemical knowledge graphs. The agent mimics human deductive reasoning, processing NMR spectra and molecular formulas to plan elucidation, propose candidate structures, verify peak-atom consistency, and refine substructures through formula-aware fragment optimization. NMRAgent achieves a 46.5% improvement in top-1 accuracy and a 0.502 increase in Tanimoto similarity on a scaffold-split benchmark, demonstrating its efficacy with novel scaffolds. It successfully elucidated structures of unknown natural products and corrected literature misassignments, establishing a new paradigm for interpretable AI in analytical chemistry.

nmr spectroscopylarge language modelsevidential reasoningchemical knowledge graphsscaffold-split benchmark

Read original →

CLQT: A Closed-Loop, Cost-Aware, Strategy-Consistent Benchmark for Diagnostic Evaluation of LLM Portfolio-Management Agents

arXiv cs.AI · Bo Qu, Mingguang Chen · 2026-06-29

CLQT introduces a closed-loop, cost-aware benchmark for diagnosing LLM portfolio-management agents, shifting evaluation from ranking returns to identifying process strengths and weaknesses. The framework employs a five-stage trading cycle (gather, synthesize, allocate, execute, reflect) within a temporally-gated environment, supported by six pillars including TimeGate, cost modeling, and strategy-consistency scoring. Agents operate as constrained committees or full-autonomy orchestrators, enabling process scaffolding experimentation. Metrics are derived from a recompute-verifiable hash chain, yielding a five-axis capability scorecard (APM-CS) validated via contamination-controlled backtests and live broker tracks. CLQT disentangles outcomes from capabilities, providing a durable map of agent competencies.

closed-loopcost-awarestrategy-consistencytemporally-gatedhash chain

Read original →

TopoAgent: An Agentic Framework for Automated Topology Learning in Medical Imaging

arXiv cs.AI · Guangyu Meng, Pengfei Gu, Xueyang Li, Yiyu Shi · 2026-06-29

TopoAgent introduces an LLM-based agentic framework for automated topology learning in medical imaging, addressing the limitation of fixed topological descriptors in conventional methods. The framework employs a Perception--Reasoning--Action--Reflection loop, supported by 21 domain-specific tools and dual memory, to analyze input images and determine optimal topological descriptors without task-specific training. It evaluates 15 topological descriptors across 26 datasets using six classifiers, enabling the generation of suitable topological feature vectors for downstream tasks. This approach leverages persistent homology to capture geometric structural properties often overlooked by pixel-level deep learning.

topological data analysispersistent homologyagentic frameworktopological descriptorsmedical imaging

Read original →

PS-PPO: Prefix-Sampling PPO for Critic-Free RLHF

arXiv cs.AI · Doo Hwan Hwang, Kee-Eung Kim · 2026-06-29

The paper introduces Prefix-Sampling Proximal Policy Optimization (PS-PPO), a critic-free RLHF method that improves computational efficiency by exploiting temporal redundancy in trajectories. PS-PPO samples trajectory prefixes via a prompt-conditioned cutoff distribution and applies importance-weighting to maintain unbiased gradient estimation while only backpropagating through prefixes. Experiments on mathematical reasoning and RLHF benchmarks demonstrate comparable accuracy to baselines while significantly reducing training compute (up to 3×) and GPU memory usage.

reinforcement learning from human feedbackproximal policy optimizationcritic-free methodstemporal redundancygradient estimation

Read original →

Rethinking Generative Reconstruction Attacks against Graph Neural Network Models

arXiv cs.AI · Adebayo Keji, Sayanton Dibbo · 2026-06-29

The paper introduces two novel graph inversion attacks against Graph Neural Networks (GNNs): graph-label conditioned (GLC) and embedding-label conditioned (ELC) attacks, leveraging model predictions and intermediate representations respectively. Using a generator-discriminator approach, the attacks reconstruct high-quality graphs in black-box scenarios, evaluated on NCI1, PROTEINS, and AIDS datasets with FGD, EGD, MMD, and GKS metrics. Results show GNN vulnerability even with 50% reduced queries (Ours-- variant) and varying Laplacian noise-scales.

graph neural networksmodel inversion attackprivacy attacksgraph reconstructionblack-box attack

Read original →

DEEPMED Search: An Open-Source Agentic Platform for Medical Deep Research with Introspective Verification

arXiv cs.AI · Maolin Liu, Fanyu Xu, Ruoqing Xu, Jiahang Zhang · 2026-06-29

The paper introduces DEEPMED Search, an open-source agentic platform for transparent medical research that addresses limitations in commercial tools and standard RAG implementations. The system employs a source-adaptive router to dispatch sub-queries to PubMed, web search, or local knowledge bases, coupled with an introspective verification module using causal-consistent multi-agent debate for evidence validation. Results demonstrate its capability to handle rare disease queries, filter noise, and generate citation-backed reports efficiently, providing a robust infrastructure for medical reasoning.

agentic platformintrospective verificationsource-adaptive routercausal-consistent debateknowledge bases

Read original →

DeepTrans Studio: Turning Expert Interventions into Shared Team Knowledge in Agentic Translation Workflows

arXiv cs.AI · Ziyang Lian, Qingya Zhang, Hao Wang, Huiwen Xiong · 2026-06-29

DeepTrans Studio introduces a collaborative translation workspace that transforms expert interventions into shared team knowledge within agentic translation workflows. The system enables professionals to intercept specific workflow nodes, review evidence, revise AI outputs, and save approved decisions to a collective team memory. During a demo, participants role-played translators and reviewers, addressing preset terminology and legal-modal risks, with their decisions propagated to downstream segments and surfaced as reusable precedents in teammates' workspaces. This approach ensures human interventions become traceable, shared knowledge rather than isolated corrections.

agentic translationteam memorylegal-modal risksworkflow nodesreusable precedents

Read original →

From Trait to Behavior: A Cognitive-Affective Personality System (CAPS) Perspective on Multi-Homing Intention in AIGC Platforms

arXiv cs.AI · Xuchao Zhang, Jihye Lee · 2026-06-29

This study addresses the theoretical gap in cross-platform usage intentions within Artificial Intelligence Generated Content (AIGC) platforms by proposing and validating a three-stage multiple mediation model. The model integrates optimum stimulation level (OSL) theory, complementarity theory, and perceived value theory, with social influence and use experience as control variables. Results indicate that OSL enhances perceived complementarity, which positively affects perceived epistemic value, subsequently predicting multi-homing intention. A chain mediation path from OSL to multi-homing intention via perceived complementarity and epistemic value was identified. Social influence positively impacts multi-homing intention, while use experience shows no significant effect.

artificial intelligence generated contentoptimum stimulation levelperceived complementarityperceived epistemic valuemulti-homing intention

Read original →

Redefining Maritime Anomaly Detection via Equation-Grounded Synthetic Anomalies

arXiv cs.AI · Youngseok Hwang, Sungho Bae, Dohun Lee, Jaeeun Seo · 2026-06-29

The paper introduces an equation-grounded taxonomy and synthetic anomaly generation pipeline for maritime anomaly detection using AIS data. The method defines three anomaly types (unexpected activity, route deviation, close approach) and implements a score-synthesize-label pipeline with LLM-guided plausibility scoring. Evaluations across temporal-window variations and anomaly compositions demonstrate the framework's effectiveness when tested on diverse time-series and anomaly detection models. The approach addresses limitations of prior statistical rarity and expert-labeling methods while enabling systematic benchmarking.

automatic identification systemanomaly taxonomyllm-guided synthesistime-series evaluationmaritime safety

Read original →

Diagnosing and Mitigating Context Rot in Long-horizon Search

arXiv cs.AI · Shijie Xia, Yikun Wang, Zhen Huang, Pengfei Liu · 2026-06-29

This paper investigates and mitigates context rot, a degradation of Large Language Model (LLM) capabilities due to extensive context in long-horizon search tasks. The authors evaluate four open-source models across three benchmarks, revealing that increasing context length leads to premature uncertain answers or model disengagement. They explore mitigation strategies through context management (seven methods across three categories) and rot-aware rejection sampling, demonstrating their effectiveness individually and in combination. Pruning experiments establish a relationship between accumulated context and rot severity, providing guidance for strategy selection based on performance, cost, and rot impact.

context rotlong-horizon searchrejection samplingcontext managementpruning experiments

Read original →

Optimizing Expert-Designed Crystal Graph Networks for Band-Gap Prediction with an Autonomous LLM Research Loop

arXiv cs.AI · Chenmu Zhang, Boris I. Yakobson · 2026-06-29

An autonomous LLM research agent optimized expert-designed crystal graph networks for band-gap prediction, achieving state-of-the-art accuracy on the MatBench benchmark (>100k crystals) without external pretraining. The agent implemented known methods, including element-pair features on message-passing edges and crystal space-group embeddings, outperforming seventeen expert-designed models. This work demonstrates the potential of LLM-driven autonomous research in optimizing machine learning models for material property prediction while highlighting its methodological limitations.

crystal graph networksband-gap predictionmatbenchmessage-passing edgesspace-group embedding

Read original →

SEVA: Self-Evolving Verification Agent with Process Reward for Fact Attribution

arXiv cs.AI · Aojie Yuan, Yi Nian, Haiyue Zhang, Zijian Su · 2026-06-29

SEVA introduces a structured verification agent for LLM fact attribution, addressing hallucination through evidence alignments, reasoning chains, and error diagnoses. The method employs process reward in RL to decompose verification quality into five components, resolving advantage collapse and inducing an implicit curriculum. Results show improved alignment (0.917→0.997) and F1 (64.9→69.0), with SEVA-3B matching GPT-4o-mini (69.0 vs. 69.8 F1) on ClearFacts while providing richer output.

fact attributionprocess rewardadvantage collapseself-evolution loopstructured verification

Read original →

ARMOR: Adaptive Retriever Optimization for Low-Resource Telecom Question Answering

arXiv cs.AI · Heshan Fernando, Quan Xiao, Yan Xin, Tianyi Chen · 2026-06-29

ARMOR introduces adaptive retriever optimization for low-resource telecom QA, prioritizing query-encoder adaptation over generator fine-tuning under bounded-parameter assumptions. The method jointly optimizes latent-document RAG likelihood and InfoNCE contrastive objectives, learning separate temperatures for each and regularizing the adapted query encoder toward its frozen base. Evaluations on telecom-specific benchmarks demonstrate improved evidence retrieval and answer generation compared to generator-side adaptation.

retrieval-augmented generationquery-encoder tuninginfoncelatent-document likelihoodtelecom qa

Read original →

GUICrafter: Weakly-Supervised GUI Agent Leveraging Massive Unannotated Screenshots

arXiv cs.AI · Sunqi Fan, Lingshan Chen, Runqi Yin, Qingle Liu · 2026-06-29

GUICrafter introduces a weakly-supervised GUI agent that reduces reliance on human annotations by leveraging massive unannotated screenshots. The method employs a two-stage curriculum learning framework: first learning visual grounding from unannotated GUI screenshots and webpages, then fine-tuning with minimal high-quality data via reinforcement learning. Experiments demonstrate competitive performance to UI-TARS using only 0.1% of its annotated data, and superior results to GUI-R1 under equivalent annotation budgets.

gui agentweakly-supervised learningvisual groundingcurriculum learningreinforcement learning

Read original →

Toward Secure and Reliable PDDL Formalization of Large Language Models with Planner-in-the-Loop Feedback

arXiv cs.AI · Jiamei Jiang, Jiajing Zhang, Feifei Mo, Linjing Li · 2026-06-29

The paper introduces NL-PDDL-Bench, a multi-domain benchmark for natural-language-to-PDDL formalization with planner-verified executability and difficulty scaling by object count. It proposes a planner-in-the-loop framework using validator and planner diagnostics for localized edits, combining Low-Rank Adaptation fine-tuning, planner-derived Direct Preference Optimization, and inference-time repair. Experiments demonstrate improved planner success rates, plan-level agreement, and robustness across domains, highlighting verifiable formalization for safety-critical LLM deployment.

pddl formalizationplanner-in-the-looplow-rank adaptationdirect preference optimizationplan-level consistency

Read original →

Early Warning Signals for OpenVLA Failure under Visual Distribution Shift

arXiv cs.AI · Dipesh Tharu Mahato, Rachel Ren · 2026-06-29

This work demonstrates that OpenVLA's internal activations contain linearly decodable signals predictive of near-term task failure under visual distribution shifts. The authors analyze LIBERO manipulation rollouts with a fixed OpenVLA policy, logging activations and fitting lightweight monitors post-hoc. Under occlusion stress tests reducing success rates from 57% to 17%, a logistic probe at layer 16 achieves AUROC 0.972 and AUPRC 0.352 for failure prediction within 15 steps, outperforming baselines. Layer-wise analysis reveals uneven decodability, with layer 16 being most informative. The monitor generalizes to camera jitter but not benign color shifts, though causal mechanisms and deployable recovery remain unestablished.

vision language action modelslinear decodabilitydistribution shifttask failure predictioninternal activations

Read original →

A Machine-Verified Proof of a Quantum-Optimization Conjecture

arXiv cs.AI · Uri Kol, Maor Ben-Shahar, Kfir Sulimany, Dirk Englund · 2026-06-29

The authors present a machine-verified proof of the Farhi-Goldstone-Gutmann (FGG) conjecture in quantum optimization, which had remained open for over a decade. Using Claude Fable 5 and Lean 4, they formalized QAOA components and reduced the conjecture to a single mathematical statement, which the LLM then proved by identifying a hidden dynamical symmetry. The proof leverages quantum information theory and adjacent mathematical tools, with Lean providing end-to-end verification while requiring human input only for structural validation. This demonstrates a scalable methodology for resolving open conjectures in quantum information science.

quantum approximate optimization algorithmlean theorem provermachine-verified proofdynamical symmetryquantum information theory

Read original →

Sample-Efficient Learning of Probabilistic Causes for Reachability in Markov Decision Processes with Probabilistic Guarantees

arXiv cs.AI · Ryohei Oura, Georgios Fainekos, Hideki Okamoto, Bardh Hoxha · 2026-06-29

The paper introduces a sample-efficient learning method for identifying probability-raising (PR) causes in Markov decision processes (MDPs) with unknown transition probabilities, providing probabilistic guarantees. The approach uses a restart-based MDP modification to reduce PR-cause verification to two conditional reachability queries, avoiding reliance on original MDP reachability values. Theoretical analysis establishes sample-complexity bounds, while experiments on benchmarks demonstrate reliable causal identification via an anytime algorithm combining learning and two-sided value iteration.

markov decision processesprobability-raising causalitysample complexityconditional reachabilityvalue iteration

Read original →

Unlocking the Visual Record of Materials Science: A Large-Scale Multimodal Dataset from Scientific Literature

arXiv cs.AI · Subham Ghosh, Shubham Tiwari, Mohammad Ibrahim, Abhishek Tewari · 2026-06-29

We introduce MatSciFig, a large-scale multimodal dataset unlocking the visual record of materials science literature by decomposing compound figures into 391,606 panel-level image-text pairs from 180,571 figures across 14,810 open-access articles. Our MatMMExtract pipeline employs a fine-tuned YOLO12-m detector for panel localization (mAP_50: 0.9227) and Gemini 3.1 Flash Lite for structured annotation generation (82% quality, 4.8% hallucination rate). Each pair includes sub-captions, visualization categories, and scientific summaries grounded in a materials science taxonomy. A dual-encoder retrieval baseline demonstrates MatSciFig's utility, achieving 4.4x improvement in R@1 over zero-shot CLIP. All resources are openly released.

multimodal datasetcompound figurespanel localizationstructured annotationvision-language learning

Read original →

Diversity is the Strength of the AI Crowd

arXiv cs.AI · Matthew Aitchison, Scott Jeen, Toby Shevlane, Ben Day · 2026-06-29

The study demonstrates that ensemble forecasting accuracy for future world events improves by combining diverse off-the-shelf LLMs rather than relying on highly correlated predictions from similar models. Using binary questions from the Metaculus AI Benchmark, the authors analyze prediction correlations and find that models like Grok 4 enhance ensemble performance due to lower correlation with other frontier LLMs. Results indicate optimal ensembles prioritize both model quality and diversity, suggesting AI forecasting systems should explicitly optimize for complementary errors.

ensemble forecastingllmsmetaculuscorrelationsuperforecaster

Read original →

Safety from Honesty in a Disinterested AI Predictor

arXiv cs.AI · Yoshua Bengio, Oliver Richardson, Tomáš Gavenčiak, Michael Cohen · 2026-06-28

The paper presents a formal safety argument for the Scientist AI (SAI) Predictor, a Bayesian posterior approximation model trained on epistemically contextualized natural-language statements. The Predictor avoids implicit agency by distinguishing factual claims from communication acts and using a posterior-seeking training objective that excludes downstream effects as reward signals. Under assumptions on training dynamics and sparsity of dangerous Predictors, the probability of residual harm exceeding a threshold is proven to be small, as coordinated deception is rare and costly. Safety and accuracy are jointly supported by constraints that prevent misalignment and agency emergence.

bayesian posteriorepistemic contextualizationimplicit agencytraining dynamicsresidual harm

Read original →

Budgeted Act-or-Defer Multi-Agent LLM Deliberation with Local Reliability Bounds

arXiv cs.AI · Mengdie Flora Wang, Haochen Xie, Guanghui Wang, Devin Zhang · 2026-06-28

The paper introduces a budgeted act-or-defer framework for multi-agent LLM deliberation, ensuring reliable decision-making by deferring to human review when confidence bounds fall below a user-specified threshold. The method maps debate prefixes to low-dimensional states, computes $k$-nearest-neighbor lower confidence bounds on correctness, and decomposes risk into calibration failure, residual action risk, and representation gap. Evaluated on six benchmarks against nine baselines, it achieves 84% automation and 96% acted-on accuracy while using only 9--12% of the pre-declared wrong-action budget. The approach prospectively converts user-declared budgets into auditable operating points under explicit assumptions.

multi-agent deliberationconfidence boundsrepresentation gapwrong-action budgetauditable operating point

Read original →

Hybrid Retriever Evolution for Multimodal Document Reasoning Agents

arXiv cs.AI · Bohan Yao, Shruthan Radhakrishna, Vikas Yadav · 2026-06-28

The paper introduces a failure-driven evolution framework for learning adaptive retrieval orchestration in multimodal document question answering. A meta-agent diagnoses reasoning failures, probes the tool environment, and iteratively rewrites the task agent's instructions to dynamically coordinate lexical, semantic, and multimodal retrievers during reasoning steps. Evaluated on MMLongBench-Doc and DocBench, the evolved agent achieves up to +19.6 point gains over baselines, outperforming MACT, MDocAgent, and SimpleDoc through adaptive routing and cross-modal evidence composition rather than fixed retrieval pipelines.

multimodal retrievalfailure-driven evolutionadaptive routingmeta-agentdocument reasoning

Read original →

Fuzzing Large Language Models to Elicit Hidden Behaviours

arXiv cs.AI · Mohammed Abu Baker, Lakshmi Babu-Saheer · 2026-06-28

This paper presents the first systematic study of fuzzing techniques to elicit hidden behaviors in sleeper-agent LLMs (7B-13B parameters), comparing Gaussian noise injection into weights versus residual-stream activations against temperature-sampling baselines. Fuzzing outperformed temperature sampling on 4 of 6 models, with up to 6x improvement on OpenHermes-13B. Hyperparameter selection proved critical, as uniform sweeps yielded low elicitation rates (few percent) compared to best cells (2-10x higher). A Thompson-sampling-based proxy task (in-context secret elicitation) improved activation-fuzzing elicitation by 4x and weight-fuzzing by 1.3-1.8x over uniform baselines. The authors propose reporting results as a (uniform-baseline, proxy-selected, oracle) triple for clarity.

fuzzingsleeper-agentthompson samplingresidual-streamhyperparameter selection

Read original →

Fast Wireless Foundation Models with Early-Exits

arXiv cs.AI · Omar Mashaal, Hatem Abou-Zeid · 2026-06-28

The paper introduces an early-exit framework for wireless foundation models (FMs) to reduce computational costs while improving out-of-distribution (OOD) task performance. The method attaches lightweight task-specific heads at intermediate layers of a frozen FM encoder, enabling variable-depth inference. Results show up to 93% FLOPs reduction and higher accuracy on unseen tasks compared to full encoder execution, with fixed-exit strategies outperforming dynamic early-exiting policies.

wireless foundation modelsearly-exitout-of-distributionvariable-depth inferenceflops reduction

Read original →

Two-Stage Prompt Optimization for Few-Shot Relation Extraction: From Reasoning-Guided Search to Gradient-Guided Refinement

arXiv cs.AI · Aunabil Chakma, Mihai Surdeanu, Eduardo Blanco · 2026-06-28

We propose a two-stage framework for automatic prompt optimization in episodic few-shot relation extraction, combining reasoning-based and gradient-based approaches. The first stage employs any reasoning-based optimizer for broad prompt improvements, while the second stage introduces GradPO, which uses loss and gradient signals to identify high-impact prompt spans and refine them with local edits. Experiments on FS-TACRED and FS-FewRel demonstrate that local refinement typically enhances prompts from the first stage, with GradPO being the most consistent refiner. Our framework achieves state-of-the-art performance on FS-TACRED using Qwen3-4B and remains competitive on FS-FewRel.

prompt optimizationfew-shot relation extractiongradient-based optimizationreasoning-based optimizerlocal refinement

Read original →

SFBench: The SciFy Scientific Feasibility Benchmark

arXiv cs.AI · Cash Costello, James Mayfield, Elsbeth Turcan, Christine Piatko · 2026-06-28

SFBench introduces a novel benchmark for evaluating systems that assess scientific claim feasibility, featuring 197 de novo claims in materials science annotated with expert-derived feasibility scores and explanations. The benchmark emphasizes complex reasoning over varying feasibility levels, avoids LLM training contamination by using original claims, and employs open-ended explanations rather than fixed-format responses. Baseline evaluations using recent GPT models demonstrate the benchmark's utility in assessing scientific reasoning capabilities.

scientific feasibilitymaterials sciencebenchmark datasetopen-ended explanationsgpt models

Read original →

SCARCE: Scalable Cascade Analysis for Rare-event Characterisation via Embeddings

arXiv cs.AI · Yingjie Wang, Yi Dong, Edmund Lau, Jie Meng · 2026-06-28

The paper introduces SCARCE, a method for scalable rare-event probability estimation that replaces traditional Subset Simulation's handcrafted performance function with learned latent embeddings and geometric rulers. By adaptively constructing nested intermediate events from data and formalizing the approach via a non-negative supermartingale, SCARCE provides valid high-probability upper bounds even under early stopping. Experiments demonstrate 400-500x lower mean absolute error than grid-searched Subset Simulation on MNIST misclassification, and 2.6% mean relative error for LLM jailbreak detection on Llama-Guard-3-8B hidden states with adversarial fractions η ≥ 10⁻³.

rare-event estimationsubset simulationlatent embeddingsnon-negative supermartingalellm jailbreaks

Read original →

Do We Still Need Fine Tuning? Turkish Sentiment Analysis in the Era of Large Language Model

arXiv cs.AI · Sercan Karakaş, Yusuf Şimşek · 2026-06-28

The study demonstrates that supervised fine-tuning remains superior to zero-shot prompting for Turkish sentiment analysis, particularly in three-class settings. Comparing classical ML, fine-tuned BERTurk, and prompted LLMs on Turkish e-commerce reviews, fine-tuned BERTurk achieves the highest accuracy, while LLMs struggle with neutral class classification, often collapsing it into polarized categories. Performance gaps narrow in binary positive-negative classification, but three-class evaluation reveals LLMs' limitations. Results emphasize the continued necessity of fine-tuning and the importance of including neutral classes for robust sentiment analysis evaluation.

sentiment analysisfine-tuninglarge language modelszero-shot learningberturk

Read original →

Does Role Specialization Matter for Explanation Faithfulness in Mixture-of-Experts?

arXiv cs.AI · Yeji Kim, Housam Babiker, Mi-Young Kim, Randy Goebel · 2026-06-28

The study investigates whether role specialization in Mixture-of-Experts (MoE) architectures preserves explanation faithfulness, hypothesizing that inter-expert representation overlap degrades attribution-based faithfulness. The authors propose representation-level decorrelation regularization to minimize inter-expert similarity, enhancing role separation. Experiments on multimodal benchmarks demonstrate improved faithfulness metrics (comprehensiveness, sufficiency, AOPC) without compromising task performance, with benefits extending to standard sparse MoE baselines. Findings suggest representation-level separation complements structural role decomposition for faithful explanations.

mixture-of-expertsexplanation faithfulnessrepresentation decorrelationmultimodal benchmarksattribution-based metrics

Read original →

Mechanistically Eliciting Latent Behaviors in Language Models

arXiv cs.AI · Andrew Mack, Nina Panickssery, Alexander Matt Turner · 2026-06-28

The paper introduces Causal Perturbative Elicitation (CPE), an unsupervised method for discovering interpretable low-rank adapters (LoRAs) that elicit latent behaviors in language models. CPE uses tensor decomposition to perturb transformer computations, efficiently learning diverse behavioral modes from minimal data. Results show competitive performance with supervised methods (85% vs 87% on Qwen3-8B Countdown task), success in unlocking sandbagged models (85% BigCodeBench recovery), and mitigation of alignment-faking in Llama3-70B. CPE also aids alignment initialization in GPT-OSS-20B, demonstrating utility for both safety evaluation and behavioral control.

causal perturbative elicitationlow-rank adapterstensor decompositionlatent behaviorsalignment-faking

Read original →

Langshaw: Declarative Interaction Protocols Based on Sayso and Conflict

arXiv cs.AI · Munindar P. Singh, Samuel H. Christie, Amit K. Chopra · 2026-06-28

Langshaw introduces a declarative protocol language for multiagent systems, addressing over-constraining and semantic ambiguity in existing approaches. The method centers on three constructs: (1) 'sayso' for attribute priority assignment, (2) 'nono' and 'nogo' for action conflict resolution, combined with an information model for semantic clarity. Results include formal semantics, safety/liveness verification procedures, and a message-oriented protocol generation method enabling flexible asynchronous enactment.

declarative protocolmultiagent systemsattribute priorityconflict resolutionasynchronous enactment

Read original →

One Scene, Two Depths: Probing Geometric Ambiguity in Monocular Foundation Models

arXiv cs.AI · Xiaohao Xu, Feng Xue, Xiang Li, Haowei Li · 2026-06-28

The paper introduces MultiDepth-3k (MD-3k), a sparse two-layer ordinal benchmark for evaluating depth-layer preference and multi-layer spatial relationship accuracy (ML-SRA) in monocular depth estimation. It highlights the geometric ambiguity in transparent scenes, where a single camera ray may intersect multiple surfaces, challenging the conventional single-depth-per-pixel paradigm. Experiments on MD-3k reveal diverse depth-layer preferences across leading depth foundation models under RGB input, with Laplacian Visual Prompting (LVP) significantly altering layer predictions for frozen models. The best-performing RGB/LVP pair, DAv2-L, achieves 75.5% ML-SRA, suggesting that depth models may express complementary geometric hypotheses beyond standard RGB inference.

monocular depth estimationgeometric ambiguitymulti-depth benchmarklaplacian visual promptingspatial relationship accuracy

Read original →

How AI settled the complexity of the oldest SGD algorithm

arXiv cs.AI · Michał Dereziński, Xiaoyu Dong · 2026-06-28

The paper establishes the worst-case complexity of the Kaczmarz algorithm, the earliest known stochastic gradient descent (SGD) method originally proposed in 1937 for solving linear systems. Modern AI models including ChatGPT and Gemini were employed to analyze this foundational optimization technique. The interdisciplinary approach connects classical numerical analysis with contemporary machine learning paradigms to characterize the algorithm's computational limits.

stochastic gradient descentkaczmarz algorithmcomputational complexitylinear systemsoptimization

Read original →

SonoCLIP: Mask-Guided Region-Aware Vision-Language Pretraining for Fetal Ultrasound Analysis

arXiv cs.AI · Hang Su, Chao Sun, Zhaofan Li, Wei Hu · 2026-06-28

SonoCLIP introduces a region-controllable vision-language foundation model for fetal ultrasound analysis, addressing limitations of global image-text alignment in existing CLIP-based approaches. The model integrates segmentation masks as mask-channel visual prompts within the vision encoder, enabling joint global-local contrastive representation learning, and employs a sigmoid-based pairwise contrastive loss for scalable region-text alignment. Pretrained on a curated 1.44M-image multimodal fetal ultrasound dataset spanning 24 standard planes, SonoCLIP demonstrates superior zero-shot transfer performance in cross-center evaluations under both global and mask-guided inference. The model establishes a clinically oriented foundation for fetal ultrasound analysis.

vision-language foundation modelmask-channel visual promptscontrastive representation learningzero-shot transferfetal ultrasound analysis

Read original →

Bilevel Optimization for Neural Architecture Search

arXiv cs.AI · Abhishek Shukla, Ankur Sinha, Faiz Hamid · 2026-06-28

The paper provides a structured overview of Neural Architecture Search (NAS) through bilevel optimization, categorizing methods into sampling-based and bilevel theory-based approaches. It introduces an auxiliary mathematical programming framework that integrates second-order information from training loss, ensuring optimal parameter updates for both architecture and model weights. Comparative analysis demonstrates that bilevel theory-based methods outperform sampling-based approaches in accuracy and efficiency.

bilevel optimizationneural architecture searchhyperparameter tuningsecond-order informationmathematical programming

Read original →

The Joint Effect of Quantization and Sampling Temperature on LLM Safety Alignment: A Factorial Analysis

arXiv cs.AI · Hari Prasad, Ritam Pal · 2026-06-28

This work systematically evaluates how quantization and sampling temperature jointly affect LLM safety alignment through a factorial study of 9 instruction-tuned models across 3 precisions (FP16, INT8, INT4) and 6 temperatures (0-1.0), generating 322k responses assessed by a safety ensemble. Results show standard quantization is generally safety-neutral (INT4 reduces attack success in 7/9 models), while temperature increases decision instability (DFR reaches 53.0% at T=1.0), with sub-additive interaction effects (Compound Degradation Index: -0.195 to +0.045). The findings suggest INT4/INT8 quantization is viable for aligned models, but safety evaluations at high temperatures should measure multi-sample stability.

quantizationsampling temperaturesafety alignmentattack success ratedecision instability

Read original →

ScAle: Attention Head Scaling as a Minimal Adapter for Spatial Reasoning in Vision Language Models

arXiv cs.AI · Rahul Chowdhury, Timothy A Rupprecht, Xuan Shen, Pu Zhao · 2026-06-28

ScAle introduces an ultra-lightweight adaptation method for vision language models (VLMs) that improves spatial reasoning by rescaling activations in transformer layers without modifying pretrained weights. The method learns a small set of scalar coefficients to modulate last-token attention and MLP activations in a fully frozen backbone. Evaluated on SpatialEval, COCOQA, and VGQA benchmarks, ScAle achieves up to 134.1% relative accuracy gains using only 1K trainable parameters, recovering a substantial fraction of standard PEFT performance while maintaining strong non-spatial VQA accuracy.

spatial reasoningvision language modelsscalar coefficientslast-token attentionparameter-efficient

Read original →

ReMAP-PET: Beyond Visual Understanding -- Learning Region-Guided Metabolic Alignment Semantics from Brain PET

arXiv cs.AI · Dasen Dai, Yanteng Zhang, Shuoqi Li, Yuxiang Wei · 2026-06-28

ReMAP-PET introduces a framework for learning region-guided metabolic semantics from brain PET scans, addressing limitations of existing 3D brain foundation models that treat PET as generic volumetric data. The method supervises a partially-tuned MedicalNet 3D ResNet-50 with brain regional standardized uptake value ratio (SUVR) profiles through joint regression and contrastive objectives. On 1015 paired PET--SUVR samples, ReMAP-PET achieves 0.070 SUVR MAE and 77.8% PET SUVR Recall@1, outperforming five frozen pretrained baselines. The framework enables PET-to-report generation via SUVR-constrained verbalization and retains clinically relevant information in embeddings without task-specific fine-tuning.

positron emission tomographymetabolic semanticsstandardized uptake value ratiocontrastive objectivespet-to-report generation

Read original →

TF-MoE: Time-Frequency Mixture-of-Experts for Efficient Speech Separation

arXiv cs.AI · Qinzhe Hu, Chenda Li, Wangyou Zhang, Shujie Liu · 2026-06-28

The paper proposes TF-MoE, a sparse Mixture-of-Experts framework for efficient speech separation that enhances model capacity without increasing inference cost. The method introduces dynamic expert specialization in time and frequency dimensions via alternating time-wise and frequency-wise MoE modules, built upon a mel-band-splitting Conformer backbone. Experiments show TF-MoE outperforms BSRNN by +3.8 dB SDR on Libri2Mix with comparable compute (4.1 GMACs/s), demonstrating effectiveness under low-compute constraints.

mixture-of-expertsspeech separationconformeredge computingdynamic routing

Read original →

Coverage-Driven KV Cache Eviction for Efficient and Improved Inference of LLM

arXiv cs.AI · Shuvendu Roy, Mengyao Zhai, Hossein Hajimirsadeghi, Golnoosh Samei · 2026-06-28

The paper introduces K-VEC, a coverage-aware KV-cache eviction strategy for efficient LLM inference that addresses performance degradation from reduced token coverage. The method employs cross-head and cross-layer coverage modules to retain critical tokens across attention heads and model layers, theoretically preserving mutual information between inputs and outputs. Evaluations on 16 LongBench subsets show K-VEC achieves up to 10.35-point improvement over existing methods under identical eviction rates and memory constraints.

kv-cache evictionlong-context reasoningmutual informationattention sparsityllm inference

Read original →

VISTA-DZ: Visual Semantic Trajectory Adaptation for Personalized Dilemma Zone Prediction

arXiv cs.AI · Chuheng Wei, Ziye Qin, Ziran Wang, Guoyuan Wu · 2026-06-28

VISTA-DZ introduces a visual semantic trajectory adaptation framework for personalized dilemma zone prediction at signalized intersections. The method converts historical trajectories into visual representations, processes them with a vision-language model to generate behavioral profiles, and uses semantic embeddings to condition a dual-output prediction network combining bidirectional GRU, driver-conditioned cross-attention, and Feature-wise Linear Modulation. Evaluated on SDZ and FDZ datasets, it achieves 93.26% in-domain accuracy and 90.22% mean accuracy across 20 held-out drivers, demonstrating effective simulation-to-real transfer.

dilemma zonevisual semantic trajectoryfeature-wise linear modulationbidirectional grucross-attention

Read original →

Proteus: Automated Adversarial Robustness Testing for Audio Deepfake Detectors

arXiv cs.AI · Nicolas M. Müller, Aditya Tirumala Bukkapatnam, Zohaib Ahmed · 2026-06-28

Proteus introduces an automated framework for adversarial robustness testing of audio deepfake detectors, combining exhaustive breadth-first search and Q-learning to identify effective attack chains. The system evaluates sequences of audio transformations (codec transcoding, noise addition, reverberation, dynamic-range compression, VoIP simulation) that fool detectors while maintaining speech quality. Results from production deployment show specific augmentation chains reliably flip detection verdicts without compromising intelligibility or speaker identity, enabling detector hardening via targeted retraining.

adversarial robustnessaudio deepfake detectionq-learningaudio transformationsautomated testing

Read original →

Learned Coordination Conventions in Cooperative MARL: Measuring the Translation Gap Between Theory-Informed Roles and Learned Routing

arXiv cs.AI · Yoosung Hong · 2026-06-28

The study introduces a diagnostic framework for measuring the translation gap between theory-informed role expectations and learned coordination conventions in cooperative multi-agent reinforcement learning (MARL). Using role-routing matrices, formation sensitivity, and gradient/occlusion attribution, the authors analyze coordination structures in MiniGrid and SMACv2 (Terran) environments. Results show that label-conditioned attention outperforms flat MLP baselines in producing role-specific routing, exhibits stability across team sizes (3v3--9v9), and transfers zero-shot. A 5-seed re-evaluation reveals partial alignment with designer-specified priors, highlighting noise-induced strategic divergence. The framework provides empirical insights into coordination structure without proposing new equilibrium concepts.

multi-agent reinforcement learningrole-routing matrixformation sensitivitylabel-conditioned attentionzero-shot transfer

Read original →

Em-ergence of the em-dash: a population-level rise in em-dash frequency in medRxiv preprints at the dawn of the large-language-model era

arXiv cs.AI · Przemysław Czuma · 2026-06-28

This study quantifies a population-level increase in em-dash usage in medRxiv preprints following ChatGPT's release, suggesting LLM-assisted writing leaves detectable stylistic traces. Analyzing 69,632 Discussion sections (≥500 chars) from 2020-2025 via logistic regression with author-clustered errors, em-dash prevalence rose from 4.23% pre-ChatGPT (before Nov 2022) to 11.58% post-ChatGPT (Δ=7.35pp, OR=2.96, 95% CIs [6.94-7.77] and [2.77-3.17]). The gradual acceleration (4% in 2023, 20.3% in 2025) persisted across sensitivity analyses and falsification tests, absent in pre-LLM placebo splits (+0.13pp) and boilerplate sections.

em-dashlarge language modelsstylometric analysismedrxivlogistic regression

Read original →

RESOURCE2SKILL: Distilling Executable Agent Skills from Human-Created Multimodal Resources

arXiv cs.AI · Yijia Fan, Zonglin Di, Zimo Wen, Yifan Yang · 2026-06-28

RESOURCE2SKILL introduces a framework for distilling executable agent skills from multimodal human resources, including tutorial videos, repositories, articles, and reference artifacts. The method organizes skills hierarchically in a multimodal Skill Wiki, preserving complementary signals from diverse sources: temporal operations from videos, executable patterns from code, and conceptual grounding from articles. At inference, agents retrieve and compose skills, with online acquisition addressing coverage gaps. Evaluated across seven authoring domains, RESOURCE2SKILL improves average overall score by +11.9 percentage points over no-skill agents and outperforms baselines in 26 of 28 model-domain cells. Ablations highlight the importance of multimodal format, hierarchical organization, source diversity, selection strategy, and online acquisition.

multimodal resourcesexecutable skillsskill wikihierarchical organizationonline acquisition

Read original →

OSWorld2.0: Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks

arXiv cs.AI · Mengqi Yuan, Zilong Zhou, Xinzhuang Xiong, Weiming Wu · 2026-06-28

OSWorld 2.0 introduces a benchmark for evaluating computer-use agents on 108 long-horizon real-world workflows, addressing limitations of prior benchmarks by capturing complex phenomena like streaming interaction, dynamic environments, and cross-source reasoning. Tasks require median 1.6 human-hours and average 318 tool calls (vs. 30 in OSWorld 1.0), grounded in authentic artifacts and user profiles. Under a binary-completion metric at 500 steps, Claude Opus 4.8 achieves only 20.6% task completion (54.8% partial), while GPT-5.5 plateaus at 13%, revealing agents' struggles with hidden state recovery and mid-task information integration.

long-horizon workflowsstreaming interactioncross-source reasoningimplicit-state inferencevisual-spatial precision

Read original →

SemJoin: Semantic Join Optimization

arXiv cs.AI · Christopher Gou, Aditya Banerjee, Jiaxuan Wang, Chunwei Liu · 2026-06-28

SemJoin introduces an LLM-agent-based decision pipeline for optimizing semantic joins in relational databases, dynamically selecting execution strategies based on table characteristics. The system employs an LLM advisor to route joins to either a Cluster Join strategy, which uses unsupervised embedding clustering and sample-based filtering, or a Classifier strategy for predicates reducible to discrete label sets. Evaluated on IMDb reviews, email contradictions, and Stack Overflow tags, SemJoin outperforms adaptive block join by 20-33 F1 points across datasets and achieves higher F1 scores than featurized-decomposition join at 1-2 orders of magnitude lower token cost.

semantic joinllm-agentembedding clusteringtoken costdynamic routing

Read original →

MotionAtlas: Detailed Region Captioning for Motion-Centric Videos

arXiv cs.AI · Weisong Liu, Haochen Wang, Kuan Gao, Yuhao Wang · 2026-06-28

MotionAtlas introduces a system for region-aware motion captioning in videos, addressing visual clutter and motion entanglement through precise spatiotemporal mask-based descriptions. The framework comprises MotionAtlas-Bench, a human-annotated benchmark with 2,073 multiple-choice questions for fine-grained motion understanding, a scalable data pipeline producing 159k high-quality motion captioning samples via self-bootstrap refinement, and a tailored training strategy enhancing Video-MLLMs like Molmo2 and Qwen3-VL. MotionAtlas-4B outperforms Qwen3-VL-4B by 5.2 percentage points on general motion benchmarks. The benchmark, dataset, and code are publicly available.

region-aware motion captioningspatiotemporal maskself-bootstrap refinementvideo-mllmsfine-grained motion understanding

Read original →

SAKE: Software Architectural Knowledge Evaluation Benchmark for Large Language Models

arXiv cs.AI · Tiziano Santilli, Francesco Daghero, Mayhar Tourchi Moghaddam · 2026-06-28

The paper introduces SAKE, a benchmark for evaluating large language models' (LLMs) software architectural knowledge, addressing a gap in existing benchmarks that focus on syntactic or algorithmic tasks. SAKE comprises 2154 expert-curated multiple-choice questions across eight architectural categories and four context-length levels, tested on 11 LLMs in zero-shot and five-shot settings. Results show high overall accuracy but significant variation across categories, revealing gaps in areas critical to professional practice. The benchmark, evaluation scripts, and results are open-sourced.

software architecturelarge language modelsbenchmarkzero-shot learningfive-shot learning

Read original →

The Verbose Context Problem in Medical Records

arXiv cs.AI · Shiva Kaul, Min-Gyu Kim, Anjum Khurshid, Sriram Vishwanath · 2026-06-28

The paper introduces PopMedQA, a benchmark addressing the verbose context problem in medical records, where structured concepts have token-inefficient textual representations. The benchmark uses neopatient, a library for generating artificial patient records, to evaluate computational tasks on longitudinal records exceeding 400K tokens. Ablations on prompting strategies, prompt compression, and agentic decomposition reveal that domain-independent methods fail to mitigate the issue, highlighting the need for domain-specific input structuring in language models for population-scale reasoning.

verbose context problempopmedqaneopatientlongitudinal recordsprompt compression

Read original →

UCOB: Learning to Utilize and Evolve Agentic Skills via Credit-Aware On-Policy Bidirectional Self-Distillation

arXiv cs.AI · Songjun Tu, Chengdong Xu, Qichao Zhang, Yiwen Ma · 2026-06-28

The paper proposes UCOB, a framework for improving agentic reinforcement learning through credit-aware bidirectional self-distillation of skill memories. The method treats skill-conditioned and no-skill prompts as on-policy context views, using the higher-return view as a local teacher to guide skill utilization, correction, and memory updates. Evaluations on ALFWorld, WebShop, and Search-QA demonstrate performance gains of up to 23.5 and 18.0 points over state-of-the-art baselines, with ablations confirming the efficacy of core mechanisms.

skill memoriesself-distillationagentic reinforcement learningcredit-aware learningon-policy training

Read original →

Cognitive World Models for Process-Level Social Influence Evaluation

arXiv cs.AI · Minghui Ma, Bin Guo, Han Wang, Mengqi Chen · 2026-06-28

We introduce Cognitive World Model (CogWM), an LLM-based user model for evaluating process-level social influence in multi-turn dialogues. CogWM jointly predicts BDI/E cognitive states (beliefs, desires, intentions, emotions) and user utterances, functioning as both a user simulator and evaluation platform. It employs a three-tier framework assessing turn-level fidelity, trajectory-level state dynamics, and task-level composite scoring. Trained on 150,454 user-turn samples via Summarize-and-Allocate (SaA) annotation, CogWM achieves 77.6% emotion accuracy (2.1× GPT-5.5) and distinguishes six commercial agents in 3600 trials, with Llama-4-Scout ranking highest (CTS +0.233).

cognitive world modelbdi/e statessummarize-and-allocatemulti-turn dialoguesocial influence

Read original →

Faults in Our Formal Benchmarking: Dataset Defects and Evaluation Failures in Lean Theorem Proving

arXiv cs.AI · Pawan Sasanka Ammanamanchi, Siddharth Bhat, Stella Biderman · 2026-06-28

The paper identifies and categorizes defects in Lean theorem-proving benchmarks, demonstrating that formal verification alone does not guarantee semantic correctness. Through corpus-scale static analysis of five widely used Lean benchmarks, the authors uncover 4,833 issues including 398 mechanically certified defects like vacuous theorems and unsound axioms. They propose a fault taxonomy, automated checkers, and audit prompts to improve dataset quality, showing on corrected subsets that benchmark defects can significantly distort prover performance evaluations.

lean theorem provingformal verificationbenchmark defectsstatic analysissemantic correctness

Read original →

Reported Confidence in LLMs Tracks Commitment More Than Correctness

arXiv cs.AI · Dharshan Kumaran · 2026-06-28

The study demonstrates that verbal confidence reports in large language models (LLMs) primarily reflect commitment readiness rather than answer correctness, challenging their use as reliability proxies. Using a two-stage abstention paradigm across four non-reasoning models and multiple prompt framings, verbal confidence predicted commit/abstain decisions better than correctness, while token log-probabilities showed the opposite pattern. Mechanistic analyses in Gemma 3 and 4 revealed that post-answer activations encoded abstention decisions orthogonally to correctness, with steering along confidence-specific directions causally altering abstention behavior.

verbal confidencelog-probabilitiesabstention paradigmcommit-readinesscorrectness discrimination

Read original →

To Reason or to Fabricate: Reasoning Without Shortcuts via Hint-Anchored Pairwise Aggregation

arXiv cs.AI · Jiuheng Lin, Chen Zhang, Yansong Feng · 2026-06-28

HIPPO introduces a reinforcement learning framework addressing shortcut exploitation in LLM reasoning caused by Pre-RL data overlap. The method integrates hint-injected aggregation and a pairwise reward model, leveraging hint injection to expose overlap-induced behaviors and generate discriminable preference signals. This enables a lightweight judge model to reliably distinguish genuine reasoning from shortcut-driven rationalization while ensuring stable optimization. Experiments demonstrate HIPPO's substantial improvements over baselines and effective generalization to out-of-distribution tasks, confirming its ability to extract authentic, transferable reasoning skills.

reinforcement learningpre-rl data overlaphint-injected aggregationpairwise reward modelshortcut exploitation

Read original →

CRAFT: Counterfactual Credit Assignment from Free Sibling Rollouts for Self-Distilled Agentic Reinforcement Learning

arXiv cs.AI · Zibin Meng, Kani Chen · 2026-06-28

CRAFT introduces a three-pillar credit-assignment scheme for self-distilled agentic reinforcement learning, addressing limitations in retrospective and sign-blind token-level distillation loss. Pillar 1 leverages sibling rollouts to estimate counterfactual advantage changes, Pillar 2 employs an asymmetric controller for distillation weight adjustment, and Pillar 3 polarises the KL penalty based on credit signs. The method ensures bit-exact reproducibility and proves estimator consistency and variance bounds. Evaluated across three environments, four model scales, and five methods, CRAFT demonstrates significant improvements, isolating counterfactual contributions effectively.

self-distilled reinforcement learningcounterfactual advantageasymmetric controllerkl penaltybit-exact reproducibility

Read original →

A Posteriori Error Analysis for Decoupled Neural Approximations of Fully Coupled FBSDEs with Control Mismatch

arXiv cs.AI · Xichuan Zhang · 2026-06-28

The paper develops an a posteriori error analysis framework for neural approximations of fully coupled forward-backward stochastic differential equations (FBSDEs) with decoupled controls. It introduces an auxiliary control process in the forward coefficients, distinct from the backward component approximated by the neural network, and analyzes the resulting control mismatch. The method derives computable error bounds depending on terminal defect, pathwise residual, and control mismatch, validated through numerical experiments on linear-quadratic and Burgers-type FBSDEs.

a posteriori error analysisforward-backward sdesneural approximationscontrol mismatchdecoupled controls

Read original →

Agent-Computer Observation Interfaces Enable Dynamic Computer Use

arXiv cs.AI · Bojie Li, Noah Shi · 2026-06-28

We introduce the Agent-Computer Observation Interface (AOI), a model-agnostic perception layer that decouples continuous observation from discrete actions in computer-use (CU) agents. AOI comprises three gated components: inter-step keyframe capture, volume-gated audio transcription, and CU-model-generated visual narration that persists as text. Evaluated on DynaCU-Bench (150 tasks), CU models from 7B to frontier scale achieve +17 to +48 percentage point improvements over screenshot baselines without retraining, with AOI agents solving all tasks involving spoken content. Analysis reveals that keyframe selection is less critical than narrating frames into persistent text, and optimal component configurations vary across models like Gemini 3 Flash due to image-token dilution effects.

agent-computer observation interfacedynacu-benchvisual narrationimage-token dilutionvolume-gated audio transcription

Read original →

One-Step Gradient Delay is Not a Barrier for Large-Scale Asynchronous Pipeline Parallel LLM Pretraining

arXiv cs.LG · Philip Zmushko, Egor Petrov, Nursultan Abdullaev, Mikhail Khrushchev · 2026-06-29

This work challenges the assumption that one-step gradient delay inherently causes instability in asynchronous pipeline parallelism for LLM pretraining, demonstrating that optimizer choice is the critical factor. Through empirical analysis, the authors show that while AdamW suffers severe degradation under PipeDream-2BW's one-step delay, newer optimizers like Muon remain robust. They propose an optimizer-agnostic Error Feedback-inspired correction and provide theoretical convergence guarantees for Muon with and without this modification. Experiments on models up to 10B parameters confirm that their approach bridges the performance gap with synchronous training, enabling practical large-scale asynchronous pipeline parallelism.

asynchronous pipeline parallelismgradient delayllm pretrainingmuon optimizererror feedback

Read original →

Wireless Backdoor Attack and Defense for Semantic Communications over Multiple Access Channel

arXiv cs.LG · Yalin E. Sagduyu, Tugba Erpek, Aylin Yener, Sennur Ulukus · 2026-06-29

The paper introduces a selective over-the-air backdoor attack targeting semantic communication (SemCom) systems over multiple access channels, where an adversary injects low-power trigger waveforms to manipulate semantic inference for one transmitter while minimally affecting others. A trigger-aware defense mechanism is proposed to mitigate this vulnerability through robust training. Experimental results demonstrate both the attack's effectiveness in selectively compromising SemCom systems and the defense's success in preserving correct semantic labels under trigger-contaminated conditions.

semantic communicationbackdoor attackmultiple access channelwireless securityrobust training

Read original →

A Hybrid Framework For Crypto-Ransomware Detection In Enterprise Shared Storage

arXiv cs.LG · Gervais Hatungimana, Abdun Naser Mahmood, Mohammad Jabed Morshed Chowdhury · 2026-06-29

The paper proposes a hybrid framework for detecting crypto-ransomware in enterprise shared storage environments, combining signature-based Indicators of Compromise (IoCs) with machine learning. The method introduces Region of Interest (RoI) analysis for network traffic feature extraction, enhancing existing security tools like EDRs and IDSs. Evaluated across multiple ransomware families, the ML module achieves 99.64% precision, 0% FNR, and minimal FPR, with 99.44% accuracy in early intrusion detection before significant damage occurs.

crypto-ransomwareindicators of compromiseregion of interestenterprise shared storageearly detection

Read original →

Uncertainty-Aware Generation and Decision-Making Under Ambiguity

arXiv cs.LG · Nico Daheim, Iryna Gurevych · 2026-06-29

The paper introduces uncertainty-aware decision-making algorithms for LLMs, leveraging Bayesian decision theory and risk-averse strategies in tutoring and peer-review tasks. Methods include conformal prediction for strategy and score guarantees, with empirical evaluation showing Bayesian approaches outperform risk-averse rules when ambiguity is high. Results indicate improved generation utility but highlight trade-offs in optimizing for generic outputs under high ambiguity.

large language modelsbayesian decision theoryconformal predictionrisk-averse decision makinguncertainty-aware generation

Read original →

The Fundamental Limits of Valid Transport Map Estimation

arXiv cs.LG · Sivaraman Balakrishnan · 2026-06-29

The authors formalize the estimation of valid transport maps within a minimax framework, establishing sample complexity lower bounds for transport-based generative methods like flow matching and diffusion models. By leveraging stability assumptions from optimal transport (OT) theory, they demonstrate that estimating any valid transport map is statistically equivalent to estimating the OT map. However, when these assumptions fail, alternative transport maps can be learned more accurately than the OT map. This analysis provides a rigorous foundation for understanding the statistical limits of modern transport-based generative modeling techniques.

optimal transportminimax frameworksample complexitytransport mapsgenerative modeling

Read original →

SWE-INTERACT: Reimagining SWE Benchmarks as User-Driven Long-Horizon Coding Sessions

arXiv cs.LG · Mohit Raghavendra, Anisha Gunjal, Aakash Sabharwal, Yunzhong He · 2026-06-29

SWE-Interact introduces a novel benchmark for evaluating coding agents in multi-turn, user-driven software engineering workflows, contrasting with traditional single-turn SWE benchmarks. The method employs a user simulator that progressively reveals requirements and provides feedback, testing agents' ability to discover intent and adapt to evolving constraints. Results show performance drops from 50% (single-turn) to 25% (multi-turn), with top models like Opus 4.8 and GPT 5.5 demonstrating better requirement integration but still suffering from over-agentic coding and technical errors.

swe benchmarkscoding agentsmulti-turn interactionuser simulatoriterative refinement

Read original →

Attractor States Emerge in Multi-Turn LLM Conversations

arXiv cs.LG · Ting-Wen Ko, Jonas Geiping · 2026-06-29

The study identifies model-specific attractor states in multi-turn LLM conversations, demonstrating their influence on stylistic and behavioral patterns. Using 7 LLMs across 20 controversial topics, the authors analyze self-play and mixed-play dyadic debates through representation space trajectories, discourse traits, and stance tracking. Results reveal asymmetric attractor effects, with Claude Haiku strongly influencing other models' latent space positions and GPT-4.1 nano showing high malleability, suggesting predictable dynamics in open-ended multi-agent interactions.

attractor statesmulti-agent interactionlatent spaceself-playdiscourse traits

Read original →

Forensic Trajectory Signatures for Agent Memory Poisoning Detection

arXiv cs.LG · Jun Wen Leong · 2026-06-29

The study identifies a behavioral invariant in LLM agents under memory poisoning attacks, demonstrating that successful attacks require a specific sequence of memory_recall_fact before email_send_email, which non-exfiltrating sessions rarely exhibit. Using a rule-based approach exploiting this invariant achieves AUC = 0.9563, while a Random Forest classifier over 19 trajectory features improves detection to AUC = 0.9904. Cross-model validation on 9 models (7B-120B parameters) confirms AUC = 1.000 on 6/9 hold-out splits, with generalization to frontier models like GPT-4.1 and GPT-4o without retraining. The method enables real-time blocking with AUC = 0.934 and distinguishes memory-channel attacks from prompt-injection attacks (score = 0.541) using tool-call logs.

memory poisoningbehavioral invariantaucrandom foresttool-call logs

Read original →

Convergence of Continual Learning in Homogeneous Deep Networks

arXiv cs.LG · Matan Schliserman, Gon Buzaglo, Itay Evron, Daniel Soudry · 2026-06-29

The paper characterizes continual classification in weakly regularized homogeneous models as sequential projections onto task margin sets, generalizing prior analyses limited to stationary deep models or continual linear models. The authors demonstrate that global convergence typically fails, even for simple models linear in data but nonlinear in parameters. Using nonconvex projection theory, they identify regularity properties in homogeneous deep networks that ensure local linear convergence under random and cyclic task sequences, extending the analysis to continual regression for unified treatment of homogeneous models.

continual learninghomogeneous modelstask margin setsnonconvex projectionlocal linear convergence

Read original →

Bridging the NISQ and Fault-Tolerant Regimes: Generative-ML-Assisted Quantum Selected CI for Molecular Simulations

arXiv cs.LG · Anurag K. S. V., Ashish Kumar Patra, Manas Mukherjee, Ruchika Bhat · 2026-06-29

(No summary returned.)

Read original →

$μ$Flow: Leveraging Average Images for Improving Generalisation of Deepfake Faces Detectors

arXiv cs.LG · Orazio Pontorno, Mattia Litrico, Luca Guarnera, Mario Valerio Giuffrida · 2026-06-29

The paper introduces $μ$Flow, a one-class deepfake detector trained exclusively on real images to improve generalization across unseen generators (GANs vs diffusion models). The method exploits averaged images to amplify generative traces, modeling their feature distribution with normalizing flows and aligning individual images to this distribution via likelihood-based separation. Evaluated in a fully out-of-distribution setting, $μ$Flow outperforms state-of-the-art detectors without relying on synthetic artifacts or pseudo-deepfakes.

deepfake detectionone-class learningnormalizing flowfeature distributiongenerative traces

Read original →

ITSPACE: Monotone Gaussian Optimal Transport Updates

arXiv cs.LG · Woojoo Na, Jennifer Dy · 2026-06-29

The paper introduces ITSPACE, a proximal majorization-minimization method for optimizing the exact Bures-Wasserstein (BW) objective on symmetric positive definite matrices. The method employs closed-form updates in a square-root factorization, ensuring PSD structure preservation and supporting rank-restricted factors. Theoretical guarantees include a sufficient-decrease inequality in exact arithmetic and a certificate-gap bound for inexact polar computations. Empirical results show ITSPACE converges faster than BW-gradient descent, alternative covariance-geometry methods, and entropic OT baselines on real-world covariance-alignment tasks.

bures-wassersteinoptimal transportcovariance alignmentproximal optimizationspd cone

Read original →

Staged Hybridisation for Visual Quantum Reinforcement Learning via Knowledge Distillation

arXiv cs.LG · Javier Lazaro, Juan-Ignacio Vazquez, Pablo Garcia-Bringas · 2026-06-29

The paper proposes staged knowledge distillation (KD) as a hybrid strategy for visual quantum reinforcement learning (QRL), addressing challenges in high-dimensional observations and unstable training. By first training a classical visual teacher, freezing its encoder, and distilling policy behavior into compact classical or variational quantum circuit (VQC)-based heads, the method enables quantum-compatible students to learn efficiently. Evaluated on CartPole Pixels and Acrobot Pixels, angle-encoded VQC heads achieve near-teacher performance, while amplitude-encoded heads trade compactness for fragility and simulation time. The approach reframes visual QRL as a compact-head learning problem.

quantum reinforcement learningknowledge distillationvariational quantum circuitsvisual controlcompact-head learning

Read original →

Muon learns balanced solutions in matrix factorization without slow saddle-to-saddle dynamics

arXiv cs.LG · Mark Rhee, Jamie Simon, Dhruva Karkada · 2026-06-29

The paper introduces Muon, an optimizer for matrix factorization problems that exhibits distinct dynamical properties compared to gradient descent. Muon avoids slow saddle-to-saddle dynamics by learning all top modes of the target matrix simultaneously, with smaller modes converging first. It remains stable at learning rates exceeding the critical threshold set by local loss sharpness, enabling rapid convergence via exponential annealing. Muon conserves the matrix quantity √(PᵀP) - √(QᵀQ), differing from gradient flow's conserved quantity PᵀP - QᵀQ, yet both find balanced solutions. Theoretical alignment rates are derived and empirically validated, with a proposed two-step schedule achieving near-perfect alignment.

matrix factorizationoptimizer dynamicslearning rate annealingbalanced solutionsalignment rates

Read original →

Doubly Robust Adaptive Conformal Inference for Causal Effects Under Temporal Dependence

arXiv cs.LG · Andreas Koukorinis, Ricardo Silva · 2026-06-29

The authors propose doubly robust adaptive conformal inference (DR-ACI), a method for constructing prediction intervals for doubly robust pseudo-outcomes in temporally dependent data. DR-ACI combines doubly robust estimation with adaptive conformal inference to achieve valid coverage guarantees under distribution shifts. The approach handles time-series dependencies without requiring strict stationarity assumptions. Theoretical results demonstrate marginal coverage guarantees, while empirical evaluations on synthetic and real-world datasets show improved interval width compared to non-adaptive baselines.

conformal inferencedoubly robust estimationtemporal dependenceprediction intervalsdistribution shift

Read original →

Discovering Collaboration from Novelty: Random Network Distillation for Clustered Federated Learning

arXiv cs.LG · Davide Domini, Gianluca Aguzzi, Ivana Dusparic, Danilo Pianini · 2026-06-29

The paper proposes a lightweight clustering method for Clustered Federated Learning using Random Network Distillation (RND) to address non-IID data challenges. Clients train compact RND predictors locally, using prediction errors as novelty signals to estimate similarity and form clusters before federated training. This approach decouples clustering from model training, reducing computational and communication costs while enabling autonomous federation without predefined cluster counts or structures. The method is task-agnostic and suitable for large-scale distributed systems.

clustered federated learningrandom network distillationnon-iid datanovelty signalautonomous collaboration

Read original →

GPU Parallelization Strategies for Forward and Backward Propagation in Shallow Neural Networks: A CUDA-Based Comparative Study

arXiv cs.LG · Rania Zitouni, Nadine Bousdjira, Sarah Hasnaoui, Amel Sadoun · 2026-06-29

The study evaluates CUDA optimization strategies for forward and backward propagation in shallow neural networks, comparing three techniques: tiled shared memory with bank-conflict elimination, pre-transposed weight matrices for coalesced access, and a fused MatMul+ReLU kernel. Implemented on an NVIDIA Tesla T4 (CUDA 13.0), the fully optimized version achieves a 1.41x speedup over the baseline CUDA implementation on large datasets (25,600 samples), reducing execution time from 21.0s to 14.8s. Results demonstrate significant performance gains from memory-access optimizations in GPU-accelerated deep learning primitives.

cudagpu optimizationneural networksmemory coalescingkernel fusion

Read original →

Factorizable Normalizing Flows for parameter-dependent density morphing

arXiv cs.LG · Davide Valsecchi, Mauro Donegà, Rainer Wallny · 2026-06-29

Factorizable Normalizing Flows (FNFs) are introduced to model parameter-dependent density deformations efficiently, addressing the intractability of learning separate flows for each parameter configuration. FNFs decompose the problem into a fixed high-fidelity flow for a reference configuration and a learnable transformation polynomial in parameters, factorized over them. This allows learning each parameter's effect in isolation and combining them via summation at inference, avoiding combinatorial sampling. On a controlled problem with two deformations, FNFs reproduce true deformations, match optimal likelihood, and capture residual correlations with optional interaction terms. The method scales linearly with parameters, maintains tractable likelihood, and enables unbinned likelihood fits in high energy physics.

normalizing flowsparameter-dependent densityfactorizable transformationsunbinned likelihoodhigh energy physics

Read original →

Non-parametric recovery of causal diffusion mechanisms from steady-state observations

arXiv cs.LG · Richard Schwank, Mathias Drton · 2026-06-29

The paper presents a non-parametric method to recover the causal drift mechanism of continuous-time diffusion processes from cross-sectional steady-state observations. Assuming a known acyclic causal graph and time-homogeneous diffusion dynamics, the authors prove identifiability under a non-explosion condition and derive a consistent kernel estimator. Theoretical analysis includes consistency guarantees and a cross-validation scheme for hyperparameter tuning, with empirical validation through simulations. Connections to irreversible generative diffusion models and low-frequency sampling are discussed.

causal inferencediffusion processesnon-parametric estimationsteady-state analysiskernel methods

Read original →

MuonSSM: Orthogonalizing State Space Models for Sequence Modeling

arXiv cs.LG · Thai-Khanh Nguyen, Ngoc-Bich-Uyen Vo, Thieu N. Vo, Tan M. Nguyen · 2026-06-29

MuonSSM introduces a framework for stabilizing state space models (SSMs) by conditioning memory update geometry rather than recurrent transitions. The method combines momentum-based pathways with Newton Schulz transformations on low-rank inputs, maintaining parallel scan complexity while bounding updates. Theoretical analysis shows improved gradient propagation and spectral conditioning. Experiments across language, vision, and time-series tasks demonstrate accuracy and robustness gains in long-context settings when integrated into SSM variants.

state space modelssequence modelingspectral conditioningmomentum pathwaynewton schulz transformation

Read original →

HSAP: A Hierachical Sequence-aware Parallelism for Hybrid-Context Generative Models

arXiv cs.LG · Songxin Zhang, Zejian Xie, Zhuoyang Song, Cong lin · 2026-06-29

The paper proposes Hierarchical Sequence-Aware Parallelism (HSAP), a novel framework combining existing sequence parallelism paradigms while addressing their limitations in handling hybrid-context packed sequences. HSAP introduces a Sequence-Aware Parallelism algorithm that optimizes tensor transmission and partial attention computation across device groups using JIT compilation for NCCL-level communication. The hierarchical framework manages memory and communication overhead effectively. Experimental results demonstrate HSAP's superiority over state-of-the-art sequence parallelism approaches across multiple metrics.

sequence parallelismhybrid-contextjit compilationattention computationnccl

Read original →

Curvature-Weighted Gradient Diversity: A Noise Measure for Geometry-Adaptive SGD Schedules

arXiv cs.LG · Muhammad Hamza, Ayush Goel · 2026-06-29

The paper introduces Curvature-Weighted Gradient Diversity (CWGD), a geometry-aware noise measure for SGD that weights per-sample gradient diversity by the inverse square root of the Hessian. The method employs a Hutchinson-based diagonal Hessian estimator and a CWGD-modulated cosine learning-rate schedule, proving a 2x reduction in asymptotic optimization error for strongly convex quadratic objectives with diagonal Hessians. Experiments show CWGD-Cosine achieves ~20% lower final error than standard cosine annealing across various condition numbers, batch sizes, and noise structures, with negligible overhead in quadratic settings. Limitations include Hessian staleness in non-convex optimization.

curvature-weighted gradient diversitygeometry-aware optimizationhutchinson estimatorcosine annealinghessian staleness

Read original →

Exploring Differences Between Tabular Enterprise Data and Public Benchmarks

arXiv cs.LG · Myung Jun Kim, Maximilian Schambach, Frank Essenberger, Andre Sres · 2026-06-29

This work identifies key differences between enterprise tabular data and public benchmarks, highlighting the need for domain-specific evaluation. The authors analyze data statistics and measure performance of tabular models (TabPFN, TabICL, ConTextTab) on enterprise datasets. Results demonstrate poor generalization between public benchmarks and enterprise data, with models excelling on one domain often underperforming on the other. The findings underscore the necessity for additional benchmarks that capture enterprise-grade characteristics to advance tabular machine learning in business applications.

tabular dataenterprise databenchmarkinggeneralizationtabpfn

Read original →

Internal-State Probes Read the Situation, Not the Action: Three Negative Results for Pre-Action Misalignment Monitoring

arXiv cs.LG · Max Fomin, Elad David, Amit LeVi · 2026-06-29

The study evaluates three methods for pre-action misalignment monitoring using internal-state probes in agentic systems, finding negative results across all approaches. Methods tested include fine-tune/base direction separation in Qwen2.5-Coder-32B-Instruct, last-token probes in Llama-3.1-8B-Instruct, and emotion-concept vectors in Gemma-3-27B-IT. Results show that while probes achieve high AUC scores (up to 1.000 for Qwen), they fail to generalize as robust pre-action monitors, with specificity and transferability limitations across domains and scenarios. The work provides a methodology for testing internal-readout claims against generalization and specificity controls.

internal-state probespre-action monitoringmisalignment detectiongeneralization checksspecificity controls

Read original →

When Does Online Imitation Learning Help in LLM Post-Training? The Role of (Non-)Realizability Beyond Horizon

arXiv cs.LG · Huaqing Zhang, Jingchu Gai, Juno Kim, Bingbin Liu · 2026-06-29

The work challenges the prevailing view that error accumulation primarily explains online imitation learning's (IL) advantages in LLM post-training, demonstrating instead that realizability—whether the student policy class can represent the expert policy—is key. Through empirical and theoretical analysis, the authors show offline IL matches expert performance under realizability, while non-realizable settings introduce an information-theoretic bottleneck even for horizon $H=1$. They propose a structural characterization of misspecification relative to rewards, proving online IL achieves high performance despite distributional mismatch.

online imitation learningrealizabilitymisspecificationpolicy distillationdistributional mismatch

Read original →

SGD Provably Prioritizes a Shortcut Spurious Feature in the XOR Model

arXiv cs.LG · Tyler LaBonte, Vidya Muthukumar · 2026-06-29

The work provides the first theoretical characterization of spurious feature learning in two-layer ReLU networks trained via online minibatch SGD on logistic loss, using high-dimensional Boolean hypercube data with XOR signal and linear spurious correlation. Analysis reveals SGD learns the spurious feature exponentially fast, with dynamics coupling spurious and signal features such that stronger spurious components inhibit signal learning. Phase transitions show initial rapid spurious feature growth driven by sign alignment, followed by suppressed signal learning due to large majority group margin. Theoretical results demonstrate spurious feature dominance even at XOR sample complexity thresholds when correlation is maximal.

spurious correlationxor modelrelu networkssgd dynamicsfeature learning

Read original →

CAN We Trust Your Results? A Cross-Dataset Study of Automotive IDS Evaluation

arXiv cs.LG · Beatrix Koltai, Gergely Acs, Andras Gazdag · 2026-06-29

This study introduces a standardized benchmarking framework to evaluate CAN bus Intrusion Detection Systems (IDS) across seven diverse datasets, addressing inconsistencies in prior evaluations. The framework enables cross-dataset comparison of five distinct IDS methodologies, revealing significant performance variations dependent on dataset characteristics. Results demonstrate that current IDS evaluations lack generalizability, emphasizing the need for robust cross-dataset validation to assess true detection capabilities in varying automotive network environments.

can busintrusion detection systemscross-dataset evaluationbenchmarking frameworkautomotive security

Read original →

Arko-T: A Foundation Model for Text-to-Structured 3D Generation

arXiv cs.LG · Liang Wang, Zhaoyang Xi, Zekai Xiang, Heng Meng · 2026-06-29

Arko-T introduces a 4B-parameter foundation model for text-to-structured 3D generation, mapping natural-language intent directly into executable, parametric CAD programs. Unlike existing text-to-3D systems that produce renderable shapes, Arko-T ensures CAD artifacts remain editable by aligning pipeline stages—data curation, code normalization, and execution-grounded supervision—to a formal notion of design state. Evaluated against seven frontier LLMs across 12 metrics, Arko-T achieves the best score on 8 metrics and the second-best on 3, at approximately one-tenth the per-benchmark cost. Results demonstrate that targeted design-level training at moderate scale can rival general-purpose models in structured CAD generation.

text-to-3dparametric caddesign stateexecution-grounded supervisioncode normalization

Read original →

Proofs of Ownership for Machine Learning Models

arXiv cs.LG · Ran Canetti, Shafi Goldwasser, Or Zamir · 2026-06-29

The paper introduces a formal framework for Proof of Ownership (PoW) in machine learning models, addressing the challenge of verifying model ownership in cases of theft. The authors model PoW as a three-party game involving a model owner, a thief, and a judge, where the owner generates a perturbed model and a proof, the thief modifies it to evade detection, and the judge determines ownership. Under standard cryptographic assumptions, the authors establish a dichotomy for classifiers in the black-box setting: ownership can be proven if and only if the concept class is not self-correctable, extending results from Blum et al. (STOC'90).

proof of ownershipmachine learning modelsblack-box settingself-correctablecryptographic assumptions

Read original →

Experience Augmented Policy Optimization for LLM Reasoning

arXiv cs.LG · Jinda Lu, Kexin Huang, Junkang Wu, Shuo Yang · 2026-06-29

The paper proposes Experience-Augmented Policy Optimization (EAPO), a method to enhance large language model reasoning by reusing experience adaptively in reinforcement learning with verifiable rewards (RLVR). EAPO employs a prior RL-optimized policy as an action-level experience prior, selectively injecting experience at critical decision points during rollout, and uses an adapted importance sampling scheme for stable learning. Evaluations on Qwen-2.5-math 7B and Qwen-3-8B across five benchmarks show EAPO outperforms state-of-the-art RLVR methods in reasoning performance.

reinforcement learninglarge language modelspolicy optimizationimportance samplingreasoning benchmarks

Read original →

Diffusion Fine-tuning with Rewarded Moment Matching Distillation

arXiv cs.LG · Alexis Jacq, Guillaume Couairon, Valentin De Bortoli, Quentin Berthet · 2026-06-29

We introduce Rewarded Moment Matching Distillation (RMMD), a framework combining diffusion model distillation with reward maximization. RMMD adapts the sampling loop for on-policy training and repurposes the distillation loss as KL regularization, preserving high-fidelity generation. Evaluations on ImageNet show RMMD achieves superior FID-Reward Pareto fronts compared to DI++ and DRaFT. Applied to GenCast, RMMD achieves a 7.5x speedup while outperforming the teacher model on 93% of weather variables and improving calibration, demonstrating scalability to high-dimensional scientific domains.

diffusion modelsdistillationreward maximizationkl regularizationon-policy training

Read original →

MOPD: Multi-Teacher On-Policy Distillation for Capability Integration in LLM Post-Training

arXiv cs.LG · Wenhan Ma, Jianyu Wei, Liang Zhao, Hailin Zhang · 2026-06-29

We introduce Multi-teacher On-Policy Distillation (MOPD), a post-training paradigm for integrating multiple capabilities into large language models (LLMs). MOPD first trains domain-specific reinforcement learning (RL) teachers, then distills them into a student model using its own rollouts, eliminating exposure bias and providing dense optimization signals. On Qwen3-30B-A3B, MOPD outperforms Mix-RL, Cascade RL, Off-Policy Finetune, and Param-Merge baselines, inheriting nearly all capabilities from each teacher. MOPD enables parallel development of domain teachers, removing cross-domain coupling in multi-domain post-training. It has been deployed in MiMo-V2-Flash, demonstrating practical value for capability integration in frontier-scale LLMs.

multi-teacher distillationon-policy learningcapability integrationreinforcement learningpost-training

Read original →

Predict, Reuse, and Repair: Accelerating Dynamic Sparse Attention for Long-Context LLM Decoding

arXiv cs.LG · Tianyu Wang, Gourav Rattihalli, Aditya Dhakal, Junbo Li · 2026-06-29

PRR introduces a speculate-reuse-repair runtime to accelerate dynamic sparse attention (DSA) in long-context LLM decoding by predicting relevant KV blocks, speculating attention computations, and incrementally repairing missed blocks. The method employs an EMA-based predictor, a profiling-guided speculation budget, and a FlashAttention-based repair kernel with online-softmax statistics. Evaluations on long-context benchmarks show PRR reduces per-token decoding latency by up to 40% without compromising downstream task accuracy.

dynamic sparse attentionkv blocksonline-softmaxflashattentionspeculative execution

Read original →

Scalar Representations of Neural Network Training Dynamics

arXiv cs.LG · Pedro Jiménez-González, Miguel C. Soriano, Lucas Lacasa · 2026-06-29

The authors propose scalar embeddings of neural network training dynamics by treating optimization trajectories as temporal networks. They apply dimensionality reduction techniques to analyze training dynamics of a multilayer perceptron on MNIST, preserving key dynamical features including sensitivity to initial conditions and Lyapunov exponents. The method enables definition of a characteristic decorrelation time for training trajectories and reveals statistical organization of asymptotic states through spacing observables. Results show rescaled asymptotic spacings follow a skew lognormal distribution, demonstrating scalar embeddings effectively capture high-dimensional optimization dynamics.

scalar embeddingtemporal networkslyapunov exponenttraining dynamicsdimensionality reduction

Read original →

RenderFormer++: Scalable and Physically Grounded Feed-Forward Neural Rendering

arXiv cs.LG · Huangsheng Du, Haoran Zhu, Youcheng Cai, Jinyang Meng · 2026-06-29

RenderFormer++ introduces a scalable, physically grounded feed-forward neural rendering framework for global illumination in mesh scenes. The method combines Physics-Informed Transport Guidance (PITG) to embed rendering-equation inductive biases into attention mechanisms and Hierarchical Object-Centric Tokenization (HOCT) to aggregate triangle-level features into object-level tokens, reducing computational costs. Experiments show improved physical accuracy, efficiency, and scalability over prior methods like RenderFormer, enabling stable rendering across complex large-scale scenes.

neural renderingglobal illuminationattention mechanismtransport consistencyobject-centric tokenization

Read original →

FlowAWR: Online Adaptive Flow Reinforcement via Advantage-Weighted Rectification

arXiv cs.LG · Zheming Fu, Ruizhe He, Wei Shang, Xiaoxiao Ma · 2026-06-29

FlowAWR introduces a novel paradigm for continuous generative policy optimization by recasting it as supervised regression toward an optimal velocity field, eliminating the need for stochastic SDE samplers and Classifier-Free Guidance (CFG). The method derives a magnitude-aware, advantage-weighted rectification form from the optimal policy of a KL-constrained reward maximization. Evaluated on SD3.5-Medium, FlowAWR achieves superior alignment performance (24.12 PickScore) with 2× to 5× faster convergence than DiffusionNFT and FlowGRPO, while maintaining stable out-of-domain performance under multi-reward constraints.

generative flow modelsadvantage-weighted rectificationkl-constrained reward maximizationvelocity field optimizationonline reinforcement learning

Read original →

On the Vulnerability of Parameter-Level Defenses to Model Merging

arXiv cs.LG · Kuangpu Guo, Qingyan Zheng, Jian Liang, Yongcan Yu · 2026-06-29

The paper exposes vulnerabilities in parameter-level defenses against unauthorized model merging, showing that protected task vectors are small-magnitude perturbations dominated by pretrained weights. The authors propose Anchor-Guided Attack (AGA), which exploits this dominance by aligning protected models with a static pretrained anchor to recover transformation matrices analytically. Experiments demonstrate AGA's effectiveness against individual and composite defenses, while Anchor-Repulsive Fine-tuning (ARF) is introduced as a countermeasure that reduces anchor dominance and mitigates AGA.

model mergingparameter-level defensesanchor-guided attacktask vectorspretrained weights

Read original →

Learning the structure of open quantum systems

arXiv cs.LG · Laura Lewis, Ewin Tang, John Wright · 2026-06-29

(No summary returned.)

Read original →

OLIVE: View-Augmented Latent Prediction with Waveform Reconstruction for Speech SSL

arXiv cs.LG · Karl El Hajal, Mathew Magimai. -Doss · 2026-06-29

The paper introduces OLIVE, a self-supervised learning framework for speech representation that jointly optimizes analysis (masked latent prediction) and synthesis (waveform reconstruction) objectives. View augmentation and invariant latent prediction enhance robustness, while reconstruction preserves signal-level information in early encoder features. OLIVE demonstrates improved performance on generation and speaker tasks, maintains competitiveness in recognition and semantic tasks, and achieves superior waveform reconstruction compared to baseline methods.

self-supervised learningwaveform reconstructionmasked latent predictionview augmentationspeech representation

Read original →

REAR: Test-time Preference Realignment through Reward Decomposition

arXiv cs.LG · Fuxiang Zhang, Pengcheng Wang, Chenran Li, Yi-Chen Li · 2026-06-29

We propose REAR, a test-time preference realignment framework for large language models that decomposes reward functions into question-related and preference-related components. By formulating REAlignment Reward (REAR) as a linear combination of token-level policy log-probabilities, our method enables computationally efficient integration with test-time scaling algorithms like best-of-N sampling and tree search. Experiments demonstrate that REAR outperforms test-time baselines in preference alignment tasks across diverse user requirements while maintaining generalization capabilities in mathematical and visual domains.

test-time scalingreward decompositionpreference alignmenttoken-level policyrealignment reward

Read original →

FlexTab: A Flexible Encoder-Decoder Architecture for In-Context Learning Across Diverse Tabular Tasks

arXiv cs.LG · Marek Polewczyk, Maximilian Schambach, Marco Spinaci, Sam Thelin · 2026-06-29

FlexTab introduces a flexible encoder-decoder architecture for in-context learning on tabular data, featuring a task-agnostic encoder and task-specific decoders. The design produces target-agnostic row embeddings applicable to six tasks: classification, regression, anomaly detection, clustering, entity matching, and entity classification in relational databases. Trained on unlabeled tables, FlexTab achieves state-of-the-art performance on four tasks and remains competitive in relational entity classification, demonstrating its efficacy as a general-purpose backbone for diverse tabular prediction problems.

encoder-decoderin-context learningtabular datatarget-agnosticrelational databases

Read original →

Local-Minima-Preserving Continuous Relaxation of Ising Problems

arXiv cs.LG · Debraj Banerjee, Santanu Mahapatra, Kunal N. Chaudhury · 2026-06-29

The authors present a polynomial relaxation for the generalized Ising problem that preserves one-flip local minima, proving a landscape equivalence theorem guaranteeing a bijective correspondence between relaxation minima and original problem minima. This enables gradient-based optimization (e.g., ADAM) for combinatorial problems like MAX-CUT and Number Partitioning. Empirical results demonstrate scalability and strong performance on spin-glass models and benchmark problems.

ising problemlocal minimapolynomial relaxationlandscape equivalencegradient-based optimization

Read original →

Extrapolating from Regularised Solutions for Solving Ill-Conditioned Linear Systems in Machine Learning

arXiv cs.LG · Disha Hegde, Jon Cockayne, Chris. J. Oates · 2026-06-29

The paper introduces autonugget, a Python package for stable numerical solution of ill-conditioned linear systems in machine learning prototyping. The method circumvents manual nugget selection in Tikhonov-regularised inversion by combining multiple linear solves via Richardson extrapolation, improving accuracy while maintaining compatibility with JAX automatic differentiation. Results demonstrate enhanced stability and computational efficiency compared to single-nugget approximations, enabling end-to-end differentiable training pipelines.

tikhonov regularizationrichardson extrapolationill-conditioned systemsautomatic differentiationjax

Read original →

Hybrid Active-Online Learning Framework for Label-Efficient Concept Drift Adaptation in Optical Network Failure Detection

arXiv cs.LG · Yousuf Moiz Ali, Jaroslaw E. Prilepsky, João Pedro, Sasipim Srivallapanondh · 2026-06-29

The authors propose a hybrid active-online learning framework for label-efficient concept drift adaptation in optical network failure detection. The method employs margin-based selective labeling to minimize annotation costs while maintaining performance. Results demonstrate near-ceiling accuracy and AUC scores with only 3.4% of streaming samples queried, introducing negligible latency overhead compared to static inference.

active learningconcept driftoptical networksselective labelingfailure detection

Read original →

BrainJanus: A Unified Model for Understanding and Generation across Brain, Vision, and Language

arXiv cs.LG · Haitao Wu, Qirui Zhang, Zhouheng Yao, Shangquan Sun · 2026-06-29

BrainJanus introduces the first unified model integrating brain, vision, and language processing within a single framework. The model employs a Unified Brain Tokenizer to quantize neural dynamics into discrete tokens aligned with visual and linguistic representations in a shared Omni space, coupled with an All-in-One autoregressive architecture for any-to-any generation tasks, including image-to-brain, text-to-brain, brain-to-image, and brain-to-text decoding. Extensive experiments demonstrate superior performance across benchmarks, zero-shot generalization, and preservation of interpretable biological topography. The code is publicly available on GitHub.

unified brain tokenizeromni spaceautoregressive architectureany-to-any generationbiological topography

Read original →

Toward an Energy-Optimized Operation of Data Centers Located in Wind Farms Using Reinforcement Learning

arXiv cs.LG · Jan Stenner, Alexander Kilian, Sebastian Peitz, Hermann de Meer · 2026-06-29

The paper proposes a Reinforcement Learning (RL) framework for optimizing energy usage in wind-turbine-integrated HPC data centers, addressing workload shifting under wind curtailment constraints. Using Proximal Policy Optimization (PPO) and a modified Soft Actor-Critic (SAC) with on-policy updates, the study evaluates Imitation Learning and Reward Shaping to mitigate credit assignment issues. Results show improved performance with these techniques, though a gap remains compared to offline optimization with full foresight. The benchmark framework supports future extensions to multi-site and continuous-time scenarios.

reinforcement learningwind-turbine-integratedproximal policy optimizationsoft actor-criticcredit-assignment problem

Read original →

TRACE: A Concept Bottleneck Model for Longitudinal 3D Glioblastoma Response Assessment

arXiv cs.LG · Alia Tarek, Hamsa Saberr, Hamza Elghonemy, Youssef Afify · 2026-06-29

TRACE introduces a concept bottleneck model for interpretable glioblastoma response classification on longitudinal 3D MRI, aligning with RANO 2.0 criteria. The model processes paired baseline and follow-up multimodal MRI scans using a shared 3D vision encoder, predicts clinically meaningful tumor measurements, and computes downstream RANO-derived concepts via deterministic rules. It achieves a 4-class macro F1 of 0.4769 and a binary progression-versus-non-progression macro F1 of 0.7085 on the LUMIERE dataset, outperforming a concept bottleneck baseline and remaining competitive with non-interpretable deep learning approaches. Ablation studies highlight the importance of the expert RANO graph and intervention-consistency training, while intervention experiments show that correcting concepts improves downstream predictions.

concept bottleneck modellongitudinal mrirano criteriaglioblastoma classification3d vision encoder

Read original →

Highly Data Parallelizable Estimation of the Sliced-Wasserstein Distance Using Cumulative Distribution Functions

arXiv cs.LG · Christophe Vauthier, Quentin Mérigot, Anna Korba · 2026-06-29

The authors propose a novel class of estimators for the Sliced Wasserstein (SW) distance based on cumulative distribution functions (CDFs) instead of traditional quantile functions. These estimators avoid sorting projected samples and enable massive dataset parallelism by leveraging CDFs of projected measures. The method is particularly advantageous for Gaussian mixtures and federated learning, as CDFs can be computed locally and aggregated without raw data exchange. The estimators include variants with hyperparameters controlling variance and smoothness, offering flexibility for different applications.

sliced wasserstein distancecumulative distribution functionsdata parallelismfederated learningoptimal transport

Read original →

DreamForge-World 0.1 Preview: A Low-Compute Real-Time Controllable World Model

arXiv cs.LG · Daniyel Ayupov, Artur Markov-Tsoy · 2026-06-29

DreamForge-World 0.1 Preview introduces a low-compute foundational world model for real-time interactive simulation, prioritizing consumer-GPU runtime and broad interactive capabilities. The system adapts the LongLive 1 autoregressive video stack, derived from Wan2.1-T2V-1.3B, and integrates a residual action pathway inspired by the Matrix-Game family. It supports multimodal initialization, live keyboard/mouse control, mid-stream reprompting, dual-view operation, and minute-scale interactive rollouts at 480p resolution, achieving 14-15 FPS on a single RTX 4090 with low memory usage. Leveraging open video backbones and targeted adaptation runs, the model demonstrates a cost-efficient approach to real-time controllable world-model previews, though it is not yet memory-complete or frontier-quality.

autoregressive video stackresidual action pathwaymultimodal initializationdual-view operationlow-compute adaptation

Read original →

When Is a Draft Accepted? A Theory of Acceptance in Speculative Decoding

arXiv cs.LG · Aaryam Sharma · 2026-06-29

The paper develops a theoretical framework for acceptance criteria in speculative decoding, focusing on practical regimes beyond distribution-preserving settings. It characterizes rejection regions as lower level sets of the target distribution and derives exact KL divergence certificates and margin-based bounds for various acceptance criteria, including strict greedy decoding, relaxed additive/multiplicative rules, top-(m) criteria, and entropy-thresholded acceptance. The framework is extended to greedy tree decoding, providing certificates for when the target token remains within the drafter's top-(m) candidates. Evaluations on Qwen3 models demonstrate that relaxed and tree-based criteria significantly expand certified acceptance regions, particularly in low-margin decoding steps.

speculative decodinggreedy decodingkl divergenceacceptance criteriatree decoding

Read original →

Your Data Manifold is Secretly a Reward Model: Shell-LCC for Text-to-Video Generation

arXiv cs.LG · Shihao Zhang, Yuguang Yan, Junzhe Zhang, Wei Zhao · 2026-06-29

We propose Shell-Local Coordinate Coding (Shell-LCC), a method that leverages the intrinsic manifold structure of high-quality Supervised Fine-Tuning (SFT) data to provide dense, differentiable reward signals for text-to-video (T2V) generation. Unlike traditional reward models that incur computational overhead and require costly annotations, Shell-LCC explicitly models the manifold 'surface' as an isotropic shell, avoiding mean regression and preserving high-frequency details. Experiments show that Shell-LCC enhances realism, mitigates low-level distortions, reduces over-smoothing artifacts, and alleviates motion blur in generated videos.

shell-lcctext-to-videomanifold structuresupervised fine-tuningreward signals

Read original →

A Distributionally Robust Framework for Learned Reconstructions in Inverse Problems

arXiv cs.LG · Floor van Maarschalkerwaart, Subhadip Mukherjee, Christoph Brune, Marcello Carioni · 2026-06-29

The paper introduces a distributionally robust optimization (DRO) framework for learned reconstructions in inverse problems, addressing poor generalization under distributional shifts. By restricting ambiguity sets to structured perturbations aligned with the data-acquisition process, the method models uncertainty in the forward operator and noise model more faithfully. Theoretical results include strong duality and finite-dimensional dual representations, while numerical experiments on deblurring and sinogram-to-CT reconstruction demonstrate improved robustness and stability over standard DRO and MSE baselines. The framework induces Tikhonov regularization and yields effectively low-rank operators in linear settings.

distributionally robust optimizationinverse problemsstructured perturbationstikhonov regularizationworst-case risk bound

Read original →

B3O: Scalable Boltzmann Batch Bayesian Optimization

arXiv cs.LG · Maximilian Bloor, Liyuan Xu, Hrvoje Stojic, Victor Picheny · 2026-06-29

B3O introduces a scalable framework for large-batch Bayesian Optimization (BO) by reframing batch generation as a sampling problem from the Boltzmann distribution defined by the acquisition function. This approach avoids computational bottlenecks and maintains batch diversity, addressing limitations of existing methods. Theoretical analysis shows negligible additional regret for queries sampled from this distribution. Empirical evaluation demonstrates B3O's superiority on synthetic benchmarks and robustness in complex tasks, including multi-objective electrode design and mixed-variable race car configuration.

bayesian optimizationboltzmann distributionbatch generationacquisition functionmulti-objective design

Read original →

Characterizing Optimizer-Dependent Training Dynamics Through Hessian Eigenvector Displacement and Localization

arXiv cs.LG · Marcelina Marjankowska, Valerio Modugno, Paolo Barucca · 2026-06-29

The paper characterizes optimizer-dependent training dynamics by analyzing Hessian eigenvector evolution in neural networks. Using multilayer perceptrons on classification tasks, the authors measure eigenvector dynamics via (i) temporal displacement metrics and (ii) localization through inverse participation ratio, comparing against a random Hessian null model. Results show SGD stabilizes leading curvature directions, while Adam exhibits stronger eigenvector reorganization and parameter subset localization in dominant curvature directions. These findings demonstrate Hessian eigenvector dynamics differentiate optimizer behaviors and training trajectories.

hessian eigenvectorstraining dynamicsoptimizer comparisonlocalizationinverse participation ratio

Read original →

Robust Strategic Classification under Decision-Dependent Cost Uncertainty

arXiv cs.LG · Sura Alhanouti, Güzin Bayraksan, Parinaz Naghizadeh · 2026-06-29

The paper introduces a two-stage robust optimization framework for strategic classification that accounts for decision-dependent cost uncertainty, addressing a key limitation in existing literature which assumes fixed manipulation costs. The proposed method captures how manipulation costs evolve based on past algorithmic decisions, reducing uncertainty and mitigating gaming behavior over time. Results demonstrate that incorporating policy-dependent costs not only enhances robustness but also more effectively curtails strategic manipulation of algorithmic systems.

strategic classificationrobust optimizationdecision-dependent uncertaintymanipulation costsalgorithmic gaming

Read original →

Predictive Objectives Discard Exogenous Control-Relevant Features: A Controlled Mechanistic Study

arXiv cs.LG · Ayan Pendharkar · 2026-06-29

The study demonstrates that joint-embedding predictive objectives (JEPA-style) discard exogenous yet control-relevant features due to their focus on temporal predictability rather than control-relevance. Through a controlled 2x2 experimental design varying feature controllability and relevance, six objectives were evaluated: reconstruction, JEPA variants, inverse dynamics, and reward-grounded JEPA. Results show reward-free predictive objectives fail to retain exogenous control-relevant features (near chance accuracy), while reward-grounded JEPA recovers them with as little as 2% reward-labeled transitions, robust across environments (16-1024 latent dimensions). Latent geometry analysis reveals JEPA achieves minimal class separation compared to supervised references.

joint-embedding predictive objectivesexogenous featurescontrol-relevancetemporal predictabilitybisimulation theory

Read original →

Data-Driven Energy-Based Learning via Gibbs Measures on Hierarchical Structures

arXiv cs.LG · L. U. Abdullaev, F. Herrera, U. A. Rozikov, M. V. Velasco · 2026-06-29

The paper introduces a probabilistic framework for learning systems using Gibbs measures on hierarchical structures, replacing empirical risk minimization with an energy-based model derived from empirical loss functions. It formulates consistency conditions for finite-volume distributions and derives nonlinear integral fixed-point equations to characterize equilibrium learning states. The analysis reveals phase-transition phenomena in hierarchical systems, where multiple Gibbs measures emerge beyond critical thresholds, corresponding to distinct prediction regimes. Numerical experiments with non-separable kernels demonstrate coexisting solution branches, illustrating data-induced probabilistic landscapes.

gibbs measureshierarchical structuresenergy-based learningphase-transitionfixed-point equations

Read original →

From Failure Taxonomy to Intervention: A Diagnostic Methodology for Industry-Scale AVLM in Video and Live-Streaming Platform Moderation

arXiv cs.LG · Shuchang Ye, Jinqiang Yu, Zhujun Xiao, Yajing Kong · 2026-06-29

The paper introduces a diagnostic methodology for industry-scale Audio-Visual-Language Models (AVLM) development, addressing the gap between generic pretrained models and platform-specific moderation requirements. The method maps model failures to a taxonomy of observable signatures and links each failure class to targeted intervention spaces, enabling traceable improvements. The authors instantiate this approach in a large-scale video and live-streaming platform, resulting in a system supporting over 100 regions and handling noisy, ambiguous global content.

audio-visual-language modelsfailure taxonomymodel interventioncontent moderationmultimodal foundation models

Read original →

Notes on generative modeling: flow matching, diffusion, optimal transport and Schr{ö}dinger bridge

arXiv cs.LG · Titouan Vayer · 2026-06-29

This work provides a unified mathematical framework connecting key generative modeling techniques. The author establishes theoretical links between optimal transport, Schrödinger bridge, flow matching, and diffusion-based approaches, demonstrating their shared underlying principles. Through mathematical analysis, the notes reveal how these methods relate to each other in terms of their formulations and optimization objectives. The exposition offers a consolidated perspective on modern generative modeling, highlighting the connections between these approaches that are often treated separately in the literature.

generative modelingoptimal transportschrödinger bridgeflow matchingdiffusion models

Read original →

Bridging the Gap Between Image Restoration and Navigational Safety in Hazy Conditions: A New Visibility Estimation Metric for Maritime Surveillance

arXiv cs.LG · Wentao Feng, Guobei Peng, Wengang Mao, Ryan Wen Liu · 2026-06-29

The study introduces a visibility-oriented evaluation framework for maritime surveillance, addressing the gap between image dehazing quality and navigational safety in hazy conditions. A Maritime Simulated Visibility Dataset (MSVD) is constructed using Unity3D, providing paired hazy and clear images with precise visibility annotations. The proposed metric leverages object detection accuracy to map visibility distance to detection performance, converting image restoration improvements into measurable visibility gains. Six dehazing methods are evaluated using both conventional metrics and the proposed framework. Results demonstrate MSVD's reliability as a benchmark and the metric's effectiveness in interpretable visible-distance estimation, supporting navigational safety assessment.

visibility estimationimage dehazingmaritime surveillanceobject detectionsimulated dataset

Read original →

Building Multi-Task Agentic LLMs via Two-Phase Distillation

arXiv cs.LG · Huaijie Wang, Shusheng Xu, Yi Wu, Kaifeng Lyu · 2026-06-29

The paper proposes a two-phase distillation method for building multi-task agentic LLMs that matches single-task RL expert performance. It identifies off-policy distillation's mode-covering limitation in multi-task settings and on-policy distillation's need for strong initialization, combining them sequentially for optimal performance. Evaluations on conversational agents and text-based games show the two-phase approach outperforms standalone off-policy or on-policy methods.

multi-task learningreinforcement learningknowledge distillationmode-coveringon-policy

Read original →

Heads, Not Backbones: Output Heads Dominate Architectures on Fat-Tailed Returns

arXiv cs.LG · Sichao He, Yansong Zhang · 2026-06-29

The study demonstrates that output heads, not backbone architectures, dominate performance in forecasting fat-tailed financial returns at short horizons. Comparing four backbones (TimesNet, DLinear, N-BEATS, iTransformer) with three output heads (point, single-Gaussian, Gaussian mixture), results show head choice drives CRPS improvements (3.7pp gradient), while backbone swaps yield ≤5.1% changes. Mixture heads excel in high-volatility regimes (13.9% CRPS gain in 1970s stagflation). Horizon analysis reveals head dominance at short horizons (h<6), with backbones prevailing at longer horizons. Distributional metrics (CRPS, pinball) separate heads, unlike squared error.

fat-tailed returnsoutput headsbackbone architecturesgaussian mixturecrps

Read original →

Consensus Clustering of Free-Viewing Gaze Data: New Insights into Human-Information Interaction

arXiv cs.LG · Beryl Gnanaraj, Jaya Sreevalsan-Nair, Saqib Alam Ansari, Maanasa Rajaraman · 2026-06-29

The paper introduces EnsembleGaze, an unsupervised ensemble learning system for consensus clustering of free-viewing gaze data to analyze human-information interaction patterns. The method employs feature engineering based on statistical descriptors of fixation distributions, followed by consensus voting of clustering methods to compute a co-association matrix. Two high-dimensional clustering strategies—consensus subspace clustering and spectral biclustering—are proposed for joint user and image characterization. Results show robust image stimuli groupings (ambient vs. focal viewing modes) and context-dependent user groupings, with biclustering uniquely recovering this structure. Evaluation on public datasets reveals dataset-specific patterns.

consensus clusteringgaze dataensemble learningspectral biclusteringhuman-information interaction

Read original →

A multi-architecture study of specificity refinement and false-positive mechanism analysis in prostate MRI

arXiv cs.LG · Yongbo Shu, Kewen Chen, Yifeng Yuan, Zirui Xin · 2026-06-29

This study characterizes false positives in prostate MRI detection and evaluates a lightweight post-hoc refinement head for case-level specificity. Using PI-CAI (5-fold cross-validation) and Prostate158 datasets, a context-aware evidence head and an 89,216-parameter refinement head were trained on a frozen detection backbone, with additional training on nnU-Net, U-Net, Mamba, and MIGF-Mamba architectures. False positives exhibited contrast ratios closer to true cancers than benign tissue, replicating across five architectures and modality-perturbation scenarios. Refinement improved case-level specificity from 0.469 to 0.549 (+17.2%) on PI-CAI fold-0 while maintaining sensitivity (0.943), though fold-conditional behavior was observed. Results suggest false positives share raw imaging features with cancers, not histologically confirmed mimicry.

false positivespost-hoc refinementcontrast ratioscase-level specificitymodality-perturbation

Read original →

Atompack: A Storage and Distribution Layer for Read-Heavy Atomistic ML Training Datasets

arXiv cs.LG · Ali Ramlaoui, Daniel T. Speckhard, Sagar Pal, Fragkiskos D. Malliaros · 2026-06-29

Atompack introduces a storage and distribution layer optimized for read-heavy atomistic ML training datasets, focusing on immutable snapshots with efficient append operations and memory-mapped reads. The system prioritizes complete molecular record serving over field chunks or object reconstruction, aligning with training pipelines' shuffled access patterns. Benchmarks against HDF5, LMDB, and ASE show Atompack achieves 96x faster shuffled reads than ASE LMDB and 79% smaller artifact sizes for 64-atom workloads.

atomistic machine learningstorage formatmemory-mapped readstraining pipelineshuffled access

Read original →

NeuReasoner: Theory-grounded Mapping of Reasoning Elicitation Boundaries

arXiv cs.LG · Aydin Javadov, Shyngys Aitkazinov, Tobias Hoesli, Florian von Wangenheim · 2026-06-29

The paper introduces NeuReasoner, a theory-grounded elicitation instrument combining Neuro Lenses (functional specificity) and Cognitive Lenses (Erotetic Theory of Reasoning) to probe reasoning boundaries in large language models. Through internal modularization, it evaluates performance on CogBench (cognitive psychology tasks) and standard benchmarks. Results show NeuReasoner matches/exceeds thinking-mode baselines on arithmetic, code generation, Bayesian reasoning, and reward learning at scale, but fails on risk-taking and decision-making under uncertainty. Scale interacts variably with elicitation, widening advantages on some tasks while erasing others.

elicitation boundariescognitive lensesinternal modularizationthinking-mode baselinesfunctional specificity

Read original →

Improved Predictive Performance and Interpretability for Mesomorphic Neural Networks Using Local Fidelity Regularization

arXiv cs.LG · Hugo L. Hammer, Vajira Thambawita, Kristoffer Herland Hellton, Pål Halvorsen · 2026-06-29

The paper introduces Local Fidelity Regularization (LFR) to address degenerate weight collapse in Interpretable Mesomorphic Neural Networks (IMNs), where explanatory variance concentrates in a single output weight. LFR aligns linear output weights with local data variations, ensuring faithful interpretations without sacrificing predictive performance. Empirical results on the OpenML benchmark suite show LFR improves AUROC over unregularized IMNs while maintaining competitive accuracy with black-box models.

interpretable mesomorphic neural networkslocal fidelity regularizationdegenerate weight collapseopenml benchmarkauroc

Read original →

Diagnosing and Mitigating Retrieval Bottlenecks in LLM-Based Cold-Start Recommendation

arXiv cs.LG · Zhe Dong, Fang Qin, Manish Shah, Yicheng Wang · 2026-06-29

The study evaluates LLM-based rerankers in cold-start recommendation systems, revealing performance gaps despite semantic understanding expectations. Using a five-domain benchmark separating reranking quality from retrieval coverage, it shows calibrated LLM rerankers (Qwen3-8B to Qwen3-32B) underperform collaborative/content baselines in natural traffic and struggle with retrieval-realistic regimes (gold item present only 4.6-22.9% of the time). The proposed LHF (learned hybrid fusion) improves retrieval coverage (17-61% recovery on content-rich domains) but highlights persistent mismatches in LLM reranking pipelines. The benchmark protocol and artifacts are publicly released.

llm rerankingcold-start recommendationretrieval coveragelearned hybrid fusionmulti-retriever pool

Read original →

Bandwidth Selection in Kernel Density Estimation for Model Calibration

arXiv cs.LG · Han Zhou, Teodora Popordanoska, Matthew Blaschko · 2026-06-29

The paper introduces Risk Alignment (RA), a novel optimization framework for selecting optimal kernel bandwidths in Kernel Density Estimation (KDE) for model calibration. RA aligns KDE-reconstructed risk with empirical risk to minimize calibration estimation bias, providing a principled criterion applicable to various metrics like canonical calibration error. Theoretical analysis shows RA's effectiveness across data distributions. Experiments on multiple architectures and datasets demonstrate RA's consistent superiority over standard bandwidth selection methods, yielding more reliable calibration assessments.

kernel density estimationmodel calibrationbandwidth selectionrisk alignmentcanonical calibration error

Read original →

MemDelta: Controlled Baselines and Hidden Confounds in Agent Memory Evaluation

arXiv cs.LG · Kuan Wang · 2026-06-29

MemDelta introduces a controlled evaluation protocol for agent memory systems, isolating component effects by varying one element at a time on LongMemEval-S (500 questions, 50+ sessions, three model families). Key findings include: (1) performance rankings reverse across models (Gemini gains +14pp from full context, Sonnet +31pp from RAG); (2) embedding model swaps shift accuracy by +6.2pp (p = 0.004); (3) self-memory underperforms basic retrieval (42% vs. 47%); (4) narrow cost-benefit tradeoffs (Mem0 matches cloud RAG on 2/6 question types at 50x cost). The study recommends fixed embeddings, model-family stratification, and cost reporting in memory evaluations.

agent memorycontrolled evaluationembedding modelretrieval-augmented generationcost-benefit analysis

Read original →

Golden Hour Divide: Trauma Care Accessibility and Resource Vulnerability in Sri Lanka

arXiv cs.LG · Sonath Kirindage, Vihanga Nimsara, Sakindu Rajapaksa, Kavyanga Hathurusinghe · 2026-06-29

This study evaluates trauma care accessibility in Sri Lanka by quantifying gaps between clinical demand and specialized resource availability across 25 districts. Using national epidemiological data and terrain-aware H3 hexagonal modeling, the authors analyzed accessibility for seven critical conditions based on spatial gaps, clinical need-gaps, lethality, coverage, and resource availability. Unsupervised K-Means clustering categorized districts into four policy-actionable archetypes, revealing severe service deficits in Northern and Eastern provinces, where spatial gaps exceed 70%. The findings suggest that improving accessibility by 25% in high-priority clusters would reduce the national need-gap by 9.65%, providing a roadmap for strategic specialist redistribution.

h3 hexagonal modelingclinical need-gapsk-means clusteringspatial gapsterrain-aware

Read original →

Same Concept, Different Directions: Cross-Modal Feature Heterogeneity in Sparse Autoencoders

arXiv cs.LG · Chungpa Lee, Jihoon Kwon, Kyle Min, Jy-yong Sohn · 2026-06-29

The paper identifies cross-modal feature heterogeneity in vision-language models, where semantically corresponding features diverge directionally across image and text modalities. To address this, the authors propose training modality-specific sparse autoencoders that preserve each modality's feature geometry, followed by post hoc alignment of corresponding features. This approach improves reconstruction fidelity and enhances performance in cross-modal retrieval and concept steering tasks, demonstrating that latent activation alignment alone is insufficient to resolve feature mismatch.

cross-modal feature heterogeneitysparse autoencodersvision-language modelsfeature geometryconcept steering

Read original →

Decision-Value Attribution in Predict-then-Optimize Systems

arXiv cs.LG · Konstantinos Ziliaskopoulos, Alexander Vinel, Alice E. Smith · 2026-06-29

The paper introduces Decision Value Attribution (DVA), a Shapley-based framework for explaining the operational value of predict-then-optimize systems by attributing value to information sources or design parameters. Three variants are proposed: InfoDVA (feature attribution), DesignDVA (operational configuration attribution), and Decision-Value Interactions (DVI) for joint attribution. The method distinguishes post-DVA (realized outcomes) from pre-DVA (model predictions) to diagnose alignment between model beliefs and performance. Case studies in electricity storage arbitrage and emergency medical services demonstrate DVA's ability to reveal mismatches between predictive explanations and operational value, guiding targeted interventions.

shapley valuepredict-then-optimizevalue attributionoperational decision-makingdecision relevance

Read original →

Implementation of Hyperelastic Physics-Augmented Neural Networks in the Explicit Finite Element Codes Simcenter Radioss and OpenRadioss with Applications to Impact Events

arXiv cs.LG · Lukas Maurer, Sascha Eisenträger, Marian Bulla, Daniel Juhre · 2026-06-29

This work integrates physics-augmented neural networks (PANNs) into the explicit finite element solvers Simcenter Radioss and OpenRadioss, enabling machine-learning-based constitutive modeling for engineering simulations. A framework is developed to transfer pretrained PANNs, trained in PyTorch or TensorFlow, into Fortran user material routines, ensuring compatibility with existing finite element technology without specialized solvers. Computational efficiency is optimized by replacing SoftPlus with SQuarePlus activation functions, reducing evaluation costs while maintaining accuracy. A GitHub repository automates routine generation, requiring only network architecture and trained parameters. Impact simulations demonstrate that PANNs accurately reproduce nonlinear hyperelastic material behavior under large strains, validating their practical application in explicit finite element simulations.

physics-augmented neural networksexplicit finite elementhyperelastic materialsfortran user materialsquareplus activation

Read original →

Comparing Chatbot Performance Enhanced with Persistent Homology

arXiv cs.LG · Nithisha Raghavaraju, Barbara Giunti, Bastian Rieck · 2026-06-29

The study investigates performance enhancement in chatbots using persistent homology (PH) vectorizations derived from raw datasets, particularly for scenarios with limited or confidential training data. The authors compare multiple chatbot models with and without PH augmentation across various metrics. Results indicate that PH enhancement occasionally yields significant improvements at minimal computational cost, though benefits are not universally observed. The approach addresses challenges in domain-specific or privacy-sensitive applications where large datasets are unavailable.

persistent homologychatbot performancedataset augmentationprivacy-sensitive trainingvectorization

Read original →

Theory of Continual Learning Against Data Poisoning Attacks

arXiv cs.LG · Yiting Hu, Lingjie Duan · 2026-06-29

We develop a theoretical framework for analyzing data poisoning attacks and defenses in regularization-based continual learning (CL), addressing a critical gap in CL security. By modeling adversary-defender interactions as an online zero-sum game, we establish fundamental performance limits: no defense succeeds against linear-proportion task poisoning with unbounded noise. We then analyze two defensible scenarios: infrequent attacks and bounded noise per attack. For infrequent attacks, we propose a task-to-task verification mechanism to detect poisoning and reduce cumulative bias. For bounded noise, we derive a robust defense that minimizes sensitivity to poisoned features, provably accelerating convergence. Experiments on realistic tasks validate our theoretical findings.

continual learningdata poisoningregularization-basedonline zero-sum gametask-to-task verification

Read original →

The Forgetting-Retention Dilemma: Certified Unlearning Theory in Continual Learning

arXiv cs.LG · Yiting Hu, Lingjie Duan, Qian Zhang · 2026-06-29

This work establishes the first theoretical foundation bridging continual learning (CL) and machine unlearning by formulating CL's unlearning objective as minimizing post-unlearning excess risk. The authors decompose this risk into CL excess risk and unlearning loss, characterizing the trade-off between knowledge preservation and targeted forgetting. Under mild assumptions, they derive an upper bound for CL excess risk in non-convex models and adapt gradient-based and Hessian-based certified unlearning approaches to CL. Experiments validate that while Hessian-based methods minimize unlearning loss more effectively, gradient-based approaches offer near-zero storage overhead, motivating a hybrid strategy balancing performance and efficiency.

continual learningmachine unlearningexcess risknon-convex modelscertified unlearning

Read original →

MemLeak: Diagnosing Information Leaks in Multimodal Agent Memory

arXiv cs.LG · Kuan Wang, Chao Zhang · 2026-06-29

The paper introduces MemLeak, a benchmark for diagnosing information leaks in multimodal agent memory systems when facts are deleted. The authors propose an Information Provenance Graph (IPG) taxonomy to classify memory representations by deletion affordance, revealing multiple leakage channels. Experiments show that while direct probing yields <1% recovery, retained correlated text enables 18.3% recovery and images enable 12.0% recovery (47% image leaks not text-recoverable), with content-aware semantic deletion reducing image residuals to 2.0%. Results are validated across multiple VLMs, a production system, and real photographs, with dual-annotator human validation (kappa=0.88).

multimodal memoryinformation leakagevisual language modelsdeletion affordanceinformation provenance graph

Read original →

GLIP: Graph and LLM Joint Pretraining for Graph-Level Tasks

arXiv cs.LG · Haoxin Sun, Yiqing Lin, Yajun Huang, Chenhui Dong · 2026-06-29

The paper introduces GLIP, a joint pretraining framework combining graph neural networks (GNNs) and large language models (LLMs) for graph-level tasks. The method employs graph augmentation to construct contrastive pairs, a multi-token selection strategy for informative patches, and a diffusion-based projector to capture global-local contextual signals. A joint objective aligns semantic (LLM) and structural (contrastive) supervision. Experiments demonstrate GLIP's superiority over state-of-the-art methods in graph-level classification and reasoning tasks with limited labeled data.

graph neural networkslarge language modelscontrastive learningdiffusion projectorgraph-level tasks

Read original →

How Far Do On-Prem Open LLMs Get on Text-to-SQL? A Cross-Family Size x Technique Frontier on BIRD

arXiv cs.LG · Vladimir Beskorovainyi · 2026-06-29

This study benchmarks on-premises open-weight LLMs for Text-to-SQL on the BIRD dataset (n=1534, Execution Accuracy), evaluating Qwen2.5-Coder, CodeLlama-Instruct, and Llama-3.x families across sizes (7B-70B) under a unified protocol. The authors ablate model-agnostic techniques (schema linking, self-correction, self-consistency) and analyze their impact. Key findings: (1) model generation matters more than size, with Qwen2.5-Coder outperforming CodeLlama-Instruct at matched sizes; (2) self-correction consistently improves accuracy; (3) schema linking provides no significant benefit despite high recall; (4) self-consistency offers minimal gains at high computational cost. Results are validated via McNemar tests, with full reproducibility and cost analysis provided.

text-to-sqlexecution accuracyschema linkingself-correctionself-consistency

Read original →

Optimizing Nursing Care Taxi Dispatch Leveraging Integer Linear Programming Solvers and Machine Learning

arXiv cs.LG · Riku Nakao, Akihito Hiromori, Hamada Rizk, Hirozumi Yamaguchi · 2026-06-29

The paper introduces Nursing Care Taxi Dispatch, a constrained Vehicle Routing Problem variant with wheelchair, compatibility, and temporal constraints, where neural methods typically fail due to complexity. A Transformer-based supervised learning approach is proposed, trained on high-quality solutions from an integer linear programming solver, with post-processing for constraint satisfaction. Evaluations on real-world data show 8% lower operating times for <30-user instances while minimizing violations, outperforming existing methods in time-vs-quality tradeoffs.

vehicle routing probleminteger linear programmingtransformer architectureconstraint satisfactionsupervised learning

Read original →

Simplifying Flow Matching Transformations with Low-Rank Mixture Models

arXiv cs.LG · Liam A. Kruse, Houjun Liu, Alexandros E. Tzikas, Mansur M. Arief · 2026-06-29

The authors propose using mixtures of probabilistic principal component analyzers (MPPCA) as latent densities in normalizing flows to simplify flow transformations and improve generative performance. By aligning the latent distribution more closely with the data distribution in terms of KL divergence, the method enables faster convergence and reduces topological mismatch. MPPCA models are efficiently fit using expectation-maximization, making them practical for high-dimensional tasks. Empirical validation on tabular and image datasets demonstrates consistent improvements in training efficiency and generation quality compared to standard normal latent densities.

normalizing flowsmppcakl divergenceexpectation-maximizationgenerative models

Read original →

ScaleAware-JEPA: Latent Representation for Discovery in Multiscale Physical Fields

arXiv cs.LG · Guang-Xing Li · 2026-06-29

ScaleAware-JEPA introduces a self-supervised framework for learning latent representations of multiscale physical fields by aligning predictive tasks with inherent scale hierarchies. The method employs Constrained Diffusion Decomposition (CDD) to separate fields into scale components, using diffusion-derived coordinates to define context-target masking geometry rather than fixed patches. Evaluated on MHD turbulence, interstellar molecular gas, and urban nighttime-light data, the approach generates dense structural atlases without labels, revealing coherent morphology through scale-aware latent spaces.

multiscale representationself-supervised learningconstrained diffusion decompositionlatent coordinatesphysical fields

Read original →

The Hidden Cost of Resampling: How Imbalance Correction Degrades Probability Calibration in Tree Ensembles

arXiv cs.LG · Zewen Liu · 2026-06-29

The study quantifies how class-imbalance correction methods affect probability calibration in tree ensembles, demonstrating that SMOTE introduces minor calibration degradation (ECE +0.009) while random undersampling causes severe miscalibration (ECE up to 0.395 at imbalance ratio 70). Through systematic experiments on five datasets (imbalance ratio 1.9-70) with random forests and gradient boosting, the authors show that post-hoc recalibration (Platt or isotonic) effectively mitigates these issues (66% ECE reduction) with minimal impact on discrimination (AUC -0.002). They establish that prior-shift correction fails for SMOTE due to distorted class-conditional densities, necessitating data-driven recalibration.

class-imbalanceprobability calibrationsmoterandom undersamplingpost-hoc recalibration

Read original →

A Diagnostic Framework and Multi-Evaluator Audit of Evaluator-Driven Preference Dynamics in Self-Adapting LLM Agents

arXiv cs.LG · Liu Zewen · 2026-06-29

The authors introduce the Evaluator Preference Collapse (EPC) framework to diagnose instability in LLM evaluators, comprising the Multimodal Preference Collapse Index (MPCI), evaluator-indexed coupling matrix, and Jensen-Shannon divergence (JSD). They apply EPC across eight experimental conditions (N=122 repetitions), revealing evaluator coupling coefficients ranging from 0.00 to 1.18 (CV≈0.9), with four conditions showing strong coupling and four collapsing to near-zero. A notable finding is the May-to-June GPT-4o drift, where evaluator instability inverted study conclusions. Self-evaluation consistently collapsed (97% zero, JSD=0.003), though floor effects may confound results. Output-format analysis showed aggregate ρ=0.89 but per-instance ρ=0.219 (p=0.093).

evaluator preference collapsemultimodal preference collapse indexcoupling matrixjensen-shannon divergencellm evaluator instability

Read original →

IG-Lens: Exact Additive Probability Attribution Across Transformer Layers via Telescoping Integrated Gradients

arXiv cs.LG · Duc Anh Nguyen · 2026-06-29

IG-Lens introduces an exact additive probability attribution method for decoder-only transformers, addressing limitations in existing layer-wise readout tools. By applying Integrated Gradients in a telescoping manner along hidden states from baseline to final layer, it attributes probability changes per layer while preserving softmax nonlinearity. The method ensures exact summation to total probability change, eliminates Riemann discretization error, and operates efficiently in a single-pass batched implementation. Verification shows completeness to floating-point precision, with code available on GitHub.

integrated gradientsprobability attributiontransformer layerssoftmax nonlinearitytelescoping sum

Read original →

CAREBench: A Child-Safety Risk Benchmark for Language Models

arXiv cs.LG · Kaavya Krishna-Kumar, Elaine Lau, Vaughn Robinson, Jay Caldwell · 2026-06-29

CAREBench introduces a child-safety risk benchmark for language models, focusing on upstream risks before explicit harm occurs. The benchmark comprises 500 prompts across 12 categories (e.g., grooming, emotional dependency) annotated by parents and clinicians, excluding explicit abuse material. Evaluation of seven frontier models reveals failure rates from 2% to 58%, with varying patterns across risk categories. The benchmark aids LLM developers in identifying and mitigating child-safety policy gaps.

child-safety evaluationlanguage modelsrisk categoriesupstream risksfailure rates

Read original →

Learning as Observable Matrix Dynamics: Diffusive Relaxations versus Phase Transitions

arXiv cs.LG · Igor Halperin · 2026-06-29

The paper introduces Observable Matrix Dynamics (OMD), a diagnostic framework for analyzing neural network training dynamics through time-evolving distance matrices of internal representations. OMD employs random matrix theory and particle dynamics to detect spectral reorganizations, decomposing matrices via Bogomolny-Bohigas-Schmit theory into ambient noise and latent geometric structures. Experiments reveal two regimes: diffusive dynamics lack stable spectral structure, while sharp reorganizations produce identifiable fingerprints corresponding to smooth, clustered, or soliton-like geometries. The method provides geometric regime identification beyond scalar intrinsic dimension metrics.

observable matrix dynamicsrandom matrix theoryspectral reorganizationbogomolny-bohigas-schmit theorylatent geometry

Read original →

I-BBS: Coordinate-Free Inference of Latent Sub-Manifolds Using Random Distance Matrix Theory

arXiv cs.LG · Igor Halperin · 2026-06-29

I-BBS introduces a coordinate-free method for inferring latent sub-manifolds from high-dimensional ambient distance matrices, applicable even when the ambient vector space is partially observable or undefined. The approach models ambient embeddings using generative noise, distinguishing between model-based and model-free classes, and identifies latent geometry through integer-stable signatures: the multiplicity of the top non-Perron multiplet and a parameter-free law governing multiplet positions under noise. Tests on synthetic spheres $S^1$, $S^2$, and $S^3$ demonstrate superior noise stability compared to continuous spectral slope, enabling accurate recovery of both manifold and noise model from a single distance matrix.

latent sub-manifoldsdistance matrixgenerative noiseinteger-stable signaturesnon-perron multiplet

Read original →

Adjusted Wasserstein distances for bridging empirical and true distributions with applications to MDS

arXiv cs.LG · Flor Martinez-Sermeno, Arturo Jaramillo, Johan Van Horebeek · 2026-06-29

The paper introduces Max-D-SW, an adjusted Wasserstein distance that aggregates contributions over orthonormal bases instead of single unit directions, enhancing Multidimensional Scaling (MDS) for pattern recognition. This modification improves numerical performance, particularly with heavy-tailed distributions, while maintaining statistical tractability with sample-complexity bounds comparable to max-sliced Wasserstein. Results demonstrate that superior sample complexity does not always correlate with better MDS performance, highlighting a nuanced trade-off in metric selection.

wasserstein distancemultidimensional scalingpattern recognitionsample complexityheavy-tailed distributions

Read original →

Benchmarking Geospatial Foundation Models for Agriculture Applications

arXiv cs.LG · Zhuocheng Shang, Sanmay Das, Ahmed Eldawy · 2026-06-29

The study benchmarks geographic transferability of geospatial foundation models (Prithvi, SpectralGPT, SatMAE) for agricultural applications, revealing significant performance degradation under regional distribution shifts. Using a controlled evaluation across four U.S. states (Iowa, North Carolina, California, Minnesota) with regionally separated train/validation/test splits, the authors measure cross-region generalization in multi-temporal crop segmentation and change detection. All models exhibit sharp performance drops, disproportionately predicting common crops while missing rare ones, with additional confounding effects from standardized input formatting. Results highlight critical limitations in current geospatial foundation models and advocate for region-aware evaluation standards.

geospatial foundation modelsregional transferabilitymulti-temporal segmentationcrop classificationdistribution shift

Read original →

t-STEP: An interpretable model for Total Electron Content predictions and irregularities estimations

arXiv cs.LG · Stephen Tete, Carl Shneider, Maxime Cordy, Claudio Cesaroni · 2026-06-28

The study introduces t-STEP, an interpretable machine learning model for high-resolution (30-second) Total Electron Content (TEC) prediction and irregularity estimation in the ionosphere. The model leverages GPS observations from solar cycle 24, employing SHAP for feature interpretability and dynamic time warping for robustness evaluation. Results show 91% accuracy (MAE: 4.38 TECU) during high solar activity, outperforming IRI-2020 by 35% in accuracy and 57% in error reduction, while capturing storm-induced irregularities better than an LSTM baseline.

total electron contentionospheric irregularitiesinterpretable machine learningdynamic time warpinggeomagnetic storms

Read original →

Lie Group Diffusion Models for Hardware-Aware Quantum Circuit Synthesis

arXiv cs.LG · Jyotirmai Singh · 2026-06-28

We introduce Lie group diffusion models for hardware-aware quantum circuit synthesis, addressing the hybrid continuous-discrete structure of unitary compilation. The method combines a discrete circuit skeleton selector with a diffusion model operating on the SU(2) manifold to generate quantum gates. Evaluated on three-qubit Hamiltonian simulation targets (Transverse Field Ising Model, Heisenberg-XXZ Model), the approach outperforms baselines in synthesizing customizable circuits with varying rotation angles while balancing fidelity and complexity. Results demonstrate effective hardware constraint incorporation and natural geometric integration for quantum circuit synthesis.

quantum circuit synthesislie group diffusionsu(2) manifoldhamiltonian simulationhardware constraints

Read original →

Kriging and neural network models for pressure losses across perforated plates

arXiv cs.LG · Shuai Li · 2026-06-28

Novel data-driven models using kriging and neural networks (NN) are proposed to predict pressure losses across perforated plates in turbulent flows, outperforming empirical formulae across most configurations. The models are trained on limited experimental datasets and validated against measurements, demonstrating strong predictive accuracy. Their applicability is further tested in numerical simulations using Reynolds-averaged Navier-Stokes (RANS) equations, where the models are implemented as source terms in momentum equations. RANS predictions align excellently with model outputs, confirming their suitability for computational fluid dynamics applications.

krigingneural networkspressure lossesperforated platesrans equations

Read original →

Bidirectional Autoregressive Latent Diffusion for Forward and Inverse Magnetohydrodynamics

arXiv cs.LG · Alexander Scheinker · 2026-06-28

The paper introduces a bidirectional autoregressive latent diffusion model for predicting multi-field magnetohydrodynamics (MHD) evolution. The method leverages bidirectional temporal flow as a self-supervised consistency metric, enabling uncertainty estimation without ground truth by comparing forward-backward predictions. Results demonstrate applications in non-invasive plasma diagnostics and robustness improvement via adaptive feedback from sparse measurements.

bidirectional autoregressivelatent diffusionmagnetohydrodynamicsself-supervised consistencyuncertainty estimation

Read original →

Boundary Degree as a Node-level Feature for Epidemic Scenario Identification in Agent-based Cascade Simulations

arXiv cs.LG · Amro Alabsi Aljundi, Galen Harrison, Jiangzhuo Chen, Abhijin Adiga · 2026-06-28

The paper introduces boundary degree, a node-level feature defined as the count of an infected node's uninfected contacts in a contact network, for epidemic scenario identification in agent-based cascade simulations. Through systematic ablation studies on realistic social contact networks of Tennessee and Virginia, the authors demonstrate that boundary degree alone improves scenario identification accuracy by 19%. The study provides theoretical grounding for the empirical importance of edge features and shows that boundary degree and edge features have complementary effects. The results indicate that certain epidemic scenarios are indistinguishable without boundary or edge information, suggesting that contact tracing applications should track contacts with non-infected individuals.

boundary degreeepidemic scenario identificationagent-based simulationscontact tracingnode-level feature

Read original →

STEMGym: Benchmarking Sequential Decision-Making under Dose Budgets in Autonomous Electron Microscopy

arXiv cs.LG · Can Polat, Erchin Serpedin, Mustafa Kurban, Hasan Kurban · 2026-06-28

The paper introduces STEMGym, a Gymnasium benchmark for autonomous electron microscopy, challenging the assumption that adaptive navigation is key to sample-efficient acquisition. The benchmark comprises 15 physics-simulated STEM environments across five materials, three difficulty levels, and four tasks, evaluated via Dose-Efficiency Curve area (DEC-AUC). Results show perception pipelines dominate dose efficiency: a CNN analyst with naïve raster scanning improves DEC-AUC by 5.5x over baseline (0.287 vs. 0.052), while advanced navigation methods yield no significant gains. Vision-language models underperform task-specific CNNs by ~13x in defect analysis.

stemgymdose-efficiency curveautonomous microscopyperception pipelinecnn analyst

Read original →

Geometric Algebra Meets Cartesian Tensors: Higher-Order Equivariance for Interatomic Potentials

arXiv cs.LG · Can Polat, Erchin Serpedin, Mustafa Kurban, Hasan Kurban · 2026-06-28

(No summary returned.)

Read original →

Speculative Pre-Positioning: Decoding Stateful Sessions to the Next Decision Point Off the Critical Path

arXiv cs.LG · Victor Norgren · 2026-06-28

The paper introduces speculative pre-positioning, a method for stateful session management in LLM inference that reduces latency by pre-decoding sessions to their next decision point during idle periods. The approach uses the target model's own forward pass (without a draft model) to move cross-request prefill and entry-decode off the critical path. Results show a capable model achieves 87% precision in triggering the confidence gate, reducing first-token latency to 1.0 ms compared to 39 ms with prefix caching, while maintaining bounded false accept rates.

stateful inferencespeculative decodinglatency reductionconfidence gateprefix cache

Read original →

Persona-Trained Monte Carlo: Estimating Market-Outcome Distributions via Swarms of Persona-Conditioned Neural Policy Bots in a Limit Order Book

arXiv cs.LG · Salavat Ishbulatov · 2026-06-28

The authors propose Persona-Trained Monte Carlo (PTMC), a method for estimating market-outcome distributions by simulating interactions among persona-conditioned neural-policy trading bots in a limit order book. PTMC generates Monte Carlo samples through repeated simulations where bots, sharing a trained policy network but conditioned on heterogeneous persona parameters, interact in a continuous double auction. The method incorporates randomness through persona draws, action sampling, and optional exogenous shocks. The authors formalize the PTMC estimator, outline its convergence properties, and propose a four-level validation methodology. While not implemented, the framework contributes a formal estimator, cross-disciplinary design justification, and validation roadmap.

monte carloneural-policylimit order bookpersona-conditioneddouble auction

Read original →

Optimizer Memory Makes Shuffle Order a First-Order Source of Fine-Tuning Noise

arXiv cs.LG · John Sweeney · 2026-06-28

The paper demonstrates that shuffle order introduces first-order fine-tuning noise in fixed-clock optimizers like AdamW, contrary to memoryless optimizers where such effects are second-order. By analyzing moment buffers and preconditioner states that advance with step index rather than learning-rate-scaled time, the authors derive a fit-free method to quantify this noise. Results show order-variance slopes of 1.83 for AdamW, 2.00 for fixed-β momentum, and 4.00 for SGD, with clock-matching restoring the regular exponent. The analysis provides error bars, attribution weights, and seed-budget criteria for fine-tuning comparisons.

fine-tuning noisefixed-clock optimizersmomentum bufferorder-variancegradient bracket

Read original →

Improved Multi-Dimensional Forecasting for Swap Regret

arXiv cs.LG · Joey Rivkin, Ramiro N. Deo-Campo Vuong, Robert Kleinberg, Chido Onyeze · 2026-06-28

The paper presents improved algorithms for multi-dimensional forecasting in swap regret minimization, targeting scenarios with multiple downstream agents of unknown objectives. For 2D outcome spaces, it introduces a polynomial-time algorithm achieving $\tilde{O}(\sqrt{kT})$ swap regret per agent, improving upon prior $\tilde{O}(kT^{5/8})$ bounds and exponential runtime. The method extends to higher dimensions with $\tilde{O}(\sqrt{T})$ regret, though runtime scales with dimension. For arbitrary dimension $d$, an $\tilde{O}(d\sqrt{kT})$ regret bound is shown, surpassing previous $\tilde{O}(T^{2/3})$ results that required behavioral assumptions.

swap regretmulti-dimensional forecastingpolynomial-time algorithmdownstream agentsregret minimization

Read original →

The Mirage of Optimizing Training Policies: Monotonic Inference Policies as the Real Objective for LLM Reinforcement Learning

arXiv cs.LG · Jing Liang, Hongyao Tang, Yi Ma, Yancheng He · 2026-06-28

The paper identifies training-inference mismatch as a key instability source in LLM reinforcement learning, where divergent probability distributions between training and inference engines create persistent off-policy effects. It proposes Monotonic Inference Policy Improvement (MIPI) as a new optimization objective that directly targets inference-side policy quality, implemented via a two-step Monotonic Inference Policy Update (MIPU) framework with sampler-referenced candidate generation and inference-gap-based acceptance. Experiments on two model scales demonstrate MIPU improves reasoning performance by 12-18% and training stability under high-mismatch conditions.

reinforcement learningtraining-inference mismatchoff-policynesspolicy optimizationlarge language models

Read original →

Do Models Read What They Write? Causal Registers in Scratchpad Reasoning

arXiv cs.LG · Benjamin Shih, John Winnicki, Eric Darve · 2026-06-28

The study demonstrates that models can causally use intermediate scratchpad states for computation, not merely as legible reasoning traces. Using a controlled state-tracking task with known transition rules, researchers edited internal representations of written states while keeping scratchpad text fixed, then measured downstream prediction accuracy. Qwen2.5-Coder-7B predicted correct next-phase bits 80-91% of the time when using edited states, significantly outperforming pretrained and final-answer-only controls. Results generalized across model families, suggesting scratchpad oversight should aim to train computationally integrated intermediate states rather than just transparent reasoning.

scratchpad reasoningcausal registersprocess supervisionintermediate variablesstate-tracking

Read original →

Not All Objectives Are Born Equal: Priority-Constrained Descent for Hierarchical Multi-Objective Optimization

arXiv cs.LG · Dara Varam, Mohamed I. Alhajri · 2026-06-28

The paper introduces Priority-Constrained Descent (PCD), a gradient-based optimization framework for hierarchical multi-objective problems where primary and secondary objectives have unequal importance. PCD preserves primary objective descent direction while minimally distorting gradients to ensure secondary objective progress, controlled by a parameter τ ∈ [0,1]. The method provides scaling invariance and closed-form solutions for 2-3 objectives. Experiments in network compression, sparsity, and low-rank tasks demonstrate Pareto dominance over baselines, with τ offering interpretable trade-offs between objectives.

hierarchical optimizationgradient descentmulti-objective learningnetwork compressionpareto efficiency

Read original →

Anti-Collapse Dynamics and the Emergence of Multi-Time-Scale Learning in Recurrent Neural Networks

arXiv cs.LG · Lorenzo Livi · 2026-06-28

The paper demonstrates that the temporal decay class (exponential vs. power-law) in recurrent neural networks emerges from coupled state-parameter dynamics, not fixed architecture. Through a coarse-grained stochastic process analysis, the authors prove the existence of an anti-collapsed regime with power-law forgetting when heavy-tailed parameter fluctuations balance training's bias toward short time scales. The spectral exponent β governs both time-scale spread and forgetting rate. Practical realization requires architectural/optimizer capacity to maintain broad time-scale spectra under heavy-tailed forcing, which enables long-range learning.

recurrent neural networkslong-range learningpower-law forgettingspectral exponentheavy-tailed fluctuations

Read original →

Harvesting AI Computation at the Edge via Generic Approximation

arXiv cs.LG · Yihan Wang, Huiru Yan, Luxin Zhang, Long Cheng · 2026-06-28

The authors propose a framework to harvest underutilized AI computation resources at the edge by converting general-purpose tasks into neural network models via neural architecture search (NAS). A runtime scheduler offloads these approximate tasks to idle AI chips, alleviating the burden on general-purpose processors without compromising primary AI workloads. Experiments on a representative AIoT processor demonstrate substantial performance improvements across various edge processing tasks.

neural architecture searchedge computingruntime scheduleraiot processorapproximation techniques

Read original →

A Mathematical Optimization Approach for Expert-Informed Bayesian Best Subset Selection

arXiv cs.LG · Nolan Alexander, Henning Mortveit · 2026-06-28

The paper introduces Expert-Implied Bayesian Best Subsets (EBBS), a method integrating domain-expert probability estimates of feature relevance into the mixed-integer optimization (MIO) framework for best subset selection. EBBS aggregates expert views using the Poisson binomial distribution, pairwise win rate, or normalized mean rank, incorporating them as log-odds penalty terms in the objective function. This approach reduces to classical Best Subsets when expert views are absent. The paper provides analytic derivations of the maximum a posteriori (MAP) formulation and characterizes its theoretical properties, with empirical results on synthetic and real datasets forthcoming.

mixed-integer optimizationbest subset selectionmaximum a posterioripoisson binomial distributionlog-odds penalty

Read original →

Reinforcement Learning in Super Mario Bros: Curriculum, Pedagogy, and Optimal Level Design in World 1-1

arXiv cs.LG · Jesse Ponnock, Lucas Ho · 2026-06-28

The study provides empirical validation for Super Mario Bros World 1-1's pedagogical level design by comparing reinforcement learning algorithms in discrete environment implementations. Four algorithms (Q-Learning, SARSA, Monte Carlo, DQN) were evaluated across three progressively complex level variants, with Monte Carlo achieving highest win rate (94.9% ±1.5%) by optimizing intermediate rewards. Curriculum experiments permuting six level segments showed canonical ordering yields fastest convergence, highest learning efficiency, and zero catastrophic failures, demonstrating its unique pedagogical structure.

reinforcement learningcurriculum learningmonte carlo methodsgame designpedagogical structure

Read original →

The Calibrated Deepfake Trust Score (CDTS): Competence-Coupled Trust Degradation Across Deepfake Detectors

arXiv cs.LG · Md Anas Biswas · 2026-06-28

The paper introduces the Calibrated Deepfake Trust Score (CDTS), demonstrating a competence-calibration coupling where calibration degrades as detector discriminative competence decreases (Pearson r = -0.81 across 32 configurations). The study validates this across three architectures (convolutional networks and CLIP ViT) and four datasets, showing label-free competence estimation can flag calibration risks. CDTS improves routing performance (lower AURC) and addresses calibration inequity across demographic subgroups. The authors propose competence-aware trust scoring as a unifying framework.

deepfake detectioncalibrationtrust scorecompetence estimationvision transformer

Read original →

Chamber geometry and specification numbers of Boolean threshold functions

arXiv cs.LG · Martin Anthony · 2026-06-28

The paper establishes a geometric interpretation of Boolean threshold functions' specification numbers, linking them to chamber facets in a hyperplane arrangement. Using methods from combinatorial geometry and the resonance arrangement, it proves the average specification number is Θ(n), resolving a question by Gutekunst et al. The analysis extends to polynomial threshold functions and connects to threshold zonotopes and one-inclusion graphs. Operations preserving simpliciality and minimum specification number are characterized, including a resolution of a posed question about variable extensions.

boolean threshold functionsspecification numberhyperplane arrangementthreshold zonotopeone-inclusion graph

Read original →

Structured Proper Loss Geometries for Multiclass Classification: Theory and Controlled Empirical Evaluation

arXiv cs.LG · Soumyadip Sarkar · 2026-06-28

The paper analyzes three multiclass loss functions—CAPM (class-aware quadratic Bregman score), HPG (log-cosh ridge generator), and APMS (HPG with annealed margin penalty)—through theoretical and empirical lenses. It derives bounds for conditional regret, curvature, and gradient behavior, while proving exact penalty-range properties for APMS. Controlled experiments on Digits, Wisconsin breast cancer, and synthetic datasets under varied noise and imbalance conditions show comparable performance to cross-entropy on clean data, with limited gains in specific noisy-label scenarios. Theoretical results are rigorously established, but empirical evidence does not support general superiority claims.

proper scoring rulesmulticlass classificationbregman divergencelabel noise robustnessconditional regret bounds

Read original →

Self-Supervised Calibration of Scientific Instruments Using Physical Consistency Constraints

arXiv cs.LG · M. Rejmund, A. Lemasson · 2026-06-28

The authors propose a physics-informed self-supervised framework for joint learning of detector calibration parameters and task-specific predictions from raw measurements, eliminating reliance on expert-labeled data. The method leverages physical consistency constraints to generate iterative pseudo-labels, reformulating calibration as a self-supervised optimization problem. Demonstrated on ionic charge-state determination in the VAMOS++ magnetic spectrometer, the approach achieves accurate reconstruction while inferring calibration coefficients that enable automated detector monitoring for gain drifts and aging effects.

self-supervised learninginstrument calibrationphysical consistencypseudo-labellingdetector monitoring

Read original →

Prototype Latent World Model Replay for Class-Incremental Learning

arXiv cs.LG · Weizhi Nie, Hui Wang, Weijie Wang, Yuting Su · 2026-06-28

The paper introduces Prototype Latent World Model Replay (LWM), a memory-free class-incremental learning framework that avoids catastrophic forgetting without storing raw exemplars. The method uses a frozen ImageNet-pretrained encoder to project images into a latent space, where old classes are represented as prototype-centered distributions with class-specific variances. During incremental learning, synthetic old-class samples are generated from these distributions and combined with new-class features to train a lightweight adapter and classifier, augmented by supervised contrastive loss for better separation. On Split CIFAR-100, LWM+Con improves LastAcc by 27.09%, 27.99%, and 26.14% absolute over fine-tuning for Inc5, Inc10, and Inc20 respectively, while maintaining AvgAcc above 45%. Ablations confirm the importance of stable latent-state replay and contrastive refinement.

class-incremental learninglatent replayprototype distributionscontrastive losscatastrophic forgetting

Read original →

Interpretable Inverse Design of Metal-Organic Frameworks with Large Language Model Agents

arXiv cs.LG · Kyungmin Nam, Seunghee Han, Jihan Kim · 2026-06-28

The paper introduces LLM4MOF, a closed-loop framework using language-model agents for interpretable inverse design of metal-organic frameworks (MOFs). Agents autonomously propose and test hypotheses about metal nodes, linkers, and pore geometry, refining designs over ten iterations. The system evaluates candidates through simulation, focusing on top-performing structures for adsorption, separation, and electronic-structure tasks within 400 evaluations. LLM4MOF outperforms random search and genetic algorithms, achieving cost-effective ($1 per campaign) and simulation-grounded design without per-objective model training.

metal-organic frameworksinverse designlanguage-model agentssimulation-groundedautonomous iterations

Read original →

How Much Due Diligence Before You Bid? Learning in Intractable Takeover Auctions

arXiv cs.LG · Zain Naboulsi · 2026-06-28

The paper contributes a computational model for studying due diligence in takeover auctions, demonstrating that self-play methods can effectively learn bidding strategies. Using a game-theoretic framework where bidders acquire costly private signals, the authors show that optimal diligence is finite, decreases with cost, and is further reduced under competition. They compare lightweight self-play (trained on a laptop) against specialized solvers, finding general methods competitive in intractable regimes while exact methods dominate smaller instances. Results provide empirical evidence for practical AI in complex auctions and quantify the economic value of information acquisition.

self-play learningtakeover auctionsdue diligenceprivate signalsgame-theoretic modeling

Read original →

Closing the Activation-Cone Blind Spot: Response-Time Probing and Unified Defense

arXiv cs.LG · Subhadip Mitra · 2026-06-28

The paper introduces response-time probing, a novel defense against prefilling attacks in large language models, addressing a structural blind spot in activation-cone-based defenses. The method employs a linear probe on hidden states at initial generated tokens, combined with a halt mechanism, achieving AUROC 0.97-1.00 across seven instruction-tuned models (7-31B). This approach reduces prefilling attack success to 0/40 with 0% benign false positives, outperforming Llama Guard 3. When composed with AlphaSteer's null-space steering, it achieves defense success rates of 0.983 on Mistral and 0.994 on Llama. Diverse negative training sets further reduce probe false positives from 80-100% to near zero.

response-time probingprefilling attacksactivation-conenull-space steeringlinear probe

Read original →

Randomized neural operator for parametric PDEs with fast training and conformal uncertainty quantification

arXiv cs.LG · Zirui Deng, Jingbo Sun, Deyu Meng, Fei Wang · 2026-06-28

We introduce PCA--RaNN, a randomized latent neural operator for parametric PDEs that combines PCA-based dimensionality reduction with fixed random features and a closed-form least-squares readout. This approach reformulates latent operator learning as fixed-feature linear regression, reducing training time by 1-3 orders of magnitude while maintaining competitive accuracy. The method incorporates energy-matched scaling, BFGS refinement, and ensemble averaging for variance reduction. Evaluated on Burgers, Darcy, Navier--Stokes, and backward heat equation benchmarks, PCA--RaNN demonstrates favorable speed--accuracy trade-offs against baselines. It supports split-conformal prediction intervals and enables rapid online adaptation via recursive least squares without retraining hidden features.

parametric pdesrandomized neural operatorpca-based dimensionality reductionsplit-conformal predictionrecursive least squares

Read original →

Fractional Stochastic Neural Networks

arXiv cs.LG · Yuecai Han, Jianming Xu · 2026-06-28

The paper introduces fractional stochastic neural networks with residual dynamics governed by fractional Brownian motion. A discrete stochastic maximum principle yields adjoint recursion for training, while projected samplewise stochastic gradient descent achieves mean-square convergence for deterministic parameters. Experiments demonstrate superior performance in long-memory time series generation (vs. Brownian/deterministic baselines) and robustness in image classification under structured perturbations, alongside closed-form convergence tests and noisy regression with uncertainty quantification.

fractional brownian motionstochastic maximum principleadjoint recursionlong memory recoverystructured perturbations

Read original →

Fourier Neural Operators with Least-Squares Readout Refit for Learning Random Obstacle-to-Solution Maps

arXiv cs.LG · Chenhui Zhu, Fei Wang · 2026-06-28

The paper introduces a least-squares readout refit (FNO-LS) for Fourier neural operators to improve learning of random obstacle-to-solution maps from elliptic variational inequalities. The method freezes the trained FNO backbone and recomputes the final affine readout via linear least-squares over all training samples and grid points, optimizing the readout while preserving nonlinear features. Evaluated against DeepONet variants and vanilla FNO on obstacle ensembles, FNO-LS achieves superior performance in field accuracy, contact-set recovery, and obstacle-violation metrics, particularly for high-amplitude obstacles with complex contact geometry. The refit provides a low-cost post-training enhancement when the FNO backbone is informative but not fully converged.

fourier neural operatoroperator learningleast-squares refitelliptic variational inequalitiescontact-set recovery

Read original →

Temporal Posed and Spontaneous Gesture Recognition from Electromyography in the Rock-Paper-Scissors Game

arXiv cs.LG · Xin Wei, Huakun Liu, Felix Dollack, Monica Perusquia-Hernandez · 2026-06-28

This work investigates temporal electromyography (EMG) characteristics for gesture recognition in Rock-Paper-Scissors (RPS), focusing on posed versus spontaneous gestures and inter-player dynamics. Twenty-four participants played RPS dyads while forearm EMG was recorded. EMG onsets were detected 800ms before visible gesture onset, peaking at 342ms prior. Posed gesture recognition achieved 63.4% accuracy, while spontaneous gestures yielded 53.6%. Opponent EMG analysis revealed gesture detection at 65% accuracy, peaking 2082ms post visual onset, indicating reaction-based gesture inference. Results demonstrate EMG's predictive advantage for rapid intent recognition, with implications for human-computer interaction and assistive technologies.

electromyographygesture recognitionrock-paper-scissorstemporal analysisintent recognition

Read original →

Can Machines Really See Objects in Images? A Study Based on Syntactic Distance and Visual Self-Referential Instances

arXiv cs.LG · Xingyu Peng, Junran Wu, Yue Hou, Zhongliang Qiao · 2026-06-28

The study investigates the structural limitations of vision models in recognizing global semantics by introducing syntactic distance, a metric quantifying class separability based on symmetry of operations. A visual self-referential task is constructed using maximum-variance binary noise, where positive and negative samples differ only in global semantics but have zero syntactic distance, eliminating local statistical cues. Experiments on ResNets and Vision Transformers demonstrate a phase-transition phenomenon, with accuracy collapsing to random guessing beyond a critical image scale, unaffected by larger training sets or model size. Globally attentive ViTs exhibit earlier collapse, revealing a capability boundary in current architectures for global-concept tasks.

syntactic distancevisual self-referentialphase-transitionglobal semanticsvision transformers

Read original →

Self-Organized Conformal Prediction: Reducing Regional Coverage Gaps with Unsupervised Group Discovery

arXiv cs.LG · Louis Berthier, Ahmed Shokry, Maxime Moreaud, Guillaume Ramelet · 2026-06-28

Self-Organized Conformal Prediction (SOCP) introduces a calibration scheme that reduces regional coverage gaps by discovering input-space groups via Self-Organizing Maps (SOMs) and retrieving local calibration buffers from best-matching unit cells or fixed grid neighborhoods. The method maintains exact validity for BMU-cell retrieval and approximate validity for neighborhood buffers, with a split-routed extension ensuring fixed retrieved-set validity. Evaluated on eight regression and classification benchmarks, SOCP reduces weighted regional coverage gaps on 7/8 datasets (mean paired change −7.1%) with a 6.2% mean prediction-set size increase, demonstrating efficacy without supervised partitions or predictor retraining.

conformal predictionself-organizing mapregional coveragenonconformity scorequantile regression

Read original →

Exploring the Cryptographic Limits of Transformer Networks

arXiv cs.LG · Stefan Domunco, Andis Draguns, Philip Torr, Isaac Robinson · 2026-06-28

This work establishes constructive upper bounds on transformer networks' cryptographic capabilities by mapping cryptographic functions to transformer architectures. The authors generate threshold circuits for Keccak functions, Merkle--Damgard constructions, and Merkle Trees, then propose two architectural mappings: no-attention and tokens-as-gates. Verified scaling laws for circuit width and depth are derived, providing structural guarantees for transformer computational capacity and enabling principled capability evaluations of AI systems.

threshold circuitskeccak functionsmerkle--damgardtransformer architecturescomputational capacity

Read original →

Interventional Flow Matching: Prospective Dose-Response Forecasting with Velocity-Field Jacobian Regularization

arXiv cs.LG · Amirreza Dolatpour Fathkouhi, Justin Lee, Heman Shakeri · 2026-06-28

The paper introduces Interventional Flow Matching (IFM), a continuous-time generative framework for prospective dose-response forecasting in glucose management. IFM conditions a flow-matching velocity field on patient history and planned treatments, using Jacobian regularization to enforce physiologically plausible responses without mechanistic ODEs. The method penalizes velocity-field Jacobians with respect to smoothed treatment drivers, ensuring signed, bounded sensitivities (e.g., insulin lowers glucose). Evaluated on a simulated UVA/Padova type 1 diabetes cohort, IFM achieves optimal balance between observational RMSE and interventional metrics while maintaining physiological correctness and directional consistency.

flow matchingdose-response forecastingjacobian regularizationcontinuous-time generative modelsphysiological trajectory prediction

Read original →

Dynamic Parsing and Updating Natural Language Specification using VLMs for Robust Vision-Language Tracking

arXiv cs.LG · Xiao Wang, Liye Jin, Dan Xu, Yuehang Li · 2026-06-28

Proposes a language dependency parsing mechanism for vision-language tracking that dynamically updates textual descriptions to mitigate semantic-visual mismatches. The method leverages Qwen-VL's cross-modal understanding to perform component-aware updates of target objects, semantic concepts, and background context. Integrated into a baseline framework, it achieves superior performance on TNL2K, LaSOT, TNLLT, and OTB-LANG benchmarks. Source code and pre-trained models will be publicly released.

vision-language trackinglanguage dependency parsingsemantic-visual mismatchqwen-vlcomponent-aware updates

Read original →

Adaptive Financial Transformer with Regime-Gated Attention for Stock Return Prediction

arXiv cs.LG · Dishan Sarkar · 2026-06-28

The Adaptive Financial Transformer (AFT) introduces regime-gated attention for stock return prediction in non-stationary markets. The model employs a Market Regime Encoder, Adaptive Gate Network, and Adaptive Financial Context module to dynamically adjust self-attention based on 95 financial features grouped into 11 semantic categories. It addresses sequence alignment and backtesting issues while optimizing a composite objective of prediction error, directional accuracy, and Sharpe ratio. Evaluations show competitive performance with 15.2% reduced complexity and improved parameter efficiency compared to Transformer baselines.

adaptive financial transformerregime-gated attentionmarket regime encodernon-stationary marketscomposite objective

Read original →

Reliability, Faithfulness, and the Limits of Post-hoc Explanations of Opaque Scientific Models

arXiv cs.LG · Nick Oh, Helen Jin · 2026-06-28

The article critiques the adequacy of post-hoc explanation methods for interpreting scientific machine learning models, arguing that reliability and faithfulness alone cannot validate structural claims about the underlying phenomenon. While reliability ensures model predictions align with observed outcomes and faithfulness ensures explanations match the model's behavior, neither verifies that the model's mechanisms mirror the phenomenon's actual structure. The authors contend that such explanations can only generate candidate hypotheses requiring external corroboration, not definitive structural insights.

post-hoc explanationsmodel reliabilityexplanation faithfulnessscientific machine learningstructural claims

Read original →

Two kinds of robustness are not the same: disentangling fault tolerance and low-SNR robustness in multi-domain event detection on real data

arXiv cs.LG · Isao Kurosawa · 2026-06-28

The study disentangles fault tolerance and low-SNR robustness in multi-domain event detection, demonstrating that sensor-dropout training dominates robustness gains over architectural redundancy. Using a unified benchmark from three real-world datasets (Hi-net seismic, Utah FORGE DAS, MAFAULDA vibration), the authors evaluate CEPHALON (a fault-tolerant detector) against standard models (1D CNN, TCN, compact Transformer) under sensor loss and additive noise. While all models achieve near-perfect AUC (~0.99) on clean data, CEPHALON excels in low-SNR conditions (AUC 0.939 vs. 0.532-0.572 at -2.5 dB), with ablation showing sensor-dropout training as the primary factor. The pipeline is released for reproducibility.

event detectionsensor-dropoutlow-snr robustnessfault tolerancemulti-domain benchmark

Read original →

AMR: Adaptive Modality Routing for Multimodal Polyglot Speaker Identification

arXiv cs.LG · Chuxiao Zuo, Yao Zhu, Minqiang Xu, Manhong Wang · 2026-06-28

The paper introduces Adaptive Modality Routing (AMR), a novel modality fusion module for multimodal polyglot speaker identification addressing missing modalities and language mismatch. AMR dynamically assesses input quality using modality adapters for audio (W2V-BERT 2.0) and face embeddings (IResNet-18), followed by a trainable router estimating dynamic modality weights for logit aggregation. Training employs a modality-aware strategy with four sample pair types and KL divergence supervision. Evaluated on POLY-SIM 2026, AMR achieves 99.93% (English multimodal), 100.00% (Urdu multimodal), 97.50% (English audio-only), and 98.83% (Urdu audio-only) accuracy, averaging 99.07% and outperforming FOP by 32.73%.

adaptive modality routingmodality fusionpolyglot speaker identificationmodality adapterskl divergence

Read original →

Sample Complexity of Scientific Discovery: PAC Learnability of Compositional Function Trees

arXiv cs.LG · Şuayp Talha Kocabay, Talha Rüzgar Akkuş, Kerem Yalçın · 2026-06-28

The paper establishes PAC-learnability guarantees for compositional function trees in symbolic regression, showing that generalization error depends polynomially on depth d and Lipschitz constants rather than exponentially on symbolic complexity. Using Rademacher complexity analysis, the authors prove risk bounds scaling as O(L^d/√n) for trees built from K base operators with arity b, when K,b=O(1). Theoretical results are validated empirically via differentiable operator trees trained on synthetic physics-like targets, demonstrating correlation between generalization gap and the predicted (L^d)/√n complexity term.

pac-learningrademacher complexitysymbolic regressioncompositional functionslipschitz constants

Read original →

Gradient boosting with vector-valued leafs

arXiv cs.LG · David Cortes · 2026-06-28

The paper extends gradient boosting to vector-valued objective functions, addressing limitations in existing frameworks that either update vector elements sequentially or use diagonal Hessian approximations. The proposed method generalizes gradient boosting for vector outputs, enabling efficient optimization of multivariate objectives like multinomial logistic regression. A key contribution is a novel algorithm compatible with histogram-based decision trees, maintaining computational efficiency while handling full vector updates. The approach theoretically supports arbitrary vector-valued loss functions and demonstrates practical applicability to multi-class classification scenarios.

gradient boostingvector-valued objectivemultinomial logistic regressionhistogram-based treeshessian approximation

Read original →

Deciphering Region-Level Signatures from Latency Measurements in LEO Satellite Internet

arXiv cs.LG · Xiang Shi, Yifei Zhang, Peng Hu · 2026-06-28

The paper proposes a hierarchical framework for analyzing region-level latency signatures in LEO satellite Internet using Starlink RTT measurements from the LENS dataset. The method transforms raw RTT sequences into multi-scale statistical features for cross-region comparison, identifying infrastructure availability and dish-to-PoP distance as key deployment factors. Results show 83% accuracy in short-term region classification using XGBoost, with minimum RTT as the most discriminative feature, though performance degrades over longer periods due to limited temporal generalization.

leo satellite internetround-trip timehierarchical frameworkxgboosttemporal generalization

Read original →

📰 Industry Media (1)

Agriculture is ready for AI, but its data isn’t

MIT Tech Review — AI · Carole Hill, Manish Sood · 2026-06-30

Agricultural AI systems demonstrate potential for yield improvement (26%), water reduction (41%), and chemical optimization (33%), but require robust data foundations to avoid misleading outputs. The study identifies key challenges: disparate IoT sensor data, heterogeneous field conditions (soil variation, GPS coordinates), and dynamic external inputs (weather, market data). Effective implementation necessitates unified data models with governance frameworks, exemplified by Reltio's context intelligence layer integrating entities, relationships, and business rules. Without such infrastructure, precision agriculture risks 'garbage in, garbage out' scenarios with operational and compliance consequences.

precision agricultureiot sensor fusioncontext intelligence layerdata governanceyield prediction

Read original →

Generated automatically at 2026-06-30 21:32 UTC. Summaries and keywords are produced by an LLM and may contain inaccuracies — always consult the original article.