Daily Digest — 2026-06-19

Thursday, June 18, 2026 · 276 items · model: deepseek/deepseek-chat

276 items · 6 research labs, 263 arxiv papers, 7 industry media

🏛️ Research Labs (6)

New usage analytics and updated spend controls for enterprises

OpenAI News · 2026-06-18

OpenAI introduces enhanced credit usage analytics and spend controls for ChatGPT Enterprise, enabling organizations to monitor and manage AI deployment costs more effectively. The Global Admin Console provides granular breakdowns of credit consumption across users, products, and models, while updated spend controls allow admins to set default limits, group-specific thresholds, and individual overrides. These features facilitate cost management without restricting high-impact work, with data accessible via a unified Cost API for deeper analysis. The tools are available immediately for ChatGPT Enterprise workspaces.

chatgpt enterprisecredit usage analyticsglobal admin consolespend controlscost api

Improving health intelligence in ChatGPT

OpenAI News · 2026-06-18

OpenAI introduces GPT-5.5 Instant, enhancing health intelligence in ChatGPT through physician-led evaluations and model advancements. The model improves in recognizing urgent care needs, handling uncertainty, and simplifying complex health information. Evaluations include HealthBench and HealthBench Professional, assessing accuracy, safety, and context awareness. Physicians rated GPT-5.5 Instant responses as superior to older models and human physicians, with fewer failure modes. Production traffic analysis shows a 71% reduction in factuality issues over two months. Over 260 physicians across 60 countries reviewed 700,000 responses, informing rubric development and continuous improvement.

gpt-5.5 instanthealthbenchphysician-led evaluationfactuality issuescontext awareness

Using AI to help physicians diagnose rare genetic diseases affecting children

OpenAI News · 2026-06-18

A study demonstrates AI-assisted retrospective genomic reanalysis for diagnosing rare childhood diseases, yielding an additional 4.8% diagnostic resolution in previously unsolved cases. Researchers employed OpenAI's o3 Deep Research model to analyze 376 de-identified cases, integrating clinical phenotypes, inheritance patterns, variant annotations, and scientific literature into reviewable hypotheses. Expert clinicians confirmed 18 diagnoses, including rediscoveries and novel mechanistic insights, through established clinical workflows. The model achieved 85.6 mean confidence scores for correct diagnoses in validation sets. This workflow highlights the scalability of periodic reanalysis as genomic knowledge evolves, though all diagnoses required human adjudication and clinical confirmation.

genomic reanalysisphenotypevariant annotationsclinical confirmationretrospective analysis

MosaicLeaks: Can your research agent keep a secret?

Hugging Face Blog · 2026-06-18

MosaicLeaks introduces a benchmark for evaluating privacy leakage in deep-research agents that interleave private documents with web queries, demonstrating that standard training increases both task performance (strict chain success from 48.7% to 59.3%) and leakage (from 34.0% to 51.7%). The proposed Privacy-Aware Deep Research (PA-DR) method combines situational task rewards with a learned privacy reward, reducing answer/full-information leakage to 9.9% while maintaining 58.7% strict chain success. Results show PA-DR achieves 5-6x better sample efficiency than outcome-only RL by precisely assigning credit to individual agent calls.

privacy leakagemulti-hop reasoningsituational rewardsquery constructionsample efficiency

Beyond LoRA: Can you beat the most popular fine-tuning technique?

Hugging Face Blog · 2026-06-18

The Hugging Face PEFT library benchmarks parameter-efficient fine-tuning (PEFT) techniques, challenging LoRA's dominance by evaluating 40+ methods on LLM math reasoning and image generation tasks. Using consistent experimental setups, the study identifies Pareto-optimal techniques like BEFT and OFT, which outperform LoRA in memory efficiency (20.2GB vs 22.6GB) or accuracy (54.9% vs 53.2%). Results show LoRA is not universally optimal, with alternatives achieving higher DINO similarity (0.708 vs 0.697) at lower VRAM usage (9.01GB vs 9.97GB). The benchmarks enable objective comparisons across metrics including VRAM usage, runtime, and checkpoint size.

parameter-efficient fine-tuninglorapareto frontierdino similarityvram usage

Is it agentic enough? Benchmarking open models on your own tooling

Hugging Face Blog · 2026-06-18

The article introduces a novel benchmarking framework for evaluating how effectively AI coding agents interact with software libraries, using Hugging Face Transformers as a case study. The method measures agent performance across three tiers (bare installation, cloned repository, and skill-enhanced usage) while tracking metrics like token efficiency, latency, and success rates. Results show that CLI optimizations reduce execution time by 1.3-1.8× for large models but increase token consumption due to documentation parsing, while smaller models exhibit varied performance across tiers.

agentic benchmarkingin-context learningtoken efficiencycli optimizationtransformers library

📜 arXiv Papers (263)

UBP2: Uncertainty-Balanced Preference Planning for Efficient Preference-based Reinforcement Learning

arXiv cs.AI · Mohamed Nabail, Leo Cheng, Jingmin Wang, Nicholas Rhinehart · 2026-06-17

UBP2 introduces an uncertainty-balanced preference planning method for efficient preference-based reinforcement learning, addressing poor sample efficiency in existing approaches. The method actively directs exploration by jointly reasoning over uncertainties in reward, dynamics, and value functions using ensembles of models. It evaluates candidate trajectories via a unified score combining expected reward, terminal value, and epistemic uncertainty, enabling explicit tradeoffs between exploitation and information acquisition. Theoretical analysis provides sublinear regret guarantees for finite-horizon and infinite-horizon settings. Empirical evaluation on the Meta-World benchmark demonstrates UBP2's superior sample efficiency compared to model-free preference-based methods and non-optimistic model-based baselines.

preference-based rlepistemic uncertaintymodel ensemblessample efficiencysublinear regret

Rethinking Reward Supervision: Rubric-Conditioned Self-Distillation

arXiv cs.AI · Siyi Gu, Jialin Chen, Sophia Zhou, Arman Cohan · 2026-06-17

The paper introduces Rubric-Conditioned Self-Distillation, a framework that leverages structured rubrics for fine-grained feedback in on-policy self-distillation of reasoning language models. Unlike traditional supervised distillation or scalar reward reinforcement learning, this method conditions the teacher model on criterion-level rubrics to provide token-level guidance on student-generated trajectories, avoiding reliance on potentially noisy chain-of-thought annotations. The framework employs a two-stage pipeline: rubric generation followed by rubric-guided reasoning. Evaluation across diverse science reasoning benchmarks demonstrates its effectiveness, outperforming GRPO by 1.0 points and OPSD by 0.9 points on average.

self-distillationrubricstoken-level guidancereasoning language modelson-policy learning

Reference-Driven Multi-Speaker Audio Scene Generation from In-the-Wild Priors

arXiv cs.AI · Michael Finkelson, Daniel Segal, Eitan Richardson, Shahar Armon · 2026-06-17

ScenA introduces a reference-driven multi-speaker audio scene generation method that conditions a text-to-audio flow-matching foundation model on multiple reference voices and free-form natural language prompts. The approach leverages pretrained in-the-wild priors to produce natural conversational audio with ambient noise, overlapping dialogue, and paralinguistic events, avoiding structured supervision. A key innovation addresses the Reference Shortcut problem through high-noise-biased timestep distribution, forcing text prompt reliance for speaker assignment. Evaluated on CoVoMix2-Dialogue, ScenA outperforms existing systems in speaker-binding metrics while generating rich audio scenes with emotional vocalizations and ambient sound.

flow-matchingreference shortcutparalinguistictimestep distributionspeaker-binding

Data Intelligence Agents: Interpreting, Modeling, and Querying Enterprise Data via Autonomous Coding Agents

arXiv cs.AI · Anoushka Vyas, Aarushi Dhanuka, Sina Khoshfetrat Pakazad, Henrik Ohlsson · 2026-06-17

The paper introduces Data Intelligence Agents (DIA), a system of three autonomous coding agents (Data Interpreter, Schema Creator, Query Generator) that streamline enterprise data integration by generating, executing, validating, and repairing concrete artifacts. DIA leverages a shared memory for experience reuse and surfaces artifacts for domain expert review. The Query Generator is evaluated in fully autonomous mode across seven SQL benchmarks spanning four task categories and four dialects, matching or surpassing the best published results on all benchmarks. This demonstrates the generalization capability of an execution-grounded architecture based on ACAs and shared memory.

autonomous coding agentsdata integrationquery generatorshared memorysql benchmarks

Explaining Attention with Program Synthesis

arXiv cs.AI · Amiri Hayes, Belinda Li, Jacob Andreas · 2026-06-17

The paper introduces a program synthesis method for approximating transformer attention heads with human-readable Python programs. Using GPT-2, TinyLlama-1.1B, and Llama-3B, the approach generates programs by prompting a language model with attention matrix summaries, then ranks them by predictive accuracy on held-out data. Results show <1,000 programs achieve >75% IoU similarity on TinyStories, and replacing 25% of heads incurs only 16% perplexity increase while maintaining QA performance. This advances symbolic interpretability for neural attention mechanisms.

program synthesisattention headstransformer interpretabilityintersection-over-unionperplexity

Correct Yourself, Keep My Trust: How Self-Correction and Social Connection Shape Credibility in Social Chatbots

arXiv cs.AI · Biswadeep Sen, Yi-Chieh Lee · 2026-06-17

This study investigates how social chatbots can maintain credibility after making errors by comparing three correction strategies: webpage retraction, self-correction, and expert correction. Through a between-subjects experiment (N=120), researchers found that while all strategies effectively corrected errors, only self-correction preserved the chatbot's credibility, yielding significantly higher trustworthiness and perceived expertise ratings. Additionally, the strength of the user's social connection (measured via social attraction and self-disclosure) predicted belief change magnitude exclusively under self-correction, as external corrections severed this link. The findings advocate for self-correction mechanisms and emphasize the functional role of social connection in enhancing correction effectiveness.

social chatbotsself-correctioncredibilitysocial connectiontrustworthiness

NeSyCat Torch: A Differentiable Tensor Implementation of Categorical Semantics for Neurosymbolic Learning

arXiv cs.AI · Daniel Romero Schellhorn, Till Mossakowski, Björn Gehrke · 2026-06-17

NeSyCat Torch introduces a differentiable tensor implementation of categorical semantics for neurosymbolic learning, unifying classical, fuzzy, probabilistic, and neural systems under a single inductive truth definition. The framework leverages neural networks to interpret computational symbols, employing probabilistic programming and tensor-based backends. It utilizes the distribution monad for reference semantics and metric evaluation, complemented by the lazy log-tensor monad for numerically stable, differentiable training. Batch monads enable efficient training. Evaluations on MNIST addition demonstrate superior speed and accuracy over LTN and DeepProbLog, nearing DeepStochLog's accuracy, while maintaining a uniform framework applicable to first-order neurosymbolic approaches.

neurosymbolic learningcategorical semanticsdifferentiable tensormonadprobabilistic programming

Trade-offs in Medical LLM Adaptation: An Empirical Study in French QA

arXiv cs.AI · Ikram Belmadani, Oumaima El Khettari, Carlos Ramisch, Frederic Bechet · 2026-06-17

This empirical study evaluates medical domain adaptation strategies for French question-answering (QA) using large language models (LLMs). Comparing continual pretraining (CPT), supervised fine-tuning (SFT), and their combination across model families and sizes, the research finds CPT+SFT marginally outperforms SFT in multiple-choice QA (MCQA), though gains are often statistically insignificant. For open-ended QA (OEQA), CPT improves overlap metrics while SFT degrades generation quality, with instruction tuning and CPT+SFT preferred by LLM-based evaluation. Cross-lingual transfer from French to English benchmarks is also demonstrated.

continual pretrainingsupervised fine-tuningmedical qallm adaptationcross-lingual transfer

A Multi-Domain Benchmark for Detecting AI-Generated Text-Rich Images from GPT-Image-2

arXiv cs.AI · Yijin Wang, Shuyi Wang, Wenhan Zhang, Yuqi Ouyang · 2026-06-17

The paper introduces a multi-domain benchmark for detecting AI-generated text-rich images from GPT-Image-2, addressing a gap in existing benchmarks focused on object-centric images. The benchmark comprises 8,602 images across six categories (commercial posters, infographics, academic posters, receipts, tables, UI screenshots) and evaluates five AI-generated image detectors in zero-shot settings. Results reveal domain-dependent performance, with conventional detectors showing sensitivity to JPEG compression, and highlight the limitations of multimodal vision-language models on structured formats. The findings underscore the need for text- and layout-aware detection methods.

ai-generated text-rich imagesmultimodal image generationzero-shot detectionjpeg compression robustnessvision-language models

X+Slides: Benchmarking Audience-Conditioned Slide Generation

arXiv cs.AI · Haodong Chen, Xuanhe Zhou, Wei Zhou, Xinyue Shao · 2026-06-17

The paper introduces X+Slides, a benchmark for audience-conditioned slide generation that evaluates systems based on target audience needs. The benchmark uses 8,133 deduplicated, source-grounded probes across 113 topics and seven presentation scenes, measuring four metrics: Audience Coverage, Domain-wise Coverage, Efficiency, and Correctness. Experiments on DeepPresenter, SlideTailor, and NotebookLM show varying performance, with DeepPresenter achieving 0.714 Audience Coverage at τ_A=0.7, highlighting the importance of source-grounded evaluation beyond visual quality and topic coverage.

slide generationaudience-conditionedsource-grounded probeslarge language modelsdynamic evaluation framework

OneCanvas: 3D Scene Understanding via Panoramic Reprojection

arXiv cs.AI · Bartłomiej Baranowski, Dave Zhenyu Chen, Matthias Nießner · 2026-06-17

OneCanvas introduces a panoramic reprojection method for 3D scene understanding in Vision-Language Models (VLMs) without complex geometry encoders or extensive training. It aggregates patch features from multiple views onto an equirectangular canvas by unprojecting them to 3D world coordinates, adding 3D position embeddings, and maintaining a shared spatial coordinate system. This approach supports situated reasoning and enables a spatial pretraining curriculum with procedurally generated supervision. OneCanvas achieves state-of-the-art accuracy on SQA3D and VSI-Bench, generalizes to SPBench, and requires significantly less training compute than competing methods.

panoramic reprojection3d scene understandingvision-language modelsequirectangular canvasspatial pretraining

A Taxonomy of Mental Health and Technology Needs for Alzheimer's and Dementia Caregivers

arXiv cs.AI · Keran Wang, Drishti Goel, Jiayue Melissa Shi, Violeta J. Rodriguez · 2026-06-17

The study introduces a Caregiver Mental Health and Technology Taxonomy to address unmet needs of Alzheimer's and dementia caregivers, who provide 18 billion hours of unpaid care annually. Through interdisciplinary literature review and qualitative studies, the taxonomy links caregiver needs with technology interventions, identifying gaps like relational strain and compassion fatigue. The framework provides a shared vocabulary for clinicians and designers to develop person-centered, AI-enabled solutions in dementia care.

caregiver burdenalzheimer's diseasetechnology taxonomymental healthai-enabled interventions

TxBench-PP: Analyzing AI Agent Performance on Small-Molecule Preclinical Pharmacology

arXiv cs.AI · Hannah Le, Ramesh Ramasamy, Alex Urrutia, Mahsa Yazdani · 2026-06-17

The authors introduce TxBench-PP, a verifiable benchmark for evaluating AI agents on small-molecule preclinical pharmacology tasks, focusing on real-world assay data interpretation rather than memorized facts. The benchmark comprises 100 evaluations across program stages, assay types, and task structures, including mechanism-of-action reasoning and compound-target engagement. Testing 16 model-harness configurations (11 models, 4,800 trajectories), results show no system reliably recovers decisions, with Claude Opus 4.8/Pi achieving 59.3% accuracy and GPT-5.5/Pi at 55.3%.

preclinical pharmacologyai agentssmall-moleculemechanism-of-actionbenchmark

STARE: Surprisal-Guided Token-Level Advantage Reweighting for Policy Entropy Stability

arXiv cs.AI · Haipeng Luo, Qingfeng Sun, Songli Wu, Can Xu · 2026-06-17

The paper introduces STARE, a method for stabilizing policy entropy in reinforcement learning with verifiable rewards (RLVR) algorithms like GRPO. Through gradient analysis, the authors identify a token-level credit assignment mismatch leading to entropy collapse, characterized by an advantage-surprisal four-quadrant structure. STARE addresses this by reweighting advantages for entropy-critical tokens identified via surprisal quantiles and incorporating target-entropy regulation. Evaluated on models from 1.5B to 32B parameters across three task families, STARE maintains stable entropy and outperforms baselines by 4%-8% in accuracy on AIME24 and AIME25 benchmarks.

reinforcement learningpolicy entropytoken-level advantagesurprisal quantilescredit assignment

Mechanism-Guided Selective Unlearning for RLVR-Induced Reasoning

arXiv cs.AI · Chenyu Zhou, Qiliang Jiang, Shuning Wu, Xu Zhou · 2026-06-17

The paper introduces MAST (Mechanism-Aligned Selective Targeting), a selective unlearning method for RLVR-induced reasoning that minimizes collateral damage compared to full-parameter updates. MAST identifies critical attention-projection tensors via off-principal energy, update magnitude, and forget-gradient coupling, then updates only this subset. Evaluated on Qwen2.5-Math-1.5B and Qwen3-1.7B-Base, MAST significantly forgets target MATH problems (45/150 to 37/150; p=0.0078) while preserving GSM8K performance (+0.8 pp) and MATH retain (-0.5 pp). Results generalize across seeds, NPO/SimNPO objectives, and model variants, where full-parameter unlearning fails.

selective unlearningrlvr-induced reasoningattention-projection tensorsoff-principal energycollateral damage

Machine Unlearning for the XGBoost Model with Network Intrusion Datasets

arXiv cs.AI · Diana Magalhães, Eva Maia, João Vitorino, Isabel Praça · 2026-06-17

The paper introduces XGBoost-Forget, a machine unlearning (MU) method for XGBoost models, addressing the gap in MU research for tabular network intrusion detection. The approach removes specific data points without full retraining, evaluated on IoT-23 and GeNIS datasets using performance, efficiency, and forgetting metrics. Results show XGBoost-Forget maintains predictive accuracy comparable to the original model while achieving significantly faster unlearning, demonstrating efficacy for tabular NI applications.

machine unlearningxgboostnetwork intrusion detectiontabular dataforgetting quality

Forecasting what Matters: Decision-Focused RL for Controlled EV Charging with Unknown Departure Times

arXiv cs.AI · Giuseppe Gabriele, Fabio Pavirani, Seyed Soroush Karimi Madahi, Chris Develder · 2026-06-17

We propose a decision-focused reinforcement learning (DF-RL) framework for electric vehicle (EV) charging control that jointly trains a forecaster and RL agent end-to-end to mitigate uncertainty in departure times. The method integrates forecasting feedback directly into the RL agent's policy optimization, prioritizing decision quality over standalone forecasting accuracy. Experimental results demonstrate that DF-RL outperforms baseline RL methods, achieving a 14% improvement in total reward and a 55% reduction in unsupplied energy compared to RL without departure time forecasting.

reinforcement learningelectric vehicleforecastingdecision-focusedcharging control

The More the Merrier: Combining Properties for ABox Abduction under Repair Semantics for ELbot

arXiv cs.AI · Anselm Haak, Patrick Koopmann, Yasir Mahmood, Anni-Yasmin Turhan · 2026-06-17

The paper investigates ABox abduction in EL_bot under brave and AR semantics, focusing on hypotheses that simultaneously satisfy multiple desirable properties or optimality criteria. The authors analyze combinations of properties like signature-restrictions, minimality in size, and conflict minimization, which were previously studied only in isolation. Key findings indicate that imposing additional constraints on hypotheses often does not increase computational complexity, maintaining tractability for practical applications in knowledge base repair.

abductionrepair semanticsknowledge baseoptimality criteriael_bot

Language Models as Interfaces, Not Oracles: A Hybrid LLM-ML System for Pediatric Appendicitis

arXiv cs.AI · Soheyl Bateni, Maryam Abdolali · 2026-06-17

The study introduces ClaMPAPP, a hybrid LLM-ML system for pediatric appendicitis diagnosis that uses large language models as structured feature extractors rather than direct predictors. The system employs an LLM to parse clinical narratives into schema-constrained features, applies plausibility checks, then feeds validated features to an XGBoost classifier trained on clinical/lab/ultrasound data. Evaluated on two German pediatric cohorts using synthetically generated narratives, ClaMPAPP outperformed end-to-end LLM baselines (including proprietary models) in diagnostic accuracy while maintaining robustness to narrative reordering. The hybrid approach demonstrated more stable sensitivity-specificity trade-offs and reduced missed diagnoses compared to direct LLM inference.

clinical decision supportfeature extractionxgbclassifiernarrative robustnesshybrid architecture

Compute Efficiency and Serial Runtime Tradeoffs for Stochastic Momentum Methods

arXiv cs.AI · Depen Morwani, Alexandru Meterez, Pranav Nair, Sham Kakade · 2026-06-17

The paper analyzes compute efficiency (CE) and serial runtime tradeoffs for stochastic momentum methods in consistent linear regression with Gaussian covariates. Focusing on Heavy Ball (HB) and Accelerated SGD (ASGD), it establishes finite-dimensional, discrete-time lower bounds on batch-size tradeoffs. Results show HB preserves SGD-level CE over a larger batch-size window (up to √κ larger than SGD's critical batch size) without improving CE frontier, while ASGD exhibits spectrum-dependent behavior: improving small-batch CE for rapidly decaying power-law spectra but trading CE for serial runtime as batch size grows. Synthetic experiments confirm these regimes, including ASGD-HB overlap for slowly decaying spectra and CE-serial tradeoffs for rapidly decaying spectra.

stochastic momentum methodscompute efficiencyserial runtimebatch-size tradeoffspower-law spectra

Hardware- and Vision-in-the-Loop Validation of Deep Monocular Pose Estimation for Autonomous Maritime UAV Flight

arXiv cs.AI · Maneesha Wickramasuriya, Beomyeol Yu, Jaden Shin, Mason Huslig · 2026-06-17

The paper presents a hardware-in-the-loop framework for validating deep monocular pose estimation in maritime UAV autonomy, addressing the challenges of costly at-sea testing. The method integrates a transformer-based pose estimator processing photorealistic rendered maritime views, fused with IMU data via a delayed Kalman filter to handle perception latency and asynchronous updates. Experiments demonstrate stable closed-loop flight for autonomous takeoff, trajectory tracking, and landing, validating the system as a realistic intermediate step before shipboard deployment.

monocular pose estimationdelayed kalman filtervision-in-the-loopuav autonomytransformer-based

A Clinician-Centered Pipeline for Annotation and Evaluation in Ultrasound AI Studies

arXiv cs.AI · Fangyijie Wang, Jianjun Yu, Wentao Shi, Haixia Huang · 2026-06-17

The authors present a clinician-centered pipeline for remote annotation and evaluation in ultrasound AI studies, addressing gaps in existing platforms that lack blinded model comparison and reproducible workflows. The system employs a centralized server with browser interfaces to enable clinician annotation, blinded ranking, and multi-rater review without local data downloads, plus automated statistical analysis. Validation in a fetal ultrasound segmentation study with six raters showed moderate-to-strong inter-rater agreement (Spearman, Kendall's τ) and preference for later active learning models, demonstrating utility for reproducible human-AI evaluation.

ultrasound aiclinician-centered evaluationblinded rankingmulti-rater agreementactive learning

User as Engram: Internalizing Per-User Memory as Local Parametric Edits

arXiv cs.AI · Bojie Li · 2026-06-17

The paper introduces User as Engram, a method for internalizing per-user memory in language models through local parametric edits. Unlike traditional approaches that use global weight deltas (e.g., LoRA adapters), this method separates user-specific content from shared reasoning skills by storing user facts as surgical edits to a hash-keyed memory table. This design ensures that unrelated text remains unaffected, achieving a 33,000x smaller memory footprint. The approach matches LoRA's direct recall accuracy while improving indirect reasoning accuracy by 5.6x on average, without degrading reasoning performance. The edits are composable, allowing multiple users to coexist in a shared table losslessly, and outperform retrieval pipelines beyond ~100 facts.

local parametric editshash-keyed memory tablelora adaptersreasoning accuracymemory footprint

Beyond Safe Data: Pretraining-Stage Alignment with Regular Safety Reflection

arXiv cs.AI · Jinhan Li, Kexian Tang, Yihan Xu, Zhuorui Ye · 2026-06-17

The paper proposes Safety Reflection Pretraining, a pretraining-stage alignment method that integrates self-monitoring into language models by regularly inserting safety reflections into training corpora. This approach contrasts with existing methods focused solely on data filtering or rewriting, aiming to prevent models from composing benign knowledge into unsafe behaviors. Experiments with 1.7B parameter models on FineWeb-Edu show improved safety classification accuracy and reduced attack success rates. Controlled tests in MedSafetyWorld demonstrate superior performance over data filtering in preventing unsafe behavior generalization from safe data.

pretraining-stage alignmentsafety reflectionself-monitoringlarge language modelsbehavior generalization

Essential Subspace Merging for Multi-Task Learning

arXiv cs.AI · Longhua Li, Lei Qi, Xin Geng, Qi Tian · 2026-06-17

The paper introduces Essential Subspace Merging (ESM) and its dynamic variant ESM++ for multi-task model merging by addressing inter-task interference. Analyzing output shifts, the authors identify an 'essential subspace' where task-specific updates concentrate energy, while other directions cause interference. ESM statically merges models via orthogonalization of essential components, while ESM++ dynamically routes through low-rank experts during inference. Experiments across diverse tasks and model scales show both methods effectively preserve task knowledge while minimizing interference.

model mergingmulti-task learningessential subspacelow-rank expertsinter-task interference

AdsMind: A Physics-Grounded Multi-Agent System for Self-Correcting Discovery of Adsorption Configurations on Heterogeneous Catalyst Surfaces

arXiv cs.AI · Zongmin Zhang, Yuyang Lou, Bowen Zhang, Junwu Chen · 2026-06-17

AdsMind introduces a physics-grounded multi-agent system for autonomous discovery of low-energy adsorption configurations on catalyst surfaces, addressing limitations of MLFFs and open-loop LLM agents. The framework combines machine-learning force fields with closed-loop relaxation feedback, enabling error correction during structural optimization. Evaluated on AA20 and OCD-GMAE62 benchmarks, AdsMind achieves 100% and 98.8% success rates respectively, reduces energy dispersion across LLM backends, and cuts MLFF relaxations by ~14× versus enumeration baselines. DFT validation confirms AdsMind preserves correct adsorption-energy signs where open-loop approaches fail, demonstrating reliability and interpretability for autonomous chemistry workflows.

adsorption configurationmachine-learning force fieldsmulti-agent systemdensity functional theoryheterogeneous catalysis

OrthoReg: Orthogonal Regularization for Hybrid Symbolic-Neural Dynamical Systems

arXiv cs.AI · Till Richter, Niki Kilbertus · 2026-06-17

The paper introduces OrthoReg, an orthogonal regularization method for hybrid symbolic-neural dynamical systems that prevents neural components from relearning mechanistic parts already captured by symbolic components. The approach directly penalizes overlap between symbolic and neural representations, ensuring complementary decomposition where symbolic parts capture expressible library structures and neural parts handle residuals. Evaluated on dynamical systems with partial library mismatch, OrthoReg demonstrates improved symbolic recovery and out-of-distribution generalization compared to standard $L^2$ regularization.

orthogonal regularizationhybrid modelingdynamical systemssymbolic recoveryneural augmentation

Human-AI Coevolution Dynamics: A Formal Theory of Social Intelligence Emergence Through Long-Term Interaction

arXiv cs.AI · Jingyi Zhou, Senlin Luo, Haofan Chen · 2026-06-17

The Human-AI Coevolution Dynamics Framework (HACD-H) proposes a unified formal model for the emergence of social intelligence in long-term human-AI interaction. HACD-H integrates emotional adaptation, relational organization, social memory, and personality consistency into a dynamical framework, introducing concepts like multi-timescale social cognition, relational attractors, and social cognitive energy dynamics. A dataset of 14,700 interaction turns was constructed for empirical evaluation. Results indicate hierarchical temporal persistence in social cognition, stable relational attractors, phase-transition-like developmental patterns, and a structured social cognitive energy landscape. Social intelligence negatively correlates with social cognitive energy (r = -0.391, p < 0.001), with interaction trajectories showing progressive energy reduction over time.

social cognitionrelational attractorssocial cognitive energyemotional adaptationphase transitions

A Technical Taxonomy of LLM Agent Communication Protocols

arXiv cs.AI · Linus Sander, Habtom Kahsay Gidey, Alexander Lenz, Alois Knoll · 2026-06-17

The study presents a technical taxonomy for classifying LLM agent communication protocols, addressing interoperability challenges in distributed agent networks. Through an iterative method involving five iterations (three empirical-to-conceptual and two conceptual-to-empirical), the authors analyzed nine open-source protocols. The taxonomy identifies five dimensions: counterparty, payload, interaction state, discovery mechanism, and schema flexibility. Key findings include prevalent hybrid payloads with session-state persistence in agent-to-agent protocols, emerging schema flexibility trends, and rare decentralized discovery. The analysis predicts short-term convergence toward unified protocols but anticipates long-term evolution into a federated, layered stack.

llm agentscommunication protocolstaxonomyinteroperabilityschema flexibility

Pareto Q-Learning with Reward Machines

arXiv cs.AI · Arnaud Lequen, Clément Legrand-Lixon, Léo Saulières · 2026-06-17

The authors present Pareto Q-Learning with Reward Machines (PQLRM), a multi-objective reinforcement learning algorithm for non-Markovian tasks specified by reward machines. PQLRM combines Pareto Q-Learning's vector-valued Q-estimates with Q-Learning with Reward Machines' automaton-based reward factorization, enabling efficient multi-policy learning. Experiments demonstrate PQLRM's faster convergence compared to naive Pareto Q-Learning on cross-product MDPs and its ability to synthesize Pareto-optimal policies unattainable by QRM.

multi-objective reinforcement learningreward machinespareto q-learningnon-markovian rewardssample efficiency

Equivariant Graph Neural Networks Improve Optical Spectra Prediction for Materials Screening

arXiv cs.AI · Kasper Helverskov Petersen, François R J Cornet, Martin Ovesen, Mikkel Jordahn · 2026-06-17

The authors propose an equivariant graph neural network (GotenNet) for improved optical spectra prediction in materials screening, addressing limitations of rotation-invariant scalar features in existing models. The method leverages geometric expressiveness through equivariance, trained on 10,533 structures with random phase approximation (RPA)-computed spectra. Results show state-of-the-art performance, particularly in the 0-8 eV range and for static real permittivity prediction, critical for thin-film optics applications.

equivariant graph neural networksoptical spectra predictionmaterials screeningrandom phase approximationthin-film optics

Analysing drivers and interdependencies in European electricity markets using XAI

arXiv cs.AI · Antoine Pesenti, Aidan O'Sullivan · 2026-06-17

This paper integrates deep neural networks (DNNs) with explainable AI (XAI) techniques to analyze electricity price determinants across 39 European bidding zones. SHAP (SHapley Additive exPlanations) and SSHAP frameworks are employed to quantify feature contributions and enhance interpretability in high-dimensional settings. Results reveal that renewable energy sources, particularly solar, disproportionately influence price formation despite their lower generation share. Gas prices consistently drive electricity markets, while interconnections significantly shape price dynamics, underscoring European electricity system interdependence. A synthetic EU-wide market is constructed to explore counterfactual scenarios of full market integration.

deep neural networksshapley additive explanationselectricity marketsfeature contributionscounterfactual scenario

Towards an Agent-First Web: Redesigning the Web for AI Agents

arXiv cs.AI · Eranga Bandara, Ross Gore, Ravi Mukkamala, Asanga Gunaratna · 2026-06-17

The paper proposes a three-layer redesign of the web to accommodate AI agents as first-class citizens, addressing access, economics, and content challenges. At the access layer, it advocates for agent-equivalent rights via HTTP metadata and dual-layer content delivery. The economic layer introduces intent-based tiers and token-based subscriptions aligned with human-proxy principles. For content, it counters epistemic recursion through Agent Text Markup Language (ATML) and cryptographic provenance chains. The work presents ten design principles for an agent-first web, restructuring foundational assumptions about access rights, economic models, and content integrity.

agent-first webepistemic recursionintent-based tieragent text markup languagecryptographic provenance

Leadership as Coordination Control: Behavioral Signatures and the Recovery-Advantage Boundary in Multi-Agent LLM Teams

arXiv cs.AI · Haewoon Kwak · 2026-06-17

The study investigates when process-level coordination control benefits multi-agent LLM teams, aligning with team science's contingency theory. Using behavioral signatures (majority lock-in, exploration, recovery) and per-action ablations, it operationalizes three leadership styles (transactional, transformational, situational) as controllers over shared actions (explore, revise, accept, synthesize). Results across four tasks and three model families (e.g., LLaMA) show no universal accuracy advantage: transactional control matches round-0 voting within 1.3pp, with gains only when the initial majority is unreliable (situational +8pp). A recovery-advantage boundary confirms controllers help only when round-0 consensus is unreliable, tasks are recoverable, and undirected interaction fails, mirroring contingency theory.

multi-agent llm teamsbehavioral signaturescontingency theoryprocess-level coordinationrecovery-advantage boundary

ProductConsistency: Improving Product Identity Preservation in Instruction-Based Image Editing via SFT and RL

arXiv cs.AI · Mukund Khanna, Raj Singh Yadav, Kunal Singh · 2026-06-17

The paper introduces ProductConsistency, a dataset and framework for improving product identity preservation in instruction-based image editing. The method combines supervised fine-tuning (87k samples) and reinforcement learning (869 unique products) with a novel Cyclic Consistency reward that enforces semantic preservation through caption similarity metrics. Evaluations on Qwen-Image-Edit-2511 and Flux.1-Kontext-dev show 5x reduction in character error rate and improvements in OCR, perceptual metrics, and MLLM-based assessments compared to baselines.

instruction-based image editingproduct consistencycyclic consistency rewardsupervised fine-tuningreinforcement learning

ARIADNE: Agnostic Routing for Inference-time Adapter DyNamic sElection

arXiv cs.AI · Enrico Cassano, Michał Brzozowski, Zuzanna Dubanowska, Paolo Mandica · 2026-06-17

ARIADNE introduces a training-free, adapter-agnostic routing framework for dynamic selection of task-specialized adapters during inference. The method represents each adapter via centroids derived from its training set embeddings, enabling selection by measuring input proximity to these centroids in latent space. Evaluated on Llama 3.2 1B Instruct across 23 NLP tasks, ARIADNE achieves 97.44% of upper-bound performance, scaling to 44 tasks with 89.7% selection accuracy without adapter modifications or additional training.

parameter-efficient fine-tuningadapter routinginference-time selectionlatent space proximitytask adaptation

RODS: Reward-Driven Online Data Synthesis for Multi-Turn Tool-Use Agents

arXiv cs.AI · Ruishan Fang, Siyuan Lu, Chenyi Zhuang, Tao Lin · 2026-06-17

RODS (Reward-driven Online Data Synthesis) addresses the sample depletion problem in multi-turn tool-use RL by dynamically generating informative training data. The method leverages reward variance as a boundary detector to identify samples near the agent's capability boundary, synthesizes structurally complex multi-turn variants via skill-aligned resampling, and maintains a dynamic replay buffer. Starting from 400 human seeds and maintaining an active pool of ~800 samples, RODS achieves performance comparable to a 17K-sample offline pipeline while requiring 20x fewer trajectories, outperforming fixed-data RL and environment augmentation in controlled settings.

multi-turn tool-usereward variancedynamic replay bufferskill-aligned resamplingsample depletion

Where Did the Variability Go? From Vibe Coding to Product Lines by Regeneration

arXiv cs.AI · Xhevahire Tërnava · 2026-06-17

The paper introduces Variability by Regeneration (VbR), a novel product-line approach where an LLM serves as the derivation engine, generating purpose-built binaries from declarative specifications to address near-zero in-artifact variability in vibe-coded projects. Through exploratory analysis of 10 C/C++ projects, the authors demonstrate that variability decisions are resolved solely at generation time, contrasting with classical SPL derivation. The VbR pipeline is formalized and implemented on a wc product family, showing that variability should reside in specifications rather than code for AI-generated software.

variability by regenerationvibe codingproduct-linellmderivation engine

A Hybrid LSTM--Vision Transformer Architecture for Predicting HRRR Forecast Errors

arXiv cs.AI · David Aaron Evans, Jay C. Rothenberger, Kara J. Sulia, Nick P. Bassill · 2026-06-17

The study introduces a hybrid LSTM-Vision Transformer (LSTM-ViT) architecture to improve High-Resolution Rapid Refresh (HRRR) forecast error prediction by integrating temporal surface observations with vertical atmospheric profiles. The method combines LSTM for sequence learning and Vision Transformer (ViT) for processing atmospheric structure from New York State Mesonet profiler data. Results show a twofold skill increase in precipitation error prediction, with notable improvements at shorter lead times and during planetary boundary layer activity, demonstrating enhanced capture of convectively driven errors.

lstm-vithrrrforecast error predictionplanetary boundary layervision transformer

FoMoE: Breaking the Full-Replica Barrier with a Federation of MoEs

arXiv cs.AI · Lorenzo Sani, Zeyu Cao, Meghdad Kurmanji, Alex Iacob · 2026-06-17

FoMoE introduces a federated MoE training system that eliminates full-model replication across distributed sites, addressing memory and communication bottlenecks in large-scale LLM pre-training. The method partitions expert layers across workers, employing partial expert replication and a skip-token mechanism to optimize throughput. Results show 1.42x-45.44x communication cost reduction over baselines, 1.4x throughput speedup, and stable routing in proxy regimes, with scalability projected to 100B parameters.

mixture-of-expertsfederated learninglarge language modelsdistributed trainingcommunication efficiency

Spotlight: Synergizing Seed Exploration and Spot GPUs for DiT RL Post-Training

arXiv cs.AI · Ruiqi Lai, Dakai An, Wei Gao, Ju Huang · 2026-06-17

Spotlight introduces a system for cost-efficient RL post-training of Diffusion Transformers (DiTs) by synergizing seed exploration and spot GPUs. Key innovations include: (1) stale-weight-tolerant exploration that preserves seed rankings, enabling exploration during idle GPU periods; (2) elastic sequence parallelism with sub-second recovery via on-node state reuse; and (3) a preemption-aware scheduler. Implemented on ROLL, Spotlight achieves 4× faster convergence and 1.4-6.4× cost reduction for Qwen-Image post-training, while maintaining superior image quality on DeepSeek-OCR and Geneval benchmarks at 512×512 and 1280×1280 resolutions.

diffusion transformersreinforcement learningspot gpussequence parallelismseed exploration

TRAP: Benchmark for Task-completion and Resistance to Active Privacy-extraction

arXiv cs.AI · Moon Ye-Bin, Nam Hyeon-Woo, Baek Seong-Eun, Yejin Yeo · 2026-06-17

The paper introduces TRAP, a benchmark evaluating the trade-off between task completion accuracy and resistance to active privacy extraction in document-intensive workflows. TRAP scenarios include private documents, task queries requiring tool invocation with private fields, and attack queries attempting to elicit private information. Evaluating 22 models, including frontier proprietary and open-source variants, reveals non-trivial privacy leakage across all model families, with instruction-following ability correlating with leakage rate. Prompt-based defenses reduce leakage but significantly impair task accuracy, with softmax-based models inherently unable to achieve both high task success and zero leakage probability. The authors propose structural private field isolation using hash keys, which largely prevents leakage while maintaining task accuracy.

privacy leakagetask completionsoftmax-based modelsprompt optimizationhash keys

G-IdiomAlign: A Gloss-Pivoted Benchmark for Cross-Lingual Idiom Alignment

arXiv cs.AI · Fengying Ye, Yanming Sun, Runzhe Zhan, Zheqi Zhang · 2026-06-17

G-IdiomAlign introduces a gloss-pivoted benchmark for cross-lingual idiom alignment, anchored by English glosses from Wiktionary. The benchmark supports two evaluation protocols: Multiple-Choice Idiom Equivalence with typed distractors and Gloss-Contrastive Generation to isolate semantic pivot effects. Results show LLMs exhibit a bias toward literal translation, particularly for low-resource languages, with glosses improving performance modestly. Analysis on Qwen3-8B indicates attention heads play a more significant role than layers in cross-condition differences, and better generations correlate with stronger gloss anchoring.

idiom alignmentgloss-pivotedcross-lingualnon-compositionalityembedding-based semantic proxy

ThinkDeception: A Progressive Reinforcement Learning Framework for Interpretable Multimodal Deception Detection

arXiv cs.AI · Jinhao Song, Shan Liang, Yiqun Yue, Zhuhuayang Zhang · 2026-06-17

The paper introduces ThinkDeception, an interpretable multimodal deception detection framework using Multimodal Large Language Models (MLLMs) to transform the task into a cognitive reasoning process. The method leverages a novel Visual-Audio Consistency Group Relative Policy Optimization (VAC-GRPO) with a progressive training strategy across four difficulty tiers, enhanced by a dynamic curriculum scheduler and multi-dimensional reward mechanism. Experiments show state-of-the-art performance in detection accuracy and rationale quality on mainstream benchmarks, highlighting the importance of modal inconsistency in deception detection.

multimodal deception detectionchain of thoughtmllmsrelative policy optimizationprogressive training

Beyond Tokenization: Direct Timestep Embedding and Contrastive Alignment for Time-Series Question Answering

arXiv cs.AI · Yafeng Wu, Huu Hiep Nguyen, Thin Nguyen, Hung Le · 2026-06-17

The paper introduces CADE, a framework for time-series question answering (TSQA) that bypasses tokenization bottlenecks in LLMs through direct timestep embedding and contrastive alignment. CADE employs a point-wise linear encoder and MLP projector to map raw numerical series directly into LLM embedding space, preserving exact index-level access. A one-directional supervised contrastive loss aligns time-series embeddings with frozen class-name text anchors. Evaluated on Time-MQA, CADE outperforms open-source and proprietary LLM baselines across six TSQA tasks.

time-series question answeringdirect timestep embeddingcontrastive alignmenttokenization bottleneckllm embedding space

CAPRA: Scaling Feedback on Software Architecture Deliverables with a Multi-Agent LLM System

arXiv cs.AI · Marco Becattini, Niccolò Caselli, Matteo Minin, Roberto Verdecchia · 2026-06-17

CAPRA introduces a multi-agent LLM system for automated assessment of software architecture deliverables, addressing structural completeness and requirements traceability. The system employs specialized agents, a Python-based microservice for multi-modal document extraction (using PyMuPDF and gpt-4o), and deterministic Evidence Anchoring with fuzzy matching to mitigate hallucinations. Evaluation on 10 student reports shows 88.8% criteria satisfaction under strict two-rater aggregation, moderate inter-rater agreement (kappa = 0.582), and 4-minute processing time per report, though human oversight remains necessary for subjective dimensions.

multi-agent llmrequirements traceabilityevidence anchoringlevenshtein distanceuml diagram parsing

A Controlled Benchmark of Quantum-Latent GAN Augmentation for Brain MRI

arXiv cs.AI · Syed Mujtaba Haider, Silvia Figini · 2026-06-17

We present a controlled benchmark isolating the contribution of quantum generators to brain-MRI augmentation, addressing methodological gaps in prior quantum generative augmentation studies. Images are encoded into a KL-regularized latent space, where a conditional Wasserstein GAN with gradient penalty is trained using either a variational quantum generator or a classical generator of near-identical parameter count (1648 vs. 1632). Synthetic samples augment a pretrained classifier across labeled data fractions (5%-100%), evaluated over eight random seeds with paired significance testing and latent-distribution analyses. Results show no significant improvement over real-data-only training, with quantum and classical generators statistically indistinguishable. Synthetic samples exhibit mode collapse and off-distribution behavior in low-data regimes, with no diversity advantage for the quantum generator.

quantum generatorbrain-mriwasserstein ganlatent spacemode collapse

RTSGameBench: An RTS Benchmark for Strategic Reasoning by Vision-Language Models

arXiv cs.AI · San Kim, Daechul Ahn, Reokyoung Kim, Hyeonbeom Choi · 2026-06-17

The paper introduces RTSGameBench, a benchmark for evaluating strategic reasoning in Vision-Language Models (VLMs) using the real-time strategy game Beyond All Reason. The benchmark features diverse gameplay scenarios, diagnostic mini-games targeting specific competencies, and a self-evolving framework for extensible evaluation. It also proposes RTSGameAgent, a finite state machine with agentic memory for unit management. Empirical results show state-of-the-art VLMs struggle with coordination and scalability in complex matchups.

vision-language modelsstrategic reasoningreal-time strategymultiagent coordinationbenchmark evaluation

Decoupling Search from Reasoning: A Vendor-Agnostic Grounding Architecture for LLM Agents

arXiv cs.AI · Emmanuel Aboah Boateng, Kyle MacDonald, Amardeep Kumar, Siddharth Kodwani · 2026-06-17

Decoupled Search Grounding (DSG) introduces a vendor-agnostic architecture that separates real-time search grounding from LLM reasoning, addressing issues like Search-Induced Verbosity and provider coupling. DSG employs an MCP-compatible gateway to expose controls such as provider routing, source-aware context rendering, and semantic caching. Evaluated on SimpleQA, FreshQA, and HotpotQA, DSG achieves near-native accuracy (86.1% vs. 87.7%) on SimpleQA with 91% lower search cost, 99.4% warm-cache hit rate, and 68% lower latency. In e-commerce query-understanding, DSG matches native-search accuracy while reducing search costs by over 98%, demonstrating its efficacy as an optimizable interface boundary.

decoupled search groundingmcp-compatible gatewaysearch-induced verbositysemantic cachingvendor-agnostic

SciRisk-Bench: A Risk-Dimension-Aware Benchmark for AI4Science Safety

arXiv cs.AI · Linghao Feng, Yinqian Sun, Dongqi Liang, Sicheng Shen · 2026-06-17

SciRisk-Bench introduces a novel benchmark for evaluating AI4Science safety across explicit risk dimensions and scientific disciplines. The benchmark spans 7 disciplines, 31 subdisciplines, and 10 risk dimensions, addressing the gap in existing AI4Science safety datasets that lack specification of underlying risk factors. Evaluation of both mainstream and science-oriented large language models (LLMs) across these dimensions enables fine-grained diagnosis of safety shortcomings in scientific contexts. Results provide insights into where scientific models remain unsafe, offering a comprehensive framework for assessing risk-awareness in AI4Science applications.

ai4sciencerisk dimensionslarge language modelssafety benchmarkscientific disciplines

TransitNet: A Compact Attention-Augmented Deep Learning Framework for Low-SNR Transit Blind Searches

arXiv cs.AI · Xingchen Yan, Jian Ge, Qingtian Liu, Kevin Willis · 2026-06-17

The authors propose TransitNet, a compact attention-augmented deep learning framework for detecting low-SNR exoplanet transits in blind searches. The method integrates a unified dataset construction pipeline with attention mechanisms for both detection and transit parameter estimation (window and midpoint). Evaluated on Kepler data, TransitNet achieves 95.2% accuracy at SNR 6-8, outperforms TLS (63.1%) and BLS (60.0%) in recovery rates, and attains 0.974 ROC-AUC with 1.24-hour mean midpoint error. The 1.5MB model provides 12-25× speedup over CPU-TLS while maintaining 97.4% window coverage on injected transits.

transit detectionattention mechanismslow-snrexoplanet searchkepler data

As Easy as Rocket Science: Assessing the Ability of Large Language Models to Interpret Negation in Figurative Language

arXiv cs.AI · Jasmine Owers, Edwin Simpson, Martha Lewis · 2026-06-17

The study evaluates large language models' (LLMs) capability to interpret negation within figurative language, addressing a known challenge in NLP. Researchers annotated an existing figurative language dataset and tested multiple LLMs under varying prompt styles. Results indicate that combined negation and figurativeness poses significant difficulty, with performance varying substantially across negation types and being highly prompt-dependent.

large language modelsfigurative languagenegation interpretationprompt engineeringnlp evaluation

SAERec: Constructing Fine-grained Interpretable Intents Priors via Sparse Autoencoders for Recommendation

arXiv cs.AI · Jiangnan Xia, Xuansheng Wu, Yu Yang, Xin Wang · 2026-06-17

SAERec introduces a sparse autoencoder-based recommender system that constructs fine-grained interpretable intents from textual corpora to enhance recommendation accuracy and explainability. The method extracts intents from large language model embeddings using sparse autoencoders, isolating intent-related semantics from noise, and retrieves both personal and public intents to guide recommendations. A multi-branch attention mechanism captures temporal dependencies and integrates intent signals, followed by adaptive fusion for final user representation. Experiments on public datasets show SAERec outperforms state-of-the-art baselines while providing human-understandable explanations.

sparse autoencoderintent-based recommendationmulti-branch attentionadaptive fusiontextual corpus

Skill-Guided Continuation Distillation for GUI Agents

arXiv cs.AI · Zhimin Fan, Hongwei Yu, Yeqing Shen, Haolong Yan · 2026-06-17

Skill-Guided Continuation Distillation (SGCD) is introduced as an iterative self-improvement framework to address policy-induced off-trajectory states in GUI agents. SGCD combines skill-guided policy completions with expert trajectories to provide supervision for unseen states, leveraging Continuation Plans, Critical Targets, Failure Traps, and Success Criteria extracted from rollouts. Evaluated on OSWorld-Verified, SGCD increases the success rate of three base models from the low-30% range to over 50%, demonstrating significant improvement in handling off-trajectory states.

skill-guided continuation distillationgui agentsoff-trajectory statescontinuation plansfailure traps

Generative-Model Predictive Planning for Navigation in Partially Observable Environments

arXiv cs.AI · Thomas Quilter, Yifan Zhu, Guorui Quan, Mingfei Sun · 2026-06-17

The paper introduces BeliefDiffusion, a novel framework combining generative modeling and planning for navigation in partially observable environments. It uses diffusion models to characterize multimodal belief distributions and Model Predictive Control (MPC) for planning. The method involves imagining environment configurations from observation history and planning navigation strategies across aggregated configurations. Experiments in synthetic map environments show BeliefDiffusion outperforms model-free RL baselines and other generative approaches in navigation success rate and path efficiency, validating its robustness in partially observable settings.

beliefdiffusiondiffusion modelsmodel predictive controlpartially observable environmentsmultimodal belief distributions

Domain-Shift Aware Neural Networks for Unbalance Characterization in Rotating Systems

arXiv cs.AI · Bernardo Feijó Junqueira, Claudio Kiyoshi Umezu, Bruno Bilhar Karaziack, Tomaz Junior · 2026-06-17

The paper proposes a domain-shift aware neural network for estimating unbalance masses in rotating systems under varying operational conditions. The method employs maximum mean discrepancy to align feature representations between source and target domains, addressing domain discrepancy introduced by secondary shaft activation. Experimental results on a test rig with triaxial accelerometer data show improved prediction accuracy when explicitly modeling domain shift, particularly for out-of-training conditions. The approach demonstrates potential for Structural Health Monitoring applications where physical behavior and domain discrepancies are not fully known.

domain adaptationmaximum mean discrepancystructural health monitoringrotating systemsinverse problem

Externalizing Research Synthesis and Validation in AI Scientists through a Research Harness

arXiv cs.AI · Zijian Wang, Hanqi Li, Ziyue Yang, Zijian Hu · 2026-06-17

The paper introduces Xcientist, a research harness that externalizes AI-driven scientific workflows into inspectable processes governed by contracts. It organizes literature evidence, idea states, implementation plans, and validation traces as persistent artifacts, enabling grounded execution and revision while maintaining evidential basis. The method addresses claim drift, where runnable artifacts diverge from original claims, across domains like training-free memory systems and physics-informed neural networks. Results demonstrate traceable trajectories from problem formulation to validation, advocating for evaluating AI scientists by attributable synthesis and validation processes.

research harnessclaim driftevidential basistraining-free memoryphysics-informed neural networks

Scaling Learning-based AEB with Massive Unlabeled Data

arXiv cs.AI · Xiangyu Wang, Yang Zhan, Mengxiang Hao, Chuanchuan Zhong · 2026-06-17

The paper proposes a stabilized meta-feedback semi-supervised learning (MF-SSL) framework for scaling learning-based automatic emergency braking (AEB) using massive unlabeled fleet data. The method introduces Noise-Aware Decoupling to exclude ambiguity-prone anchors from teacher updates and kinematics-gated pseudo-labeling with a teacher conflict penalty to mitigate mismatch-induced errors. Evaluations show consistent improvements as unlabeled data scales from 1M to 1B windows, with the deployed model achieving a 100:1 positive-to-false activation ratio and 35% accident-free mileage improvement over a rule-based baseline across 10^9 km of real-world driving.

automatic emergency brakingmeta-feedback sslpseudo-labelingnoise-aware decouplingkinematics-gating

URDF Synthesis from RGB-D Sequences via Differentiable Joint Inference and Energy-Consistent Verification

arXiv cs.AI · Xinze Zhang · 2026-06-17

KinemaForge introduces a constraint-driven pipeline for reconstructing simulation-ready URDFs from RGB-D sequences, jointly inferring part-level shape, joint topology, and parameters while enforcing energy consistency. The method combines a kinematic constraint graph, differentiable screw-axis solver, and energy residual loss to backpropagate through articulated-body dynamics. Results show 37.4% lower joint-axis error than PARIS, 46.6% improvement over Ditto, 64% reduced simulation drift, and 14.6 percentage point higher manipulation success rates across PartNet-Mobility categories.

urdf synthesisdifferentiable inferenceenergy-consistent verificationarticulated-body dynamicskinematic constraint graph

Aligning Implied Statements for Implicit Hate Speech Generalizability with Context-Bounded Semi-hard Negative Mining

arXiv cs.AI · Wicaksono Leksono Muhamad, Yunita Sari · 2026-06-17

We propose ImpSH, a triplet-based framework for implicit hate speech detection that aligns posts with implied statements and employs context-bounded semi-hard negative mining to focus on near-confusable instances. The method also introduces AugSH, which generates positives via data augmentation. Evaluations on IHC, SBIC, and DynaHate datasets using BERT and HateBERT show that ImpSH improves cross-domain performance under matched preprocessing and tuning budgets, outperforming standard supervised contrastive baselines. Representation analysis reveals tighter positive pairs with balanced global spread, demonstrating enhanced generalization across domains.

implicit hate speechtriplet-based frameworkcontext-bounded miningsupervised contrastive learningcross-domain performance

WorldLines: Benchmarking and Modeling Long-Horizon Stateful Embodied Agents

arXiv cs.AI · Yehang Zhang, Jianchong Su, Haojian Huang, Yifan Chang · 2026-06-17

We introduce WorldLines, a benchmark for evaluating long-horizon stateful embodied agents in household assistance scenarios, addressing gaps in existing benchmarks that focus on short-horizon tasks or language-centric retrieval. WorldLines constructs temporally extended household traces incorporating dialogues, actions, execution feedback, and state changes, converting them into evidence-linked samples for Memory QA and Embodied Task Planning. We propose ObsMem, an observer-grounded memory framework that maintains visibility-aware memories and action-native state trails for state-aware decision-making. Experiments highlight persistent challenges in partial observability, state overwriting, and memory-to-plan translation, with ObsMem demonstrating superior performance as a reference architecture.

embodied agentslong-horizon taskspartial observabilitymemory frameworktask planning

Skill-MAS: Evolving Meta-Skill for Automatic Multi-Agent Systems

arXiv cs.AI · Hehai Lin, Qi Yang, Chengwei Qin · 2026-06-17

Skill-MAS introduces a novel approach to automatic Multi-Agent Systems (MAS) generation by evolving Meta-Skills, decoupling experience retention from parametric updates. The method employs a closed optimization loop comprising Multi-Trajectory Rollout for behavioral sampling and Selective Reflection for hierarchical contrastive analysis to distill systemic experience. Evaluated across four complex benchmarks and four distinct LLMs, Skill-MAS achieves significant performance gains while maintaining cost-efficiency. The evolved Meta-Skills demonstrate robustness and transferability across unseen tasks and different LLMs, addressing the limitations of inference-time and training-time MAS approaches.

meta-skillmulti-agent systemshierarchical contrastive analysismulti-trajectory rolloutselective reflection

Improving Human-Robot Teamwork in Urban Search and Rescue Through Episodic Memory of Prior Collaboration

arXiv cs.AI · Taewoon Kim, Emma van Zoelen, Mark Neerincx · 2026-06-17

The study demonstrates that robots can improve human-robot teamwork in Urban Search and Rescue (USAR) by leveraging episodic memory of prior collaboration patterns (CPs). Using the MATRX USAR environment, historical CPs are encoded as knowledge-graph episodic memories, with graph representation learning applied for node classification to select optimal memories for reuse. Initializing robots with a single prior CP increased rescue success from 25.7% to 41.3% and reduced task time by 283 seconds across 20 participants and 160 observations, with notable early-interaction improvements.

human-robot teamworkepisodic memoryknowledge-graphurban search and rescuegraph representation learning

Target-confidence Recourse Using tSeTlin machines: TRUST

arXiv cs.AI · K. Darshana Abeyrathna, Sara El Mekkaoui, Nils Enric Canut Taugbøl, Anuja Vats · 2026-06-17

The paper introduces TRUST, a framework for generating counterfactual explanations with user-specified confidence targets, addressing fragility in conventional boundary-based recourse methods. TRUST employs Probabilistic Tsetlin Machines (PTMs) and Bayesian optimization to directly optimize for minimal changes that meet confidence thresholds, linking confidence to rule stability. Experiments on synthetic and real-world datasets, including Haberman (L2 distance 0.10 at 0.92 confidence), demonstrate TRUST's superior robustness and interpretability compared to existing approaches.

counterfactual explanationsprobabilistic tsetlin machinesbayesian optimizationrecourse robustnessconfidence thresholds

Beyond Reward Engineering: A Data Recipe for Long-Context Reinforcement Learning

arXiv cs.AI · Xiaoyue Xu, Sikui Zhang, Xiaorong Wang, Xu Han · 2026-06-17

The paper proposes a data-centric approach to improve long-context reasoning in LLMs through reinforcement learning, avoiding complex reward engineering. The method constructs a curated dataset (~14K examples) targeting three task families (retrieval, multi-evidence synthesis, reasoning) and applies minimal outcome-based GRPO training. Evaluations on Qwen3-4B/8B/30B-A3B show average gains of +7.2/+3.2/+6.4 points across seven benchmarks, with additional improvements in agentic tasks (+4.8 GAIA, +7.0 BrowseComp).

long-context reasoningreinforcement learningdata-centricoutcome-based grpoagentic tasks

Space Is Intelligence: Neural Semigroup Superposition for Riemannian Metric Generation

arXiv cs.AI · Chenghao Xu · 2026-06-17

The paper proposes encoding intelligence directly in spatial geometry by inducing a Riemannian metric on configuration manifolds, replacing traditional agent-centric planning. The method employs an Encoder-Router network with three parameter groups (frame, modulation, and basic coefficients) combined via semigroup-superposition to generate metric fields. Trained on a single two-obstacle scene, the model achieves zero-shot generalization to novel configurations, exhibiting orders-of-magnitude cost separation between collision-free and penetrating paths.

riemannian metricconfiguration manifoldsemigroup-superpositionzero-shot generalizationencoder-router network

Maturing Markov Decision Processes: Decision Making under Increasing Information and Shrinking Action Sets

arXiv cs.AI · Jiaxi Liu, Aiping Yang, Yuhang Yang, Shuqi Zhang · 2026-06-17

The authors introduce Maturing Markov Decision Processes (MMDPs), a novel formulation addressing asymmetric evolution of information and decision flexibility in sequential decision problems. MMDPs explicitly model nested information-action asymmetry through an expiring-action priority principle, identifying actions requiring immediate resolution. They propose a structure-aware reinforcement learning framework incorporating stage-aware policy design, expiring-action abstraction, and search-augmented learning with distillation. Experiments across multi-supplier replenishment, cash-management environments, and production-scale simulators demonstrate improved learning efficiency, with benefits scaling with problem complexity. The approach outperforms standard MDP formulations by preserving the inherent asymmetry between information gain and action expiration.

markov decision processesinformation-action asymmetryexpiring-action prioritystructure-aware reinforcement learningsearch-augmented learning

SwitchBraidNet: Quantisation-Aware Lightweight Architecture for Hybrid Brain-Computer Interface

arXiv cs.AI · Gourav Siddhad, Yogesh Kumar Meena · 2026-06-17

SwitchBraidNet introduces a quantisation-aware lightweight architecture for hybrid brain-computer interfaces (BCIs), addressing computational constraints in embedded hardware. The model combines a dual-path temporal braid for multiscale oscillatory feature extraction, an adaptive squeeze-and-excitation spatial switch for electrode gating, and a log-variance readout layer for band-power encoding. Evaluated on the OpenBMI dataset, it achieves 69.49% MI accuracy (FP16), 93.48% SSVEP accuracy (FP32), and 64.82 bits/min hybrid information transfer rate (FP16), with a 3.03 KB INT8 footprint, demonstrating robust performance across precisions.

hybrid brain-computer interfacequantisation-aware trainingmotor imagerysteady-state visual evoked potentialsembedded hardware

Reinforcement Learning Foundation Models Should Already Be A Thing

arXiv cs.AI · Abdelrahman Zighem, Jill-Jênn Vie · 2026-06-17

The paper argues for developing reinforcement learning (RL) foundation models using synthetic MDPs, analogous to existing tabular foundation models like TabPFN. It identifies RL as an overlooked structured domain where synthetic priors could enable in-context learning, noting MDPs' fixed-size sufficient statistics are naturally compatible with transformer architectures. As proof, the authors train a model entirely on synthetic MDPs, demonstrating zero-shot performance on tabular benchmarks: online, it outperforms UCB-VI and tabular Q-learning in sample efficiency, and offline, it matches VI-LCB.

reinforcement learningfoundation modelssynthetic mdpin-context learningtabular benchmarks

Rescaling MLM-Head for Neural Sparse Retrieval

arXiv cs.AI · Youngjoon Jang, Seongtae Hong, Jonah Turner, Heuiseok Lim · 2026-06-17

The study identifies a scale mismatch in MLM-head outputs when using strong pretrained encoders for learned sparse retrieval (LSR) models like SPLADE, leading to performance degradation. A zero-cost initialization-time correction rescales the MLM-head projection, improving training stability without architectural changes. Empirical results show this adjustment enhances performance for large-norm backbones (ModernBERT, Ettin) across in-domain and out-of-domain retrieval benchmarks, often matching or surpassing BERT-SPLADE baselines.

mlm-headsparse retrievalspladepretrained encoderscontrastive training

Learning from Own Solutions: Self-Conditioned Credit Assignment for Reinforcement Learning with Verifiable Rewards

arXiv cs.AI · Yingyu Shan, Yuhang Guo, Zihao Cheng, Zeming Liu · 2026-06-17

We introduce SC-GRPO, a self-conditioned credit assignment method for reinforcement learning with verifiable rewards (RLVR) that addresses uniform credit assignment in existing approaches like GRPO. SC-GRPO leverages per-token KL divergence between original and self-conditioned distributions, computed from verified trajectories, to weight GRPO gradients multiplicatively. Evaluated across five benchmarks in math, code, and agentic tasks, SC-GRPO outperforms GRPO by 8.1% and DAPO by 5.9%, demonstrating stronger out-of-distribution performance and surpassing On-Policy Distillation.

credit assignmentkl divergenceverifiable rewardsself-conditioninggradient weighting

ProfiLLM: Utility-Aligned Agentic User Profiling for Industrial Ride-Hailing Dispatch

arXiv cs.AI · Tengfei Lyu, Zirui Yuan, Xu Liu, Kai Wan · 2026-06-17

ProfiLLM introduces a utility-aligned agentic LLM pipeline for industrial ride-hailing dispatch, addressing three key constraints: platform-scale data exceeding LLM context windows, sparse long-tail user interactions, and misalignment between fluent profiles and downstream utility. The method combines (1) Tool-Augmented Global Knowledge Mining, equipping an LLM agent with 27 analytical tools to extract reusable knowledge, and (2) Utility-Aligned Profile Exploration, generating and refining candidate profiles via a lightweight utility proxy and DPO fine-tuning. Deployment on DiDi's production dispatcher yielded +6.14% AUC improvement, +4.35% GMV gain in simulation, and consistent online A/B test improvements including +0.47% GMV and -0.82% Cancel-Before-Accept rate.

llm data pipelineutility-aligned profilingtool-augmented miningdpo fine-tuningride-hailing dispatch

SHIFT: Semantic Harmonization via Index-side Feature Transformation for Multilingual Information Retrieval

arXiv cs.AI · Youngjoon Jang, Seongtae Hong, Hyeonseok Moon, Heuiseok Lim · 2026-06-17

The paper introduces SHIFT, a training-free method for mitigating language bias in Multilingual Information Retrieval (MLIR). The approach estimates relative language vectors from parallel translations and applies them as index-side corrections to document embeddings, reducing language-specific offsets without model retraining. Evaluations across four MLIR benchmarks demonstrate SHIFT's effectiveness in improving retrieval performance by reducing language preference bias in diverse dense retrieval models.

multilingual information retrievaldense retrievallanguage biasembedding correctionparallel translation

Closing the Loop: PID Feedback Control for Interpretable Activation Steering in Symbolic Music Generation

arXiv cs.AI · Ioannis Prokopiou, Pantelis Vikatos, Maximos Kaliakatsos-Papakostas, Theodoros Giannakopoulos · 2026-06-17

The paper introduces a PID feedback control framework for interpretable activation steering in symbolic music generation using the Multitrack Music Transformer (MMT). By applying Difference-in-Means (DiffMean) to isolate latent directions for Pitch and Duration attributes, the authors validate the Linear Representation Hypothesis and propose Dual Steering with Gram-Schmidt Orthogonalization to mitigate feature entanglement. Experiments show this geometric decoupling reduces conceptual interference by 37% compared to naive vector addition, enabling independent deterministic control of musical attributes during inference.

activation steeringmultitrack music transformerdifference-in-meansgram-schmidt orthogonalizationlinear representation hypothesis

R2D-RL: A RoboCup 2D Soccer Environment for Multi-Agent Reinforcement Learning

arXiv cs.AI · Haobin Qin, Baofeng Zhang, Hidehisa Akiyama, Keisuke Fujii · 2026-06-17

R2D-RL introduces a reinforcement learning environment that bridges RoboCup 2D Soccer Simulation (RCSS2D) with modern Python-based multi-agent RL workflows through shared-memory communication and cycle-level synchronization. The system supports full-field and scenario-based training, offering configurable opponents, discrete/hybrid action spaces, action masks, EPV-based reward shaping, and parallel execution. Benchmarking includes front-goal scenarios and 11-vs-11 full-field matches with baseline results. This integration addresses the challenge of adapting RCSS2D's competition-oriented architecture for MARL research in robot soccer, which combines partial observability, cooperative-adversarial interactions, sparse rewards, and tactical planning.

robocup 2dmulti-agent reinforcement learningshared-memory communicationreward shapingaction masks

Bayesian Anytime Pareto Set Identification for Multi-Objective Multi-Armed Bandits

arXiv cs.AI · Lennert Saerens, Bram Silue, Eleni Litsa, Peter Vrancx · 2026-06-17

The authors introduce Top-Two Pareto Front Thompson Sampling (TTPFTS), the first Bayesian anytime algorithm for Pareto Set Identification in Multi-Objective Multi-Armed Bandits. The method employs Thompson sampling to sequentially identify Pareto optimal solutions while quantifying uncertainty via a novel confidence metric. Empirical evaluations on synthetic benchmarks and a molecular discovery task demonstrate TTPFTS outperforms fixed-budget baselines, with theoretical guarantees of asymptotic correctness provided.

multi-objective optimizationthompson samplingpareto frontbayesian banditsuncertainty quantification

RedactionBench

arXiv cs.AI · Sean Brynjólfsson, Shashvat Jayakrishnan, Esha Sali, Diptanshu Purwar · 2026-06-17

We introduce RedactionBench, a manually annotated benchmark of 200 documents across 11 domains for evaluating contextual redaction of personally identifiable information (PII), grounded in contextual integrity. The benchmark includes R-Score, a novel character-level metric that accounts for semantic similarity and nullifies formatting variations. Evaluations of 35 models, including Named Entity Recognition systems, Small Language Models, and frontier models with agentic tools, demonstrate that contextual redaction remains unsolved. Human evaluations with 80+ users reveal high consensus on mandatory redactions (89.4%) and safe text preservations (94.1%), but low agreement on contextual redactions (47.7%), highlighting the subjective nature of contextual privacy.

contextual integritypersonally identifiable informationnamed entity recognitionsemantic similarityagentic tools

Private Learning with Public Feature Conditioning

arXiv cs.AI · Shuli Jiang, Walid Krichene, Nicolas Mayoraz · 2026-06-17

We introduce Cond-DP, a differentially private regression method leveraging public feature conditioning to accelerate optimization under privacy constraints. Cond-DP incorporates a data-driven conditioning matrix derived from public features, exploiting their rapidly decaying spectra to reshape the optimization landscape. Theoretical convergence guarantees are provided for convex, strongly convex, and non-convex settings, with Cond-DP recovering standard DPSGD as a special case. Empirical evaluations demonstrate that Cond-DP consistently outperforms state-of-the-art baselines across diverse datasets and model architectures under label DP, achieving faster convergence without additional privacy cost.

differentially private regressionpublic feature conditioningoptimization landscapelabel dpconvergence guarantees

Generating Natural and Expressive Robot Gestures through Iterative Reinforcement Learning with Human Feedback using LLMs

arXiv cs.AI · Chris Lee, Flora Salim, Benjamin Tag, Francisco Cruz · 2026-06-17

The paper presents an RLHF-based system for improving co-speech gesture generation in humanoid robots, addressing the rigidity of expert-authored animations and unnatural outputs from pure LLM approaches. The method integrates ChatGPT with Pepper robot for baseline gesture synthesis, then applies iterative reinforcement learning with human feedback to refine motion quality. Evaluations demonstrate RLHF-enhanced gestures achieve higher perceived naturalness, expressiveness, and fluidity compared to the initial LLM-generated outputs.

co-speech gesturesreinforcement learning with human feedbackhumanoid roboticslarge language modelshuman-robot interaction

What Must Generalist Agents Remember?

arXiv cs.AI · Khurram Yamin, Namrata Deka, Maitreyi Swaroop, Albert Ting · 2026-06-17

The paper formalizes memory requirements for generalist agents operating across multiple environments and goals, proving they must maintain distinct memory distributions at observational bottlenecks when domains require incompatible optimal actions. It establishes a separation theorem showing near-optimal policies cannot rely solely on current observations but require domain-relevant memory retention. Additionally, the work demonstrates that memory containing sufficient goal-value estimation information enables approximate reconstruction of local transition dynamics, characterizing memory's role in domain disambiguation, transition modeling, and planning for generalist agents.

generalist agentsobservational bottleneckseparation theoremtransition dynamicsdomain disambiguation

SWE-Future: Forecast-Conditioned Data Synthesis for Future-Oriented Software Engineering Agents

arXiv cs.AI · Qiao Zhao, JianYing Qu, Jun Zhang, Yehua Yang · 2026-06-17

The paper introduces SWE-Future, a forecast-conditioned data synthesis method for generating future-oriented software engineering tasks without direct historical replay. The approach uses pre-$T_0$ repository evidence to forecast task families (feature implementation, bugfix, refactor) and validates these forecasts retrospectively against later pull requests. In an 80-repository study, the method achieves 58.1% relevance under semantic matching. The validated forecasts then condition synthesis of a 200-task dataset across 61 repositories, demonstrating reduced dependence on historical replay while maintaining realism.

forecast-conditioned synthesissoftware engineering agentstask-family predictionrepository evolutionsynthetic benchmarks

Two-Phase Bilevel Search for the Moving-Target Traveling Salesman Problem with Moving Obstacles

arXiv cs.AI · Allen George Philip, Anoop Bhat, Sivakumar Rathinam, Howie Choset · 2026-06-17

We propose a Mixed-Integer Conic Programming (MICP) formulation and a Two-Phase Bilevel Search (TPBS) algorithm for the Moving-Target Traveling Salesman Problem with Moving Obstacles (MT-TSP-MO), extending MT-TSP to account for dynamic obstacles. The MICP leverages off-the-shelf solvers, while TPBS provides scalable, high-quality solutions. Both methods were evaluated against a baseline on instances with up to 40 targets and 40 obstacles, demonstrating superior performance in success rates, solution costs, and computation time.

moving-target traveling salesman problemmixed-integer conic programmingtwo-phase bilevel searchmoving obstaclestime windows

Graph Grounded Cross Attention Transformer Neural Network for Structurally Constrained Full Event Sequence Generation in Predictive Process Monitoring

arXiv cs.AI · Fang Wang, Ernesto Damiani · 2026-06-17

The Graph Grounded Cross Attention Transformer Neural Network (GGATN) addresses structurally constrained full event sequence generation in predictive process monitoring by unifying activity, timestamp, and attribute prediction. GGATN integrates a global process graph as structural memory with Transformer self-attention for sequence context, employing graph-grounded cross-attention to inject process topology and Viterbi-style constrained decoding for feasible paths. Evaluations on six benchmark logs demonstrate superior sequence similarity (Damerau-Levenshtein, bigram control flow), duration distribution fidelity, and zero hallucinated activities versus LLM baselines, with ablations confirming the graph encoder's stability as a structural prior.

predictive process monitoringgraph-grounded cross attentionviterbi decodingstructural priorevent sequence generation

Morpheus: A Morphology-Aware Neural Tokenizer and Word Embedder for Turkish

arXiv cs.AI · Tolga Şakar · 2026-06-17

Morpheus introduces a neural morphology-aware tokenizer and word embedder for Turkish that preserves semantic morphemes while ensuring reversibility ($\mathrm{decode}(\mathrm{encode}(w)) = w$). The method employs a differentiable Poisson-binomial dynamic program to predict morpheme boundaries from character-level probabilities, enabling exact segmentation at inference. Results show superior performance in bits-per-character (1.425), morphological alignment (MorphScore macro-F1 0.61), and lexical retrieval (root-family MAP 0.85), though contextual tasks favor heavier encoders.

morphology-awarereversible tokenizerpoisson-binomialdynamic programword embedding

TW-LegalBench: Measuring Taiwanese Legal Understanding

arXiv cs.AI · Fei-Yueh Chen, Chun Huang Lin, Chan Wei Hsu, Kuan Hsuan Yeh · 2026-06-17

TW-LegalBench introduces a benchmark for evaluating large language models (LLMs) on Taiwanese legal reasoning, addressing the gap in jurisdiction-specific legal understanding. The benchmark comprises 16,000+ multiple-choice questions, 117 open-ended essay questions, and 14,000+ legal judgment prediction instances, sourced from official Taiwanese legal examinations and judgments. Evaluation of 13 LLMs reveals that top models surpass the passing threshold for qualified lawyers (11%) but fall short for judges and prosecutors (1-2%). While models show reasonable verdict type accuracy and sentence prediction capability, they struggle with exact legal article citation, highlighting challenges in reliable legal text generation.

tw-legalbenchlegal judgment predictionjurisdiction-specificllm-as-judgesentencing accuracy

Leveraging Energy Features for Surface Classification with Deep Learning: A Comparative Analysis Across Three Independent Datasets

arXiv cs.AI · Alexander Belyaev, Oleg Kushnarev · 2026-06-17

This study demonstrates the efficacy of energy-derived features for surface classification in mobile robotics, either as standalone inputs or combined with inertial data. The authors evaluated recurrent neural networks, convolutional neural networks, encoder-only transformers, and Mamba state-space models across three datasets, using automated hyperparameter tuning and sequence length optimization. Results show 85-90% accuracy with energy features alone, 96-99% when combined with inertial data (1-2% improvement over inertial-only), with convolutional neural networks performing best overall.

surface classificationenergy-based featuresinertial datadeep learning architectureshyperparameter tuning

Dual-Channel Grounded World Modeling (DCGWM): Structural Prevention of Objective Interference Collapse via Heterogeneous External Grounding with Inward-Only Gradient Flow

arXiv cs.AI · Akshay Hazare · 2026-06-17

The paper proposes Dual-Channel Grounded World Modeling (DCGWM), a method to prevent Objective Interference Collapse (OIC) in Joint Embedding Predictive Architectures (JEPAs) when grounded against heterogeneous external signals. DCGWM partitions the latent space into physical (Z_p) and behavioral (Z_b) subspaces with inward-only gradient flow, using separate grounding channels: a Physical Grounding Channel with VICReg-style alignment and a Social-Behavioral Grounding Channel with multi-agent trajectory alignment. Theoretical results show the partition eliminates gradient interference, inherits anti-collapse guarantees, and requires generative isolation under specific geometric assumptions. Experimental validation is pending.

objective interference collapsejoint embedding predictive architecturesdual-channel groundinglatent space partitioninggradient flow

ForecastBench-Sim: A Simulated-World Forecasting Benchmark

arXiv cs.AI · Jaeho Lee, Nick Merrill, Ezra Karger · 2026-06-17

ForecastBench-Sim introduces a simulated-world forecasting benchmark leveraging game rollouts from Freeciv, a turn-based strategy game, to address limitations of real-world forecasting tasks. The benchmark provides structured snapshots of game states, enabling forecasting questions at arbitrary time horizons, paired intervention worlds for conditional queries, and resolved examples of rare outcomes. It includes a pipeline, question families, scoring protocol, and validation slices from model evaluations and a human pilot. ForecastBench-Sim complements real-world benchmarks by offering controlled, immediately resolvable tasks for probabilistic reasoning under dynamic conditions.

forecasting benchmarksimulated-worldgame rolloutsprobabilistic reasoningdynamic world states

Bounded Context Management for Tabular Foundation Models on Stream Learning

arXiv cs.AI · Jinmo Lee, Doyun Choi, Moongi Choi, Jaemin Yoo · 2026-06-17

The paper introduces CURE, a context management policy for tabular foundation models (TFMs) in stream learning scenarios. CURE addresses three requirements: preserving recent examples, retaining uncertain examples, and removing redundant examples, via entropy-gated admission and redundancy-aware eviction. Evaluated on seven streams, CURE achieves up to 27.0% relative improvement over classical stream learners, demonstrates robustness across TFM backbones, and outperforms other policy variants.

tabular foundation modelsstream learningcontext managemententropy-gated admissionredundancy-aware eviction

scGTN: Deep Siamese Graph Transformer Network for Single-cell RNA Sequencing Clustering

arXiv cs.AI · Jinke Wu, Yifan Wang, Siyu Yi, Caiyang Yu · 2026-06-17

We propose scGTN, a deep Siamese Graph Transformer Network for single-cell RNA sequencing (scRNA-seq) clustering that addresses data sparsity, noise, and intercellular structural information. The method formulates scRNA-seq data as a graph, constructs dual augmented graph views, and employs a Siamese graph transformer network to capture structural relationships via shortest-path information and node-wise distances. Cell clustering is guided by an optimal transport strategy in a self-supervised manner. Experiments on multiple benchmark scRNA-seq datasets demonstrate scGTN's consistent superiority over existing methods. Code is publicly available.

single-cell rna sequencinggraph transformer networksiamese networkoptimal transportself-supervised learning

NeuralMUSIC: A Hybrid Neural-Subspace Framework for Robot Sound Source Localization

arXiv cs.AI · Yizhuo Yang, Junqiao Fan, Shenghai Yuan, Lihua Xie · 2026-06-17

NeuralMUSIC introduces a hybrid neural-subspace framework for robot sound source localization, combining neural network-based covariance matrix estimation with classical MUSIC pipeline processing. The method employs eigenvalue decomposition (EVD) and pseudo-spectrum computation, enhanced by a Frequency Attention Fusion (FAF) module for final direction-of-arrival (DOA) estimation. A Self-supervised Spatial Correlation Learning (SSCL) strategy improves data efficiency using unlabeled acoustic data. Experiments demonstrate competitive localization accuracy, robustness, and cross-domain generalization in robotic tasks.

neural-subspacemusic pipelineeigenvalue decompositionfrequency attention fusionself-supervised learning

LandslideAgent with Multimodal LandslideBench: A Domain-Rule-Augmented Agent for Autonomous Landslide Identification and Analysis

arXiv cs.AI · Chengfu Liu, Dongyang Hou, Junwu Xiang, Cheng Yang · 2026-06-17

The authors propose LandslideAgent, an instruction-driven agentic framework for autonomous landslide identification, comprising three components: (1) LandslideBench, a multimodal dataset with seven subtype labels constructed via multi-VLM cross-validation; (2) LandslideVLM, a LoRA-fine-tuned vision-language model for geological semantics; (3) a domain-rule-enhanced agent with dual-rule controllers for tool invocation. LandslideVLM achieves accuracy gains of 10.96-32.87% on discrimination and classification tasks, while LandslideAgent enables full-process landslide analysis through multi-source spatial data inference.

landslide identificationvision-language modelmultimodal datasetdomain-rule controllergeological semantics

Augmenting Dysarthric Speech Severity Assessment with MOS Supervision

arXiv cs.AI · Kaimeng Jia, Minzhu Tu, Zengrui Jin, Siyin Wang · 2026-06-17

The study proposes augmenting dysarthric speech severity assessment by leveraging Mean Opinion Score (MOS)-annotated speech synthesis data from the QualiSpeech corpus. Fine-tuning on synthesis assessment data improves intelligibility and naturalness prediction, with joint training showing particular gains in naturalness evaluation. Results indicate perceptual commonalities between synthesis artifacts and dysarthric speech, demonstrating that synthesis evaluation corpora can effectively reduce dependency on scarce clinical annotations for dysarthria assessment.

dysarthriamean opinion scorespeech synthesisintelligibility predictionnaturalness assessment

PEC-Home: Interpretation of Progressively Elliptical Commands in Smart Homes

arXiv cs.AI · Yingyu Shan, Zeming Liu, Silin Li, Boao Qian · 2026-06-17

PEC-Home introduces a simulated dataset to address elliptical command interpretation challenges in smart homes, focusing on referential and intention ambiguities. The study evaluates LLMs like GPT-4o, revealing persistent execution accuracy deficits even with dialogue history tools. Results indicate current assistants fail to match performance achieved with complete commands, highlighting limitations in handling progressive omission.

elliptical commandsreferential ambiguityintention ambiguityllmsgpt-4o

EffiNav: Fusing Depth and Vision-Language for Efficient Object Goal Navigation

arXiv cs.AI · Zecheng Yin, Benedict Jun Ma · 2026-06-17

EffiNav introduces a novel framework for efficient Object Goal Navigation (ObjNav) by fusing depth and vision-language modalities. The method addresses key challenges in exploration efficiency, avoiding redundant motion and visited areas through intelligent decision-making. Evaluated on Habitat Matterport 3D (HM3D) and Open-Vocabulary Object Navigation (OVON) benchmarks, EffiNav demonstrates superior performance in Success Rate (SR) and Success weighted by Path Length (SPL). It also adapts to memory-augmented ObjNav tasks on GOAT-BENCH, showing robustness in both simulation and real-world robot deployments.

object goal navigationdepth-vision fusionhabitat matterport 3dopen-vocabulary navigationpath efficiency metrics

BCL: Bayesian In-Context Learning Framework for Information Extraction

arXiv cs.AI · Haoliang Liu, Chengkun Cai, Xu Zhao, Han Zhu · 2026-06-17

The paper introduces BCL, a Bayesian in-context learning framework for information extraction (IE) that systematically optimizes label representations via particle filtering with Bayesian updates. The method operates through four stages: initialization, observation, weight update, and resampling, achieving generalization across both sequence labeling and relation classification tasks. Experiments show consistent performance improvements over existing approaches, addressing limitations in current ICL-based IE methods.

bayesian inferencein-context learninginformation extractionparticle filteringsequence labeling

Code-Augur: Agentic Vulnerability Detection via Specification Inference

arXiv cs.AI · Zhengxiong Luo, Mehtab Zafar, Dylan Wolff, Abhik Roychoudhury · 2026-06-17

Code-Augur introduces a security-specification-first paradigm for agentic vulnerability detection, exposing LLM agents' tacit assumptions as explicit security specifications and refining them via runtime falsification. The system analyzes codebases, commits local invariants as in-source assertions when deeming components secure, and uses guided fuzzing to falsify assumptions, grounding the agent's understanding. Evaluations show Code-Augur detects more vulnerabilities than state-of-the-art agents, uncovering 22 new vulnerabilities in open-source projects, while leveraging widely available LLMs like Sonnet and DeepSeek.

agentic vulnerability detectionsecurity specificationsruntime falsificationguided fuzzingin-source assertions

AI-Driven Assessment of Human Tutors: Linking Training Performance to Real-Life Practice

arXiv cs.AI · Danielle R. Thomas, Marie Cynthia Abijuru Kamikazi, Clara Brandt, Conrad Borchers · 2026-06-17

The study introduces an AI-driven system using Gemini-2.5-pro to assess human tutor performance across training and real-life tutoring, bridging a gap in existing platforms. It analyzes 405 session-to-lesson pairs from 86 math tutors, employing mixed-effects models and interrupted time series analysis. Results show training performance predicts real-life application (effect size 0.25 SD), with open responses being more predictive than multiple choice. Tutors demonstrated significant improvements in pedagogical opportunity utilization (61.1% to 68.9%) and execution quality (65.5% to 68.1%), following a gradual trend. The work contributes open datasets, AI prompts, and scoring rubrics for reproducibility.

generative aimixed-effects modelspedagogical opportunityinterrupted time serieseffect size

Are LLMs Ready to Assist Physicians? PhysAssistBench for Interactive Doctor-Patient-EHR Assistance

arXiv cs.AI · Tianming Du, Peijie Yu, Sihan Shang, Danli Shi · 2026-06-17

The paper introduces PhysAssistBench, a benchmark for evaluating LLMs in interactive doctor-patient-EHR assistance scenarios, addressing the gap in current evaluations that test isolated capabilities. The benchmark uses a scalable pipeline to create agentic patients from MIMIC-IV cases, generating multi-turn clinical interactions while preserving clinical factuality. Experiments on 1,296 physician-validated turns reveal that leading LLMs remain unreliable in coordinating knowledge, communication, and EHR system interaction, highlighting a critical bottleneck for clinical applications.

medical llmsinteractive benchmarkagentic patientsclinical factualityehr system interaction

QC-GAN: A Parameter-Efficient Quaternion Conformer GAN for High-Fidelity Speech Enhancement

arXiv cs.AI · Shogo Yamauchi, Hideaki Tamori, Makoto Sakai, Yosuke Yamano · 2026-06-17

The paper introduces QC-GAN, a parameter-efficient speech enhancement framework combining a Quaternion Conformer generator with MetricGAN-based training. The method leverages Hamilton products for structured weight sharing, reducing parameters while preserving magnitude-phase interdependencies, and employs a metric-learning discriminator to optimize perceptual quality. On VoiceBank+DEMAND, QC-GAN achieves a PESQ score of 3.48 with 0.89M parameters, matching larger SOTA models, while a 35K-parameter variant scores 3.23. Generalization is demonstrated on DNS-Challenge 3 under real-world conditions.

quaternion conformermetricganhamilton productspeech enhancementpesq

Steerable Cultural Preference Optimization of Reward Models

arXiv cs.AI · Minsik Oh, Advit Deepak, Sophie Wu, Douwe Kiela · 2026-06-17

The paper introduces Steerable Cultural Preference Optimization (SCPO), a novel reward model training algorithm designed to align large language models (LLMs) with diverse cultural sub-communities while mitigating bias. SCPO incorporates balanced cultural preferences through a weighting method, achieving up to 7-point performance gains for minority reward models on PRISM and GlobalOpinionQA datasets across 7 countries. The method demonstrates 280% greater training data efficiency compared to full-data finetuning, with bias analysis confirming reduced excessive bias toward specific subcommunities.

reward modelcultural alignmentpreference optimizationbias mitigationdata efficiency

MIDS: Detecting Stealthy Masquerade and Tampering Attacks on CAN Bus via Bidirectional Mamba

arXiv cs.AI · Qiqi Liu, Runhan Song, Lei Cui, Heng Zhang · 2026-06-17

The Mamba Intrusion Detection System (MIDS) introduces a dual-stream framework for detecting stealthy masquerade attacks on CAN bus, addressing the limitation of existing systems tuned for fabrication-style attacks. MIDS processes CAN identifiers and payloads in parallel, reconstructing their joint temporal semantics through bidirectional selective state-space modelling. Evaluated on a dataset of over 100 million CAN frames from a Tesla Model 3 and 54 masquerade attack variants, MIDS achieves an F1 score of 96.94%, outperforming baselines by over 8 percentage points with a 1.147 ms inference latency. Generalization tests on four public benchmarks show F1 scores ranging from 93.70% to 99.61%, surpassing baselines by up to 13.94 percentage points.

can busmasquerade attacksintrusion detectionstate-space modellingtemporal semantics

Optimizing Lithium Production Decisions under Geological, Demand, and Pricing Uncertainties: A POMDP Framework for Multi-Objective Decision Making

arXiv cs.AI · Anna C. Edmonds, Mansur M. Arief, Robert J. Moss, Mykel J. Kochenderfer · 2026-06-17

The study proposes a POMDP framework for optimizing lithium production decisions under geological, demand, and pricing uncertainties, incorporating extraction technology choices. Using belief state planning methods, the approach dynamically adapts to various lithium price regimes (static, linear, exponential, stochastic) and outperforms human-inspired heuristics. Results demonstrate improved demand fulfillment and balanced economic-environmental outcomes across different pricing and deposit scenarios through optimal sequencing of exploration, production, and technology selection.

pomdpbelief state planninglithium extractionmulti-objective optimizationuncertainty management

Better Adherence, Richer Context: A Field Evaluation of LLM-Powered Conversational Voice Diaries for Sleep

arXiv cs.AI · Amama Mahmood, Bokyung Kim, Honghao Zhao, Molly E. Atwood · 2026-06-17

This work introduces an LLM-powered conversational voice diary for sleep tracking, demonstrating improved adherence and richer contextual data compared to traditional text-based methods. The system employs proactive smart-speaker prompts and adaptive dialogue to administer clinically validated sleep diary questions, evaluated in a 4-week between-subjects study with 30 participants. Results showed 1) higher adherence rates, 2) more detailed self-reports about sleep-related factors, but 3) lower completeness for structured fields, revealing a trade-off between expressive richness and data precision in voice-based health reporting.

llm-poweredconversational voice diaryadaptive dialoguesleep trackingadherence rates

Benchmarking Action Spaces in Reinforcement Learning for Vision-based Robotic Manipulation

arXiv cs.AI · Seyed Alireza Azimi, Homayoon Farrahi, Abhishek Naik, Colin Bellinger · 2026-06-17

This study benchmarks action space representations in reinforcement learning for vision-based robotic manipulation, demonstrating their impact on motion smoothness, safety, and task performance. The authors evaluate pose increment, pose velocity, joint position increment, and joint velocity across object picking and pushing tasks, training policies in simulation and deploying them via sim-to-real transfer. Results indicate that joint velocity action space outperforms others in terms of smoothness and task success. Practical guidance is provided for RL practitioners on selecting action spaces for simulation and real-world applications.

reinforcement learningaction spacesim-to-real transfervision-based manipulationjoint velocity

Dual Dimensionality for Local and Global Attention

arXiv cs.AI · Zhiyuan Wang, Xuan Luo, Sirui Zeng, Xifeng Yan · 2026-06-17

The paper introduces Distance-Adaptive Representation (DAR), a method for decoder-only Transformers that adapts key and value dimensionality based on token distance from the prediction target. DAR maintains full-dimensional representations for local tokens within a context window while reducing dimensionality (e.g., 1/4) for distant tokens, hypothesizing that local tokens require richer representations for immediate predictions. Evaluated across pretraining scales (70M to 410M parameters) and fine-tuning on a 1B-scale model, DAR matches full-dimensional baseline performance, whereas uniform dimensionality reduction degrades results. This challenges the assumption of uniform key-value dimensionality and suggests adaptive allocation of representational capacity for KV cache efficiency.

distance-adaptive representationkv cachedecoder-only transformersdimensionality reductioncontext window

APT: Atomic Physical Transitions for Causal Video-Language Understanding

arXiv cs.AI · Shang Wu, Haoran Lu, Songling Liu, Chenwei Xu · 2026-06-17

The paper introduces Atomic Physical Transitions (APTs), minimal state changes that explain physical events through causal sequences rather than clip-level labels. APTs are constructed from mixed human and simulator data covering 14 transition types (27,303 instances). Current VLMs show poor APT recall (≤14%), and direct fine-tuning causes event-level forgetting. The proposed APT-Tune method combines image-pad-aware supervision, format-conditional co-training, and mechanism-conditioned decoding, improving APT recall and event-level transfer with only 11M LoRA parameters on Qwen3-VL-2B.

atomic physical transitionsvideo-language modelscausal supervisionparameter-efficient fine-tuningphysical state changes

Multi-Modal Hyper-Graph Fusion for Low-Light Crowd Counting

arXiv cs.AI · Hao-Yuan Ma, Li Zhang, Yushi Qiu, Jie Gao · 2026-06-17

The paper introduces LCNet, a unified framework for low-light crowd counting, addressing the underexplored challenge of unreliable RGB representations in dark environments. The method leverages multi-modal hyper-graph fusion to integrate RGB appearance, depth geometry, and Canny edge structure cues via dynamic hyperedge construction and message passing, alongside a deformable rectangular sparse attention (DRSA) module for adaptive computation allocation. Evaluated on three new benchmarks (SHA_Dark, SHB_Dark, LC-Crowd), LCNet outperforms state-of-the-art methods, with datasets to be released upon acceptance.

low-light crowd countingmulti-modal hyper-graph fusiondeformable rectangular sparse attentionretinex-based modelingdepth geometry priors

Correcting Sensor-Induced Distribution Drift with Wasserstein Adversarial Learning

arXiv cs.AI · Saraa Ali, Vladimir Bocharnikov, Fedor Ratnikov, Mikhail Hushchyn · 2026-06-17

The authors propose an unsupervised Wasserstein-GAN-based method for correcting sensor-induced distribution drift in data acquisition systems. The approach employs a generator as a learnable calibration transformation, with trainable weights representing physically interpretable transformation parameters, while a critic provides distributional distance signals via the Wasserstein objective. Validation on a tracking-detector toy model and Geant4-simulated calorimeter data demonstrates accurate recovery of aging coefficients for individual cells, with high correlation to ground truth, and improved agreement between calibrated and reference energy-sum distributions. The method shows expected performance degradation under increasing channel-to-channel noise levels, indicating its potential for data-driven calibration in label-scarce settings.

wasserstein-gandistribution driftcalibration transformationgeant4 simulationunsupervised learning

DeFAb: A Verifiable Benchmark for Defeasible Abduction in Foundation Models

arXiv cs.AI · Patrick Cooper, Alvaro Velasquez · 2026-06-17

DeFAb introduces a verifiable benchmark for defeasible abduction in foundation models, converting decades of knowledge bases into 372,648+ instances with polynomial-time verifiable gold standards. The method pairs taxonomic hierarchies (OpenCyc, YAGO, Wikidata) with behavioral property graphs (ConceptNet, UMLS) to generate instances requiring valid derivation, conservativity, and minimality checks. Results show frontier models struggle with defeasible reasoning: rendering-robust Level 2 accuracy is 7.8-23.5%, and chain-of-thought variance exceeds inter-model gaps. DeFAb-Hard and CONJURE variants further test model capabilities, with symbolic methods achieving 100% accuracy versus best model performance of 53.3%.

defeasible abductionknowledge basespolynomial-time verificationtaxonomic hierarchiesbehavioral property graphs

Engagement Intensity as a Learner-Modeling Signal for Adaptive AI Ethics Instruction

arXiv cs.AI · Yongkyung Oh, Lynn Talton, Alex Bui · 2026-06-16

The study demonstrates that self-reported LLM usage frequency outperforms prior AI education and self-rated familiarity as a predictor of baseline AI perceptions in adaptive ethics instruction. Analyzing 93 bioscience trainees through intake surveys, researchers compared three features against five perception outcomes. Usage frequency showed Holm-corrected associations with all outcomes, familiarity with three, and prior education with none. Results reveal threshold effects at lower usage levels, particularly for training interest and accuracy trust. The findings support using behavioral signals for lightweight learner profiling in AI ethics curricula.

adaptive instructionlearner modelingllm familiarityintake profilingholm correction

CEO-Bench: Can Agents Play the Long Game?

arXiv cs.AI · Haozhe Chen, Karthik Narasimhan, Zhuang Liu · 2026-06-16

The paper introduces CEO-Bench, a novel benchmark evaluating language model agents' ability to perform long-horizon, multi-faceted real-world tasks by simulating startup management over 500 days. Agents interact via a Python interface to handle pricing, marketing, budgeting, and strategic decision-making while processing noisy business data and adapting to dynamic conditions. Results show limited success: only Claude Opus 4.8 and GPT-5.5 maintained balances above the $1M starting capital, with neither achieving consistent profitability, highlighting current limitations in sustained strategic reasoning.

language model agentslong-horizon tasksstrategic adaptationnoisy environmentbenchmark evaluation

AI Sandboxes: A Threat Model, Taxonomy, and Measurement Framework

arXiv cs.AI · Inderjeet Singh, Haitham Mahmoud, Andrés Murillo · 2026-06-16

The article contributes a formal framework for AI sandboxes, addressing assurance challenges in digital, embodied, and cyber-physical systems. It methodologically develops a threat model, taxonomy, and measurement framework, formalizing sandbox boundaries via weakest-link composition of evidence and distinguishing major archetypes. Results include a cyber-physical threat model encompassing assurance apparatus attacks, plus metrics for fidelity, controllability, observability, containment, and reproducibility, validated through three case studies. The framework clarifies sandbox testing validity, risk containment, and evidence generation for safety and regulatory assurance.

ai sandboxescyber-physical systemsthreat modelassurance apparatuscontainment metrics

Sparsity Curse: Understanding RLVR Model Parameter Space from Model Merging

arXiv cs.AI · Chenrui Wu, Zexi Li, Jiajun Bu, Jiangchuan Liu · 2026-06-16

The study identifies a 'sparsity curse' in Reinforcement Learning with Verifiable Reward (RLVR) models, where sparse parameter updates form near-orthogonal shortcuts, making model merging fragile compared to Supervised Fine-Tuning (SFT). The authors propose Sensitivity-aware Resolving Merging (SAR-Merging), which resolves conflicts via Fisher Information-based arbitration, sparsification, and rescaling. Experiments on mathematical and coding benchmarks show SAR-Merging outperforms existing methods, enabling single-task enhancement and multi-capability fusion in RLVR models.

rlvrmodel mergingsparsity cursefisher informationparameter space

As You Wish: Mission Planning with Formal Verification using LLMs in Precision Agriculture

arXiv cs.AI · Marcos Abel Zuzuárregui, Stefano Carpin · 2026-06-16

The paper extends an LLM-based mission planner for precision agriculture by introducing formal verification via linear temporal logic (LTL) to resolve natural language ambiguities. The architecture employs two distinct commercial LLMs for specification generation and verification, creating multiple feedback loops to ensure plan correctness. Experimental results demonstrate the system's effectiveness in generating valid LTL formulas and highlight challenges in autonomous verification pipelines.

linear temporal logicmission planningprecision agricultureformal verificationllms

PSyGenTAB: A Privacy-Preserving Framework for Synthetic Clinical Tabular Data Generation via Constrained Optimization

arXiv cs.AI · Arshia Ilaty, Hossein Shirazi, Manasi Chitale, Kedar Hegde · 2026-06-16

PSyGenTAB introduces a privacy-preserving framework for synthetic clinical tabular data generation via constrained optimization, addressing the privacy-utility trade-off in healthcare AI. The method formulates data generation as a constrained optimization problem solved using the Augmented Lagrangian Method, embedding configurable privacy constraints during training. Evaluations on clinically motivated benchmarks show preserved inter-feature relationships and minority-class patterns, with downstream models achieving comparable performance to those trained on real data. Privacy audits confirm reduced record reproduction and resilience to membership inference attacks.

synthetic data generationconstrained optimizationaugmented lagrangian methodprivacy-utility trade-offmembership inference attacks

Neural Phase Correlation

arXiv cs.AI · Cole Reynolds · 2026-06-16

The paper introduces a learned generalization of phase correlation that extends beyond rigid transformations by learning the decomposition basis. Unlike dominant learning-based methods that encode images independently, this approach directly models inter-image relationships in the Fourier domain. The method handles dense non-rigid deformations and unitary dynamics, demonstrating competitive performance on the ACDC cardiac-MRI benchmark and matching state-of-the-art on CAMUS echocardiography without auxiliary mechanisms. Additionally, it successfully recovers Hermite-function eigenstates and quantized energy levels from quantum harmonic oscillator observations.

phase correlationfourier domainnon-rigid deformationsunitary dynamicshermite-function eigenstates

SFT Overtraining Predicts Rank Inversion via Entropy Collapse Under RLVR

arXiv cs.AI · Siddharth Aphale, Kelly Liu · 2026-06-16

The study identifies a failure mode in supervised fine-tuning (SFT) where checkpoint selection based on pass@1 leads to rank inversion in group relative policy optimization (GRPO) due to entropy collapse. Analyzing SFT depth ladders for Qwen2.5-Coder-3B and DeepSeek-Coder-6.7B, the authors demonstrate that pre-RL entropy positively correlates with GRPO outcomes (ρ=+0.69) and propose a two-stage diagnostic combining pre-RL entropy triage with early GRPO entropy monitoring to flag high-risk checkpoints. Results show GRPO pass@10 drops from 0.806 to 0.481 for Qwen2.5-Coder-3B, while DeepSeek-Coder-6.7B maintains pass@1 above p*(8)=0.083 without rank inversion.

supervised fine-tuningentropy collapsegroup relative policy optimizationrank inversioncheckpoint selection

MagpieTTS-LF: Inference-Time Long-Form Speech Generation Without Training on Long-Form data

arXiv cs.AI · Subhankar Ghosh, Jason Li, Paarth Neekhara, Shehzeen Hussain · 2026-06-16

MagpieTTS-LF enables long-form speech generation without long-form training data through three inference-time innovations: (1) soft attention priors preserving bidirectional context while guiding monotonic alignment, (2) stateful inference maintaining cross-chunk prosodic continuity, and (3) history-aware text encoding for discourse-level prosody planning. Evaluations demonstrate significant improvements over baselines in long-range intelligibility (17.8% relative gain), prosodic coherence (23.4% MOS increase), speaker consistency, and boundary naturalness, while using the original MagpieTTS model parameters.

text-to-speechlong-form generationattention priorsprosodic continuitystateful inference

Structured Representation Learning with Locally Linear Embeddings and Adaptive Feature Fusion

arXiv cs.AI · Somjit Nath, Jackson J Cone, Derek Nowrouzezahrai, Samira Ebrahimi Kahou · 2026-06-16

The paper introduces a reinforcement learning framework that disentangles dynamics-specific and reward-specific features, inspired by neural manifold representations. The method combines locally linear embeddings (LLEs) to capture locally linear environmental structure with an attention-based gating mechanism for adaptive feature fusion. Evaluations on benchmark tasks show improved learning efficiency and performance over conventional RL approaches, demonstrating benefits of biologically inspired local structure modeling and dynamic feature integration.

reinforcement learninglocally linear embeddingsneural manifoldsadaptive gatingfeature disentanglement

What Does the Weight Norm Control in Grokking? Logit-Scale Mediation under Cross-Entropy

arXiv cs.AI · Truong Xuan Khanh · 2026-06-16

The study identifies the logit scale as the proximal variable mediating grokking delays under cross-entropy loss, with weight norms acting as an upstream control mechanism. Through weight norm clamping and output temperature variation across a grid of configurations, the authors demonstrate that grokking delays collapse onto logit scale alone (R² = 0.97), with norms contributing only 1-2% additional variance. Experiments under mean-squared error reveal a distinct mechanism, while controls including a float64 softmax-collapse audit and a no-LayerNorm transformer confirm the findings. Results are reproducible from released code and data.

grokkinglogit scaleweight normcross-entropysoftmax saturation

Veriphi: Attack-Guided Neural Network Verification with Dataset-Dependent Training Methods

arXiv cs.AI · Pratik Deshmukh, Kartik Arya, Vasili Savin · 2026-06-16

Veriphi introduces a GPU-accelerated neural network verification system combining adversarial attacks and formal bound certification via alpha,beta-CROWN methods. It evaluates three training methodologies (standard, adversarial, certified) on MNIST and CIFAR-10, demonstrating dataset-dependent effectiveness. Interval Bound Propagation achieves 78% certified accuracy on MNIST but negligible performance on CIFAR-10, where PGD adversarial training dominates with 94% certification at small perturbations. Veriphi achieves 5x verification speedup through attack-guided falsification and scales to production-size models (105.8M parameters) for aerospace logistics optimization. Results challenge the universality of certified training over adversarial training, emphasizing context-critical verification strategy selection.

neural network verificationinterval bound propagationadversarial trainingalpha,beta-crowndataset-dependent

TMR-GGNN: Credit Card Fraud Detection based on Time-Aware Multi-Relational Guided Graph Neural Network

arXiv cs.AI · Rohit Tewari, Shubhankar Shilpi, Navin Chhibber, Devendra Singh Parmar · 2026-06-16

The paper introduces TMR-GGNN, a Time-aware Multi-Relational Guided Graph Neural Network for credit card fraud detection. The framework extends encoder-decoder GNNs by modeling heterogeneous interactions (customers, merchants, devices, IPs) with temporal windows and a relational attention mechanism. It employs contrastive learning to distinguish real vs. synthetic transactions and combines InfoNCE-based contrastive loss with Focal Loss to address class imbalance. Results show improved fraud identification and reduced false negatives through dynamic graph construction and temporal proximity weighting.

graph neural networkcontrastive learningfocal lossinfoncefraud detection

CAOA -- Completion-Assisted Object-CAD Alignment

arXiv cs.AI · Hiranya Garbha Kumar, Minhas Kamal, Balakrishnan Prabhakaran · 2026-06-16

CAOA introduces a novel method for precise 9-DoF CAD-to-object alignment in indoor RGB-D scans by integrating a semantically-aware point cloud completion module with a symmetry-aware pose estimation algorithm. The approach addresses challenges of noisy scans and segmentation errors through a synthetic data generation strategy that reduces the synthetic-to-real domain gap. A new dataset, S2C-Completion, comprising 8,500 expert-annotated object-CAD pairs, is released as a benchmark. CAOA demonstrates a 17% accuracy improvement over state-of-the-art methods on the Scan2CAD benchmark.

point cloud completionsymmetry-aware losssynthetic-to-real domain gap9-dof alignmentscan2cad benchmark

From Specification to Execution: AI Assisted Scientific Workflow Management

arXiv cs.AI · Komal Thareja, Hamza Safri, Rajiv Mayani, Anirban Mandal · 2026-06-16

The paper introduces an AI-assisted scientific workflow management system that integrates specification-driven generation, automated debugging, and distributed execution. The method employs a structured specification phase separating intent, design, and implementation, validated before code generation, alongside an LLM-based debugging agent for multi-layer failure resolution. Integration with Pegasus WMS via a Model Context Protocol enables unified workflow control. Evaluation on federated learning for medical imaging demonstrated successful generation/execution of workflows with thousands of jobs, reduced debugging effort, and enabled non-experts to use expert-level patterns.

scientific workflow managementllm-based debuggingpegasus wmsmodel context protocolfederated learning

A Variational Framework for LLM Generator-Regulator Games

arXiv cs.AI · Quanyan Zhu · 2026-06-16

The paper introduces a variational framework for regulated language generation, modeling generator-regulator interactions as a saddle-point problem. It derives the induced distribution over complete messages from autoregressive token sampling, relating it to an entropy-regularized Gibbs law, and formulates regulation via an optimal discriminator whose convex-dual value is an f-divergence. The framework addresses applications including moderation, censorship, and phishing defense, focusing on distributions over messages rather than single outputs. Theoretical analysis clarifies tradeoffs among utility, entropy, regulatory alignment, and detectability. Case studies on censorship filtering and phishing defense demonstrate evaluation through utility, entropy, divergence, receiver-side scores, and detection probability.

variational frameworkautoregressive samplingentropy-regularized gibbsf-divergencesaddle-point problem

Searching for Synergy in Shared Workspace Human-AI Collaboration

arXiv cs.AI · Nachiket Kotalwar, Rohini Das, Carolyn Rose · 2026-06-16

The paper investigates coordination dynamics in shared-workspace human-AI teams, demonstrating that unstructured collaboration can degrade performance despite increased capability. Using the Collaborative Gym environment with DiscoveryBench tasks across 1,482 sessions, the authors analyze process loss and introduce scaffolding combining shared group memory with human-in-the-loop gates for action approval. Results show structured coordination (particularly in three-agent teams) improves mean performance through clearer responsibility signaling and expertise routing, highlighting the critical role of integration mechanisms alongside raw capability.

human-ai collaborationshared workspaceprocess losshuman-in-the-loopcoordination scaffolding

Deep-Learning-Based Pixelated Microwave Filter Design and Characterization using Electro-Optical Electric-Field Measurements

arXiv cs.AI · Han Zhou, Richard Bannister, Caspar Pierce, Haojie Chang · 2026-06-16

A deep learning approach combining convolutional neural networks and genetic algorithms automates pixelated microwave filter synthesis, overcoming limitations of traditional iterative design methods. The method was experimentally validated using S-parameter and spatial electric-field measurements. The synthesized low-pass filter achieved a 7 GHz passband with over 20 dB suppression beyond 9.5 GHz, demonstrating excellent agreement between simulated and measured performance. Electro-optical measurements revealed electric field patterns resembling coupled transmission-lines or stub structures, providing novel insights into AI-generated designs.

convolutional neural networksgenetic algorithmsmicrowave filterelectro-optical measurementss-parameter

Deep Learning-Driven Inverse Design of Doherty Power Amplifiers Using Pixelated Combiners and Dual-State Impedance Synthesis

arXiv cs.AI · Han Zhou, Haojie Chang, David Widen, Christian Fager · 2026-06-16

The paper introduces a deep learning-driven inverse design methodology for Doherty power amplifier combiners, integrating convolutional neural networks, pixelated layout representations, and genetic algorithms with dual-state impedance synthesis. The approach simultaneously addresses peak and back-off power conditions through a three-port combiner design. Two GaN HEMT prototypes demonstrate measured performance with >44.2 dBm saturated output power, >71.2% peak drain efficiency (2.6-2.8 GHz), and 64% efficiency at 6-dB back-off. Post-digital predistortion yields adjacent channel leakage ratio below -51.3 dBc.

doherty power amplifierpixelated combinerdual-state impedance synthesisconvolutional neural networksgenetic algorithms

Learning-Based Decision Making for Combustion Phasing Control in Multi-Fuel CI Engines with Latent Fuel Reactivity Estimation

arXiv cs.AI · Rajasree Sarkar, Aditya Satish Patil, Arunava Banerjee, Ihsan Berk Altiner · 2026-06-16

The paper proposes a GRU-guided RL framework for combustion-phasing control in multi-fuel compression-ignition engines with latent fuel-reactivity variation. The method formulates CA50 regulation as a partially observable sequential decision problem, comparing LinUCB, DDPG variants, and a novel approach that jointly learns fuel-reactivity estimation (via GRU) and control policy. Evaluated on a Gaussian-process surrogate trained from experimental data, the framework achieves 0.25° CA mean absolute tracking error by conditioning actor-critic networks on inferred reactivity rather than oracle cetane number, outperforming myopic bandits and generic recurrent RL.

combustion-phasing controlpartially observable mdpgru-guided rlcetane number estimationmulti-fuel engine

LLMZero: Discovering Adaptive Training Strategies for RL Post-Training via LLM Agents

arXiv cs.AI · Haoyang Fang, Wei Zhu, Boran Han, Alex Zhang · 2026-06-16

LLMZero introduces an LLM-agent-based system for discovering adaptive RL post-training strategies through tree search, diagnosing training pathologies and proposing coordinated parameter transitions. The method reveals a structural principle: capacity parameters accumulate monotonically while regularization parameters oscillate, enabling non-stationary exploration-exploitation tradeoffs. Evaluated on 4 GRPO tasks, LLMZero outperforms base models by 9%-140%, grid search by 6%-15%, and random search, with discovered strategies exhibiting transferable parameter dynamics across tasks.

rl post-trainingllm agentstree searchparameter dynamicsexploration-exploitation

CaVe-VLM-CoT: An Interpretable Vision-Language Model Framework

arXiv cs.AI · Sneha Rao, Shaina Raza, Dhanesh Ramachandram · 2026-06-16

CaVe-VLM-CoT introduces a modular reflection-based agentic-RAG framework to address hallucinations in Vision-Language Models (VLMs) through a five-stage closed-loop pipeline: Extractor, Retriever, Solver, Citation Injector, and Verifier. The framework enforces evidence-grounded reasoning by routing verification failures back to retrieval for correction and proposes CaVeScore, a composite metric evaluating retrieval quality, step-wise citation faithfulness, and cross-modal grounding. Without architectural or prompt modifications, CaVe-VLM-CoT achieves 87.1% accuracy and 56.6% CaVeScore on ScienceQA, and 55.2% accuracy and 35.7% CaVeScore on MMMU (30 subjects).

vision-language modelsevidence-grounded reasoningagentic-ragcavescorecross-modal grounding

RankGraph-2: Lifecycle Co-Design for Billion-Node Graph Learning in Recommendation

arXiv cs.AI · Renzhi Wu, Zikun Cui, Junjie Yang, Tai Guo · 2026-06-16

RankGraph-2 introduces a co-designed framework for billion-node graph learning in recommendation systems, jointly optimizing graph construction, representation learning, and real-time serving. The method employs popularity-bias-corrected edge subsampling (reducing edges from trillions to billions), pre-computes multi-hop neighborhoods via personalized PageRank, and co-learns a residual-quantization cluster index to reduce serving costs by 83%. Evaluations show 3.8× higher recall than GAT + Deep Graph Infomax on bipartite graphs and 2.1× improvement over PyTorch-BigGraph on item retrieval, with real-world gains of +0.96% CTR and +2.75% CVR across 20+ Meta deployments.

graph-based retrievalpersonalized pagerankresidual-quantizationsimilarity-based retrievalbipartite graph

Redact or Keep? A Fully Local AI Cascade for Educational Dialogue De-Identification

arXiv cs.AI · Haocheng Zhang, Zhuqian Zhou, Kirk Vanacore, Bakhtawar Ahtisham · 2026-06-16

The paper introduces a fully local AI cascade framework for de-identifying educational dialogue transcripts, addressing the tradeoff between governance and accuracy in handling personally identifiable information (PII). The method combines a recall-first union proposer with lightweight encoders and deterministic rules to generate candidate spans, followed by a context-aware reviewer making binary Redact/Keep decisions. Evaluated on math tutoring transcripts, the framework achieves 0.958 macro F1, outperforming same-family LLM-only baselines (0.767) and a commercial API (0.706), while operating entirely on a single laptop. The approach demonstrates robustness in handling curricular-personal name ambiguity, with minimal performance degradation.

de-identificationeducational dialogueprivacy triagelightweight encoderscontext-aware reviewer

Guava: An Effective and Universal Harness for Embodied Manipulation

arXiv cs.AI · Haowen Liu, Xirui Li, Shaoxiong Yao, Peng Shi · 2026-06-16

Guava introduces a harness framework for embodied manipulation, systematically exploring design spaces of agent workflows, action spaces, and observation spaces. Key design principles include iterative perception-reasoning-action loops, semantic action abstractions, and multimodal observations. The framework distills embodied capabilities into a 4B open-source model using <2K simulated trajectories, achieving performance comparable to proprietary models with strong generalization to novel objects, instructions, and long-horizon tasks in both simulation and real-world environments.

embodied manipulationvision-language modelsaction abstractionmultimodal observationsim-to-real transfer

SafeClawBench: Separating Semantic, Audit-Evidence, and Sandbox Harm in Tool-Using LLM Agents

arXiv cs.AI · Yuchuan Tian, Mengyu Zheng, Haocheng Mei, Ye Yuan · 2026-06-16

SafeClawBench introduces a staged benchmark for evaluating security failures in tool-using LLM agents, separating semantic attack acceptance (9.0%-44.2% failure rates), audit-visible harm evidence, and sandbox-observed tool/state harm. The benchmark comprises 600 adversarial tasks across six attack families, evaluated on five agent endpoints under four prompt-level policies. Results show divergent failure modes: 291 of 347 sandbox harms occurred in tasks passing semantic checks, with prompt policies exhibiting model- and protocol-dependent effects.

tool-using agentsprompt injectionaudit evidencesandbox harmadversarial evaluation

Self-CTRL: Self-Consistency Training with Reinforcement Learning

arXiv cs.AI · Itamar Pres, Laura Ruis, Melat Ghebreselassie, Belinda Z. Li · 2026-06-16

Self-CTRL introduces a reinforcement learning method for improving consistency between language models' self-explanations and behavior, either by updating explanations to predict behavior or vice versa. The approach is evaluated on two tasks: probabilistic reasoning, where it increases the correlation between self-reported and measured biases from R²=0.24 to 0.64, matching ground-truth supervision; and constitutional AI, where it improves auditor prediction accuracy from 36% to 92% and reduces HarmBench failure rates from 15.0% to 0.5%. The method enhances model transparency and safety without excessive refusal on harmless prompts.

self-consistency trainingreinforcement learninglanguage modelsbehavioral alignmenttransparency

The Stanford EDGAR Filings Dataset: Reconstructing U.S. Corporate and Financial Disclosures into Layout-Faithful and Token-Efficient Pretraining Data

arXiv cs.AI · Nick Bettencourt, Xiaowei Ding, Kay Giesecke · 2026-06-16

The Stanford EDGAR Filings Dataset (SEFD) addresses the scarcity of clean long-context documents for LLM pretraining by reconstructing SEC filings into layout-faithful MultiMarkdown format. The dataset includes audited financial statements, risk disclosures, and market-moving event filings, optimized for token efficiency and minimal overlap with Common Crawl-derived corpora. SEFD-v1 comprises 152B tokens, with a larger archive estimated at 550B tokens across 18.5M filings. Two benchmarks are introduced: EDGAR-Forecast evaluates numerical forecasting grounded in filings post-knowledge cutoff, while EDGAR-OCR assesses transcription of complex financial tables.

long-contextmultimarkdowntoken-efficientcommon crawlknowledge cutoff

Agentra: A Supervisable Multi-Agent Framework for Enterprise Intrusion Response

arXiv cs.AI · Raj Patel, Shaswata Mitra, Michele Guida, Stefano Iannucci · 2026-06-16

Agentra introduces a supervisable multi-agent framework for enterprise intrusion response, addressing delays in static playbook approaches. The system decomposes response reasoning across role-scoped agents, validates plans via a Planner--Validator loop, screens threat intelligence through a Moderator gateway, and gates actions via risk scoring. Evaluated on a 120-event corpus against OASIS CACAO v2.0, Agentra improves FP-aware F1 from 0.61 to 0.84 while maintaining 0.0% harmful-action rates, demonstrating enhanced coverage without compromising auditability.

intrusion response systemmulti-agent frameworkmitre att&ckrisk scoringauditability

Your AI Travel Agent Would Book You a Bullfight: An Agentic Benchmark for Implicit Animal Welfare in Frontier AI Models

arXiv cs.AI · Jasmine Brazilek, Joel Christoph, Miles Tidmarsh, Carol Kline · 2026-06-16

The paper introduces TAC (Travel Agent Compassion), the first agentic benchmark evaluating AI models' avoidance of animal exploitation in travel booking scenarios. The benchmark tests twelve scenarios across six exploitation categories, expanded to forty-eight samples controlling for confounds. Evaluation of seven frontier models shows all scoring below chance (64%), with Claude Opus 4.7 top at 53%. Adding welfare-aware system prompts improved performance by 12-63 percentage points depending on model. Auxiliary audits suggest low scores are not due to evaluation awareness.

agentic benchmarkanimal welfarefrontier modelssystem promptevaluation awareness

Why SWAVE May Not Be All You Need:A Concept-Evolution Retrospective on Complex-Valued Recurrent Language Models

arXiv cs.AI · Ramprasath Ganesaraja, Swathika N, Sahil Dilip Panse · 2026-06-16

The paper presents SWAVE, a complex-valued recurrent language model (169.26M parameters) trained on FineWeb-Edu, designed to encode language as complex waves with a Cayley-parameterised unitary transition for stable long-context processing. Key architectural evolutions include replacing the Resonance Head with Phase-Associative Memory to avoid cos-domination collapse, retaining ComplexNorm and Wave Propagation Scan, and simplifying the ComplexGatedUnit. The model achieved a best-step perplexity of 22.0 at step 89,861, with stable training over 200,000 steps. The study formalizes cos-domination collapse, introduces a log-space backward pass for numerical stability, and provides six engineering principles for complex-valued recurrent training.

complex-valued recurrentcos-domination collapsephase-associative memorywave propagation scancayley-parameterised unitary

Trust the Right Teacher: Quality-Aware Self-Distillation for GUI Grounding

arXiv cs.AI · Jingyuan Huang, Zuming Huang, Yucheng Shi, Tianze Yang · 2026-06-16

The paper proposes quality-aware self-distillation (QASD) for vision-language models in GUI grounding tasks, addressing unreliable teacher signals in on-policy self-distillation (OPSD). The method combines soft correctness-aware gating—which down-weights teacher signals when student-generated prefixes deviate from ground-truth coordinates—with teacher-probability scaling to calibrate remaining signal strength. Experiments on six benchmarks demonstrate that both components are necessary for performance gains, with QASD consistently outperforming base models and baselines in coordinate prediction accuracy.

gui groundingself-distillationvision-language modelscoordinate predictionteacher signals

SAE Interventions are Unreliable: Post-Intervention Recovery of Suppressed Behavior

arXiv cs.AI · Mingyue Cui, Linghui Shen, Xingyi Yang · 2026-06-16

The study demonstrates that Sparse Autoencoder (SAE) feature interventions, while effective at suppressing unwanted behaviors in language models, are unreliable due to post-intervention recovery. Through constrained residual-space optimization, the authors show that models can recover pre-intervention behaviors while maintaining clamped SAE feature values, using encoder-orthogonal updates and feature-map Jacobians. Experiments on TPP, unlearning, IOI, and refusal steering reveal a 95.8% recovery rate in refusal-steering tasks with minimal feature drift (0.131), attributing recovery to SAE reconstruction residuals. This exposes a critical gap between feature-level control and behavioral completeness in SAE-based safety interventions.

sparse autoencodersresidual-space optimizationfeature interventionpost-intervention recoveryrefusal steering

ASTRA: A Scalable Next-Generation ATCO Training Simulator with Autonomous Simpilots

arXiv cs.AI · Ethan Chew, Enjia Wu, Iruss Eng Wei Yeow, Ian Weiqin Lim · 2026-06-16

The paper presents ASTRA, a scalable ATCO training simulator that automates simpilot roles through an end-to-end pipeline for speech transcription, instruction interpretation, and response generation. The system employs locally adapted voice models and fine-tuned ASR to address Western-centric speech model limitations, reducing WER from 107.80% to 23.45% on Singaporean-accented aviation speech. ASTRA also features an AI-assisted evaluation framework for trainee radiotelephony, achieving 91.7% accuracy, 88.2% brevity, and 86.9% completeness scores, while leveraging open-source tools like DSPy and Unsloth.

automatic speech recognitionair traffic controlvoice adaptationperformance evaluationsimulator pipeline

Freeing the Law with LOCUS: A Local Ordinance Corpus for the United States

arXiv cs.LG · Denis Peskoff, Joe Barrow, Christopher Vu, Diag Davenport · 2026-06-17

The paper introduces LOCUS (Local Ordinance Corpus for the United States), a comprehensive machine-readable corpus of U.S. municipal and county ordinance codes, addressing a critical gap in legal AI resources. The corpus aggregates codes from 9,239 jurisdictions, with a harmonized access layer covering 2,309 counties, processed using OCR to handle diverse document formats. The authors train ModernBERT-based classifiers to analyze dimensions like legal opacity and paternalism, enabling novel large-scale studies of local law. The dataset and models are publicly released on Hugging Face.

legal ailocal ordinancesocrmodernbertcorpus harmonization

The Chandra-Gaia Catalog of Counterparts: Resolving ambiguous Gaia matches to X-ray sources in the Chandra Source Catalog using Machine Learning

arXiv cs.LG · V. Samuel Pérez-Díaz, Vinay L. Kashyap, Joshua D. Ingram, David Fouhey · 2026-06-17

The authors present a machine learning framework to resolve ambiguous cross-matches between the Chandra Source Catalog (CSC v2.1) and Gaia Data Release 3, leveraging source properties beyond spatial proximity. A gradient-boosted classifier (LightGBM) is trained on a high-confidence training set generated using NWAY, incorporating features such as magnitudes, colors, and distances. The method identifies counterparts for ~113k of ~254k X-ray sources, resolves ~7k cases with multiple plausible counterparts, and detects ~20k chance coincidences. Validation on the Chandra Orion Ultradeep Project achieves 95% agreement with NWAY matches without positional information. The released catalog supports future population studies of Chandra-Gaia detectable sources.

cross-matchinggradient-boosted classifierchandra source cataloggaia data releasenway

Diffusion-Proof: Recipe for Formal Theorem Proving Beyond Auto-Regressive Generation

arXiv cs.LG · Ruida Wang, Rui Pan, Pengcheng Wang, Shizhe Diao · 2026-06-17

We introduce Diffusion-Proof, the first framework applying diffusion-based Large Language Models (dLLMs) to formal theorem proving, addressing limitations of auto-regressive models in long-range coherence. The framework trains two 7B-parameter models: dLLM-Prover-7B for whole-proof generation and dLLM-Corrector-7B for local proof correction via bi-directional infilling. Experiments show Diffusion-Proof outperforms auto-regressive baselines, achieving absolute improvements of 1.61% on ProofNet-Test and 6.14% on MiniF2F-Test, while solving an IMO problem unsolved by DeepSeek-Prover-V2-7B.

diffusion llmsformal theorem provingauto-regressive generationproof correctionlong-range coherence

P-K-GCN: Physics-augmented Koopman-enhanced Graph Convolutional Network for Deep Spatiotemporal Super-resolution

arXiv cs.LG · Xizhuo, Zhang, Zekai Wang, Fei Liu · 2026-06-17

The authors propose P-K-GCN, a Physics-augmented Koopman-enhanced Graph Convolutional Network for spatiotemporal super-resolution on irregular geometries. The method combines a continuous spline-based GCN for spatial dependency extraction with Koopman operator theory for linearizing temporal dynamics in a latent space, augmented by physics-based loss constraints. Theoretical analysis demonstrates reduced super-resolution error through diminished Rademacher complexity and tighter generalization bounds. Evaluated on 3D cardiac electrodynamics reconstruction, P-K-GCN outperforms baseline models in accuracy.

graph convolutional networkkoopman operatorspatiotemporal super-resolutionrademacher complexitygeneralization bounds

Optimal scenario design for climate emulation

arXiv cs.LG · Christopher B. Womack, Shahine Bouabid, Andrei Sokolov, Popat Salunke · 2026-06-17

The study introduces a method to optimize training datasets for climate emulators by maximizing predictive skill through scenario design. Using a differentiable Simple Climate Model (SCM), the authors calculate emulator loss sensitivity to training data perturbations, iteratively updating scenarios to improve generalization. Results show that a single optimized scenario outperforms training on six standard ScenarioMIP pathways, achieving higher skill while isolating distinct physical behaviors of climate forcing agents. The method also improves emulator performance when applied to intermediate-complexity climate models, suggesting dynamically rich scenarios offer greater value than traditional emissions pathways for climate emulation.

climate emulationscenario designdifferentiable scmpredictive skillscenariomip

Confidence is Not Reliability: Rethinking MC Dropout in Brain Tumour Segmentation

arXiv cs.LG · Xin Ci Wong, Duygu Sarikaya, Kieran Zucker, Marc De Kamps · 2026-06-17

The study demonstrates that Monte Carlo (MC) Dropout uncertainty estimation, while effective in ranking erroneous voxels (AUROC ≈0.97 for entropy), fails to ensure clinical safety in glioma segmentation due to region-specific calibration issues. Evaluating two models (SegResNet and UNet-Res) on 126 BraTS21 patients, MC Dropout preserved segmentation accuracy (|ΔDice|<0.01) but revealed severe miscalibration in clinically critical sub-regions, particularly for UNet-Res (enhancing tumour entropy=0.054, ECE=0.915). These findings highlight the necessity of supplementing AUROC with sub-region-specific calibration assessments for clinical model deployment.

monte carlo dropoutglioma segmentationuncertainty estimationcalibration errorvoxel-level analysis

Does VLA Even Know the Basics? Measuring Commonsense and World Knowledge Retention in Vision-Language-Action Models

arXiv cs.LG · Nikita Kachaev, Andrey Moskalenko, Matvey Skripkin, Nikita Kurlaev · 2026-06-17

The study introduces Act2Answer, a protocol for evaluating commonsense and world knowledge retention in Vision-Language-Action (VLA) models by adapting Vision-Language Model (VLM) benchmarks to require agents to answer through object-placement actions. This method reduces control confounds by measuring action-grounded success rates across diverse knowledge categories. Layerwise intent probing localizes answer-relevant information across model layers. Evaluating 7 VLA models and 9 VLM baselines reveals that VLAs perform well on simple concepts but show gaps in richer semantic categories compared to source VLMs, with VQA co-training improving knowledge retention and answer-relevant signals peaking in middle layers.

vision-language-action modelscommonsense knowledgelayerwise intent probingaction-grounded evaluationvqa co-training

Risk Stratification for ICU Delirium using Pervasive Ambient Sensing Information

arXiv cs.LG · Jiaqing Zhang, Sabyasachi Bandyopadhyay, Miguel Contreras, Jessica Sena · 2026-06-17

This study introduces a novel approach for ICU delirium risk stratification using ambient sensing data, demonstrating that sound and light features can independently predict delirium onset across multiple time horizons. The authors evaluated four sequential neural network models on data from 309 patients across 9 ICUs, focusing on 10 prediction-window sizes. Using Shapley Additive Explanations for interpretability, they found that a convolutional model achieved the highest discrimination (AUC = 0.80) on sound and combined data. Sound features emerged as dominant predictors, with combined sound-light data improving short-term (<1 week) prediction accuracy. These results indicate that passive ambient sensing provides clinically meaningful signals for delirium risk estimation.

ambient sensingrisk stratificationshapley additive explanationsconvolutional modelintensive care unit

Beyond Algorithms: Conceptual Innovation in Medical Imaging AI

arXiv cs.LG · Mark A. Anastasio · 2026-06-17

The paper critiques the overemphasis on algorithmic innovation in medical imaging AI, advocating for greater recognition of conceptual innovation that reframes problem definitions, evaluation metrics, and clinical relevance. Through case studies, it demonstrates how inadequate conceptual grounding leads to misaligned objectives and limited real-world impact. The authors propose structural changes in research incentives, training, and publishing to better integrate conceptual contributions alongside technical advances.

medical imagingalgorithmic innovationconceptual innovationclinical relevanceevaluation metrics

Structured Inference with Large Language Gibbs

arXiv cs.LG · Sanghyeok Choi, Henry Gouk, Esmeralda S. Whitammer · 2026-06-17

The paper introduces Large Language Gibbs, a method for structured probabilistic inference using LLMs' conditional distributions as transition operators. Unlike single-pass autoregressive generation, it iteratively resamples variables via LLM conditionals, avoiding order-dependent biases and achieving a stationary distribution balanced across local conditionals. Applications include sampling synthetic distributions, consistent reasoning tasks, and Bayesian structure learning. Results indicate LLM-conditioned MCMC is a viable alternative to one-pass generation for structured inference under noisy LLM-derived priors.

large language gibbsstructured probabilistic inferencemcmcllm conditionalsbayesian structure learning

Detecting Hidden ML Training With Zero-Overhead Telemetry

arXiv cs.LG · Robi Rahman, Sabiha Tajdari · 2026-06-17

The study presents a robust method for detecting hidden ML training workloads using zero-overhead telemetry from NVIDIA Management Library (NVML), addressing adversarial evasion attempts. The approach leverages content-agnostic signals to monitor physical GPU effects without accessing sensitive model data. Evaluated across 9 GPU models and 20 evasion strategies, the classifier achieves 98.2% binary accuracy for training workload detection and maintains 43-87% accuracy against adversarially disguised workloads.

gpu telemetryadversarial robustnessworkload classificationnvmlzero-overhead monitoring

SCAN: Enhance Time Series Anomaly Detection via Multi-Scale Neighborhood-Centered Clustering

arXiv cs.LG · Xingze Zheng, Hanyin Cheng, Siyuan Wang, Yiting Hao · 2026-06-17

The paper proposes SCAN, a method enhancing time series anomaly detection through multi-scale neighborhood-centered clustering. It addresses over/under-generalization in reconstruction-based approaches by integrating cluster center representations of normal patterns at the representation level and combining cluster membership probability with reconstruction error at the anomaly criterion level. Multi-view clustering improves performance via neighborhood-centered representations. Experiments across diverse real-world datasets demonstrate state-of-the-art results.

time series anomaly detectionreconstruction-based methodsmulti-scale clusteringcluster membership probabilityneighborhood-centered representations

Acceleration of an algebraic multigrid pressure solver using graph neural networks

arXiv cs.LG · Eric Chillón, Artur K. Lidtke, Nguyen Anh Khoa Doan, Bernat Font · 2026-06-17

The paper introduces a data-driven algebraic multigrid (AMG) smoother using a modified graph convolutional isomorphism network (GCIN) to accelerate pressure-Poisson equation solving in unstructured flow simulations. The method predicts optimal polynomial coefficients for a sparse pseudo-inverse operator, adapting to local grid anisotropies while preserving solver linearity. Benchmarks show wall-clock speedups of 4-37%, reduced V-cycle counts, and robust generalization to meshes 128× larger than training data, including the AirfRANS dataset.

algebraic multigridgraph neural networkspressure-poisson equationunstructured gridssparse pseudo-inverse

Transformer Geometry Observatory TGO-I: Spectral Geometry Observatory

arXiv cs.LG · Kaustubh Kapil, Kishor P. Upla · 2026-06-17

The Transformer Geometry Observatory (TGO-I) introduces a systematic framework to analyze the spectral geometry of Vision Transformers (ViTs), focusing on ViT-Small/16 trained on ImageNet-100. The study examines metrics like Effective Rank, Spectral Entropy, and anisotropy, revealing increased dimensional utilization and decreasing anisotropy during training. Contrary to expectations, variance redistributes across dimensions rather than concentrating in dominant directions, with the CLS token showing the highest effective dimensionality and lowest anisotropy.

vision transformersspectral geometryeffective rankanisotropycls token

A Human-in-the-Loop Bayesian Optimization Framework for Constraint-Aware Bioprocess Development

arXiv cs.LG · Samuel Stricker, Claus Wirnsperger, Alessandro Butté, Laura Helleckes · 2026-06-17

The paper extends Pareto Front Guided Sampling (PFGS), a Human-in-the-Loop Bayesian Optimization framework, to address constrained and robust optimization in bioprocess development. It incorporates the posterior probability of satisfying output constraints as an explicit Pareto objective, computed analytically from the Gaussian process posterior, and employs Monte Carlo sampling to estimate expected lower-confidence performance under input perturbations. The framework visualizes trade-offs between performance, uncertainty, constraint satisfaction, and robustness via pairwise projections on an interactive dashboard. Demonstrated on an eight-dimensional CHO cell culture simulator, it systematically identifies high-performing, feasible, and resilient operating conditions, leveraging expert-defined criteria for resource allocation.

bayesian optimizationgaussian processpareto frontconstrained optimizationrobust optimization

Generalised Eigenvalue Geometry of Semantic Adversarial Attacks

arXiv cs.LG · Martin Anthony, Kaveh Salehzadeh Nobari · 2026-06-17

The paper develops a continuous local model for semantic adversarial attacks, where paraphrases preserve meaning under a reference embedding but alter classifier predictions. The authors derive an attackability index λ*(x) based on the largest generalized eigenvalue of a matrix pencil (A,B) constructed from Jacobians of embedding maps, providing closed-form prediction-flip conditions for affine readouts. Theoretical contributions include distribution-free VC bounds for binary attackability indicators and margin bounds adjusted by local geometric penalties. The framework connects continuous theory to discrete paraphrase search and proposes empirical verification using soft-token relaxations and generated paraphrase sets.

semantic adversarial attacksgeneralized eigenvaluematrix pencilattackability indexvc bound

Learning to Annotate Delayed and False AEB Events: A Practical System for Extreme Class Imbalance and Asymmetric Label Noise

arXiv cs.LG · Mengxiang Hao, Xin Jiang, Xinghao Huang, Wenliang Su · 2026-06-17

The authors present an automated annotation framework for rare delayed/false Autonomous Emergency Braking (AEB) triggers, addressing extreme class imbalance (<5% minority samples) and asymmetric label noise. Key innovations include targeted data augmentation through focal attribute manipulation and ego-vehicle dynamics transplantation, coupled with noise suppression via stable hardness estimation and probe-guided adaptive thresholding. Deployed as a full-stack system, it achieves 80% recall improvement for delayed/false triggers while reducing manual workload by 50%, enabling continuous AEB optimization through accumulated high-quality annotations.

autonomous emergency brakingclass imbalanceasymmetric label noisedata augmentationhardness estimation

AGDN: Learning to Solve Traveling Salesman Problem with Anisotropic Graph Diffusion Network

arXiv cs.LG · Bolin Shen, Ziwei Huang, Zhiguang Cao, Yushun Dong · 2026-06-17

The paper introduces AGDN (Anisotropic Graph Diffusion Network), a Graph Neural Network framework for solving the Traveling Salesman Problem (TSP). The method addresses two key challenges: uninformative topological priors in fully connected TSP graphs and loss of optimal solution nodes during graph sparsification. AGDN employs a MixScore transition matrix combining node similarity with pairwise distance, and an anisotropic graph diffusion strategy for multi-hop information exchange. Experiments demonstrate AGDN's superior performance across various instance sizes and node distributions, with strong generalization to unseen problem scales. Computational efficiency remains competitive with existing methods.

graph neural networkcombinatorial optimizationanisotropic diffusiontraveling salesman problemgraph sparsification

When AUC Misleads: Polarization-Aware Evaluation of Deepfake Detectors under Domain Shift

arXiv cs.LG · Dat Nguyen, Cosmin Radoi, Romain Hermary, Marcella Astrid · 2026-06-17

The paper introduces Cross-AUC, a novel metric for evaluating deepfake detectors under domain shift, addressing limitations of traditional AUC measurements. Cross-AUC combines per-domain AUCs with prediction polarization quantified via Wasserstein Distance between class score distributions, offering improved interpretability and robustness assessment. Experiments across seven benchmark datasets validate its effectiveness in capturing generalization performance amid diverse data sources and manipulation types.

cross-aucdomain shiftwasserstein distancedeepfake detectiongeneralization

Complementary Attention Head Pruning for Efficient Transformers

arXiv cs.LG · Yaniv Livertovsky, Shahar Somin, Gonen Singer · 2026-06-17

The paper introduces CAHP (Complementary Attention Head Pruning), a post-hoc framework for Transformer compression that formulates attention head selection as a global graph-theoretical problem. Instead of isolated evaluations, CAHP employs graph-based clustering and information-theoretic distances to preserve topologically diverse, complementary heads, automatically determining pruning thresholds via diminishing marginal performance curves. Evaluations on SST-5 and MNLI show CAHP outperforms baselines, especially in high-compression regimes, while avoiding gradient-based methods' proximity bias by retaining critical intermediate-layer heads.

attention head pruningtransformer compressiongraph-based clusteringinformation-theoretic distancepost-hoc pruning

OpenAnt: LLM-Powered Vulnerability Discovery Through Code Decomposition, Adversarial Verification, and Dynamic Testing

arXiv cs.LG · Nahum Korda, Gadi Evron · 2026-06-17

OpenAnt introduces a multi-stage pipeline for automated vulnerability discovery in large codebases by integrating static analysis with LLM-based reasoning. The system employs three key techniques: codebase decomposition into self-contained analysis units, reducing the analysis surface by up to 97%; adversarial verification through constrained attacker simulation; and dynamic verification via automatically generated exploit environments executed in sandboxed containers. Evaluated on projects including OpenSSL, WordPress, and Flowise, OpenAnt identifies previously unknown vulnerabilities while maintaining manageable analysis cost and reducing false positives. The system is released as open source under the Apache 2.0 license.

vulnerability discoverystatic analysisadversarial verificationdynamic verificationllm-based reasoning

ChronoSurv: A Clinical Pathway-Guided Graph Framework for Multimodal Survival Analysis

arXiv cs.LG · Hugo Miccinilli, Theo Di Piazza · 2026-06-17

ChronoSurv introduces a clinical pathway-guided graph framework for multimodal survival analysis in head and neck cancer, addressing limitations of static fusion and temporally agnostic modeling. The method constructs heterogeneous hierarchical directed graphs representing patient trajectories, incorporating fine-grained to global representations via progression-aware clinical steps and heterogeneous message passing. Evaluated on two public datasets, ChronoSurv achieves state-of-the-art discriminative performance (C-index: 0.72-0.75) with reliable calibration, validated through ablation studies.

survival analysisheterogeneous graphsmultimodal fusionclinical pathwaysmessage passing

INDEQS: Informed Neural controlled Differential EQuationS

arXiv cs.LG · Michael Detzel, Gabriel Nobis, Kristiyan Blagov, Juri Schubert · 2026-06-17

The paper introduces INDEQS (Informed Neural controlled Differential EQuationS), a graph-based NCDE method that incorporates prior knowledge of directed graph structure into time series forecasting. The architecture separates inner (node-wise hidden state mixing) and outer (vector field-control interaction) graph-informed operations, offering both constrained and adaptive variants. Evaluation on synthetic advection simulations and real-world tasks (hydrological discharge and PeMS08 traffic) shows outer informedness reduces MAE versus uninformed NCDEs, particularly on large graphs, while inner informedness provides parameter efficiency. Continuous-time decoders outperform discrete counterparts in accuracy and temporal flexibility. Code is publicly available.

neural controlled differential equationsgraph-informed learningspatio-temporal forecastingadaptive graph convolutionscontinuous-time decoders

Giskard : Byzantine Robust and Confidential Aggregation for Large-Scale Decentralized Learning

arXiv cs.LG · Ousmane Touat, César Sabater, Mohamed Maouche, Sonia Ben Mokhtar · 2026-06-17

Giskard introduces a protocol for confidential and Byzantine-robust decentralized aggregation in large-scale learning. The method organizes $n$ parties into a tree of $O(\log n)$-sized committees, employing BGW-style MPC for coordinate-wise approximate median computation via distributed binary search. Theoretical analysis confirms security and confidentiality, while experiments with up to one million participants demonstrate reduced per-party communication complexity and comparable model utility under $n/4$ Byzantine parties.

decentralized learningbyzantine robustnesssecure multi-party computationconfidential aggregationdistributed binary search

Seeing Before Reasoning: Decoupling Perception and Reasoning for Shortcut-Resilient Multimodal On-Policy Self-Distillation

arXiv cs.LG · Sihan Wang, Xiyao Liu, Lianqing Liu, Zhi Han · 2026-06-17

ViGOS introduces a visually grounded on-policy self-distillation (OPSD) framework to address shortcut learning in multimodal large language models (MLLMs). The method decouples perception and reasoning: a student first generates a visual description supervised by an image-only perception teacher, then reasons toward the final answer guided by a privileged reasoning teacher. A reference teacher handles invalid rollouts to maintain output format. Evaluated across vision-language, expert reasoning, visual math, spatial grounding, and visual-language-prior benchmarks, ViGOS preserves OPSD benefits while enhancing image-grounded behavior in shortcut-prone scenarios.

on-policy self-distillationmultimodal large language modelsvisual descriptionshortcut learningperception-reasoning decoupling

JourneyFormer: Encoding Airbnb Guest Journey with Sequence Modeling

arXiv cs.LG · Daochen Zha, Chun How Tan, Xin Liu, Bin Xu · 2026-06-17

JourneyFormer introduces a sequence modeling solution for Airbnb's search ranking, addressing production challenges with long, exploratory guest sequences and sparse booking labels. The method involves careful design choices in guest event selection, ID embeddings, model architecture, and label attribution, alongside optimizations for training and inference efficiency. Deployment results show improved offline ranking metrics and significant business metric gains in online A/B tests across two production surfaces.

sequence modelingsearch rankingid embeddingslabel attributiona/b testing

Smoothness-Based Derandomization of PAC-Bayes Bounds

arXiv cs.LG · Alexandre Lemire Paquin, Brahim Chaib-Draa, Philippe Giguère · 2026-06-17

The paper presents a PAC-Bayes derandomization framework for smooth loss functions, yielding generalization bounds for deterministic predictors. By analyzing the cost of transitioning from Gibbs to deterministic predictors via the Jensen gap class, the method controls this class using Rademacher complexity, resulting in bounds involving parameter Jacobians and Hessians. The approach applies to both bounded and unbounded losses, with specialization to linear predictors and neural networks. A practical regularizer motivated by these quantities is proposed, with experiments on CIFAR-10 demonstrating its behavior across batch sizes for BatchNorm networks.

pac-bayesderandomizationsmooth lossrademacher complexitybatchnorm

Structure Over Nonlinearity: Explicit Interaction Architectures for Dynamical Learning

arXiv cs.LG · Augusto Sarti · 2026-06-17

The paper proposes structured dynamical units as an alternative to generic nonlinear function approximation for learning dynamical systems. Inspired by wave-based computation, the units employ explicit interaction architectures with internal state, eliminating algebraic loops and enabling fully explicit evaluation without implicit solvers. Experiments on nonlinear system identification show that depth improves representation quality and generalization, even with limited parameter optimization, and that useful dynamical structure emerges prior to substantial parameter tuning. The results demonstrate that structure-first design can effectively replace conventional black-box approaches.

dynamical systemsexplicit architectureswave-based computationsystem identificationinteraction structure

Context-Aware Optimization of Follow-Up Intervals for Type 2 Diabetes Care Using Markov Decision Processes

arXiv cs.LG · Parisa Lotfibagha, Kristen Miller, William J. Gallagher, Elizabeth B. Selden · 2026-06-17

This study introduces a Contextual Markov Decision Process (CMDP) model to optimize follow-up intervals for Type 2 Diabetes (T2D) care, addressing heterogeneity in patient trajectories. Using EHR data from 22,154 patients, the method combines dimensionality reduction via Principal Component Analysis and clustering to identify two risk-based subpopulations. CMDP-derived policies recommend context-specific intervals (1-12 months), reducing expected cumulative costs by 34.8% for high-risk and 6.4% for low-risk patients compared to fixed-interval benchmarks, demonstrating the efficacy of adaptive, data-driven chronic care management.

contextual markov decision processtype 2 diabeteselectronic health recordprincipal component analysisfollow-up intervals

Model-Free Reinforcement Learning Control for Resilient Cyber-Physical Systems

arXiv cs.LG · Hugo O. Garcés, Alejandro J. Rojas, Bernardo A. Hernández, Andrés Escalona · 2026-06-17

The paper evaluates model-free reinforcement learning controllers for resilient cyber-physical systems under cyberattacks, including false data injection and denial-of-service attacks. Four reward types—Lyapunov, exponential, progressive, and linear—are analyzed for accuracy, cost, and resilience. Results indicate that Lyapunov reward provides the highest resilience with low tracking error, while exponential reward offers a balanced trade-off under moderate training conditions. Proximal Policy Optimization outperforms Deep Deterministic Policy Gradient with reduced KPI variance. RL-MPCs demonstrate strong steady-state resilience but require longer training, whereas RL-PID controllers achieve faster convergence with less training time.

model-free reinforcement learningcyber-physical systemsfalse data injectionproximal policy optimizationlyapunov reward

Quantifying and Auditing LLM Evaluation via Positive--Unlabeled Learning

arXiv cs.LG · Zilong Zhang, Yi-Ting Hung, Lei Ding, Chi-Kuang Yeh · 2026-06-17

The authors propose a geometric auditing framework for LLM evaluation under selective human supervision, formulated as a positive-unlabeled learning problem. The method leverages Partial Optimal Transport to align human-verified positives with a reliable subset of unlabeled outputs in a fixed embedding space, identifying human-consistent preferences and correcting biased judges without retraining. Experiments demonstrate improved alignment with human preferences, increased robustness to presentation biases, and interpretable confidence estimates, offering a scalable alternative to LLM-as-a-judge pipelines.

positive-unlabeled learningpartial optimal transportllm evaluationgeometric auditingpresentation biases

Adaptive Speech-to-Spike Encoding for Spiking Neural Networks

arXiv cs.LG · Taharim Rahman Anon, Jakaria Islam Emon · 2026-06-17

The paper introduces an adaptive speech-to-spike encoder for Spiking Neural Networks (SNNs) to address the mismatch between continuous acoustic signals and event-driven processing. The method employs a learnable residual encoder jointly trained end-to-end with a Recurrent Leaky Integrate-and-Fire (R-LIF) backbone. On Google Speech Commands v2, it achieves 94.97% accuracy (89.8% with a 35k-parameter variant), outperforming larger baselines. Analysis reveals the encoder learns task-aligned spike representations rather than signal reconstruction. Direct Feedback Alignment (DFA) reaches 91.5% accuracy, demonstrating bio-inspired learning trade-offs.

spiking neural networksspeech-to-spike encodingrecurrent leaky integrate-and-firedirect feedback alignmentneuromorphic audio

Geometric and Stochastic Analysis of Discontinuities in Sparse Mixture-of-Experts

arXiv cs.LG · Tho Tran Huu, Huu-Tuan Nguyen, Thien-Hai Nguyen, Nhat-Tri Ho · 2026-06-17

This work provides a geometric and stochastic analysis of discontinuities in Sparse Mixture-of-Experts (SMoE) architectures, which arise from Top-k expert selection. Using measure-theoretic slicing and diffusion process modeling, the authors classify discontinuities by order, establish volume estimates, and derive finite-time probability bounds for discontinuity encounters. Results show lower-order discontinuities dominate in volume and likelihood of encounter. A smoothing mechanism is proposed to softly incorporate experts near discontinuities, with theoretical guarantees of minimal computational overhead and empirical improvements in language and vision tasks.

sparse mixture-of-expertsdiscontinuity surfacesmeasure-theoretic slicingdiffusion processsmoothing mechanism

Lifecycle-Aware Dynamic Analysis for Secure ML Model Execution

arXiv cs.LG · Gabriele Digregorio, Marco Di Gennaro, Francesco Pastore, Stefano Zanero · 2026-06-17

We propose Moat, a dynamic lifecycle-aware approach for securing ML model execution, addressing limitations of static model-scanning solutions. Moat focuses on host system interactions during well-defined ML lifecycle phases, leveraging structured and predictable execution patterns. We implement Re-Moat and evaluate it on 77,974 Hugging Face models, 31 CVEs, and 334 models from a state-of-the-art dataset, comparing against existing model-scanning solutions. Results demonstrate detection of all evaluated attack classes with near-zero false positives, validating the efficacy of dynamic analysis for ML security.

dynamic analysisml lifecyclemodel-scanninghost systemfalse-positive rate

Sumi: Open Uniform Diffusion Language Model from Scratch

arXiv cs.LG · Mengyu Ye, Keito Kudo, Wataru Ikeda, Ryosuke Matsuda · 2026-06-17

We introduce Sumi, a 7B-parameter uniform diffusion language model pretrained from scratch on 1.5T tokens, addressing the lack of large-scale native implementations in this paradigm. Sumi employs uniform diffusion, enabling any token to be updated at any generation step, contrasting with autoregressive and masked diffusion approaches. Evaluation shows competitive performance with autoregressive models on knowledge, reasoning, and coding benchmarks, though underperformance on commonsense tasks likely due to education-heavy data mixture. We release full model weights, checkpoints, and training recipes to facilitate research on scaling behavior, generation dynamics, and controllability of uniform diffusion models.

uniform diffusionlanguage modelautoregressivemasked diffusionpretraining

DIPHINE: Diffusion-based $Φ$-ID Neural Estimator

arXiv cs.LG · Simon Pedro Galeano Munoz, Mustapha Bounoua, Giulio Franzese, Pietro Michiardi · 2026-06-17

The authors introduce DIPHINE, a neural estimator for Integrated Information Decomposition ($Φ$ID) that handles continuous non-Gaussian systems via score-based diffusion models. The method jointly estimates all required mutual information terms through a single amortized network, recovering the sixteen $Φ$ID atoms via Möbius inversion. Theoretical analysis shows the synergy-to-synergy atom is hardest to estimate due to integer-valued Jacobians. Experiments demonstrate accurate atom recovery on synthetic benchmarks, outperforming existing mutual information estimators, and successful application to real-world physiological data without distributional assumptions.

integrated information decompositiondiffusion modelsmutual information estimationmöbius inversionneural estimator

Sequential Kernel-based Conditional Independence Testing via Adaptive Betting

arXiv cs.LG · Zheng He, Danica J. Sutherland · 2026-06-17

We propose a robust sequential kernel-based conditional independence test that substantially reduces Type I error inflation while maintaining high power, addressing the fragility of existing sequential Model-X approaches to estimation errors in the conditional distribution. The method combines testing-by-betting with an adaptively optimized Kernel Conditional Independence statistic, employing a normalization scheme and truncate-and-shift calibration strategy. Evaluations on high-dimensional synthetic benchmarks and real-world fairness tasks demonstrate superior performance over existing sequential Model-X methods. Code is publicly available.

conditional independencekernel conditional independencemodel-xtesting-by-bettingtype i error

FOSC-X: An Extended Framework for Optimal Local Cuts and Non-Horizontal Cluster Selection from Clustering Hierarchies

arXiv cs.LG · Connor Simpson, Ricardo J. G. B. Campello · 2026-06-17

FOSC-X introduces a framework for extracting the top-M globally optimal flat clusterings from hierarchical trees via non-horizontal cuts, enabling discovery of alternative clustering structures. The method employs dynamic programming to efficiently combine locally optimal partial candidates, with polynomial-time complexity for unconstrained cases and linear-time for cluster-count constraints using feasibility bounds. Experiments demonstrate its effectiveness in identifying multiple high-quality solutions overlooked by single-solution approaches.

hierarchical clusteringdynamic programmingoptimal cutscluster extractionfeasibility bounds

EfficientRollout: System-Aware Self-Speculative Decoding for RL Rollouts

arXiv cs.LG · Minseo Kim, Minjae Lee, Seunghyuk Oh, Kevin Galim · 2026-06-17

EfficientRollout introduces a system-aware self-speculative decoding framework to accelerate RL rollouts in LLMs. It addresses two key challenges: (i) policy-drafter mismatch due to evolving target policies, and (ii) shifting compute/memory-bound regimes during decoding. The method employs a quantized self-speculative drafter coupled to the policy without separate training, alongside system-aware toggle policies for adaptive speculation. Results show 19.6% and 12.7% reductions in rollout and end-to-end latency, respectively, while maintaining output quality.

speculative decodingrl rolloutslatency reductionself-speculative draftersystem-aware adaptation

Online Reward-Punishment Learning from Fixed-Channel Perceptual Event Streams without Environment Rewards

arXiv cs.LG · Zirong Li · 2026-06-17

The paper introduces OHIRL, an online reward-punishment learning framework for environments without scalar rewards or evaluative labels. The method decomposes learning into four modules: M_psi (next-packet prediction), D_omega (residual dynamics), C_eta (trajectory evaluation), and B_xi (policy updates). C_eta uses a recovery-positive and persistence-negative orientation, while B_xi learns to infer reward valence from perceptual dimensions like pain or energy. Experiments on a 2x2-XOR packet task show B_xi achieves 0.952 balanced reward-sign accuracy, with policy reaching 0.979 optimal-action accuracy. Controls confirm the necessity of each module and information boundaries.

online learningreward-punishment learningperceptual event streamsresidual dynamicstrajectory evaluation

Be Your Own Teacher: Steering Protein Language Models via Unsupervised Reward Optimization

arXiv cs.LG · Lanqing Li, Shentong Mo, Yang Yu, Pheng-Ann Heng · 2026-06-17

The paper introduces unsupervised reward optimization for steering protein language models (PLMs) without labeled data, addressing the supervision bottleneck in biomolecular design. The method leverages task-agnostic rewards combining intrinsic model uncertainty and extrinsic semantic consistency from protein representation models. Two offline algorithms, Soft Reward Optimization (SRO) and Binarized Reward Optimization (BRO), maximize these proxy rewards. Experiments show both methods outperform baselines (DPO, KTO) on compositional out-of-distribution prompts, approaching oracle performance across temperatures, model scales, and protein families, while improving coverage in pass@k evaluations.

protein language modelsunsupervised reward optimizationbiomolecular designtask-agnostic rewardsoffline algorithms

Zero-Shot Active Feature Acquisition via LLM-Elicitation

arXiv cs.LG · Binyamin Perets, Natalie Mendelson, Shiran Vainberg, Yehuda Chowers · 2026-06-17

The paper introduces a zero-shot active feature acquisition (AFA) framework that leverages large language models (LLMs) for eliciting domain knowledge without labeled data. The method separates LLM capabilities by eliciting only unary deviations and pairwise co-variations—sufficient statistics for a Markov random field (MRF)—and applies a maximum-entropy closure to address discriminative statistic limitations. Evaluated on an Inflammatory Bowel Disease (IBD) cohort, the framework outperforms LLMs on real labels and extracted beliefs, particularly excelling in top-$k$ identification for challenging cases.

active feature acquisitionmarkov random fieldlarge language modelszero-shot learningmaximum-entropy

GrapNet: A Programmable Dynamic-Architecture Neural Graph Substrate

arXiv cs.LG · Zirong Li · 2026-06-17

GrapNet introduces a programmable neural graph substrate where the graph structure serves as both architecture and executable program, enabling dynamic editing of relations, freezing subgraphs, and modular composition with conventional neural modules. The framework features child-owned graph nodes with trainable allocation vectors, decoupled structural rules, and execution policies, supporting operations like topology-aware routing and dense snapshot lowering. Evaluated on Split Fashion-MNIST and Split CIFAR-10, GrapNet outperforms dense MLP baselines by 12.08 and 3.81 accuracy points respectively, demonstrating its utility as an editable neural substrate with structural programmability.

neural graph substratedynamic architectureprogrammable networkschild-owned graphallocation vector

Some Complexity Results for Robustness Verification for Binarized Neural Networks

arXiv cs.LG · Harshit Goyal, Sudakshina Dutta · 2026-06-17

The paper establishes computational complexity results for verification tasks in Binarized Neural Networks (BNNs). It analyzes two problems: satisfiability and robustness under uniform image occlusion. Through a reduction from the Boolean satisfiability problem (SAT), the authors prove that BNN satisfiability is NP-complete. Additionally, they demonstrate that uniform occlusion induces a piecewise-constant structure in the network output, enabling a polynomial-time algorithm for robustness verification. These results provide fundamental insights into the computational tractability of BNN verification tasks.

binarized neural networkscomputational complexitysatisfiabilityrobustness verificationpolynomial-time algorithm

REVES: REvision and VErification--Augmented Training for Test-Time Scaling

arXiv cs.LG · Yuanxin Liu, Ruida Zhou, Xinyan Zhao, Amr Sharaf · 2026-06-17

REVES introduces a two-stage iterative framework for test-time scaling in Large Language Models, addressing the misalignment between single-shot optimization and multi-step inference dynamics. The method alternates between online data/prompt augmentation and policy optimization, leveraging intermediate 'near-miss' answers to generate decoupled revision and verification prompts. This approach reduces computational overhead compared to standard multi-turn reinforcement learning. Results show improvements of +6.5 points over RL baselines on LiveCodeBench and state-of-the-art performance on circle packing with a 4B parameter model. The method also generalizes to constraint-satisfaction puzzles like n_queens and mini_sudoku.

test-time scalingmulti-step inferencepolicy optimizationnear-miss answersconstraint-satisfaction puzzles

Anomaly Detection for Sparse and Irregular Multivariate Time Series with Latent SDEs

arXiv cs.LG · Martin Uray, Dominik Geng, Florian Graf, Stefan Huber · 2026-06-17

A generative approach for multivariate time series anomaly detection (MTSAD) using latent stochastic differential equations (SDEs) is proposed, addressing challenges of sparse, irregularly sampled, and partially observed data. The method projects observed time series onto a continuous-time stochastic dynamical system, naturally handling missing observations and capturing cyclic behavior. Evaluated on six anomaly benchmark datasets, the approach outperforms state-of-the-art baselines, demonstrating superior robustness under severe data sparsity conditions. Results indicate latent SDEs as an effective inductive bias for anomaly detection in real-world irregular multivariate time series.

multivariate time seriesanomaly detectionstochastic differential equationsirregular samplingdata sparsity

Test-Time Adaptation in Optical Coherence Tomography Using Trajectory-Aligned Time-Independent Flow

arXiv cs.LG · Veit Hucke, Thomas Pinetz, Gregor Reiter, Ursula Schmidt-Erfurth · 2026-06-17

The paper introduces a flow-matching-based test-time adaptation method for optical coherence tomography (OCT) to address domain gaps between test and training data. By generating high-quality surrogate images from noisy inputs through histogram matching to synthetic reference trajectories and removing time conditioning in the network, the method aligns input distributions with expected ones. This approach achieves state-of-the-art performance in segmenting biomarkers for Age-related Macular Degeneration (AMD), demonstrating effectiveness in low-cost OCT devices with inconsistent image quality.

optical coherence tomographytest-time adaptationflow-matchinghistogram matchingage-related macular degeneration

Strategic Feature Selection

arXiv cs.LG · Jivat Neet Kaur, Pratik Patil, Divya Shanmugam, Emma Pierson · 2026-06-17

The paper introduces a formal framework for strategic classification via feature selection, analyzing its interaction with ridge regularization. The authors demonstrate that excluding features solely based on manipulability is suboptimal, providing a fine-grained characterization of feature subset performance under optimal regularization. They develop a practical algorithm for joint feature selection and regularization tuning, validated through a healthcare payments case study. Results show the method offers principled guidance for designing policy levers in strategic decision-making systems.

strategic classificationfeature selectionridge regularizationalgorithmic decision-makinghealthcare payments

Quantification of Uncertainty with Adversarial Models in Medical Image Segmentation

arXiv cs.LG · Hana Jebril, Thomas Pinetz, Günter Klambauer, Hrvoje Bogunović · 2026-06-17

QUAM-SM introduces a post-hoc framework for pixel-level uncertainty quantification in medical image segmentation, using adversarial search to identify prediction vulnerabilities. The method disentangles epistemic and aleatoric uncertainty by targeting perturbations that expose unstable regions, particularly at pathological boundaries. Evaluated on two public datasets with multi-expert annotations, QUAM-SM outperforms standard and recent uncertainty estimation methods in reliability and boundary sensitivity.

uncertainty quantificationadversarial searchmedical image segmentationepistemic uncertaintyaleatoric uncertainty

Investigating Inductive Biases for Machine Learning Emulation of Sudden Stratospheric Warmings in Idealised Isca Simulations

arXiv cs.LG · Oskar Bohn Lassen, Simon Driscoll, Stephen I. Thomson, Sebastian Schemm · 2026-06-17

The study evaluates how architectural inductive biases affect machine-learning emulation of sudden stratospheric warmings (SSWs) using paired idealised Isca simulations with wave-2 heating perturbations. Testing convolutional, transformer, and graph-based architectures for one-step prediction, results show modest differences during quiet stratospheric conditions but significant divergence during SSW-like variability. Explicit three-dimensional vertical coupling emerges as a critical inductive bias, though Eliassen-Palm flux diagnostics reveal persistent errors in stratospheric wave-driving structure despite low forecast error.

inductive biassudden stratospheric warmingsmachine-learning emulationeliassen-palm fluxisca simulations

Approximate Structured Diffusion for Sequence Labelling

arXiv cs.LG · Nicolas Floquet, Joseph Le Roux, Nadi Tomeh · 2026-06-17

The paper introduces Approximate Structured Diffusion, a method that enhances sequence labelling by training a Linear-Chain Conditional Random Field (CRF) conditioned on entire noisy label sequences via diffusion. This approach addresses the limitation of finite decision spans in traditional CRFs, enabling better handling of long-range dependencies. Experimental results demonstrate a 16.5% error reduction in POS-tagging accuracy compared to conventional CRF methods.

sequence labellingconditional random fielddiffusion modelslong-range dependenciespos-tagging

Kernel of Partition Paths: A Unified Representation for Tree Ensembles

arXiv cs.LG · Nicolas Mahler · 2026-06-17

The paper introduces Kernel of Partition Paths (KPP), a unified geometric representation for tree ensembles that indexes feature maps by nodes rather than splits, weighted by a path metric for squared-Euclidean embedding. KPP integrates prediction, additive attribution, deterministic Lipschitz robust radius, and uniform Rademacher risk bounds under fixed, honest, or cross-fit conditioning regimes. Theoretical guarantees are conditional on the representation, with deterministic robustness in the KPP metric. Open problems include conjectured fast-rate refinements for regression and classification.

tree ensemblespath metricsquared-euclidean embeddingrademacher risk boundslipschitz robustness

Learning from Your Own Mistakes: Constructing Learnable Micro-Reflective Trajectories for Self-Distillation

arXiv cs.LG · Zhilin Huang, Hang Gao, Ziqiang Dong, Yuan Chen · 2026-06-17

The paper introduces Trajectory-Augmented Policy Optimization (TAPO), a self-distillation method that replaces implicit KL-based alignment with explicit micro-reflective trajectories for error correction. TAPO constructs contrastive training trajectories by preserving the model's erroneous reasoning up to failure points, then inserting natural-language diagnoses and corrected reasoning from correct references. The method incorporates difficulty-aware candidate selection and decoupled advantage estimation to maintain on-policy distribution and prevent gradient contamination. Experiments on AIME 2024, AIME 2025, and HMMT 2025 demonstrate consistent improvements over GRPO in reasoning and error-correction effectiveness under equivalent training steps.

self-distillationtrajectory constructionpolicy optimizationerror-correctioncontrastive learning

Semantic Robustness Certification for Vision-Language Models

arXiv cs.LG · Peiyu Yang, Paul Montague, Feng Liu, Andrew C. Cullen · 2026-06-17

The paper introduces a novel robustness certification framework for vision-language models (VLMs) under semantic-level transformations. By leveraging VLMs' open-vocabulary capability, the method uses text prompts as semantic proxies to construct parameterized transformations, then analytically characterizes decision boundaries to certify robustness intervals. The approach requires no additional data per variation, enabling practical application. Experiments demonstrate certification capability across diverse semantic variations on synthetic and real-world data.

vision-language modelsrobustness certificationsemantic transformationsdecision boundaryopen-vocabulary

Identifying Structural Biases from Causal Mechanism Shifts

arXiv cs.LG · Praharsh Nanavati, Jilles Vreeken, David Kaltenpoth · 2026-06-17

The paper introduces a method to identify hidden confounding and selection biases by analyzing causal mechanism shifts across environments. The authors prove that structural biases induce dependent mechanism shifts, formalizing this as a testable mutual information criterion, and propose the StruBI algorithm to classify variables as unbiased, confounded, or selection-biased. Experiments demonstrate StruBI's superior performance over state-of-the-art methods in accurately recovering bias types and affected variable sets on synthetic and real-world data.

causal discoveryhidden confoundingselection biasmechanism shiftmutual information

Seed-Guided Semi-Supervised Clustering by A-Contrario Anomaly Detection

arXiv cs.LG · Nassir Mohammad · 2026-06-17

The paper proposes a seed-guided semi-supervised clustering framework that reformulates clustering as the dual of anomaly detection using a-contrario statistical reasoning. The Perception algorithm identifies clusters as maximal anomaly-free subsets relative to a uniform randomness null hypothesis, employing expectation-based thresholds (E < 1) for parameter-free outlier rejection. Initialized with minimal user seeds (10–30 per cluster), it iteratively expands groups while isolating noise and unknown clusters. Evaluations on synthetic and real-world datasets (image/text, raw/reduced embeddings) show competitive performance with linear scalability in observations and dimensionality under low-tuning conditions.

semi-supervised clusteringa-contrario detectionseed-guided learninganomaly rejectiongestalt proximity

GateMem: Benchmarking Memory Governance in Multi-Principal Shared-Memory Agents

arXiv cs.LG · Zhe Ren, Yibo Yang, Yimeng Chen, Zijun Zhao · 2026-06-17

GateMem introduces a benchmark for evaluating memory governance in multi-principal LLM agents, addressing shared deployments in medical, office, education, and household settings. The benchmark assesses utility for long-horizon requests, access control across authorization boundaries, and active forgetting after deletion requests, using structured judging and leak-target annotations. Experiments show no current method achieves strong performance across all metrics: long-context prompting yields best governance at high token cost, while retrieval-based methods leak unauthorized or deleted information, highlighting reliability gaps for institutional deployment.

memory governancemulti-principal agentsaccess controlactive forgettinglong-horizon requests

Where Will They Go? Modelling Multimodal Pedestrian Manoeuvres from Ego-centric Videos

arXiv cs.LG · Yuxuan Xie, Nicolas Pugeault, Chongfeng Wei, Hubert P. H. Shum · 2026-06-17

The paper introduces MMPM, a mode-aware framework for multimodal pedestrian trajectory prediction from ego-centric videos, addressing the limitations of unimodal stochastic predictors that yield implausible mixed-mode trajectories. MMPM comprises two modules: a behavior-aware Pedestrian Interaction Module (PIM) that captures pedestrian-vehicle and pedestrian-environment interactions using gaze, head, and hand gestures, and a CVAE-based Mode-aware Trajectory Predictor (MTP) that separately models future trajectory distributions for crossing and non-crossing behaviors. A query-based decoder ensures mode consistency during decoding. Evaluations on PIE and JAAD datasets demonstrate state-of-the-art performance, with improved frame-wise displacement errors validated through a data-driven protocol. MMPM is model-agnostic and enhances existing frameworks like BiTrap-NP and SGNet-ED.

multimodal trajectory predictionego-centric videospedestrian interaction modulemode-aware trajectory predictorframe-wise displacement errors

Learning Augmented Exact Exponential Algorithms

arXiv cs.LG · Tatiana Belova, Yuriy Dementiev, Danil Sagunov · 2026-06-17

The paper introduces learning-augmented exact exponential algorithms for NP-hard subset selection problems, extending prior work focused on polynomial-time algorithms. The method augments state-of-the-art exact algorithms by leveraging noisy predictions, requiring only pairwise independence or unknown predictor accuracy. Results show that predictors marginally better than random guessing provably reduce search space, with runtime speedups scaling smoothly with prediction quality.

learning-augmented algorithmsexponential-time algorithmssubset selectionpairwise independenceruntime speedup

Online Distributional Prediction via Latent Cluster Geometry Under Drift and Corruption

arXiv cs.LG · Navyansh Mahla, Prateek Chanda, Ganesh Ramakrishnan · 2026-06-17

The paper introduces an online distributional prediction method for non-stationary data streams with drift and adversarial corruption, using a latent cluster geometry to represent candidate laws. The approach employs a Gibbs quasi-posterior over variable-size configurations of centers, updated via reversible-jump MCMC, enabling structured uncertainty and regularization without parametric assumptions. Performance is evaluated through cumulative Wasserstein-1 regret, with analysis separating corruption-induced loss perturbations from drift-induced stale posterior memory. A restarted variant achieves sublinear cumulative Wasserstein regret under conditions including bounded support, stable latent geometry, and sublinear transport action, without requiring parametric models for the stream, drift, or corruption.

online distributional predictionlatent cluster geometrywasserstein regretgibbs quasi-posteriorreversible-jump mcmc

RouteJudge: An Open Platform for Reproducible and Preference-Aware LLM Routing

arXiv cs.LG · Guannan Lai, Haoran Hu, Han-Jia Ye · 2026-06-17

RouteJudge introduces an open platform for evaluating LLM routing strategies through pairwise preference comparisons, complemented by ORBIT, a modular toolbox for standardized router development. The system anonymizes model responses from different routers, collects user preferences, and attributes performance metrics (cost, latency, task metadata) to routing decisions. ORBIT provides unified interfaces for benchmark integration, router implementation, and budget-aware evaluation, enabling reproducible research. The platform and toolbox are publicly available, supporting continuous expansion of routing methods under consistent evaluation protocols.

llm routingpreference evaluationbudget-aware inferencemodular toolboxreproducible research

A Neural Network Framework for Geodesic-Like Curve Computation on Parametric Surfaces

arXiv cs.LG · Sheng-Gwo Chen, Chen-Chang Peng · 2026-06-17

The paper introduces a neural network framework for computing geodesic-like curves on parametric surfaces, addressing a gap in efficient numerical methods since Chen's 2010 theoretical work. The approach leverages Physics-Informed Neural Networks (PINNs) to handle both single surfaces and complex multi-surface systems with C0 or higher continuity, including surfaces of revolution. Results demonstrate robust performance across this broad class of parametric geometries.

geodesic-like curvesparametric surfacesphysics-informed neural networksmulti-surface systemssurfaces of revolution

Ensuring Trustworthy Online A/B Testing: Addressing Five Key Questions on CUPED

arXiv cs.LG · Yu Zhang, Bokui Wan, Yongli Qin, Jinyong Ma · 2026-06-17

The paper systematically addresses five underexplored methodological and practical nuances of Controlled-experiment Using Pre-Experiment Data (CUPED) in online A/B testing. It compares post-CUPED estimators for optimal adjustment, evaluates regression-based adjustments with robust variance estimation, and extends analysis to multi-arm experiments and two-stage sampling designs. Results show standard variance estimators can yield misleading inferences in complex scenarios. The methodologies, validated theoretically and experimentally, have been deployed in ByteDance's experimentation platform.

cupedvariance reductiontreatment effectmulti-arm experimentstwo-stage sampling

Point-Cloud-Assistant Localized Statistical Channel Prediction by Tangent Gaussian Splatting

arXiv cs.LG · Ye Xue, Yiheng Wang, Xinhua Shao, Qi Yan · 2026-06-17

The authors introduce Point-Cloud-Assisted Tangent Gaussian Splatting (PC-TGS), a novel framework for extrapolating angular power spectrum (APS) to unmeasured outdoor grids by integrating sparse radio measurements with dense LiDAR-based geometry. PC-TGS represents environmental scatterers as anisotropic 3D Gaussians, initialized and refined through a relaxed-mean reparameterization of raw point clouds, and employs tangent-plane projection and depth-aware electromagnetic splatting for accurate APS mapping. Evaluations on a city-scale dataset (5M points, 6,310 RSRP samples) show PC-TGS outperforms state-of-the-art baselines in APS and RSRP prediction while offering faster inference, enabling geometry-aware channel prediction in large-scale wireless digital twins.

angular power spectrumlidargaussian splattingwireless networkselectromagnetic scattering

Low-Cost Neuromorphic Fall Detection Using Synthetic Event Data and Hybrid SNNs

arXiv cs.LG · Guillermo Rojas, Gonzalo Soto, Daniel Yunge · 2026-06-17

The paper introduces hybrid models combining spiking neural networks (SNNs) and convolutional neural networks (CNNs) for fall detection using synthetic event-based camera data. The method converts smartphone video frames into event-based data via Dynamic Vision Sensor (DVS) simulation, leveraging SNNs' energy efficiency and spatio-temporal processing. Evaluated on multiple datasets, the approach achieves comparable accuracy to traditional models while significantly improving computational efficiency, demonstrating the viability of SNN-DVS integration for real-world applications.

spiking neural networksdynamic vision sensorevent-based datahybrid modelsfall detection

TimeLAVA: Learning-Agnostic Data Valuation for Time Series

arXiv cs.LG · Wenqin Liu, Weizhi Quan, Aoqi Zuo, Erdun Gao · 2026-06-17

TimeLAVA introduces a learning-agnostic framework for valuing time series data by quantifying segment contributions to minimizing distributional discrepancy. The method combines selective wavelet-based Wasserstein discrepancy for temporal localization with unbalanced optimal transport, enabling robust valuation without model training. Theoretical guarantees link valuation to generalization and outlier robustness. Experiments on anomaly detection, data pruning, and label noise tasks demonstrate superior performance over existing methods on real-world datasets.

data valuationwasserstein discrepancyoptimal transportwavelet transformtime series

Clinically Aligned Geometry Constraints for Robust IVUS Vessel Boundary Segmentation

arXiv cs.LG · Yunshu Chen, Litao Yang, Giuseppe Di Giovanni, Jordan Tan · 2026-06-17

GeoCat introduces geometry-consistent constraints for robust IVUS vessel boundary segmentation, addressing boundary drift and topology errors in lumen and external elastic membrane delineation. The method employs dual Cartesian-polar encoders with cross-domain attention and temporal fusion to process 5-frame IVUS clips, supervised by a differentiable geometry consistency loss targeting clinically relevant descriptors. Trained on 12,242 annotated frames from 146 patients across two IVUS systems, GeoCat achieves a Dice score of 0.93, reduces 95HD to 0.14 mm, and lowers topology violations to 1.0%. It significantly improves geometric fidelity, yielding diameter errors of 0.13-0.16 mm and angular errors of ~8 degrees, enhancing plaque burden quantification.

ivus segmentationgeometry consistencycartesian-polar encodertemporal fusionplaque burden

Trainable Photonic Measurement for Physics-Informed PDE Learning

arXiv cs.LG · Jiale Linghu, Hao Dong, Yangshuai Wang · 2026-06-17

The paper introduces a photonic quantum neural field for physics-informed PDE learning, where coordinates are encoded as trainable optical phases and decoded via photon-number measurements. The photonic circuit serves as the neural-field representation, optimized directly rather than as a fixed feature map. Benchmarked across seven PDE tasks, the method exhibits a phase-complexity transition: classical networks suffice for smooth regimes, while photonic fields excel when residual derivatives amplify phase mismatch, achieving up to 10x lower errors with 75% fewer parameters than baselines. Noise tests confirm robustness via learned interference and stable Fock-probability readout.

photonic quantum neural fieldphysics-informed pdefock-space interferencephase-complexity transitiontrainable optical phases

Contextualizing Biological Language Models across Modalities via Logit-Space Contrastive Alignment

arXiv cs.LG · Yanjun Shao, Yundi Chen, Yashvi Patel, Aurelien Pelissier · 2026-06-17

The paper introduces LOGICA (Logit-space Contrastive Alignment), a framework for conditioning biological language models on task-specific contexts while preserving their native likelihood interface. The method employs gated cross-modal adapters to perform contrastive learning directly in output-logit space, enabling context-sensitive token probability matching without shared tokenizers or embedding spaces. Evaluated on protein-ligand binding, TCR-peptide activity, and drug-conditioned resistance prediction, LOGICA outperforms latent-space contrastive and conditional MLM baselines, improving AUC from ~0.55 to ~0.65 on drug-resistance mutation prediction while maintaining interpretable token-level outputs.

biological language modelscontrastive learninglogit-space alignmentcontext-conditioned predictioncross-modal adapters

Stealthy World Model Manipulation via Data Poisoning

arXiv cs.LG · Yibin Hu, Xiaolin Sun, Zizhan Zheng · 2026-06-17

SWAAP introduces a two-stage data poisoning framework for learned world models in model-based reinforcement learning. First, it identifies a harmful target world model using bilevel optimization and a transition-gradient theorem, ensuring proximity to clean dynamics while inducing low-return behavior. Second, it achieves this target via stealth-constrained gradient matching, modifying a limited fraction of fine-tuning transitions to steer the victim model toward adversarial dynamics while minimizing prediction error. Evaluated across continuous-control tasks, SWAAP causes significant performance degradation while evading non-adaptive defenses like residual/CUSUM/TRIM, highlighting vulnerabilities in world-model adaptation pipelines.

data poisoningworld modelsbilevel optimizationgradient matchingcontinuous-control

Attention as Frustrated Synchronization

arXiv cs.LG · Joshua Nunley · 2026-06-17

The paper introduces the Frustrated Synchronization Network (FSN), an attention architecture where token states are phases on a torus and computation arises from structured departures from perfect synchronization. The FSN employs a learned complex coupling kernel with static Kuramoto-Sakaguchi frustration angles, repulsive Daido harmonics, and a delay term coupling tokens to successors. At 1M parameters, FSN outperforms RoPE-SwiGLU transformers on character-level text and code (enwik8 validation loss: 1.5953 vs. 1.611), with advantages persisting up to 4M parameters. A variant replacing feed-forward blocks with mean-field coupling matches transformer performance.

frustrated synchronizationkuramoto-sakaguchi couplingdaido harmonicsrope-swishglumean-field coupling

Robust and Interpretable Adaptation of Equivariant Materials Foundation Models via Sparsity-promoting Fine-tuning

arXiv cs.LG · Youngwoo Cho, Seunghoon Yi, Wooil Yang, Sungmo Kang · 2026-06-17

The authors propose a sparsity-promoting fine-tuning method for E(3)-equivariant materials foundation models, enabling domain-specific calibration while preserving pre-trained physicochemical knowledge. The method selectively updates parameters by exploiting structural properties of equivariant networks, achieving comparable or superior performance to full fine-tuning and equivariant low-rank adaptation with only ∼3% (sometimes ∼0.5%) parameter updates. Evaluated on energy/force prediction across molecular/crystalline benchmarks, plus magnetic moment prediction tasks, the approach demonstrates task generalizability. Sparsity patterns reveal physically interpretable features like enhanced d-orbital contributions in transition metals.

equivariant modelssparsity-promoting fine-tuningmaterials foundation modelsinteratomic potentialsparameter efficiency

Fair Online Resource Allocation

arXiv cs.LG · Christopher En, Yuri Faenza, Andrea Lodi, Gonzalo Muñoz · 2026-06-17

The paper introduces a fair online resource allocation model maximizing welfare under capacity constraints and Lipschitz fairness, ensuring similar outcomes for comparable agents within batches. It analyzes the offline problem, proving the optimal fair allocation achieves at least Ω(1/γ) of the unfair optimum, bounding fairness's price. An online algorithm using dual mirror descent enforces fairness constraints while estimating dual variables, achieving sublinear regret against the offline benchmark. Validation on Refugee Economies Programme data demonstrates welfare-fairness trade-offs.

fair allocationlipschitz fairnessdual mirror descentsublinear regretwelfare maximization

InTrain: Intrinsic Trainability for Zero-Cost Neural Architecture Search

arXiv cs.LG · Qinqin Zhou, Fuhai Chen, Jipeng Wu, Zhiwei Chen · 2026-06-17

The paper introduces Intrinsic Trainability (InTrain), a theoretical proxy for zero-cost neural architecture search that formalizes trainability through geometric capacity and optimization resilience. Geometric capacity is quantified via activation covariance eigenspectrum participation ratio, while optimization resilience uses cumulative gradient health. InTrain combines these dimensions multiplicatively, hypothesizing their synergy is key to trainability. Experiments on NAS benchmarks show InTrain matches ensemble-based proxies in ranking correlation and outperforms single-metric methods.

intrinsic trainabilityzero-cost nasgeometric capacityoptimization resilienceparticipation ratio

BLADE: Scalable Bi-level Adaptive Data Selection for LLM Training

arXiv cs.LG · Jiaxing Wang, Deping Xiang, Jin Xu, Zirui Liu · 2026-06-17

The paper introduces BLADE, a Hessian-free framework for bi-level adaptive data selection in LLM training, addressing limitations of influence-based and excess-loss methods. BLADE reformulates the bi-level optimization problem via Lagrange multipliers, avoiding inverse-Hessian computations while maintaining dynamic model alignment. Theoretical guarantees for first-order convergence are provided, with an efficient online batch selection implementation using a randomized block-coordinate Frank-Wolfe algorithm. Experiments demonstrate BLADE's superiority over state-of-the-art baselines in LLM training scenarios.

bladebi-level optimizationlagrange multipliersfrank-wolfedata selection

MetaboNet-Bench: A Multi-modal Benchmark for Glucose Forecasting in Type 1 Diabetes

arXiv cs.LG · Nathaniel Jeffries, Miriam Wolff, Sam Royston, Elizabeth Healey · 2026-06-17

We introduce MetaboNet-Bench, a multi-modal benchmark for glucose forecasting in type 1 diabetes that standardizes performance evaluation across algorithms leveraging glucose, insulin, and carbohydrate data. The benchmark provides an extensible open-source framework and demonstrates its utility by evaluating several recent glucose forecasting models, including a custom multi-modal time-series model. Results indicate that the benefits of incorporating additional data modalities are model-dependent, with more complex architectures showing greater improvements. The benchmark also highlights gaps in current approaches through comprehensive clinical metrics, facilitating targeted future research in glycemic control management.

glucose forecastingtype 1 diabetesmulti-modaltime-seriesbenchmark

PACT: Preserving Anchored Cores in Task-vectors for Model Merging

arXiv cs.LG · Ningyuan Shi, Zhipeng Zhou, Hao Wang, Chunyan Miao · 2026-06-17

PACT introduces a method to preserve Load-Bearing Wall (LBW) dimensions in model merging, addressing the limitation of task-vector-based approaches that assume task-specific knowledge resides solely in task vectors. By characterizing LBW dimensions from scalar-weight and subspace perspectives, PACT aligns orthogonal complements with pre-trained weights and removes them from task vectors before merging. An efficient randomized SVD variant improves scalability. Experiments show PACT enhances existing merging methods, achieving state-of-the-art performance across multiple benchmarks.

model mergingtask arithmeticload-bearing wall dimensionsrandomized svdmulti-task learning

Towards Anomaly Detection on Relational Data

arXiv cs.LG · Shiyuan Li, Yunfeng Zhao, Yue Tan, Qingfeng Chen · 2026-06-17

The paper proposes RelAD, a reconstruction-based framework for anomaly detection in relational databases that addresses challenges of high-dimensional heterogeneous attributes and cross-table connection patterns. The method combines conditional sparse-gated attribute reconstruction to filter redundant features with dual-view multi-relational edge reconstruction to detect abnormal connections, integrated via a lightweight fusion module. Experiments on 6 benchmark datasets demonstrate RelAD's consistent superiority over baselines while maintaining competitive computational efficiency.

anomaly detectionrelational databasesreconstruction-based learningsparse-gated networksmulti-relational edges

Fair Cognitive Impairment Detection Through Unlearning

arXiv cs.LG · William Nguyen, Jiali Cheng, Hadi Amiri · 2026-06-17

The study introduces a multimodal framework for fair Mild Cognitive Impairment (MCI) detection from spontaneous speech, addressing performance gaps across demographic subgroups. The method combines cross-model fusion of speech, text, and image modalities with gradient reversal-based unlearning to remove task-irrelevant demographic attributes from shared embeddings. Evaluated on TAUKADIAL and PREPARE benchmarks, the approach outperforms state-of-the-art multilingual and multimodal baselines in MCI classification while reducing subgroup performance disparities, particularly across sex and language. Transfer learning analysis confirms that demographic unlearning yields more robust MCI representations.

mild cognitive impairmentmultimodal fusiongradient reversaldemographic unlearningspontaneous speech

Bridging Data Gaps in Structural Fragility Modeling through Transfer Learning: Methodology and Case Studies

arXiv cs.LG · Narges Saeednejad, Jamie Ellen Padgett · 2026-06-17

The paper introduces a transfer learning framework for structural fragility modeling that addresses domain shift, class imbalance, and label scarcity while maintaining interpretability. Four strategies (instance-based, parameter-based, hierarchical Bayesian, and multi-source transfer learning) are validated through case studies on coastal bridges (Hurricane Katrina), residential buildings (Hurricane Ian), and seismic bridges (Nisqually earthquake). Results show that direct transfer fails under domain shifts, while adapted models improve failure detection by 15-30% and predictive stability in low-data regimes, with hierarchical Bayesian methods providing robust uncertainty quantification.

transfer learningfragility modelingdomain adaptationhierarchical bayesianmulti-source fusion

TS-Fault: Benchmarking Time Series Forecasters Against Structural Faults

arXiv cs.LG · Yuyang Zhao, Lian Xu, Hao Miao, Chenxi Liu · 2026-06-16

The paper introduces TS-Fault, a benchmark for evaluating time series forecasting (TSF) models against structured faults rather than generic noise. It organizes faults into four modes along observation vs. mechanism-level and univariate vs. multivariate axes, injecting them via importance scores. Testing 21 models across 6 datasets reveals: (i) clean-data accuracy anti-correlates with robustness; (ii) rankings hold under observation-level faults but shuffle under mechanism-level faults; (iii) catastrophic failures occur only under mechanism-level faults, with foundation models being most fragile despite high clean accuracy.

time series forecastingstructural faultsrobustness benchmarkmechanism-level faultsimportance score

Effects of sparsity and superposition on loss in simple autoencoders

arXiv cs.LG · Mriganka Basu Roy Chowdhury, Eric McLaughlin Weiner · 2026-06-16

This work mathematically analyzes superposition in autoencoders with sparse inputs, building on Elhage et al. (2022)'s findings on polysemanticity. The study derives tight upper and lower bounds for L2 reconstruction loss in the sparse regime, focusing on power activation functions. Results rigorously validate superposition as an optimal compression strategy for sparse feature representation, while identifying open problems for future research.

superpositionsparsityautoencodersl2 reconstructionpolysemanticity

Do as the Romans Do: Learning Universal Behaviors from Heterogeneous Agents

arXiv cs.LG · Caleb Chang, Davin Win Kyi, Natasha Jaques, Karen Leung · 2026-06-16

The paper introduces General Reward Inference and Disentanglement (GRID), a social learning method that extracts universally beneficial behaviors from heterogeneous demonstrators. GRID decomposes agent rewards into general (shared) and specific (individual) components, enabling generalist pretraining focused on universal competencies like safety and task proficiency. Experiments in synthetic decomposition, multi-agent Craftax, and Highway-Env demonstrate GRID's superior reward disentanglement, outperforming standard imitation learning baselines and enabling more efficient downstream specialization.

social learningreward decompositiongeneralist pretrainingheterogeneous agentsbehavioral disentanglement

Shrinkage priors for Bayesian Substitute Confounders

arXiv cs.LG · Yordan P. Raykov, Hengrui Luo, Justin D. Strait, Wasiur R. KhudaBukhsh · 2026-06-16

The paper proposes a Bayesian factor assignment framework with shrinkage priors for learning sparse substitute confounders in multi-cause observational studies. The method addresses limitations of existing approaches by preserving coarse multi-cause dependence while avoiding over-encoding or single-cause variation. Theoretical results demonstrate posterior concentration, factor score contraction, and overlap-preserving geometry, with consistent regression-adjusted estimators under latent variable identification assumptions. Experiments on synthetic data and the Alzheimer's Disease Neuroimaging Initiative show effective adjustment comparable to direct biomarker conditioning, with collapse diagnostics identifying problematic factors.

substitute confoundersshrinkage priorsbayesian factor analysiscausal adjustmentoverlap preservation

When Does Trajectory-Level Supervision Permit Efficient Offline Reinforcement Learning?

arXiv cs.LG · Xuanfei Ren, Tengyang Xie · 2026-06-16

The paper develops a statistical theory for offline reinforcement learning with trajectory-level supervision, where only scalar outcome labels are observed per trajectory. It proposes OPAC, a pessimistic actor-critic algorithm that learns a latent reward model from such labels, achieving a high-probability bound of $\widetilde O(H^2\sqrt{C_{sa}(π^\star)/n})$ and matching lower bound. The analysis extends to preference-based feedback and identifies structural coefficients $κ_μ(σ)$ and $χ_μ(σ)$ that enable polynomial sample complexity in generalized outcome-based RL, while showing fundamental barriers for certain objectives like all-success cases.

offline reinforcement learningtrajectory-level supervisionpessimistic actor-criticconcentrability coefficientgeneralized bellman updates

Evaluating Prompting-Based Defenses Against Domain-Camouflaged Injection Attacks

arXiv cs.LG · Aaditya Pai · 2026-06-16

This paper presents the first systematic evaluation of prompting-based defenses against domain-camouflaged injection attacks, which embed malicious instructions using domain-appropriate vocabulary. The study tests five defenses (spotlighting, paraphrasing, prompt sandwiching, and combinations) across three model families (Claude Haiku, Llama 3.1 8B, Gemini 2.0 Flash) and three deployment domains (financial, legal, general) using 3,510 trials. Paraphrasing retrieved content proved most effective, reducing attack success rates by 55-84% across models and outperforming Llama Guard 4. Defense effectiveness varied significantly by model, with spotlighting halving attacks on Claude Haiku but providing no benefit on Llama 3.1 8B. Financial domains showed highest residual risk (26-33% baseline success), with no defense fully eliminating threats on weaker models.

domain-camouflaged injectionprompting-based defensesparaphrasingspotlightingllama guard

Hierarchical Attention via Domain Decomposition

arXiv cs.LG · Stephan Köhler, Oliver Rheinbach · 2026-06-16

The paper introduces a hierarchical attention mechanism inspired by two-level overlapping Schwarz domain decomposition, designed to enhance operator learning in sequence-to-sequence tasks. The method decomposes the traditional global softmax-free low-rank attention (QK^T) into local subdomain attention blocks and a coarse global attention block, connected via restriction operators (R_i) and partition-of-unity weights (D_i). Evaluated on a 1D diffusion problem with known exact solution operator, the approach demonstrates faster training and higher accuracy than global low-rank attention baselines while reducing parameter count.

hierarchical attentiondomain decompositionoperator learningschwarz methodlow-rank approximation

On the Residual Scaling of Looped Transformers: Stability and Transferability

arXiv cs.LG · Shaowen Wang, Bingrui Li, Ge Zhang, Wenhao Huang · 2026-06-16

The paper analyzes residual scaling in looped (weight-tied) Transformers, demonstrating that prior depth-scaling prescriptions (ε=1/√L) are insufficient due to correlated updates across iterations. It proposes a stronger ε=1/N scaling for single-layer blocks and a factored parameterization ε=λ/(N√L) for multi-layer blocks, separating loop correlation (1/N) and layer variance (1/√L) effects. This enables hyperparameter transfer across loop counts N without retuning. Experiments confirm that 1/N scaling improves trainability and achieves better loss than 1/√N scaling across varying N.

looped transformersresidual scalingweight-tiedhyperparameter transfertrainability

Compact Geometric Representations of Hierarchies

arXiv cs.LG · Prashant Gokhale, Piotr Indyk, Yuhao Liu, Sandeep Silwal · 2026-06-16

The paper presents compact geometric embeddings for hierarchical data structures, proving that directed trees admit reachability embeddings in constant dimension 3, independent of size or depth. For graphs with treewidth t, it constructs embeddings of dimension O(t log n), where n is the number of nodes. The work establishes matching lower bounds: Ω(n) for general DAGs and Ω(t/log(n/t)) for treewidth t graphs. Experimental validation shows practical applicability, with embeddings outperforming prior work in high-recall regimes by achieving smaller dimensions while preserving theoretical guarantees.

reachability embeddingsdirected acyclic graphstreewidthdimension boundshierarchical retrieval

Exponentially many initializations to avoid barren plateaus

arXiv cs.LG · Ankit Kulshrestha, Ricard Puig, Diego García-Martín, Lukasz Cincio · 2026-06-16

The paper introduces a first-moment framework to diagnose initialization strategies that avoid barren plateaus in quantum ansätze, showing that barren-plateau avoidance is non-unique and can be achieved through exponentially many inequivalent parameter distributions. The method analyzes operator-level biases induced by different initializations, recovering known schemes like identity and Gaussian initialization while revealing diverse alternatives. Numerical results suggest distinct initializations lead to different local minima, transforming the exponential concentration problem into a selection challenge among trainable options.

barren plateausquantum ansätzefirst-moment frameworkinitialization strategiesparameter distributions

N(CO)$^2$: Neural Combinatorial Optimization with Chance Constraints to Solve Stochastic Orienteering

arXiv cs.LG · Anas Saeed, Marcos Abel Zuzuárregui, Stefano Carpin · 2026-06-16

The paper introduces N(CO)$^2$, a neural combinatorial optimization method with chance constraints for solving the Stochastic Orienteering Problem (SOP). Leveraging reinforcement learning, the approach learns adaptive heuristics without hand-crafted designs, optimizing path selection under uncertainty. Empirical results show competitive performance against state-of-the-art MILP methods, demonstrating generalization across diverse SOP instances. The method reduces human effort in heuristic design while enabling efficient decision-making in stochastic environments.

neural combinatorial optimizationchance constraintsstochastic orienteering problemreinforcement learningmixed-integer linear program

Concept Modulation Models: A Unified Framework for Identifiability and Extrapolation

arXiv cs.LG · Soheun Yi, Yizhou Lu, Chandler Squires, Pradeep Ravikumar · 2026-06-16

The paper introduces concept modulation models (CMMs), a unified framework for analyzing identifiability and extrapolation in conditional latent variable models. CMMs employ a structured generative process $A\to Λ\to C\to X$, where attributes select modulators that induce latent concept laws, which then generate observed features. The framework demonstrates that feature agreement on observed attributes induces constrained latent concept transitions, formalized via attribute potentials (log-density ratios). These potentials also govern extrapolation, yielding algebraic criteria for generalization to unseen attributes. The approach generalizes prior model-specific identifiability and extrapolation results from nonlinear ICA, causal representation learning, and perturbation modeling.

concept modulation modelsconditional latent variable modelsattribute potentialsidentifiabilityextrapolation

Beyond AHI: An Interpretable Causal-Discovery-Guided Framework for Sleep Recovery in Connected Health

arXiv cs.LG · Saba A. Farahani, Elahe Khatibi, Manoj Vishwanath, Amir M. Rahmani · 2026-06-16

The authors propose an interpretable Sleep Recovery Score (SRS) framework that outperforms traditional Apnea-Hypopnea Index (AHI) by 2.5× in aligning with patient-reported outcomes. The method combines causal discovery via directed acyclic graphs (DAGs) on polysomnography data from two cohorts (MESA: n=1540; MrOS: n=825) with physiology-based constraints and LLM-assisted auditing to identify five key physiological domains: respiratory burden, hypoxic burden, sleep fragmentation, sleep architecture, and autonomic regulation. The resulting SRS demonstrates stronger correlation with perceived recovery while maintaining mechanistic plausibility for connected health applications.

sleep recovery scoredirected acyclic graphpolysomnographyphysiological domainsconnected health

Quantum Annealing Enhanced Reinforcement Learning for Accurate Remaining Useful Lifetime Prediction

arXiv cs.LG · Manoranjan Gandhudi, Arunkumar V., G. R. Anil, Gangadharan G. R · 2026-06-16

The paper proposes Quantum Annealing enhanced Q-Learning (QAQL), a framework integrating quantum annealing with Q-learning for remaining useful life (RUL) prediction. QAQL encodes Q-value updates as quadratic unconstrained binary optimization (QUBO) problems solved on a D-Wave Advantage system, using the annealer's stochastic sampling for exploration. Evaluated on NASA C-MAPSS turbofan and device-fleet datasets, QAQL outperforms classical and quantum baselines across six error metrics, demonstrating practical quantum-enhanced reinforcement learning for industrial predictive maintenance.

quantum annealingq-learningremaining useful lifepredictive maintenancequbo

The Illusion of Improvement: Reject Inference Strategies in Credit Scoring

arXiv cs.LG · Bruno Scarone, Ricardo Baeza-Yates · 2026-06-16

We demonstrate that reject inference methods in credit scoring create an illusion of model improvement by increasing accuracy while degrading rejection quality, leading practitioners to falsely believe the system is improving. Through systematic evaluation across three real-world datasets and two machine learning methods, we propose a controlled exploration strategy where lenders approve a fraction of rejected applicants to observe true outcomes. Experiments show that minimal exploration rates (2-5%) effectively diagnose feedback loops at near-zero cost, revealing that accuracy and rejection quality provide opposing recommendations on exploration necessity under selection bias.

reject inferencecredit scoringsurvival biasfeedback loopselection bias

ToolChain-CRC: Conformal Risk Control for Agentic AI Under Retrieval and Tool-Use Drift

arXiv cs.LG · Jeffery Opoku, David Banahene · 2026-06-16

ToolChain-CRC introduces a conformal risk-control method for retrieval-augmented and tool-using AI agents, addressing trajectory-level risks under drift. The method constructs step-level risk scores, combines them into a trajectory score, and calibrates an accept-or-intervene rule with an anytime alarm for early intervention. Theoretical guarantees include trajectory-level risk control under exchangeability and a drift-aware extension with auditable constants. Experiments on synthetic drift, RAG/tool-use stress tests, and live benchmarks demonstrate that trajectory-level calibration maintains risk below target, unlike final-answer-only approaches.

conformal risk controlretrieval-augmented agentstool-use drifttrajectory-level calibrationsupermartingale construction

Modeling Doppler Shifts in Radial-Velocity Data with Deep Learning toward Earth-mass Exoplanet Detection

arXiv cs.LG · Isidro Gómez-Vargas, Xavier Dumusque, Yinan Zhao, Khaled Al Moulla · 2026-06-16

The authors present a deep-learning framework for detecting Earth-mass exoplanets via Doppler shifts in radial-velocity data, addressing stellar activity interference. They train artificial neural networks on HARPS-N solar spectra with injected planetary signals, using physics-motivated spectral representations (flux, line-formation temperature, velocity gradients) and employ genetic-algorithm hyperparameter optimization with Monte Carlo dropout for uncertainty quantification. Their best model reliably retrieves planetary signals ≥25 cm/s with periods 10–550 days, with temperature-based representations outperforming flux-based ones. The work includes the release of doppleriann, a Python package implementing the framework.

radial-velocitydoppler shiftsdeep learningstellar activityuncertainty quantification

Mixed-Precision Communication-Avoiding SGD for Generalized Linear Models on GPUs

arXiv cs.LG · Aditya Devarakonda, Irene Simó Muñoz, Giulia Guidi · 2026-06-16

The authors propose mixed-precision communication-avoiding SGD (CA-SGD) for generalized linear models on GPUs, optimizing precision choices to balance computation and communication. CA-SGD amortizes communication over s iterations by replacing s AllReduces with a single AllReduce of an sb×sb Gram matrix, leveraging GPU matrix hardware and reduced-precision formats. A finite-precision analysis decomposes local rounding error into nine precision choices, yielding a transferable recipe across GPU generations. Experiments on NERSC Perlmutter A100 GPUs show mixed-precision CA-SGD matches FP32 SGD loss within 0.5% on logistic, linear, and Poisson problems, achieving 5.1–6.8× speedup over FP32 SGD on benchmark datasets.

communication-avoiding sgdmixed-precisiongeneralized linear modelsgram matrixallreduce

Task-Restricted Symmetries in Recurrent Weight Space

arXiv cs.LG · Simon Dräger · 2026-06-16

The study identifies task-restricted functional redundancies in recurrent neural networks (RNNs) by analyzing weight space symmetries using ordered real Schur coordinates. One-layer tanh RNNs were examined through structured ablations that fixed input and readout maps while varying spectral blocks and directed nonnormal couplings. Results show that certain nonnormal Schur couplings can be removed with minimal loss in trained solutions for tasks like fixed-length copy, while others are essential for accurate autonomous replay. Task-specific ablation profiles varied across flip-flop, sine generation, and context-dependent integration tasks, revealing approximate functional invariances rather than universal symmetries. Schur-coordinate ablations serve as a diagnostic tool for identifying perturbation effects on trained RNN solutions.

recurrent neural networksschur coordinatesfunctional redundancystructured ablationsnonnormal couplings

A Cross-Model VLM-Judge Protocol for Single-Image 3D Mesh Quality (and Why Cheap Proxies Fall Short)

arXiv cs.LG · Ali Asaria, Tony Salomone, Deep Gandhi · 2026-06-16

The paper introduces a vision-language model (VLM)-judge protocol for evaluating single-image 3D mesh quality, addressing the lack of reliable automatic metrics. The protocol employs a fixed 24-view render rig, two independent VLM judge families, and position-bias correction, achieving substantial inter-judge agreement (Cohen's kappa = 0.66). Results demonstrate that common proxies (render-space CLIP similarity and geometry-validity statistics) fail to correlate with perceived quality, exhibiting bimodal behavior and chance-level performance. The study recommends the VLM-judge protocol for reproducible evaluation and cautions against using geometry/CLIP proxies as optimization targets.

vision-language model3d mesh qualitycohen's kapparender-space clipgeometry-validity

Sequential Hiring of Contingent Workers Through Learning-Based Optimization

arXiv cs.LG · Chris Lee, Xiuli Chao, Izak Duenyas · 2026-06-16

The paper proposes DR-UCB (DelayedReplacement-UCB), a learning-based hiring policy for sequential workforce management under uncertainty in worker productivity and labor supply. The problem is formulated as a stochastic multi-play bandit with costly switching and delayed actions, addressing operational frictions like replacement costs and hiring delays. DR-UCB dynamically adjusts workforce composition using real-time production data, achieving leading-order regret matching its lower bound in time horizon dependence. Numerical experiments demonstrate superior performance over benchmarks.

stochastic banditworkforce optimizationdelayed actionscostly switchingregret analysis

Pointwise is Pointless? A Multimodal Ablation Study for Precipitation Nowcasting with Graph Neural Networks

arXiv cs.LG · Ophélia Miralles, Máté Mile, Christoffer Artturi, Thomas Nipen · 2026-06-16

This study evaluates the impact of sparse point observations on precipitation nowcasting using a multimodal graph neural network. The model integrates radar history, MEPS numerical weather predictions, Netatmo surface observations, MSG satellite channels, stochastic noise, and CRPS-based ensemble losses. Ablation experiments reveal that MEPS stabilizes radar extrapolation, Netatmo improves local station and onset diagnostics, and satellite data reduces biases but may trigger premature rain activation. CRPS-based configurations yield consistent radar-grid improvements, while combined satellite and CRPS setups achieve optimal oracle/DAS scores. Results indicate that sparse observations enhance local constraints but their utility for radar-field accuracy depends on loss functions and encoding methods.

precipitation nowcastinggraph neural networksmultimodal ablationcprs-based ensemblesparse observations

Beyond Prediction: Tail-Aware Scheduling for LLM Inference

arXiv cs.LG · Yueying Li, Yuanfan Chen, Jiayang Chen, Esha Choukse · 2026-06-16

We introduce a distribution-aware, prediction-free scheduling framework for LLM inference that replaces explicit length prediction with soft priority boosting using lightweight statistical signals. The method co-optimizes scheduling and cache-aware preemption to handle memory-coupled decode dynamics across diverse workloads. Evaluated on production and open-source traces, it reduces P99 tail latency (TTLT) by 35-50% compared to SRPT with perfect length knowledge and decreases time-to-first-token (TTFT) by 34-47% across reasoning-heavy and chat-heavy tasks. This demonstrates a robust alternative for optimizing tail latency in online LLM serving without relying on fragile length predictions.

llm inferencetail latencycache-aware preemptionsoft priority boostingmemory-coupled decode

Signature filtering: a lightweight enhancement for statistical watermark detection in large language models

arXiv cs.LG · Chih-Duo Hong, Yen-Pang Chen, Fang Yu · 2026-06-16

The paper introduces signature filtering, a lightweight detection-time module that enhances statistical watermark detection in LLM outputs without modifying watermark embedding or generation. The method learns signature tokens via mixed-integer linear programming on a training set, removing unreliable tokens before detection, with theoretical bounds derived for various attacker models. Evaluations across four watermark families (Kgw, Sweet, Unigram, Exp), four corpora (C4, MBPP, HumanEval, Code-Search-Net), and six LLMs (Opt-1.3b to Phi-3-medium-14b) show 2- and 3-gram signatures improve detection rates from 8~31% to 78~99% in weak-signal/low-entropy settings, while maintaining low false positives and robustness to text perturbations.

statistical watermarkingsignature filteringmixed-integer linear programllm provenancedetection-time enhancement

Measurement noise limits the advantage of nonlinear models over linear models in biomedical prediction

arXiv cs.LG · Marc-Andre Schulz, Kerstin Ritter · 2026-06-16

The study demonstrates that measurement noise fundamentally limits the advantage of nonlinear models over linear models in biomedical prediction tasks. By analyzing how additive noise attenuates nonlinear interactions (degree-k terms decay as reliability^k) while affecting linear terms only linearly, the authors show that typical biomedical measurement reliabilities erase nonlinear signal. They formalize this through an excess-risk identity combining psychometrics and Gaussian analysis. Empirical validation across 140 UK Biobank tasks reveals the predicted noise signature, with flexible models outperforming linear models only when measurement reliability, sample size, and feature representation jointly permit it.

measurement noisenonlinear interactionsexcess-risk identitybiomedical predictionmodel comparison

P$^2$CE: Model-Agnostic Plausible Pareto-Optimal Counterfactual Explanations

arXiv cs.LG · Arthur Hendricks Mendes de Oliveira, Giovani Valdrighi, Marcos Medeiros Raimundo · 2026-06-16

We introduce P$^2$CE, a model-agnostic algorithm for generating plausible Pareto-optimal counterfactual explanations that balance feasibility, plausibility, and computational efficiency. The method employs an isolation forest outlier detector to ensure explanations align with the data distribution and leverages SHAP values for efficient computation across arbitrary models. Empirical evaluation on three datasets demonstrates P$^2$CE's superior performance in solution quality and computational efficiency compared to existing techniques.

counterfactual explanationspareto-optimalisolation forestshap valuesmodel-agnostic

MOLAR: Learning Multimodal Molecular Representations from Noisy Labels

arXiv cs.LG · Yingxu Wang, Kunyu Zhang, Nan Yin, Yu Li · 2026-06-16

MOLAR introduces a noise-aware framework for multimodal molecular representation learning that decouples latent clean-property inference from observed noisy labels. The method models graph and text views as contributing residual evidence to a clean-property distribution, while a categorical label-observation channel maps this to recorded labels, enabling posterior reliability estimation. Evaluations on molecular benchmarks with natural and synthetic label noise demonstrate consistent performance gains over baselines, with interpretable reliability and modality-evidence diagnostics provided through visualization.

multimodal learningmolecular representationnoisy labelsresidual evidencelabel reliability

SCOPE-FL: A Strategy-proof Chain-based Optimal pareto efficient Federated Learning System

arXiv cs.LG · Seyed Salar Ghazi, Kaiwen Zhang, Mehdi feizi, Hans-Arno Jacobsen · 2026-06-16

SCOPE-FL introduces a strategy-proof, Pareto-efficient federated learning system by reformulating client selection as a two-sided school choice problem solved via the Top Trading Cycle algorithm. The framework employs blockchain smart contracts for tamper-proof execution and approximates Shapley values through One-Round Reconstruction for fair reward distribution. Evaluations on MNIST, Fashion-MNIST, and CIFAR-10 show superior accuracy, convergence, and reward efficiency compared to DA, IAS, and other baselines, with competitive communication latency and reduced blockchain overhead.

hierarchical federated learningpareto efficiencystrategy-proofnesstop trading cycleshapley value

From Sparse Features to Trustworthy Proxies: Certifying SAE-Based Interpretability

arXiv cs.LG · Dibyanayan Bandyopadhyay, Asif Ekbal · 2026-06-16

The paper introduces a certification framework for evaluating the faithfulness of sparse autoencoder (SAE)-based explanations in language models (LMs). By replacing native hidden activations with SAE reconstructions, the framework derives an upper bound on the base model's expected risk using proxy risk, SAE reconstruction gap, concept-pool mismatch, and sparse complexity. Empirical validation on GPT-2 Small, Gemma-2B, and Llama-3-8B demonstrates non-vacuous bounds at practical sample sizes, with layerwise analysis revealing depth-dependent certification ease in Llama-3-8B. Feature-shuffling ablations confirm the framework's ability to distinguish semantic alignment from statistical sparsity.

sparse autoencoderlanguage modelsproxy riskreconstruction gapsemantic alignment

Do Time Series Foundation Model Benchmarks Hide Regime-Dependent Failures? Evidence from Traffic Speed Forecasting

arXiv cs.LG · Yingshuo Wang, Xian Sun, Lingdong Kong, Wei Gao · 2026-06-16

The paper reveals that standard aggregate metrics in time series foundation model (TSFM) benchmarks obscure regime-dependent failures, particularly during traffic state transitions. Through regime-stratified evaluation of three TSFMs on traffic speed forecasting, the authors show MAE increases to 11 mph (vs. 3 mph overall) and 90% prediction interval coverage drops to 55% during transitions. A historical conditional baseline outperforms TSFMs in transition coverage but lags in overall accuracy. The proposed bimodal mixture augmentation (BMA) combines TSFM forecasts with historical distributions, improving transition coverage while maintaining accuracy. Results advocate for regime-aware evaluation in TSFM benchmarks.

time series foundation modelsregime-stratified evaluationbimodal mixture augmentationprediction-interval coveragetraffic speed forecasting

Structural MRI Synthesis for Alzheimer's Disease via Conditional Diffusion on Anatomical Masks

arXiv cs.LG · Muge Zhang, Muhammad Ali Khaliq, Jamal Alsakran, Byeong Kil Lee · 2026-06-16

The study extends Med-DDPM, a conditional diffusion model, to synthesize 3D structural MRIs for Alzheimer's Disease (AD) by conditioning on anatomical segmentation masks from ADNI. This approach captures subtle AD-related anatomical changes, leveraging Med-DDPM's structural fidelity. Evaluation shows segmentation models trained on synthetic data achieve comparable Dice scores (0.6532) to real data (0.6513), with hybrid datasets (real + synthetic) outperforming both (Dice 0.7244), demonstrating the utility of synthetic data for enhancing neuroimaging studies.

conditional diffusionstructural mrialzheimer's diseaseanatomical masksdice score

ThousandWorlds: A benchmark for climate emulation of potentially habitable exoplanets

arXiv cs.LG · Edward T. Stevenson, Mei Ting Mak, Eric Wolf, Denis E. Sergeev · 2026-06-16

The authors introduce ThousandWorlds, a machine-learning benchmark for emulating exoplanet climates using global climate models (GCMs). The dataset comprises ~1800 simulations from five GCMs, mapping eight planetary parameters to 3D atmospheric fields (temperature, humidity, winds, etc.), with three nested subsets for progressively challenging tasks. Evaluation protocols compare methods against both each other and inter-GCM disagreement. Baseline tests show Gaussian process methods outperform deep learning, highlighting current limitations in low-data, multi-simulator regression. The resource includes open data and code for reproducibility.

exoplanet climate emulationglobal climate modelsparameter-to-field regressiongaussian processesmulti-simulator benchmark

Neural Network Implementation of the Renormalization Group for Fault Diagnosis with Class Imbalance

arXiv cs.LG · Evgeny Nikulchev, Dmitry Ilin · 2026-06-16

RGNet, a novel neural network architecture inspired by the renormalization group (RG), addresses class imbalance and multidimensional noise in fault diagnosis tasks. The model hierarchically coarse-grains the feature space by sequentially compressing input dimensionality and concatenating all scales before classification, enabling simultaneous capture of local details and global patterns. RG-flows, interpretable low-dimensional representations, are introduced and visualized via t-SNE, revealing discrete curvilinear structures that validate the coarse-graining effectiveness. Experiments on the imbalanced AI4I dataset demonstrate RGNet's universality, interpretability, and competitiveness in fault prediction.

renormalization groupcoarse-grainingclass imbalancerg-flowst-sne

OmniPlan: An Adaptive Framework for Timely and Near-Optimal Network Planning Optimization

arXiv cs.LG · Longlong Zhu, Jiashuo Yu, Zedi Chen, Yuhan Wu · 2026-06-16

OmniPlan introduces an adaptive framework for network planning optimization that balances timeliness and near-optimality across diverse domains. The method employs an LLM-based interpreter to unify natural-language intents into quantifiable preferences, a mixture-of-experts architecture integrating MIP solvers, heuristics, and DRL models, and a DRL-based expert configuration module for preference alignment. Evaluated on distributed ML inference tasks, OmniPlan reduces latency by up to 97.8% and resource consumption by 11.5% compared to existing solutions.

network planning optimizationmixture-of-expertslarge language modeldeep reinforcement learningmixed integer programming

Reliable Neural-Codec Text-to-Speech by ASR Self-Verification and Distillation: Near-Zero Catastrophic Failures Across Models and Codecs

arXiv cs.LG · Ali Asaria, Tony Salomone, Deep Gandhi · 2026-06-16

The paper addresses catastrophic failures in autoregressive neural-codec text-to-speech (TTS) models, such as silence or repetitive outputs, by proposing ASR self-verification and distillation. Best-of-N ASR verification reduces failure rates to near-zero (N=2 on LibriSpeech, N=4 on hard prompts) across four TTS systems and three codecs (XCodec2, SNAC, Mimi). Distillation transfers robustness to single-shot decoding, mitigating 52-58% of failures on hard inputs without test-time cost. Direct preference optimization (DPO/IPO) underperforms supervised distillation, and scale benefits are inconsistent (e.g., larger Llasa model). Rare-word challenges remain unresolved.

asrneural-codecdistillationcatastrophic failurestext-to-speech

📰 Industry Media (7)

Perplexity Launches Brain, a Self-Improving Memory System That Builds a Context Graph of an Agent’s Work and Learns Overnight

MarkTechPost · Asif Razzaq · 2026-06-18

Perplexity introduces Brain, a self-improving memory system for AI agents that constructs a context graph of completed work to enhance performance. The system operates by logging tasks, results, and corrections, then synthesizing these overnight into reusable lessons via a traceable LLM wiki structure. Early internal testing reports +25% answer correctness, +16% recall, and -13% cost on historically contextual tasks compared to non-memory baselines. The architecture enables recursive self-improvement through incremental updates while maintaining provenance links for debugging.

context graphrecursive self-improvementllm wikitraceable memoryovernight synthesis

The KV Cache Compression Race: TurboQuant vs OSCAR vs EpiCache

MarkTechPost · Arnav Rai · 2026-06-18

Recent KV cache compression techniques address the memory bottleneck in long-context LLMs by quantizing key-value vectors. TurboQuant (ICLR 2026) employs data-oblivious random rotation and optimal scalar quantization, achieving near-lossless compression at 3–4 bits. OSCAR uses attention-aware calibration for INT2 quantization, enabling 7.83× throughput and ~8× memory reduction at 100K context. EpiCache manages multi-turn conversations via episodic clustering, yielding 40% higher accuracy and 3.5× lower peak memory. These methods are complementary, with TurboQuant excelling in generality, OSCAR in deployable INT2, and EpiCache in conversational contexts.

kv cache compressionquantizationattention-aware calibrationepisodic clusteringmemory bottleneck

OpenAI Releases LifeSciBench, a 750-Task Benchmark Grading AI Models on Real Life-Science Research With Expert-Written Rubric

MarkTechPost · Michal Sutter · 2026-06-18

OpenAI introduced LifeSciBench, a 750-task benchmark evaluating AI models on real-world life science workflows through expert-authored rubrics. The benchmark spans seven biological domains (e.g., genomics, medicinal chemistry) and seven workflows (e.g., scientific reasoning, translation), with 79% of tasks requiring multi-step reasoning. Expert cohorts (173 authors, 453 reviewers) constructed tasks with 1,062 artifacts (53% of tasks) and 19,020 atomic criteria. In single-turn evaluations, GPT-Rosalind achieved the highest normalized score (0.576) and task pass rate (36.1%), though performance dropped significantly on artifact-heavy tasks (28.1% vs. 45.1% text-only).

lifescibenchrubric-based gradingmulti-step reasoningartifact-heavy tasksgpt-rosalind

NVIDIA SkillSpector Guide: Scanning AI Skills for Security Risks with Static Analysis and SARIF Reports

MarkTechPost · Sana Hassan · 2026-06-18

The NVIDIA SkillSpector framework enables static analysis of AI skills for security risks through programmatic LangGraph workflows. It evaluates a corpus containing both benign and vulnerable skills, generating risk scores, severity classifications, and SARIF-formatted reports. Key capabilities include customizable analyzers, visualization of risk distributions, and optional LLM-based semantic validation. Results demonstrate effective detection of environment variable harvesting, code execution, and prompt injection vulnerabilities across the tested skill corpus.

static analysissariflanggraphrisk scoringvulnerability detection

Computer vision deployments drive retail productivity gains

AI News · Ryan Daws · 2026-06-18

Computer vision deployments in retail demonstrate significant productivity gains by automating shelf tracking and inventory management, addressing operational inefficiencies costing 6.4% of gross sales. A Coresight Research study, in collaboration with Simbe and RELEX Solutions, reveals that full-scale deployments now cover 60% of enterprise footprints, with top-tier retailers ($5B+ revenue) achieving 73% adoption. Key results include a 40% improvement in picking efficiency (BJ’s Wholesale Club), 14% reduction in manual task hours, and 11% increase in customer lifetime value. However, flawed deployment sequencing—prioritizing pricing software over foundational sensor infrastructure—leads to 13% mispricing rates in 2026.

computer visionshelf digitisationplanogram verificationdigital twinspricing automation

HSBC expands AI banking partnership with Google Cloud

AI News · Muhammad Zulhusni · 2026-06-18

HSBC has expanded its AI banking capabilities through a multi-year partnership with Google Cloud, leveraging Gemini models and the Gemini Enterprise Agent Platform. The collaboration targets over 200 AI use cases across wealth management, financial crime risk management, and internal decision support, with selected initiatives projected to yield over $100 million in revenue or efficiency gains. HSBC will deploy generative AI and agentic AI to enhance financial crime detection, aiming to double intervention speed across nearly one billion monthly transactions. The bank also reported a 15% efficiency gain in software development using AI coding assistants. This partnership builds on HSBC’s existing AI deployments, including over 600 active use cases.

generative aiagentic aigemini modelsfinancial crime detectioncoding assistants

Microsoft sells OpenAI models in China. OpenAI and Anthropic won’t.

AI News · Dashveenjit Kaur · 2026-06-18

Microsoft has established a unique intermediary role in China's AI market by exclusively distributing OpenAI's GPT models to major Chinese tech firms (ByteDance, Ant Group, Meituan, Tencent) through Azure, despite OpenAI and Anthropic's direct market abstention due to IP and misuse concerns. This arrangement leverages Microsoft's contractual rights with OpenAI, offering Chinese customers cloud-based access to GPT models hosted outside China (e.g., Singapore), while employing automated monitoring to mitigate model distillation risks. Azure's AI revenue in China grew 400% in FY2024 and is projected to triple by June 2025, with ByteDance alone spending >$1B annually on Microsoft's AI/cloud services. Concurrently, Microsoft integrates Chinese models (DeepSeek-V4) into Western enterprise offerings, creating bidirectional AI trade flows.

gpt modelsmodel distillationazure aicloud computingsynthetic data


Generated automatically at 2026-06-18 21:44 UTC. Summaries and keywords are produced by an LLM and may contain inaccuracies — always consult the original article.