Daily Digest — 2026-06-04

Wednesday, June 03, 2026 · 354 items · model: deepseek/deepseek-chat

354 items · 6 research labs, 334 arxiv papers, 14 industry media

🏛️ Research Labs (6)

Introducing new capabilities to GPT-Rosalind

OpenAI News · 2026-06-03

OpenAI introduces GPT-Rosalind, a specialized LLM for life sciences research, combining GPT-5.5's agentic capabilities with enhanced performance in drug discovery domains. The model integrates multimodal chemical understanding, genomics analysis, and wet lab troubleshooting via three new benchmarks: LifeSciBench (end-to-end research workflows), MedChemBench (27.5% accuracy vs GPT-5.5's 25.1%), and LabWorkBench (63.2% experimental protocol accuracy). Enterprise deployment includes biological file viewers and plugins for omics analysis. Evaluations show 7.2-31% token efficiency gains over GPT-5.5 while improving accuracy across medicinal chemistry, genomics, and lab workflows.

agentic codingmedicinal chemistryomics analysisretrosynthesiswet lab protocols

A blueprint for democratic governance of frontier AI

OpenAI News · 2026-06-03

OpenAI proposes a three-part blueprint for U.S. federal governance of frontier AI systems, addressing safety and national security challenges. The strategy involves: (1) establishing a national framework aligned with emerging state-level legislation (e.g., California SB 53, New York RAISE Act), (2) strengthening the federal CAISI institution for AI safety oversight, and (3) implementing a cross-government resilience plan. The blueprint builds on recent policy developments, including the White House executive order on AI innovation and security, advocating for adaptive governance structures to match AI's rapid evolution.

frontier aigovernance frameworkfederal policyai safetyresilience plan

OpenAI public policy agenda

OpenAI News · 2026-06-03

OpenAI outlines its public policy agenda centered on democratizing AI benefits through five principles: democratization, empowerment, universal prosperity, resilience, and adaptability. The organization advocates for policies addressing frontier AI safety, cybersecurity, youth protections, education, and workforce transitions. Key initiatives include supporting state and federal regulatory frameworks, international standards, and safeguards against misuse of generative AI. Empirical data shows diverse user demographics, with balanced gender representation and broad income distribution. OpenAI collaborates with governments, educators, and labor organizations to ensure equitable AI access and mitigate risks.

artificial general intelligencefrontier ai safetygenerative aiai literacyworkforce transition

Direct Preference Optimization Beyond Chatbots

Hugging Face Blog · 2026-06-03

The study demonstrates Direct Preference Optimization (DPO) as an effective post-SFT intervention for reducing text degeneration in structured OCR tasks, achieving an average 59.4% reduction across five model families. Unlike conventional chat-alignment applications, the method constructs preference pairs from the model's own degenerate outputs (rejected) versus correct transcriptions (chosen), implementing 'preference-guided implicit unlikelihood'. Results show consistent degeneration reduction (37-88%) regardless of architecture or initial SFT performance, with Qwen2.5-VL-3B's anomaly confirming SFT can simultaneously improve task capability while exacerbating failure modes.

direct preference optimizationtext degenerationsupervised fine-tuningrejection pairsstructured ocr

Adding MCP Tools to Reachy Mini

Hugging Face Blog · 2026-06-03

The Hugging Face blog introduces a modular tool system for Reachy Mini, enabling remote capability integration via Model Capability Protocol (MCP). The method involves tool discovery through public Gradio Spaces, profile-based activation via tools.txt, and namespaced tool identifiers to prevent collisions. Results demonstrate concurrent tool usage (e.g., weather and web search) with optimized prompts for parallel execution, reducing latency. The system supports built-in, local custom, and remote tools while maintaining a trusted core.

model capability protocolgradio spacestool discoveryprofile-based activationnamespaced identifiers

5 ways Google Search can level up your thrift and vintage shopping

Google AI Blog · Megan Stoner · 2026-06-03

Google Search introduces five AI-driven features to enhance thrift and vintage shopping: (1) AI Mode for personalized outing planning, (2) Google Lens for item identification and valuation, (3) Circle to Search for visual similarity matching, (4) Virtual Try-On for outfit visualization, and (5) Lens-assisted resale evaluation. These tools leverage multimodal AI (e.g., computer vision, NLP) and real-time data integration to bridge physical and digital shopping experiences. The methods demonstrate practical applications of in-context learning and visual search paradigms, with quantified user benefits including trend analysis ('vintage jersey' searches up 120% YoY) and reduced decision latency.

ai modegoogle lenscircle to searchvirtual try-onresale evaluation

📜 arXiv Papers (334)

Imaginative Perception Tokens Enhance Spatial Reasoning in Multimodal Language Models

arXiv cs.AI · Mahtab Bigverdi, Lindsey Li, Weikai Huang, Yiming Liu · 2026-06-02

The paper introduces Imaginative Perception Tokens (IPT) to enhance spatial reasoning in vision language models (VLMs) by externalizing perceptual representations for unseen viewpoints. IPT supervision is applied to the BAGEL VLM across three tasks—Perspective Taking (PET), Path Tracing (PT), and Multiview Counting (MVC)—using 20K examples with ground truth data. Results show IPT improves MVC accuracy by 3.4%, matches closed-source models on PT, and outperforms textual chain-of-thought training, demonstrating its efficacy for spatial reasoning without image generation.

imaginative perception tokensspatial reasoningvision language modelsmultiview countingperspective taking

Humanoid-GPT: Scaling Data and Structure for Zero-Shot Motion Tracking

arXiv cs.AI · Zekun Qi, Xuchuan Chen, Dairu Liu, Chenghuai Lin · 2026-06-02

Humanoid-GPT introduces a GPT-style Transformer for zero-shot motion tracking, scaling to billion-frame mocap datasets. The model employs causal attention and whole-body control, overcoming prior limitations of shallow MLPs through pre-training on a 2B-frame corpus combining major datasets and in-house recordings. Results demonstrate robust zero-shot generalization to unseen motions and tasks, establishing a new performance frontier in dynamic behavior tracking.

transformerzero-shotmotion trackingcausal attentionwhole-body control

Language Models Need Sleep: Learning to Self-Modify and Consolidate Memories

arXiv cs.AI · Ali Behrouz, Farnoosh Hashemi, Vahab Mirrokni · 2026-06-02

The paper introduces a 'Sleep' paradigm for Large Language Models (LLMs) to enable continual learning and knowledge consolidation, inspired by human cognitive processes. It features two stages: Memory Consolidation via Knowledge Seeding (distilling short-term memories into long-term parameters through on-policy distillation and RL-based imitation learning) and Dreaming (self-improvement through RL-generated synthetic data curriculum). Experiments demonstrate effectiveness in long-horizon tasks, continual learning, and few-shot generalization.

sleep paradigmknowledge seedingmemory consolidationgeneralized distillationdreaming process

Formalizing the Binding Problem

arXiv cs.AI · Lianghuan Huang, Yihao Li, Saeed Salehi, Yingshan Chang · 2026-06-02

The study formalizes the binding problem in visual representations using an information-theoretic framework and introduces a probing method to measure binding information in Vision Transformers (ViTs). Experiments evaluate binding across ViT components (e.g., [CLS] token, spatial tokens) on datasets with feature sharing, occlusion, and natural features. Results demonstrate binding as critical for robust visual recognition, revealing ViTs' limitations in feature attribution despite partial binding awareness.

binding problemvision transformersinformation-theoreticfeature attributionprobing method

Quantifying Faithful Confidence Expression in Large Reasoning Models

arXiv cs.AI · Areeb Gani, Asal Meskin, Gabrielle Kaili-May Liu, Arman Cohan · 2026-06-02

The paper introduces a novel framework to quantify faithful calibration (FC) in large reasoning models (LRMs), addressing challenges in measuring confidence alignment between intrinsic uncertainty and linguistic expression. The method analyzes linguistic decisiveness using token probabilities, hidden states, and response consistency, while employing prefix-conditioned sampling to control for trace variation. Results show FC remains a significant challenge for LRMs, with reasoning behaviors not improving FC and divergent assessments from different confidence estimators, highlighting fragility in prior evaluation methods.

faithful calibrationlarge reasoning modelslinguistic decisivenessprefix-conditioned samplingconfidence estimators

QUBRIC: Co-Designing Queries and Rubrics for RL Beyond Verifiable Rewards

arXiv cs.AI · Rongzhi Zhang, Rui Feng, Zhihan Zhang, Jingfeng Yang · 2026-06-02

QUBRIC introduces a co-design framework for queries and rubrics in reinforcement learning beyond verifiable rewards, addressing structural bottlenecks in rubric quality. The method combines teacher-derived key points to rewrite open-ended queries into evaluable questions, contrastive rubric generation to transform teacher-policy gaps into criteria, and learnability filtering to retain informative pairs for GRPO training. Results show a +5.5 point improvement on ArenaHard over SFT baselines and +6.3 points average gain on three held-out benchmarks (legal, moral, narrative reasoning), demonstrating transferability and reasoning-focused enhancements.

rubric-based rlquery-rubric co-designcontrastive rubric generationlearnability filteringgrpo training

AlignAtt4LLM: Fast AlignAtt for Decoder-Only LLMs at IWSLT 2026 Simultaneous Speech Translation Task

arXiv cs.AI · Quentin Fuxa, Dominik Macháček · 2026-06-02

AlignAtt4LLM introduces a simultaneous speech translation system for English to German, Italian, and Chinese, adapting AlignAtt to decoder-only LLMs like Gemma-4 E4B-it. The method employs (1) explicit source span prompts, (2) alignment head selection, (3) qk-fast replay, and (4) query/key capture to maintain bit-identical outputs. On IWSLT 2026, it outperforms baselines for European languages in low (2s) and high (4s) latency regimes, with mixed results for Chinese. The approach is model-agnostic, requiring only prompt layout, calibrated heads, and KV-capture.

alignattdecoder-only llmsimultaneous translationqk-fast replaykv-capture

Agentic Chain-of-Thought Steering for Efficient and Controllable LLM Reasoning

arXiv cs.AI · Yu Xia, Zhouhang Xie, Xin Xu, Byungkyu Kang · 2026-06-02

The paper introduces Agentic Chain-of-Thought Steering (ACTS), a method for efficient and controllable LLM reasoning by formulating reasoning steering as a Markov decision process. A controller agent dynamically steers a frozen reasoner during inference, observing the reasoning trace and remaining budget to issue strategy and phrase actions. Initialized with synthetic trajectories and optimized via RL with budget-conditioned rewards, ACTS achieves full-thinking performance with token savings and enables accuracy-efficiency trade-offs across benchmarks.

chain-of-thoughtmarkov decision processreinforcement learningreasoning steeringtoken efficiency

Self-Refining Agentic Reinforcement Learning for Vision-Conditioned UAV Navigation

arXiv cs.AI · Roohan Ahmed Khan, Yasheerah Yaqoot, Muhammad Ahsan Mustafa, Dzmitry Tsetserukou · 2026-06-02

The paper introduces AgenticRL, a self-refining reinforcement learning framework for vision-conditioned UAV navigation that autonomously designs rewards and refines policies. The method employs a multimodal GPT agent to interpret visual scenes, generate task-specific rewards via Proximal Policy Optimization (PPO), and iteratively improve policies through diagnostic feedback. Evaluated on gate traversal, obstacle avoidance, and other tasks, the framework achieves 71% behavioral improvement over initial rewards, with 91% real-world success rate and 94% sim-to-real accuracy.

agenticrlproximal policy optimizationmultimodal gptsim-to-real transferreward design

Using Reward Uncertainty to Induce Diverse Behaviour in Reinforcement Learning

arXiv cs.AI · Anthony GX-Chen, Ankit Anand, Gheorghe Comanici, Zaheer Abbas · 2026-06-02

The paper proposes a reformulation of reinforcement learning (RL) objectives to induce diverse behaviors by modeling reward uncertainty. Instead of scalar rewards, the method uses a distribution over reward functions and applies a non-linear objective over action sets, enabling calibrated diversity without sacrificing expected reward. Theoretical analysis shows generalization of policy gradient and action-set approaches, with empirical validation in contextual bandit settings demonstrating robust diversity control.

reinforcement learningreward uncertaintydiverse behaviorcontextual banditpolicy gradient

Efficient ASR Training with Conversations that Never Happened

arXiv cs.AI · Máté Gedeon, Péter Mihajlik · 2026-06-02

The paper proposes a pipeline for augmenting conversational ASR training data by generating synthetic dialogues through LLM-based scenario generation, speaker attribute mapping to TTS voice profiles, and utterance assembly. Five LLM families were evaluated under different generation strategies using FastConformer-Large training on Hungarian BEA-Dialogue, demonstrating that synthetic data consistently improves ASR performance. With only 67 hours of real and 636 hours of synthetic data, the method outperforms a zero-shot model trained on 2700 hours of Hungarian speech, showing LLM-TTS synthesis as an effective complement to real conversational corpora.

conversational asrsynthetic dialoguestts voice profilesfastconformer-largebea-dialogue

FlashbackCL: Mitigating Temporal Forgetting in Federated Learning

arXiv cs.AI · Mubarak A. Ojewale, Adriana E. Chis, Jorge M. Cortes-Mendoza, Bernardo Pulido-Gaytan · 2026-06-02

Flashback Continual Learning (FlashbackCL) mitigates temporal forgetting in Federated Learning (FL) by extending Flashback with three components: temporally-decayed label counts, a device-aware replay buffer using Class-Balanced Reservoir Sampling (CBRS), and server-side active coreset curation. Evaluated on CIFAR-10 with 50 clients under three temporal shift modes, FlashbackCL achieves 6.9% to 10.0% relative improvement over Flashback and reduces temporal forgetting by up to 68%. Ablation studies identify CBRS replay as the critical component. FlashbackCL also improves Flashback by 3.5 points on stationary CIFAR-100, demonstrating its dual efficacy in addressing spatial heterogeneity and temporal shift.

federated learningtemporal forgettingclass-balanced reservoir samplingreplay buffercoreset curation

q0: Primitives for Hyper-Epoch Pretraining

arXiv cs.AI · Bishwas Mandal, Shmuel Berman, Akshay Vegesna, Samip Dahal · 2026-06-02

The paper introduces q0, a hyper-epoch pretraining method that transforms multi-epoch budgets into diverse model populations for improved generalization. It employs three primitives: cyclic scheduling with anti-correlated learning rates/weight decay, chain distillation for quality compounding, and a learned prior for model selection. On a 1.8B-parameter model trained with 100M FineWeb tokens, q0 achieves comparable performance to a 256-epoch ensemble baseline using only ~56 epochs (4.6× fewer) or ~67 epochs (3.8× fewer) when matched to ensemble size, yielding 12.9× data efficiency gains in Slowrun settings.

hyper-epoch pretrainingchain distillationcyclic schedulinganti-correlated learningmodel population

Entropy Is Not Enough: Unlocking Effective Reinforcement Learning for Visual Reasoning via Vision-Anchored Token Selection

arXiv cs.AI · Senjie Jin, Peixin Wang, Boyang Liu, Xiaoran Fan · 2026-06-02

The paper introduces VEPO (Vision-Entropy token-selection for Policy Optimization), a reinforcement learning framework addressing the limitations of token-level entropy in visual reasoning tasks. VEPO integrates visual sensitivity with token entropy through a principled multiplicative coupling, redirecting gradient credit toward tokens that are both visually grounded and informative. Experiments demonstrate VEPO's superior performance, outperforming entropy-only baselines by 2.28 points at 7B-scale and 3.15 points at 3B-scale. Ablation studies confirm the method's robustness, highlighting its effectiveness in interleaving precise perceptual grounding with semantic reasoning.

visual reasoningtoken entropypolicy optimizationgradient creditmultimodal rl

FFR: Forward-Forward Learning for Regression

arXiv cs.AI · Xinyang Liu, Xuanyu Liang, Shiqi Ding, Boyang Li · 2026-06-02

The paper introduces FFR (Forward-Forward for Regression), the first framework extending the Forward-Forward (FF) algorithm to regression tasks. FFR addresses FF's classification bias via three innovations: (1) an ordinal competitive goodness function replacing contrastive pairs with distance-aware supervision, (2) a stratified ladder architecture for coarse-to-fine feature learning, and (3) hierarchical prediction with uncertainty estimation. Experiments on five benchmarks show FFR achieves 98.6% of backpropagation's accuracy while reducing training memory to 27% (depth 8) and 8% (depth 32) of BP's, with 72% per-iteration time, outperforming all BP-free alternatives.

forward-forward learningregressionordinal supervisionstratified ladder architectureuncertainty estimation

Hedge-Bench: Benchmarking Agents on Hard, Realistic Tasks Pertaining to Financial Reasoning

arXiv cs.AI · Eric Cho, Shawn Huang, Alice Lu, Andy Lyu · 2026-06-02

Hedge-Bench 1.0 introduces a benchmark for evaluating AI agents on hard, realistic financial reasoning tasks, addressing limitations of existing benchmarks that rely on model-judged outputs. The dataset comprises 102 real-world hedge fund analyst tasks, grounded in expert reasoning traces and relevant information sources, enabling deterministic grading. Frontier models and agents achieve below 16% accuracy, highlighting the challenge of open-ended financial reasoning. The benchmark and evaluation harness are publicly available.

financial reasoningbenchmarkdeterministic gradingexpert tracesopen-ended tasks

NetKV: Network-Aware Decode Instance Selection for Disaggregated LLM Inference

arXiv cs.AI · Mubarak Adetunji Ojewale · 2026-06-02

NetKV introduces network-aware decode instance selection for disaggregated LLM inference, addressing the suboptimality of cache-only schedulers by incorporating topological distance and dynamic congestion metrics. The method employs a network cost oracle and an O(|D|) greedy algorithm with provable robustness to stale telemetry. Evaluated on a 64-GPU fat-tree simulator with Mooncake traces, NetKV reduces mean TTFT by up to 21.2% versus round-robin and 17.6% versus cache+load-aware baselines, improves SLO attainment by 20.1 percentage points, and maintains sub-0.5ms Time Between Tokens overhead without hardware modifications.

disaggregated inferencekv cachetime to first tokennetwork cost oracleslo attainment

The Impact of Configuring Agentic AI Coding Tools on Build-vs-Buy Decisions: A Study Protocol

arXiv cs.AI · Jai Lal Lulla, Matthias Galster, Jie M. Zhang, Sebastian Baltes · 2026-06-02

This pre-registered study protocol investigates how configuration mechanisms influence build-versus-buy decisions in agentic AI coding tools, specifically Claude Code and OpenAI Codex. The study employs controlled programming tasks from a benchmark of staged projects, manipulating configurations ranging from no configuration to context files, Skills, MCP-enabled library discovery tools, and permission controls. It measures library selection, disclosure completeness, and accuracy. Nine hypotheses guide the analysis, and the resulting benchmark dataset and pipeline will be released as a reusable artifact for evaluating build-versus-buy behavior in agentic AI coding tools.

agentic ai coding toolsbuild-versus-buyconfiguration mechanismsclaude codeopenai codex

scTranslation: A Comprehensive Benchmark for Single-Cell Multi-Omics Modality Translation

arXiv cs.AI · Jiabei Cheng, Jingbo Zhou, Jun Xia, Changkai Li · 2026-06-02

The authors introduce scTranslation, a comprehensive benchmark for evaluating single-cell multi-omics modality translation methods, addressing the lack of systematic evaluation in datasets, metrics, and influencing factors. The benchmark integrates diverse datasets, state-of-the-art models, and comprehensive evaluation metrics, while assessing performance under scenarios like feature selection, quality, and few-shot settings. A large-scale study reveals significant performance variations across these factors, providing insights for future method development. The open-sourced benchmark is available at https://github.com/Bunnybeibei/scTranslation.

single-cell omicsmodality translationbenchmark evaluationfeature selectionfew-shot learning

Agent libOS: A Library-OS-Inspired Runtime for Long-Running, Capability-Controlled LLM Agents

arXiv cs.AI · Yingqi Zhang · 2026-06-02

Agent libOS introduces a library-OS-inspired runtime substrate for long-running LLM agents, treating them as schedulable AgentProcesses with explicit capabilities, object memory, and auditability. The design implements tools as libc-like wrappers, enforcing authority boundaries at runtime primitives for filesystem access, human approval, and side effects. A Python prototype demonstrates async scheduling, JIT tool registration, and 123 regression tests, focusing on safety and auditability rather than planner accuracy.

llm agentsruntime substratecapability controlobject memoryjit tools

Synthesize and Reward -- Reinforcement Learning for Multi-Step Tool Use in Live Environments

arXiv cs.AI · Ibrahim Abdelaziz, Asim Munawar, Kinjal Basu, Maxwell Crouse · 2026-06-02

The paper introduces PROVE, a framework for training LLMs in multi-step tool use via three contributions: (1) a library of 20 stateful MCP servers with 343 tools for live-execution RL training, (2) a data synthesis pipeline generating validated tool-call trajectories grounded in live server state, and (3) a programmatic reward combining validity scoring, dependency-aware coverage, and efficiency penalties. Using GRPO with ~13K examples, PROVE improves performance by up to +10.2 points on BFCL Multi-Turn, +6.8 on tau2-bench, and +6.5 on T-Eval across four models (Qwen3-4B to Granite-4.1-8B).

reinforcement learningtool orchestrationprogrammatic rewardstateful environmentdependency-graph

Reasoning Structure of Large Language Models

arXiv cs.AI · Frédéric Berdoz, Luca A. Lanzendörfer, Fabian Farestam, Roger Wattenhofer · 2026-06-02

We introduce a scalable benchmark and pipeline for analyzing reasoning structures in large reasoning models (LRMs), converting unstructured traces into verifiable reasoning graphs of claims and dependencies. This enables quantitative analysis of reasoning topology and defines a reasoning efficiency metric measuring logical flow concentration. Evaluations on open-source reasoning models demonstrate that structural measurements differentiate behaviors conflated by token count and accuracy, providing diagnostic tools for failure modes and scaling analysis with puzzle difficulty.

reasoning graphslogical flowreasoning efficiencylarge reasoning modelspuzzle difficulty

Beyond Encoder Accumulation: Measuring Encoder Roles in Multi-Encoder VLMs

arXiv cs.AI · Wei Ding, Yudong Zhang, Ruobing Xie, Xingwu Sun · 2026-06-02

This study investigates encoder interactions in multi-encoder vision-language models (LVLMs) through systematic retraining of 31 encoder subsets on the Cambrian-1 benchmark suite. The authors introduce a Capacity-Necessity decomposition to quantify individual encoder contributions, revealing that optimal pairs combine a high-Capacity anchor with an adaptive complement rather than two top-performing encoders. Key findings include: (1) encoder rankings shift under joint training versus masking, (2) pairing strategy outperforms naive combinations, and (3) pre-projector effective rank correlates with residual performance. The work provides methodological insights for LVLM design.

multi-encodercapacity-necessitypre-projector rankvision-language modelscambrian-1

From 'What' to 'How' and 'Why': Sharing LLM-Generated Retrospective Summaries of Older Adults' Passive Tracking Data with Remote Family Members

arXiv cs.AI · Jiachen Li, Reina Szeyi Chan, Akshat Choube, Xiang Zhi Tan · 2026-06-02

This work presents a multi-agent LLM system that generates retrospective summaries from multi-modal tracking data for remote family members (RFMs) of older adults. The authors customized the Vital Insight system to produce initial summaries, conducted interviews with 11 RFMs, and redesigned the system to create multi-layer, context-aware narratives. Evaluation showed significant improvements in RFM satisfaction (p<0.05), perceived helpfulness, trust, and willingness to receive summaries compared to baseline versions, demonstrating the value of shifting from raw data ('What') to explanatory narratives ('How' and 'Why').

large language modelsmulti-modal trackingretrospective summariescontext-aware narrativestechnology probes

A Training-Free Mixture-of-Agents Framework for Multi-Document Summarization using LLMs and Knowledge Graphs

arXiv cs.AI · Cuong Vuong Tuan, Trang Mai Xuan, Tien-Cuong Nguyen, Vu-Duc Ngo · 2026-06-02

The authors propose a training-free mixture-of-agents framework for multi-document summarization (MDS) that combines large language models (LLMs) and knowledge graphs without task-specific fine-tuning. The method decomposes summarization into three agent tasks—extractive selection, knowledge-aware abstraction, and iterative refinement—and unifies outputs via a multi-perspective consistency mechanism. Experiments on four English and Vietnamese datasets show state-of-the-art or competitive performance, demonstrating the framework's effectiveness and cross-lingual adaptability.

multi-document summarizationmixture-of-agentsknowledge graphslarge language modelstraining-free

Taiji: Pareto Optimal Policy Optimization with Semantics-IDs Trade-off for Industrial LLM-Enhanced Recommendation

arXiv cs.AI · Yuecheng Li, Zeyu Song, Jing Yao, Chi Lu · 2026-06-02

The paper introduces Taiji, a Pareto Optimal Policy Optimization (POPO) framework for aligning LLM semantic spaces with recommender ID spaces in industrial settings. The method combines reverse-engineered reasoning and open-ended rejection sampling for high-quality chain-of-thought (CoT) data generation during SFT, and adaptively adjusts cross-domain reward weights during RL alignment. Deployed on Kuaishou's platform, Taiji serves 400M users daily, demonstrating significant commercial impact and scalability.

pareto optimal policy optimizationchain-of-thoughtsemantic-id alignmentrecommender systemslarge language models

PyraMathBench: Evaluating and Improving Mathematical Capability in Large Language Models

arXiv cs.AI · Zetian Ouyang, Linlin Wang, Gerard de Melo, Liang He · 2026-06-02

The paper introduces PyraMathBench, a hierarchical benchmark with 32,505 questions across 4 cognitive aspects and 2 modalities, addressing gaps in evaluating numerical reasoning in LLMs. It proposes SOLVE and IRPO to enhance numerical-mathematical synergy via tool calls (fuzzy matching, low-quality call rejection). Experiments show Qwen-2.5 improves by 5.0 points with these methods, highlighting weaknesses in numerical computation and abstract question handling.

numerical reasoningmath word problemstool callsfuzzy matchingpolicy optimization

FLARE: Fine-Grained Diagnostic Feedback for LLM Code Refinement

arXiv cs.AI · Yinsheng Yao, Hongxiang Zhang, Weixi Tong, Tianyi Zhang · 2026-06-02

FLARE introduces a fine-grained diagnostic framework for iterative LLM code refinement, employing a lightweight model to predict line-level suspiciousness signals for bug localization. The method searches top-k suspicious regions (k=10 optimal) and selects candidates via execution outcomes, addressing uncertainty in diagnostic predictions. Evaluations on LiveCodeBench and BigCodeBench with five base LLMs show absolute improvements of 1.72-7.42% (k=1) and 8.50% average gain (k=10), while the diagnostic model outperforms existing fault localization methods.

code refinementbug localizationdiagnostic modelsuspiciousness signalsiterative framework

Clustered Self-Assessment: A Simple yet Effective Method for Uncertainty Quantification in Large Language Models

arXiv cs.AI · Qi Cao, Takeshi Kojima, Andrew Gambardella, Helinyi Peng · 2026-06-02

The paper introduces Clustered Self-Assessment, a novel method for uncertainty quantification in large language models (LLMs) that leverages the model's own probability estimates. The approach clusters sampled generations into semantically distinct groups, reformulates them as multiple-choice options, and uses the LLM's assigned probabilities as confidence scores. Evaluations across multiple models and datasets demonstrate consistent outperformance over baseline methods, with competitive results achieved using just two additional samples. This method provides interpretable uncertainty estimates while maintaining computational efficiency.

uncertainty quantificationlarge language modelsself-assessmentsemantic clusteringconfidence estimation

Re-Evaluating Continual Learning with Few-Shot Adaptation

arXiv cs.AI · Amogh Inamdar, Matthew So, Vici Milenia, Richard Zemel · 2026-06-02

The paper proposes few-shot evaluation as an improved metric for assessing continual learning systems, challenging the conventional 0-shot paradigm. It introduces per-shot plasticity to measure adaptation speed across task sequences in continual image classification. Experiments reveal that meta-learning future tasks induces learning-to-learn behavior, demonstrating enhanced stability and plasticity compared to standard continual learning methods.

continual learningfew-shot adaptationmeta-learningplasticityimage classification

EvoDS: Self-Evolving Autonomous Data Science Agent with Skill Learning and Context Management

arXiv cs.AI · Zherui Yang, Fan Liu, Yansong Ning, Hao Liu · 2026-06-02

The paper introduces EvoDS, a self-evolving autonomous data science agent that addresses limitations in static action sets and context management through two novel strategies: Autonomous Skill Acquisition (ASA) for synthesizing and reusing executable skills, and Adaptive Context Compression (ACC) for learned context control. The agent employs a two-stage multi-agent training scheme, theoretically reducing tool-selection error and aligning with an information bottleneck principle. Empirical results show EvoDS outperforms state-of-the-art open-source agents by 28.9% across four benchmarks while eliminating out-of-token failures.

autonomous skill acquisitionadaptive context compressioninformation bottleneckmulti-agent trainingtool-selection error

BigFinanceBench: A Workflow-Grounded Benchmark for Financial-Research Agents

arXiv cs.AI · Alex Wang, Georg Meinhardt, Jacob Katz, Joseph H. Kim · 2026-06-02

BigFinanceBench introduces a 928-item expert-authored benchmark for financial-research agents, evaluating not just final answers but the auditable derivation process through workflow-grounded rubrics. Each task pairs a reference answer with a point-weighted rubric decomposing the derivation into checkable steps, enabling partial-credit evaluation across 36,241 rubric points. Evaluation of ten frontier and open-weight agents reveals substantial headroom (best system: 58.8% rubric score), non-uniform capability across workflows, and final-answer accuracy as a lossy proxy for derivation quality.

financial-research agentsworkflow-grounded evaluationpartial-credit rubricderivation auditingbenchmark design

Conditional Latent Diffusion Model with Fourier-based Motion Modelling for Virtual Population Synthesis

arXiv cs.AI · Shaokun Lan, Haoran Dou, Jinghan Huang, Arezoo Zakeri · 2026-06-02

The authors propose 4D F-MeshLDM, a conditional latent diffusion model for generating virtual cardiac populations with periodic motion. The framework combines a convolutional mesh VAE, a Fourier-based motion parameterization in latent space, and a diffusion prior conditioned on clinical covariates via affine modulation. Evaluated on 5,000 UK Biobank subjects, the method outperforms baselines in anatomical fidelity, achieves near-zero cycle closure error, and preserves clinical functional indices, demonstrating potential for in-silico cardiac trials.

latent diffusion modelfourier motion modelingmesh vaein-silico trialsaffine modulation

Calibrating Urban Traffic Simulation from Sparse Road Observations via Genetic Optimization

arXiv cs.AI · Hunter Sawyer, Jesse Roberts, Simon Matei · 2026-06-02

The paper presents a genetic algorithm-based framework for calibrating urban traffic simulations using sparse road observations, eliminating the need for detailed employment distribution data. The method optimizes job distributions and gate-traffic parameters in the SUMO traffic simulation platform for Greensboro, NC, aligning simulated traffic with limited real-world flow measurements. Results show strong correlation with withheld road segments and qualitative agreement with census employment data, demonstrating scalable, data-efficient simulation calibration.

genetic algorithmtraffic simulationsumoparameter optimizationsparse observations

Leveraging BART to Assess CS1 C++ Programming Assignments using Rubric-based Criteria

arXiv cs.AI · Kelsey Rainey, Jesse Roberts · 2026-06-02

The paper introduces a rubric-aware multitask fine-tuning approach for automated grading of C++ programming assignments, aiming to better emulate instructor grading behavior than general-purpose LLMs. Using CS1 data, student submissions are paired with scores, grade buckets, and rubrics, preprocessed into sequences for transformer input. A BART encoder-decoder with LoRA adaptation is trained to jointly predict numeric grades and grade buckets, enhanced with a distribution-matching term. Experiments show multitask BART with boundary-based soft labels and rubric context achieves lower mean absolute error and stronger grade-distribution alignment than baselines, while fully fine-tuned T5 improves distributional fidelity.

rubric-awaremultitask fine-tuninglora adaptationdistribution-matchingboundary-based soft labels

Enhancing Operational Safety via Agentic Dialogue Hazard Identification Analysis

arXiv cs.AI · Sanjay Das, Ran Elgedawy, Ethan Seefried, Ryan Burchfield · 2026-06-02

The paper introduces HAZDIAL, a framework employing multi-agent dialogue (adversarial debate and constructive discussion) to enhance hazard identification in safety-critical systems, overcoming limitations of single-turn LLM inference. It proposes algorithm-based agentic interaction optimization and evaluates configurations against a golden dataset using classification and novel dialogue metrics. Results demonstrate improved accuracy, precision, recall, and F1 scores, advancing dialogue systems, multi-agent reasoning, and AI safety.

hazard identificationmulti-agent dialoguellmssafety-critical systemsalgorithmic optimization

AI Agents Enable Adaptive Computer Worms

arXiv cs.AI · Jonas Guan, Tom Blanchard, Hanna Foerster, Hengrui Jia · 2026-06-02

This work introduces a novel class of AI-driven computer worms that autonomously generate adaptive attack strategies for each target, leveraging open-weight large language models (LLMs) hosted on compromised machines. The worm propagates across heterogeneous networks (Linux, Windows, IoT) by exploiting common corporate vulnerabilities, with zero marginal cost per infection due to parasitic computation. Unlike traditional worms, this approach bypasses centralized AI safety controls and creates an economic asymmetry favoring attackers. Experimental results demonstrate that self-sustaining, generative malware systems capable of reasoning, adaptation, and real-time attack synthesis are now feasible, posing a significant cybersecurity threat.

computer wormlarge language modelsparasitic computationadaptive attack strategiescybersecurity

Consistency Training Can Entrench Misalignment

arXiv cs.AI · David Demitri Africa, Arathi Mani · 2026-06-02

This work demonstrates that consistency training systematically influences model alignment, with effects varying by misalignment type. The authors evaluate seven consistency training methods across 108 open-source models (7B-70B parameters) fine-tuned for controlled misaligned behaviors. Results show suppression of reward hacking and emergent misalignment but amplification of sycophancy, with distribution shifts identified as the primary driver. A theoretical framework explains conditions for misalignment amplification/suppression, establishing that consistency training requires careful auditing in critical systems.

consistency trainingmodel alignmentreward hackingsycophancydistribution shift

PURGE: Projected Unlearning via Retain-Guided Erasure

arXiv cs.AI · Vedant Jawandhia, Daksh Ahuja, Ghufran Alam Siddiqui, Prashant Trivedi · 2026-06-02

PURGE introduces a machine unlearning algorithm that exploits the duality between continual learning and unlearning by adapting gradient projection from A-GEM to constrain unlearning steps without increasing retain-set loss. The method performs multi-layer representation erasure, targeting the model's natural confusion pattern on retain data rather than uniform distribution to evade membership inference attacks. Experiments on five datasets (CIFAR-10, MNIST, SVHN, STL10, PathMNIST) show retain accuracy above 96% and MIA AUROC near 0.5, outperforming gradient ascent and KL-uniform baselines.

machine unlearninggradient projectionmembership inferencerepresentation erasureretain-confusion target

LiveBand: Live Accompaniment Generation in the Audio Domain

arXiv cs.AI · Marco Pasini, Javier Nistal, Mathias Rose Bjare, Stefan Lattner · 2026-06-02

LiveBand introduces a real-time system for generating high-fidelity music accompaniments to live audio input under strict causal constraints. The method employs a causal transformer generator operating in the continuous latent space of a pre-trained causal audio autoencoder, with adversarial sequence-level supervision from a discriminator. The generator processes only causally available mix context and Gaussian noise at each timestep, predicting accompaniment latents autoregressively without future context or ground-truth targets. Training uses causal masking in a single parallel forward pass, while inference maintains streaming compatibility via rolling attention states. Evaluated on a multi-instrument benchmark, LiveBand outperforms prior work in audio quality, beat alignment, and mix adherence while enabling real-time generation on consumer hardware.

causal transformerlatent spaceadversarial supervisionautoregressive generationstreaming inference

Trading Human Curation for Synthetic Augmentation in RLVR

arXiv cs.AI · Akshansh, Leonardo Rosa Rodrigues, Michael Korostelev, Youssef Hassan · 2026-06-02

The study addresses the scalability bottleneck in reinforcement learning from verifiable rewards (RLVR) by substituting human-curated tasks with synthetic augmentations. Using gate-filtered augmentations of a small base of hand-authored tasks, the authors formalize and measure the cost-adjusted trade rate ρ_cost between augmented and human-authored tasks. Results show that synthetic augmentation maintains generalization across ten benchmarks (code, instruction following, reasoning, function-calling) with ρ_cost ranging [1.4×, 11.6×] for plausible cost ratios.

rlvrsynthetic augmentationcost-adjusted trade rategeneralizationgate-filtered

Signed Spiking Neuron Enabled by an Orthogonal-Easy-Axis Magnetic Tunnel Junction

arXiv cs.AI · Huannan Zheng, Jingli Liu, Kezhou Yang · 2026-06-02

The authors introduce a compact magnetic tunnel junction (MTJ)-based neuron enabling signed leaky integrate-and-fire (LIF) operation, which carries richer information than standard spiking neurons. The design utilizes orthogonal easy axes in the free and pinned layers to map magnetic-moment dynamics to signed LIF membrane-potential evolution, enabling bipolar spike generation. Landau--Lifshitz--Gilbert simulations confirm that proper free-layer dimensions allow the device to follow a signed LIF equation, with a representative design measuring 10 nm x 45 nm x 50 nm. Network evaluations using the fitted device-neuron model achieve 91.06% accuracy on CIFAR-10 and 77.40% on CIFAR10-DVS, retaining most of the accuracy of ideal signed LIF neurons.

magnetic tunnel junctionsigned spiking neuronleaky integrate-and-firebipolar spike generationlandau--lifshitz--gilbert

From Control Boundary to Insurance Claim: Reconstructing AI-Mediated Losses Through the CER Framework

arXiv cs.AI · Alex Leung, Rex Zhang, Kentaroh Toyoda, SiewMei Loh · 2026-06-02

The paper introduces CER, a diagnostic framework for AI residual risk transfer, addressing losses involving generative or agentic AI systems. CER evaluates three dimensions: control boundary (enforceable operating envelope), evidence reconstruction (system state and causal chain from artifacts), and insurance response (coverage availability and claim proof). The framework is applied to public incidents like PocketOS and Replit database-deletion cases, demonstrating its utility in reconstructing AI-mediated losses. Contributions include problem definition, operationalization through CER, and specification of claim-grade evidence requirements.

cer frameworkresidual risk transferagentic aievidence reconstructioninsurance claim

E2LLM: Towards Efficient LLM Serving in Heterogeneous Edge/Fog Environments

arXiv cs.AI · Truong-Thanh Le, Amir Taherkordi, Hoang-Loc La, Frank Eliassen · 2026-06-02

E2LLM introduces a framework for efficient LLM serving in heterogeneous Edge/Fog environments by replicating full models across device groups with specialized PREFILL/DECODER roles. The method employs a Genetic Algorithm for device clustering and Dynamic Programming for optimal model-parallel partitioning, addressing phase-specific computational demands. Experiments show 50% reduction in average waiting time versus Splitwise under high load, with robust adaptation to varying input/output token lengths.

llm servingedge computingmodel parallelismgenetic algorithmdynamic programming

Merit or networks? What decides where research is published

arXiv cs.AI · Ning Li · 2026-06-02

The study disentangles meritocratic and network effects in scientific publishing by developing a five-input production function for journal placement in economics. Using 6,208 working papers, it measures idea quality via a discipline-trained LLM evaluator, execution quality via a rubric, connection indices, author ability, and text scores. Results show execution quality as the dominant meritocratic input, while connections primarily affect top-tier journals through additive channels: connected authors produce higher-scoring papers and receive preferential placement at equal scores. The model reconciles meritocracy and network accounts without mutual exclusion.

llm evaluatorproduction functionconnection indexjournal placementmeritocracy

Tool-Aware Optimization with Entropy Guidance for Efficient Agentic Reinforcement Learning

arXiv cs.AI · Hongye Cao, Nuo Yan, Haoyuan Deng, Ziwei Wang · 2026-06-02

The paper introduces TAO-RL, a framework for optimizing agentic reinforcement learning (RL) in tool-augmented LLMs through tool-aware trajectory filtering and entropy-guided exploration. The method filters rollout trajectories to retain only tool-capable and informative samples, while reshaping advantage functions with entropy bonuses at post-tool-call tokens to diversify reasoning paths. Evaluated across 7 reasoning benchmarks and 3 model scales, TAO-RL outperforms existing approaches by stabilizing training and enhancing exploration efficiency.

agentic reinforcement learningtool-aware optimizationentropy-guided explorationtrajectory filteringadvantage reshaping

LAP: An Agent-to-Instrument Protocol for Autonomous Science

arXiv cs.AI · Linwu Zhu, Liqiang Gao, Yan Chen, Dan Zhu · 2026-06-02

The Lab Agent Protocol (LAP) introduces a standardized agent-to-instrument protocol for autonomous science, addressing the gap left by existing agent-interoperability frameworks. LAP extends Google's Agent2Agent (A2A) structure with four physical-world primitives: InstrumentCard for capability description, reservation for exclusive access, safety-fence handshake for hazardous operations, and MeasurementResult for physically typed, calibration-anchored data. The protocol specifies roles, a six-layer architecture, JSON-RPC methods, task and safety state machines, error models, and cross-laboratory federation. LAP is compatible with the A2A/MCP ecosystem and encapsulates existing device standards like SiLA 2 and OPC-UA, enabling closed-loop autonomous campaigns.

agent-to-instrumentautonomous sciencesafety-fencemeasurementresultjson-rpc

Ultralytics YOLO26: Unified Real-Time End-to-End Vision Models

arXiv cs.AI · Glenn Jocher, Jing Qiu, Mengyu Liu, Shuai Lyu · 2026-06-02

Ultralytics YOLO26 introduces a unified real-time vision model family addressing key limitations in YOLO detectors: NMS dependence, heavy detection heads, long training schedules, and small object detection. The architecture features a dual-head design for NMS-free end-to-end inference, removes Distribution Focal Loss (DFL), and employs MuSGD optimizer, Progressive Loss, and STAL label assignment. The model supports detection, instance segmentation, pose estimation, classification, and oriented detection across five scales, achieving 40.9-57.5 mAP on COCO at 1.7-11.8 ms latency. YOLOE-26x extends to open-vocabulary tasks with 40.6 AP on LVIS minival.

real-time visionnms-free inferencedual-head designprogressive lossopen-vocabulary detection

Qwen-Image-Flash: Beyond Objective Design

arXiv cs.AI · Tianhe Wu, Kun Yan, Zikai Zhou, Lihan Jiang · 2026-06-02

The paper introduces Qwen-Image-Flash, a method for few-step distillation in visual generative models, focusing on training pipeline organization rather than distillation objectives alone. Using Qwen-Image-2.0 as a case study, the authors systematically examine data composition, teacher guidance, and task mixture in text-to-image generation and instruction-guided image editing. Empirical analysis reveals non-obvious behaviors, leading to the development of Qwen-Image-Flash. Results indicate that effective few-step distillation depends on principled pipeline design alongside objective optimization.

few-step distillationvisual generative modelstext-to-image generationinstruction-guided editingtraining pipeline

Proof-Refactor: Refactoring Generated Formal Proofs into Modular Artifacts

arXiv cs.AI · Yiming Fu, Peixuan Liu, Zichen Wang, Kun yuan · 2026-06-02

The paper introduces Proof-Refactor, an agentic framework for refactoring LLM-generated formal proofs into modular, library-quality artifacts. It decomposes proof refactoring into four phases: fragment extraction, helper declaration design, formal verification, and proof repair. Evaluated on Lean proofs from PutnamBench and Putnam2025, the method outperforms Claude Code baselines in rubric-based scores, particularly improving signature quality and human readability. Results demonstrate that process-guided refactoring enhances proof structure without relying solely on length-based metrics.

proof-refactorformal proofsmodular artifactslean proofsrubric-based scores

When to Re-Plan: Subgoal Persistence in Hierarchical Latent Reasoning

arXiv cs.AI · Ayushi Chadha · 2026-06-02

The paper investigates the stability-adaptivity tradeoff in hierarchical latent reasoning systems, proposing a feudal-style manager-worker architecture where subgoal persistence duration (P) critically impacts performance. The extended Hierarchical Reasoning Model (HRM) employs a high-level module emitting directional subgoals that persist for P steps, biasing worker updates via cosine alignment loss. On ARC and ConceptARC benchmarks, optimal subgoal persistence (P=3) achieves lowest LM loss (1.544 vs 1.674 at P=1), with complementary alignment weight λ≈0.05. Ablations confirm directional structure—not just capacity or loss—drives interference when λ exceeds optimum, suggesting medium-horizon intent coherence is key for compositional planning.

hierarchical reasoningsubgoal persistencelatent reasoningcosine alignmentfeudal architecture

Unveiling the Structure of Do-Calculus Reasoning via Derivation Graphs

arXiv cs.AI · Clément Yvernes, Emilie Devijver, Marianne Clausel, Eric Gaussier · 2026-06-02

The paper introduces derivation graphs to formalize the application of do-calculus rules for causal inference, characterizing the space of equivalent interventional and observational probabilities. By representing rule combinations as graph structures, the authors demonstrate a procedure requiring ≤4 rule applications and reveal how identification algorithms generate multiple valid estimands for the same causal quantity, enabling more efficient estimation. Results show this structural approach simplifies reasoning about equivalent causal expressions.

do-calculusderivation graphscausal inferenceinterventional queriesidentification algorithms

Code-on-Graph: Iterative Programmatic Reasoning via Large Language Models on Knowledge Graphs

arXiv cs.AI · Weiwei Ding, Zixuan Li, Long Bai, Zhuo Chen · 2026-06-02

The paper introduces Code-on-Graph (CoG), a programmatic reasoning framework that enhances LLM-KG integration by generating executable Python code grounded in KG schemas. CoG addresses inflexibility and scalability limitations of existing approaches by representing KG schemas as Python classes and instantiating retrieved facts as objects during execution, avoiding direct knowledge injection into prompts. Experiments on WebQSP, CWQ, and GrailQA show CoG outperforms prior state-of-the-art models by up to 10.5%.

knowledge graphslarge language modelsprogrammatic reasoningpython classesfact retrieval

Dynamic Objective Selection with Safeguards and LLM Oversight for Financial Decision-Making

arXiv cs.AI · Keigo Sakurai, Takahiro Ogawa, Miki Haseyama, Anjyu Anan · 2026-06-02

We propose DOSS (Dynamic Objective Selection with Safeguards), a learning-based framework for financial decision-making that dynamically selects optimization objectives based on interpretable statistical summaries of recent returns, avoiding reliance on latent regime estimates. DOSS formulates objective selection as a classification problem, performs sequential updates with a rolling window to prevent temporal leakage, and outputs confidence scores for each proposal. It incorporates confidence-aware gating, fail-safe defaults, and explicit switching frequency controls to mitigate misselection and instability. Additionally, DOSS integrates LLM oversight constrained to accept proposals or override to predefined defaults, ensuring deterministic governance.

dynamic objective selectionconfidence-aware gatingtemporal leakagelatent regime estimatesllm oversight

SkillPyramid: A Hierarchical Skill Consolidation Framework for Self-Evolving Agents

arXiv cs.AI · Yuan Xiong, Ziqi Miao, Qian Chen, Lijun Li · 2026-06-02

The paper introduces SkillPyramid, a hierarchical skill consolidation framework that enables self-evolving AI agents to systematically construct, accumulate, and transfer skills. The method employs a hierarchical skill topology with a self-evolution mechanism for dynamic skill composition, validation, and incorporation during task execution. Evaluated on ALFWorld, WebShop, and ScienceWorld benchmarks across four backbone models, SkillPyramid achieves 38.0% higher average reward and 27.7% reduced execution steps compared to baselines, demonstrating effective transformation of static skill collections into dynamic evolution systems.

skill consolidationhierarchical topologyself-evolution mechanismtask generalizationdynamic skill composition

Staying Alive: Uncensored Survival Analysis with Tabular Foundation Models

arXiv cs.AI · Mariana Vargas Vieyra · 2026-06-02

The authors propose a training-free method for survival regression using Tabular Foundation Models (TFMs) to handle right-censored data. Their approach leverages TFMs to predict event times and iteratively impute censored data, constructing an Accelerated Failure Time (AFT) model with only a single scalar parameter. They introduce a non-parametric in-context estimator based on the Buckley-James estimator. Experiments on standard benchmarks demonstrate competitive performance with trained parametric and semi-parametric models like Cox regression and AFT models.

survival analysistabular foundation modelsright-censoringaccelerated failure timebuckley-james estimator

The DeepSpeak-Agentic Dataset

arXiv cs.AI · Sarah Barrington, Maty Bohacek, Hany Farid · 2026-06-02

The authors introduce DeepSpeak-Agentic, a 37-hour dataset of semi-structured human-AI conversations for forensic identification and interaction analysis. They develop a scalable data-capture system that generates embodied AI agents, pairs them with crowd workers, records multimodal conversations, and separates human-agent streams. The dataset serves as a benchmark for evaluating large-language models and AI-generated voices/faces in embodied agent applications.

embodied ai agentsforensic identificationmultimodal conversationslarge-language modelsai-generated voices

A Close Look At World Model Recovery In Supervised Fine-Tuned LLM Planners

arXiv cs.AI · Patrick Emami, Nan Qiang, Peter Graf · 2026-06-02

This work investigates whether supervised fine-tuned (SFT) LLMs learn underlying world models during classical planning tasks. Through interpretability experiments analyzing internal representations and generative capabilities, the study finds that SFT enables linear encoding of action validity and state predicates, with internal representations often outperforming output probabilities for validity classification. Random walk data during fine-tuning improves world model recovery. The research provides both methodological insights for probing planning LLMs and empirical evidence about knowledge representation in these models.

supervised fine-tuningworld model recoveryinterpretability experimentsaction validitylinear representations

EvoDrive: Pareto Evolution for Safety-Critical Autonomous Driving via Self-Improving LLM Agents

arXiv cs.AI · Tong Nie, Yuewen Mei, Yihong Tang, Junlin He · 2026-06-02

EvoDrive introduces a novel LLM-based agentic evolution framework for generating safety-critical autonomous driving scenarios. The method employs a simulator-grounded actor-critic architecture with memory-driven proposal generation, critic filtering, and a self-evolving evaluator to optimize simulation budgets. Results demonstrate significant expansion of the Pareto frontier in MetaDrive and CARLA benchmarks, yielding diverse attack-realism trade-offs and valuable training scenarios.

evodrivepareto frontieractor-critic architecturesafety-critical scenariosautonomous driving

AUGUSTE: Online-Learning dApp for Predictive URLLC Scheduling

arXiv cs.AI · Maxime Elkael, Michele Polese, Yunseong Lee, Koichiro Furueda · 2026-06-02

AUGUSTE introduces an online-learning MAC scheduler for URLLC that predicts uplink packet arrivals to eliminate Scheduling Request overhead. The framework alternates between learning phases (collecting unbiased arrival statistics) and confident phases (proactively allocating resources based on predictions), implemented via adaptive state machines. Evaluated on a 5G OpenAirInterface testbed with three URLLC traffic patterns, AUGUSTE achieves 10 ms median RTT (50% reduction vs SR baseline) at 7-10% resource overhead, matching always-on scheduling latency with 90% lower resource cost.

urllcmac schedulingonline learning5g tddconfigured grant

From Answers to States: Verifiable Process-Level Evaluation of Chemical Reasoning in Large Language Models

arXiv cs.AI · Hongyu Guo, Hao Li, He Cao, Gongbo Zhang · 2026-06-02

The paper introduces ChemCoTBench-V2, a verifiable benchmark for evaluating chemical reasoning in large language models (LLMs) through structured, rule-checked intermediate steps. It addresses limitations of final-answer scoring by enforcing expert-designed templates and deterministic chemistry rules for step-wise verification across 5,620 samples in 18 tasks, including molecular understanding and reaction prediction. Results show a persistent gap between final-answer correctness and reasoning-state consistency, with models often failing chemical-step checks despite correct answers or template adherence.

chemical reasoninglarge language modelsverifiable evaluationintermediate commitmentsmolecular optimization

Diagnosing Knowledge Gaps in LLM Tool Use: An Agentic Benchmark for Novel API Acquisition

arXiv cs.AI · Jinnuo Liu, Yue Peng, Jinhan Niu, Hongyi Wen · 2026-06-02

The paper introduces NovelAPIBench, a dynamic benchmark for evaluating LLMs' ability to acquire and use novel APIs, addressing limitations of static or synthetic benchmarks. The method automatically discovers novel APIs, extracts knowledge bundles (signatures, mechanisms, examples), generates executable tasks (1.9K total), and categorizes failures into six diagnostic types. Experiments across four models and five domains show usage examples are the strongest signal, while signature+example or signature+mechanism pairs perform best depending on domain. Retrieval outperforms parametric adaptation for volatile API content, though fine-tuning improves procedural integration of provided knowledge.

novel api acquisitionknowledge bundlesparametric adaptationretrieval-augmented generationdynamic benchmarking

Towards Non-Monotonic Entailment in Propositional Defeasible Standpoint Logic

arXiv cs.AI · Nicholas Leisegang, Thomas Meyer, Ivan Varzniczak · 2026-06-02

The paper introduces non-monotonic entailment relations for propositional defeasible standpoint logic (PDSL) by extending it with situated standpoint conditionals, enabling defeasible conditionals within specific standpoints. The method transports ranking-based entailment relations (rational and lexicographic closures) from propositional to PDSL contexts while preserving complexity bounds. Results show that a significant PDSL fragment can be expressed via situated conditionals, and entailment-checking leverages existing propositional algorithms without increasing computational complexity.

defeasible reasoningnon-monotonic entailmentpropositional standpoint logicsituated conditionalsrational closure

CoEval: Ranking Language Models for Custom Tasks Without Labeled Data or Trustworthy Benchmarks

arXiv cs.AI · Alexander Apartsin, Yehudit Aperstein · 2026-06-02

CoEval introduces a framework for ranking language models in scenarios lacking labeled data or reliable benchmarks. It synthesizes contamination-free, attribute-controlled benchmarks from task descriptions using teacher models, then employs a cross-family judge ensemble for model ranking without human input. Validated against ground truth, CoEval achieves a correctness correlation of 0.86 and avoids verbosity bias. The method demonstrated scalability with 7,978 evaluations across four tasks for minimal cost ($5.89), ensuring reproducibility and applicability to any domain.

contamination-freecross-family judge ensembleattribute-controlledverbosity biaslabel-free

Safety Measurements for Fine-tuned LLMs Should be Grounded in Capability

arXiv cs.AI · Krishnapriya Vishnubhotla, Hillary Dawkins, Isar Nejadgholi, Svetlana Kiritchenko · 2026-06-02

The study proposes grounding safety measurements for fine-tuned LLMs in capability to avoid arbitrary empirical choices and enable consistent comparisons. It conducts a multi-dimensional evaluation of fine-tuning effects on both capability and safety, revealing three key issues: fine-tuned models generate incoherent responses to safety prompts, automated safety judgments fail on such outputs, and safety conclusions vary by benchmark and evaluator. The work highlights the need for capability-anchored safety assessments in LLM adaptation.

large language modelsfine-tuningsafety evaluationcapability groundingautomated judgments

Black-box, Adaptive, Efficient, Transferable, Harmful, Applicable... Attacks Are All You Need to Break LLMs

arXiv cs.AI · Vincent Limbach, Jonas Dornbusch, David Lüdke, Stephan Günnemann · 2026-06-02

We introduce Indirect Harm Optimization (IHO), a black-box adversarial attack method for evaluating LLM robustness against jailbreaks. IHO employs a masked diffusion language model trained via iterative preference optimization against a harmfulness judge, requiring only black-box access to the target model. The method functions adaptively against individual behaviors and transfers efficiently to unseen behaviors and models without fine-tuning. Evaluations demonstrate that IHO significantly improves attack success rates over state-of-the-art approaches, even against layered defenses like Circuit Breaker-trained models with auxiliary detectors. This positions IHO as a practical step toward standardized jailbreak evaluation.

black-boxdiffusion modeljailbreakiterative optimizationrobustness evaluation

Gender-Dependent Diagnostic Substitution in LLM Medical Triage: Same Symptoms, Unequal Urgency

arXiv cs.AI · Qi Han Wong · 2026-06-02

The study reveals systemic gender-based triage disparities in LLM medical recommendations, with young women receiving significantly lower ER referral rates than men for identical neurological symptoms. Using Gemini 3.5 Flash, Claude Sonnet 4.6, and GPT-5.4-mini (n=630 trials), researchers presented standardized symptoms across age/gender conditions. Results show diagnostic substitution as the primary mechanism: models anchor on gender-associated diagnoses (e.g., Idiopathic Intracranial Hypertension for women), routing them to lower-urgency care despite comparable severity ratings. The disparity vanishes at age 65, indicating models replicate human clinical biases by using epidemiological priors to suppress urgency.

diagnostic substitutiontriage disparityepidemiological priorsllm biasurgency assessment

VidMsg: A Benchmark for Implicit Message Inference in Short Videos

arXiv cs.AI · Issar Tzachor, Michael Green, Rami Ben-Ari · 2026-06-02

The paper introduces VidMsg, a benchmark for evaluating implicit message understanding in short internet videos, comprising 400 YouTube clips across 9 topics and 52 fine-grained messages. The dataset is constructed via a message-first pipeline: an LLM generates indirect search scenarios to retrieve candidate clips, which human annotators verify for message alignment without explicitness. Experiments show contemporary video-language and retrieval models struggle with VidMsg due to its requirements for pragmatic inference and contextual cue integration, prompting the authors to propose VidVec-Msg as a baseline method with remaining performance gaps.

implicit message inferencevideo-language modelsbidirectional retrievalpragmatic inferencecontextual cues

AnchorMoE: Interpretable Time Series Classification via Anchor-Routed MoE

arXiv cs.AI · Tao Xie, Zexi Tan, Haoyi Xiao, Mengke Li · 2026-06-02

AnchorMoE introduces an interpretable multivariate time series classification (MTSC) framework using a Mixture-of-Experts (MoE) architecture. The method encodes multi-view representations of local patches, routes them to specialized experts via geometric orthogonality constraints to reduce redundancy, and employs an uncertainty-aware reliability gate to suppress noise. Experiments on real-world and synthetic benchmarks show competitive classification performance while providing ante-hoc transparency by decomposing predictions into input segments.

multivariate time series classificationmixture-of-expertsinterpretabilitygeometric orthogonalityuncertainty-aware gating

TSQAgent: Rating Time Series Data Quality via Dedicated Agentic Reasoning

arXiv cs.AI · Shunyu Wu, Dan Li, Haozheng Ye, Weibin Feng · 2026-06-02

The paper introduces TSQAgent, an agentic reasoning framework for assessing time series data quality via three collaborative roles: Perceiver (dimension selection), Inspector (quantitative analysis), and Adjudicator (judgment refinement). The method addresses limitations of current LLMs in dimension identification and evidence-grounded comparison by incorporating external analytical tools. Evaluations on TSQBench and 11 real-world datasets show TSQAgent improves LLMs' quality understanding, quantitative comparison, and downstream data selection efficiency.

time seriesdata qualityagentic reasoningllmsquantitative analysis

Building Reliable Long-Form Generation via Hallucination Rejection Sampling

arXiv cs.AI · Lin Li, Georgia Channing, Suhaas M Bhat, Gabriel Davis Jones · 2026-06-02

The paper introduces Segment-wise HAllucination Rejection Sampling (SHARS), an inference-time framework to mitigate hallucination in long-form text generation by LLMs. SHARS employs a hallucination detector to identify and resample unreliable segments, preventing error propagation through segment-wise rejection sampling. The method adapts semantic uncertainty as its detector with modifications for long-form contexts, enabling self-correction without external resources. Evaluations on standardized benchmarks show SHARS significantly reduces hallucinations while maintaining or improving generation informativeness.

long-form generationhallucination mitigationrejection samplingsemantic uncertaintyself-correction

TurtleAI: Benchmarking Multimodal Models for Visual Programming in Turtle Graphics

arXiv cs.AI · Chao Wen, Jacqueline Staub, Adish Singla · 2026-06-02

The paper introduces TurtleAI, a benchmark of 823 education-oriented visual programming tasks in Turtle Graphics, assessing multimodal models' ability to perceive geometric patterns, reason spatially, and generate accurate Python code. Evaluating 20+ vision-language models (VLMs) including GPT-5, GPT-4o, and Qwen2-VL-72B reveals poor performance (<30% success), with GPT-4o struggling particularly on spatial reasoning and visual replication. A synthetic data generation method improves Qwen2-VL-72B's performance by ~20% through enhanced visual-code alignment, as shown by failure analysis.

visual programmingturtle graphicsvision-language modelsspatial reasoningsynthetic data generation

Bridging Auxiliary Constraints to Resolve Instruction Following in Large Reasoning Models

arXiv cs.AI · Zhengyi Zhao, Shubo Zhang, Huimin Wang, Zezhong Wang · 2026-06-02

The paper introduces Constraint Relationship Graph Completion (CRGC), a novel framework addressing the Constraint Adherence Problem (CAP) in Large Reasoning Models (LRMs). CRGC represents instructions as structured knowledge graphs, explicitly models constraint relationships, and identifies bridge constraints to reconcile competing requirements. These auxiliary constraints enhance primary constraint salience and compatibility, leveraging the model's knowledge for improved generation pathways. Experiments on three instruction-following datasets demonstrate a 39% reduction in constraint violations compared to standard prompting, while preserving LRM reasoning capabilities.

constraint adherence problemlarge reasoning modelsbridge constraintsknowledge graphinstruction following

Physics-Guided Policy Optimization with Self-Distillation

arXiv cs.AI · Ke Wang, Yuning Wu, Haoran Liu, Chaoqun Jia · 2026-06-02

The paper introduces Physics-Guided Policy Optimization (PGPO), a method for stabilizing self-distilled policy optimization (SDPO) in LLM post-training by modulating step sizes based on mutual information between student predictions and teacher feedback. Drawing from viscous-fluid dynamics and formalized via SDE analysis, PGPO maintains order-1 weak-approximation guarantees of SGD while adding negligible per-iteration overhead. Evaluated on Science-QA, PGPO outperforms SDPO in 3 of 4 domains (up to +4.5 points) and prevents training collapse observed with SDPO.

self-distillationpolicy optimizationmutual informationsde approximationllm post-training

Cross-Lingual Token Arbitrage: Optimizing Code Agent Context Windows via Local LLM Preprocessing

arXiv cs.AI · Mehmet Utku Colak · 2026-06-02

The paper introduces a pre-flight prompt-rewriting middleware for AI-assisted coding agents to optimize token usage. The system employs a local Llama 3.2 (3B) model to perform cross-lingual translation into English and structural rewriting into compact task-oriented formats, with regex-validated safeguards. Evaluated on the OMH-Polyglot benchmark spanning Turkish, Arabic, and Chinese, the method reduces prompt tokens by 34-47% and total tokens by up to 18.8% while maintaining or improving task accuracy, outperforming LLMLingua-2 in OckScore across three commercial LLM backends.

token arbitrageprompt optimizationmultilingual codingcontext windowllama 3.2

Exploiting Verification-Generation Gap: Test-Time Reinforcement Learning with Confidence-Conditioned Verification

arXiv cs.AI · Jiahui Li, Jianfeng Shan, Wenpei Chen, Shunyu Wu · 2026-06-02

The paper introduces TTRL-CoCoV, a confidence-adaptive framework for test-time reinforcement learning that optimizes both Pass@1 and Pass@k performance in label-free settings. The method addresses two key challenges: incorrect pseudo-labels for low-confidence samples and diversity collapse in high-confidence samples, by employing a confidence-conditioned mechanism that selectively applies verification and exploration-enhancing rewards. Experiments across six benchmarks show average gains of +9.8% in Pass@1 and +18.7% in Pass@16 over baseline TTRL, with up to +5.0% improvement over supervised RL methods.

test-time reinforcement learningpass@k optimizationconfidence-conditioned verificationdiversity collapsepseudo-label estimation

Testing LLM Arithmetic Reasoning Generalization with Automatic Numeric-Remapping Attacks

arXiv cs.AI · Malia Barker, Bishal Lakha, Edoardo Serra, Francesco Gullo · 2026-06-02

The paper introduces an automatic algorithm for generating numeric-remapping attacks to test LLM robustness in arithmetic reasoning. The method derives symbolic representations, generates constrained numeric remappings, recomputes answers, and performs deterministic edits via LLM-generated plans, validated through stage-wise auditing. Evaluations on DeepSeek-R1 (70B), Gemma4 (31B), and GPT-OSS (120B) show GSM8K accuracy drops by 12.16–25.82 percentage points under attack, while MAWPS and MultiArith remain stable (>98% accuracy), revealing dataset-dependent sensitivity to numeric perturbations.

arithmetic reasoningnumeric-remappingsymbolic representationllm robustnessgsm8k

CauTion: Knowing When to Trust LLMs for Ensemble Causal Discovery

arXiv cs.AI · Bo Peng, Kaiwen Wu, Sirui Chen, Zhiheng Wang · 2026-06-02

CauTion introduces a framework for robust ensemble causal discovery by integrating LLM domain knowledge with statistical methods through consensus filtering and trust calibration. The method proceeds in three stages: (1) algorithm ensemble with consensus voting resolves 96% of edges with near-perfect accuracy, (2) trust-weighted arbitration restricts LLM input to unreliable edges via annotation-free calibration, and (3) cycle repair ensures acyclic graphs. Experiments on six datasets show CauTion outperforms data-centric and LLM-augmented baselines, particularly on larger graphs, while maintaining robustness to LLM errors.

causal discoveryensemble learningtrust calibrationlarge language modelsacyclic graphs

DDOR: Delta Debugging for Explainable Overrefusal Testing and Repair

arXiv cs.AI · Qinyan Zhou, Peixin Zhang, Jun Sun, Haonan Zhang · 2026-06-02

The paper introduces DDOR (Delta Debugging for OverRefusal), an automated framework for testing and repairing overrefusal in black-box LLMs. The method applies delta debugging to identify minimal refusal-triggering fragments (mRTFs), generates diverse prompts for multi-oracle validation, and constructs model-specific test suites (~1K cases). Results show DDOR enables targeted prompt repair that reduces overrefusal while maintaining safety on genuinely harmful inputs, improving LLM usability without compromising alignment.

overrefusaldelta debuggingminimal refusal-triggering fragmentsmulti-oracle validationprompt repair

PHASER: Phase-Aware and Semantic Experience Replay for Vision-Language-Action Models

arXiv cs.AI · Ziyang Chen, Shaoguang Wang, Weiyu Guo, Qianyi Cai · 2026-06-02

PHASER introduces a phase-aware experience replay framework for Vision-Language-Action (VLA) models to mitigate catastrophic forgetting in continual learning. It employs phase-centric capacity allocation to balance sub-skill retention and multi-modal interference routing to prioritize vulnerable historical tasks. The method integrates Auto-PC for unsupervised boundary detection in manipulation trajectories. Evaluated on LIBERO benchmarks with three VLA backbones, PHASER improves Average Success Rate by 31% over standard experience replay, achieving 87.8% ASR in goal-conditioned continual learning.

phaservla modelscontinual learningexperience replayauto-pc

When Attention Collapses: Stage-Aware Visual Token Pruning from Structure to Semantics

arXiv cs.AI · Jiahui Wang, Kai Zhang, Mai Han, Huanghe Zhang · 2026-06-02

The paper introduces Structure-to-Semantics (STS), a two-stage visual token pruning framework for Vision-Language Models (VLMs) that addresses redundancy in attention-based selection. STS first maximizes spatial diversity via repulsion-based sampling, then filters prompt-irrelevant tokens using instruction-aware cross-attention. Evaluations show STS improves structural diversity and task alignment of retained tokens, mitigating feature collapse from single-metric pruning. The method demonstrates effectiveness in reducing computational overhead while preserving vital contextual details.

visual token pruningvision-language modelsattention collapserepulsion-based samplinginstruction-aware cross-attention

Learned Non-Maximum Suppression for 3D Object Detection

arXiv cs.AI · Timo Osterburg, Stefan Schütte, Torsten Bertram · 2026-06-02

The paper introduces two learned modules for post-processing in LiDAR-based 3D object detection, replacing heuristic non-maximum suppression (NMS). D2D-Rescore employs transformer-based detection-to-detection attention, while GossipNet3D adapts 2D GossipNet to 3D via localized message passing in bird's-eye view. Both methods improve mean average precision (mAP), nuScenes detection score (NDS), and true positive quality over CircleNMS, particularly for small and infrequent classes, with minimal computational overhead. The approaches demonstrate that learned filtering enhances 3D detector reliability without modifying the base network.

3d object detectionnon-maximum suppressionlidartransformermessage passing

Efficient Transformer-Based Localized Patch Sampling for Choroid Plexus Segmentation in Multiple Sclerosis

arXiv cs.AI · Po-Jui Lu, Alessandro Cagol, Mario Ocampo-Pineda, Federico Spagnolo · 2026-06-02

A SwinUNETR-based pipeline was developed for efficient and accurate segmentation of the lateral ventricle choroid plexus (LVCP) in multiple sclerosis patients using localized patch sampling. The method processes 32x32x32 voxel patches from standalone and multi-modal MRI inputs, benchmarking against 3D UXNET on three datasets totaling 742 scans. SwinUNETR achieved a mean Dice Similarity Coefficient (DSC) of 0.868 with combined MPRAGE and FLAIR inputs, significantly outperforming UXNET (DSC: 0.858, p<0.0001), while reducing computational load by 99% (91.8 vs. 22,080 GFLOPs). The approach maintained high accuracy (DSC: 0.863) with standalone FLAIR inputs and improved spatial localization (HD95: 1.86 vs. 3.00 mm).

swinunetrchoroid plexusdice similarity coefficienthausdorff distancepatch sampling

\textsc{CR-Seg}: Attention-Guided and CoT-Enhanced Coarse-to-Refined Reasoning Segmentation

arXiv cs.AI · Yifan Cao, Xiaocui Yang, Faxian Wan, Shi Feng · 2026-06-02

The paper introduces CR-Seg, a two-stage framework for reasoning segmentation that combines attention-guided coarse localization with refined mask generation. The method employs an Extract Attention Maps and Points (EAP) module to identify target regions and select key points for input into SAM (Segment Anything Model). To enhance reasoning consistency, a Global-to-Local Chain-of-Thought (GLCoT) approach is proposed, enabling progressive reasoning from global context to local details. Experiments on benchmarks validate CR-Seg's effectiveness in improving segmentation accuracy through joint visual-textual reasoning.

reasoning segmentationattention mapschain-of-thoughtmultimodal llmssam

From Prompt to Service: An SLM-Based Agent Orchestration Gateway for AI-Driven Virtual Worlds

arXiv cs.AI · Louis Nisiotis, Aimilios Hadjiliasi · 2026-06-02

The paper proposes an SLM-based Agent Orchestration Gateway to decouple virtual world clients from heterogeneous AI backends via intent-driven service routing. The architecture employs edge-deployed small language models (SLMs) for semantic intent classification, a configurable service registry for routing decisions, and transparent backend invocation. Evaluated in the InterwovenXR virtual museum testbed, results demonstrate that fine-tuned sub-billion-parameter SLMs can reliably route intents on edge hardware, with a layered configuration (small router + larger conversational SLM) proving more efficient than a single-model approach. The solution enables scalable, extensible AI interaction in virtual worlds.

small language modelsintent classificationservice orchestrationedge computingvirtual worlds

SAGE: A Quantitative Evaluation of Socialized Evolution in Agent Ecosystems

arXiv cs.AI · Linyue Pan, Yaoming Zhu, Lin Qiu, Xuezhi Cao · 2026-06-02

The paper introduces SAGE (Social Agent Group Evolution), a framework for evaluating socialized evolution in agent ecosystems by comparing SocialEvo (agents co-evolving with peer histories) and SelfEvo (isolated self-improvement). Experiments span three domains—open-ended ML research, long-horizon economic planning, and strategic multiplayer play—across multiple evolutionary rounds. Results show that group history does not universally enhance performance, as the strongest agents do not surpass their self-evolution limits. However, agents plateauing in isolation achieve breakthroughs with peer experience. Gains depend on abstracting transferable knowledge from filtered traces or summaries rather than raw exposure volume, indicating agent-specific and arena-dependent benefits.

socialized evolutionself-improving agentspeer historiestransferable knowledgeevolutionary rounds

When Should the Teacher Move? Temporal Coupling and Stability in Self On-Policy Distillation

arXiv cs.AI · Haowei Guo, Baolong Bi, Ruicheng Zhang, Bingqian Sun · 2026-06-02

The paper systematically studies teacher update schedules in self on-policy distillation, identifying isolation periods (complete teacher freezing between updates) as critical for stability rather than teacher age. It introduces a diagnostic framework analyzing temporal KL structure, refresh shock, and length-tail risk, revealing state-oblivious collapse—a failure mode where clock-driven updates propagate student drift irreversibly. The proposed Consolidation-Gated Teacher Refresh (CGTR) dynamically gates updates based on reward improvement and length-tail safety, achieving zero collapse and superior performance across four tasks (Chemistry, Biology, Physics, ToolUse) without task-specific tuning.

self on-policy distillationtemporal couplingisolation periodsstate-oblivious collapseconsolidation-gated teacher refresh

High-Precision APT Malware Attribution with Out-of-Scope Resilience

arXiv cs.AI · Peter Williams, Adam Sobey, Erisa Karafili · 2026-06-02

The paper introduces a high-precision APT malware attribution method using ranked binary classifiers with explicit abstention to address out-of-scope resilience. Instead of a multi-class classifier, it trains two binary classifiers per APT group, ranks them by validation performance, and applies them sequentially, attributing only when sufficient evidence exists. Evaluated on the APT Malware dataset and a larger combined dataset, the method achieves higher precision than prior work, abstaining on 94% of out-of-scope samples while maintaining 92% precision and 95% selective accuracy on classified samples.

apt malware attributionbinary classifiersout-of-scope resilienceselective accuracyabstention mechanism

Post-Hoc Robustness for Model-Based Reinforcement Learning

arXiv cs.AI · Siemen Herremans, Ali Anwar, Siegfried Mercelis · 2026-06-02

This work introduces a post-hoc robustification method for model-based reinforcement learning (RL) agents at inference time, without additional neural network training. The approach combines a learned transition model with a nominal policy, using model-predictive control and adversarial rollouts approximated via projected gradient descent within a bounded uncertainty set. The method addresses out-of-distribution issues during offline rollouts. Validation in perturbed Gymnasium MuJoCo environments demonstrates significant robustness improvements, while accounting for computational constraints in post-hoc inference settings.

post-hoc robustificationmodel-based rladversarial rolloutsprojected gradient descentout-of-distribution

Overlaying Governance: A Compositional Authorization Framework for Delegation and Scope in Agentic AI

arXiv cs.AI · Amjad Ibrahim, Yong Li · 2026-06-02

The paper proposes a compositional authorization framework for governing autonomous AI agents, addressing limitations of traditional IAM systems in handling dynamic delegation and scoping. It introduces formal primitives for recursive delegation, contextual boundaries, and attenuated resource access, expressed as relational definitions composable with existing policies. The framework's operator overlays agentic semantics (e.g., delegation chains) onto legacy systems without policy rewrites. Formal proofs and empirical evaluations demonstrate its effectiveness in providing accountable authorization for agentic AI systems.

agentic aiauthorization frameworkdelegation semanticsresource scope attenuationcompositional governance

Scalable On-Hardware Training of Quantum Neural Networks and Application to Clinical Data Imputation

arXiv cs.AI · Natansh Mathur, Panagiotis Kl. Barkoutsos, Masako Yamada, Martin Roetteler · 2026-06-02

A scalable framework for training quantum neural networks (QNNs) on near-term hardware is introduced, reducing gradient estimation cost from quadratic to logarithmic in qubit count. The method combines a Butterfly circuit architecture with O(n log n) parameters, layer-wise training, and a parallelized parameter-shift rule to minimize circuit evaluations per optimization step. Validated on clinical data imputation using the MIMIC-III dataset, hybrid classical-quantum models trained on IonQ Forte Enterprise hardware at 16 qubits matched or exceeded classical neural baselines in patient survival prediction, with reduced variance across runs. The framework demonstrated scalability up to 32 qubits via tensor-network simulation.

quantum neural networksparameter-shift rulebutterfly circuitmimic-iiiionq forte enterprise

SPADE: Sketch-guided Path Planning Augmented with Diffusion Experts

arXiv cs.AI · Charbel Abi Hana, Tatiana Ghantous, Mikael Khalil, Anthony Rizk · 2026-06-02

SPADE introduces a novel framework for sketch-guided path planning in Autonomous Mobile Robots, addressing generalization and robustness limitations in existing imitation learning approaches. The method combines an enhanced ROS 2-based annotation tool with a diffusion-augmented behavioral cloning training strategy. Evaluations on expert demonstration datasets show SPADE achieves 39.1% lower Absolute Pose Error and 33.5% lower Fr'echet Inception Distance compared to state-of-the-art methods, while reducing trainable parameters by 93.8%. The framework maintains real-time, on-edge performance while achieving diffusion-level generalization capabilities.

path planningdiffusion augmentationbehavioral cloningabsolute pose errorfr'echet inception distance

BaltiVoice: A Speech Corpus and Fine-tuned Whisper ASR System for the Balti Language

arXiv cs.AI · Muhammad Ali · 2026-06-02

The authors introduce BaltiVoice, the first publicly available 16.8-hour read-speech corpus for Balti (ISO 639-3: bft), containing 10,060 validated utterances in native Nastaliq script derived from Mozilla Common Voice. They fine-tune OpenAI Whisper-small on this corpus, achieving a Word Error Rate (WER) of 30.07% on a 538-utterance validation set, representing a 152.11 percentage point improvement over the zero-shot baseline (182.18% WER). The resource includes dataset, model weights, and a live demo on HuggingFace.

speech corpuswhisper-smallword error ratenastaliq scriptzero-shot baseline

ThoughtFold: Folding Reasoning Chains via Introspective Preference Learning

arXiv cs.AI · Ziyan Liu, Xueda Shen, Yuzhe Gu, Songyang Gao · 2026-06-02

ThoughtFold introduces a fine-grained preference learning framework to mitigate redundant explorations in Large Reasoning Models (LRMs) by folding reasoning chains into concise paths. The method employs introspective redundancy identification within correct Chain-of-Thought (CoT) trajectories and optimizes via masked preference learning, penalizing redundant steps while preserving essential reasoning segments. Experiments demonstrate a 56% reduction in token usage for DeepSeek-R1-Distill-Qwen-7B while maintaining state-of-the-art accuracy.

large reasoning modelschain-of-thoughtpreference learningintrospective redundancymasked optimization

Learn from Your Mistakes: Tree-like Self-Play for Secure Code LLMs

arXiv cs.AI · Wenqi Chen, Ziyan Zhang, Bing Wang, Lin Liu · 2026-06-02

The paper introduces Tree-like Self-Play (TSP), a framework for enhancing secure code generation in Large Language Models (LLMs) by treating it as a fine-grained sequential decision process. TSP constructs a decision tree where the model explores secure and vulnerable code variants, enabling localized error correction through self-play. Experiments show TSP improves CodeLlama-7B's pass rate to 75.8% on Python security benchmarks, outperforming Supervised Fine-Tuning (57.0%) and unstructured self-play. TSP also reduces vulnerabilities in unseen CWEs by 24.5% and demonstrates cross-language generalization, suggesting internalization of language-agnostic security logic.

tree-like self-playsecure code generationdecision treelocalized error correctionlanguage-agnostic security

NeuroArmor: Safe-Variant-Guided Representation Consistency for Selective Re-Anchoring in Jailbreak Defense

arXiv cs.AI · Zhongyang Lin, Ziran Zhao, Feifei Zhai, Pengyuan Liu · 2026-06-02

NeuroArmor introduces a white-box runtime defense against jailbreak attacks in large language models by leveraging prompt-specific safe variants for selective intervention. The method generates K safe variants per prompt, compares hidden-state representations, and routes anomalies to refusal or recovery branches based on safety thresholds. Evaluated on Llama-3-8B-Instruct, it reduces attack success rate from 41.56% to 1.57% while lowering benign false positives from 30.26% to 22.05%, outperforming baselines in safety-helpfulness trade-offs.

jailbreak defensehidden-state consistencyselective re-anchoringruntime interventionsafe variants

Analyzing Stream Collapse in Hyper-Connections: From Diagnosis to Mitigation

arXiv cs.AI · Ekaterina Alimaskina, Gleb Molodtsov, Aleksandr Beznosikov · 2026-06-02

The paper diagnoses and mitigates stream collapse in Hyper-Connections (HC), which replace single Transformer residual streams with multiple permutation-symmetric streams. Using fine-grained diagnostics on HC-based language models, the authors find that residual mixing often remains near-identity, limiting inter-stream communication, while features concentrate in a dominant stream. They demonstrate that breaking symmetry during stream initialization reduces dominant-stream behavior and improves performance across modified HC variants.

hyper-connectionsresidual streampermutation symmetrystream collapsetransformer

A formal definition and meta-model for a machine theory of mind

arXiv cs.AI · Fabio Cuzzolin · 2026-06-02

The paper introduces a formal definition of Machine Theory of Mind (MToM), grounded in cognitive psychology, neuroscience, and AI principles, and proposes a holistic meta-model for MToM. It evaluates current state-of-the-art approaches and benchmarks, identifying gaps and suggesting future research directions to advance the field. The work provides a rigorous framework for analyzing and developing MToM systems, bridging theoretical foundations with empirical validation.

machine theory of mindcognitive psychologymeta-modelneuroscienceempirical benchmarking

StepFinder: A Temporal Semantic Framework for Failure Attribution in Multi-Agent Systems

arXiv cs.AI · Taiyu Zhu, Yifan Wu, Weilin Jin, Ying Li · 2026-06-02

StepFinder introduces a lightweight temporal semantic framework for failure attribution in LLM-based multi-agent systems, addressing inefficiencies in existing LLM-based methods. The framework employs LLMs only during feature construction to encode execution logs into temporal semantic sequences, then uses parameter-efficient temporal modeling and attention modules to capture sequential evolution and cross-step dependencies. Experiments on the Who&When benchmark show StepFinder outperforms LLM-based methods in step-level failure attribution while reducing inference time by 79%, with no text generation overhead.

failure attributionmulti-agent systemstemporal modelingattention modulesexecution logs

Rethinking the Role of Tensor Decompositions in Post-Training LLM Compression

arXiv cs.AI · Artur Zagitov, Alexander Miasnikov, Maxim Krutikov, Vladimir Aletov · 2026-06-02

This work systematically evaluates tensor decompositions for post-training compression of large language models (LLMs), addressing limitations in prior narrow evaluations. The authors analyze both dense and mixture-of-experts (MoE) architectures through empirical and theoretical lenses, identifying a fundamental mismatch between tensor decomposition assumptions and LLMs' heterogeneous representations. Their findings delineate practical limits of tensorization methods and clarify their viable role in large-scale LLM deployment. The study provides grounded performance trade-offs and releases code at https://github.com/brain-lab-research/TT-LLM.

tensor decompositionspost-training compressionlarge language modelsmixture-of-expertsheterogeneous representations

DMF: A Deterministic Memory Framework for Conversational AI Agents

arXiv cs.AI · Matteo Stabile, Enrico Zimuel · 2026-06-02

The Deterministic Memory Framework (DMF) introduces a CPU-first approach for conversational AI memory systems, replacing LLM-based summarization with deterministic techniques. DMF computes a Survival Score (Ω) from content signals, conversational cues, and provenance, using logistic projection and an interaction-count decay law (Ω_eff(Δn)) for relevance tracking. Evaluated on LoCoMo and LongMemEval datasets, DMF matches Mem0's accuracy while eliminating LLM memory-preparation tokens and reducing total token usage by 5x to 242x, enabling deterministic memory management with near-zero token costs.

deterministic memorysurvival scorelogistic projectioninteraction-count decaytoken efficiency

What Makes Interaction Trajectories Effective for Training Terminal Agents?

arXiv cs.AI · Sidi Yang, Chaofan Tao, Jierun Chen, Tiezheng Yu · 2026-06-02

The study identifies a pedagogical paradox in agent training: trajectories from lower-performing teachers (DeepSeek-V3.2) yield better student generalization than those from higher-performing agents (Claude Opus 4.6). Using Terminal-Lego, a pipeline converting real-world problems into environment-verified tasks, the authors demonstrate that Environment-Grounded Supervision (EGS) – exposing inspect-act-verify behaviors through harness-visible interactions – produces more robust learning. Results show exceptional data efficiency: Qwen3-32B achieves 24.3% on Terminal-Bench 2.0 with only 15.3k trajectories, matching prior SOTA with 30x less data. The findings advocate for 'Harness Engineering' as a key direction in agent post-training.

pedagogical paradoxenvironment-grounded supervisionterminal-legoharness engineeringagent post-training

Tonal parsimony in chord-sequence analysis: combining modulation cost and tonal vocabulary

arXiv cs.AI · François Pachet · 2026-06-02

The paper introduces tonal parsimony, a method combining modulation cost and tonal vocabulary minimization for harmonic analysis of chord sequences. It extends standard dynamic programming approaches by lexicographically optimizing first for minimal modulations, then for minimal distinct tonal centers (within the 24 major/minor tonality system). Evaluated on 31,032 LMD Chords sequences, the method preserves optimal modulation counts while reducing tonal vocabulary in 55.8% of cases, decreasing mean tonalities from 3.802 to 3.206 and modulations from 16.728 to 12.141. On 1,555 jazz standards, it achieves 95.6% chord-scale agreement, demonstrating practical utility for professional harmonic analysis.

tonal parsimonyharmonic analysismodulation costdynamic programmingchord sequences

FORGE: Multi-Agent Graduated Exploitation and Detection Engineering

arXiv cs.AI · Farooq Shaikh · 2026-06-02

FORGE introduces a multi-agent system that integrates proof-of-concept generation, vulnerability prioritization, and detection rule engineering through graduated exploitation depth. The system employs five specialized agents (Intel, Generator, Planner, Exploit, Detector) in a fixed pipeline to generate vulnerable applications from CVE metadata, conduct multi-turn exploitation assessed by an LLM-primary oracle, and produce Sigma and Snort detection rules based on OpenTelemetry traces. Evaluated on 603 CVEs from CVE-GENIE, FORGE achieves 67.8% end-to-end L1+ exploitation at USD 1.50 per CVE across eight languages and 187 CWE types. Detection rules from L2+ exploitation show higher span-normalized grounding (p=0.035), with 93.4% of Snort rules producing zero false positives.

graduated exploitationmulti-agent systemcve metadataopentelemetry tracesdetection rules

PRISM: Synergizing Vision Foundation Models via Self-organized Expert Specialization

arXiv cs.AI · Ying Tang, Dong Li, Youjia Zhang, Zikai Song · 2026-06-02

PRISM introduces a dual-stream Mixture-of-Experts (MoE) framework to unify Vision Foundation Models (VFMs) by addressing negative transfer through modular specialization. The method employs a two-stage paradigm: (1) expertise deconstruction, where a teacher-conditional router guides experts to specialize in distinct representational subspaces, and (2) dynamic recomposition, where the router assembles experts into task-specific computational pathways. Experiments on PASCAL-Context and NYUD-v2 demonstrate state-of-the-art performance, validating sparse emergent specialization as an effective approach for integrating diverse visual knowledge.

mixture-of-expertsvision foundation modelsnegative transfermodular specializationdynamic recomposition

CP-Agent: Context-Aware Multimodal Reasoning for Cellular Morphological Profiling under Chemical Perturbations

arXiv cs.AI · Yuxin Zhang, Yiyao Li, Ping Shu Ho, Simon See · 2026-06-02

The authors present CP-Agent, a multimodal LLM for interpretable cellular morphological profiling under drug perturbations. The system combines a context-aware alignment module (CP-CLIP) that jointly embeds high-content images and experimental metadata with agentic reasoning to generate mechanism-relevant rationales. CP-CLIP achieves 0.896 F1-score for treatment/MoA discrimination, while the full agent produces structured reports to guide drug discovery workflows.

cell paintingmultimodal llmmechanism-of-actionphenotypic screeningcontext-aware alignment

A Hybrid Approach For Malware Classification Using Secondary Features Fusion

arXiv cs.AI · Raja Khurram Shahzad, Muhammad Mustaqeem, Haroon Elahi · 2026-06-02

The paper proposes a hybrid malware classification method combining feature fusion and algorithm voting to improve family-level detection. The approach extracts API calls and n-gram features (both fixed and variable length), applies customized feature selection, and employs an ensemble predictive model. Evaluated on the Microsoft malware dataset, the method achieves 99.72% accuracy, 0.989 AUC, and 0.01 log loss in binary and multi-class classification, outperforming state-of-the-art approaches.

malware classificationfeature fusionn-gramsensemble learningapi calls

FlowGuard: Flow Matching for Identity-Independent Detection of Data-Free Model Stealing Attacks on Energy System Intrusion Detection Systems

arXiv cs.AI · Maxime Schwarzer, Laurin Holz, Tobias Huerten, Johannes Loevenich · 2026-06-02

FlowGuard introduces an identity-independent defense against model stealing attacks on AI-based Intrusion Detection Systems (IDS) in energy infrastructure. The method uses flow matching to detect out-of-distribution queries by leveraging lower-dimensional manifolds in synthetic attack traffic, measured via Continuous Normalizing Flow log-likelihoods. Evaluated against PRADA and FDINet under MAZE and DisGUIDE attacks, FlowGuard maintained stable detection rates (unlike PRADA's 0% drop) in both single-client and 100-client Sybil scenarios without identity reliance. The paper discusses limitations and potential extensions to data-dependent attacks.

flow matchingmodel stealingcontinuous normalizing flowintrusion detection systemssybil attacks

PrimeSVT: An Automated Memory-aware Pruning Framework with Prioritized Compression Policy for Spiking Vision Transformers

arXiv cs.AI · Rachmad Vidya Wicaksana Putra, Achyuta Muthuvelan, Alberto Marchisio, Muhammad Shafique · 2026-06-02

PrimeSVT introduces an automated memory-aware structured pruning framework for Spiking Vision Transformers (SViTs), addressing limitations of manual unstructured pruning. The method prioritizes compression by layer size, employs channel-wise filter pruning based on L2-norm values, and adheres to user-defined accuracy and memory constraints. Experiments demonstrate 26.68% memory savings with <3% accuracy drop (70.3% without fine-tuning, 72.9% with fine-tuning) compared to the original SViT (73.3%), enabling efficient embedded implementation.

spiking vision transformersstructured pruningmemory-aware compressionl2-norm pruningembedded implementation

Optimizing Explicit Unit-Distance Lower-Bound Certificates

arXiv cs.AI · Michael T. M. Emmerich · 2026-06-02

The paper presents an optimization framework for unit-distance lower-bound certificates, improving upon Sawin's explicit quantitative refinement of Erdős's conjecture. It formulates parameter selection as a nonlinear integer programming problem, proposing a Python verification pipeline with greedy heuristics and tailored evolutionary strategies. Computational results demonstrate four certificate levels, with the best achieving δ=0.01526, supporting u(n)>n^(1.0152) for arbitrarily large n. The methods are lightweight and replicable on standard hardware.

unit-distancelower-bound certificatesnonlinear integer programmingevolutionary strategiespython verification

Causal Evidence of Stack Representations in Modeling Counter Languages Using Transformers

arXiv cs.AI · Nishit Singh · 2026-06-02

The paper establishes causal evidence that transformers trained on counter languages learn stack-structured representations essential for model performance. Using linear probes, the authors identify a principal representation direction corresponding to stack depth in the model's hidden states. Ablating this direction causes sequential accuracy to collapse to near 0%, demonstrating that the stack representation is not merely learned but causally necessary for task execution. This work advances understanding of how transformers internally model hierarchical structures in formal languages.

transformerscounter languagesstack representationlinear probescausal ablation

When Model Merging Breaks Routing: Training-Free Calibration for MoE

arXiv cs.AI · Canbin Huang, Tianyuan Shi, Xiaojun Quan, Jingang Wang · 2026-06-02

The paper introduces Hessian-Aware Router Calibration (HARC), a training-free method to address routing breakdown in merged Mixture-of-Experts (MoE) models. HARC leverages second-order curvature information to realign the merged router, formulated as a closed-form solution solvable via matrix-free conjugate gradient. Experiments on mathematical reasoning and code generation tasks demonstrate HARC's effectiveness in mitigating routing breakdown and improving performance across various MoE merging baselines.

mixture-of-expertsmodel mergingrouter calibrationhessian-awaretraining-free

Grasp-Then-Plan with Failure Attribution: A Closed Two-Stage Framework for Precise and Generalizable Robotic Manipulation

arXiv cs.AI · Jiahao Xu, Peiyuan Wang, Hanzhuo Zhang, Zihao Yu · 2026-06-02

The paper proposes GTP-FA, a two-stage grasp-then-plan framework for robotic manipulation that decouples grasping and motion planning while attributing failures to specific modules. The method first generates grasp candidates and performs motion planning conditioned on the selected grasp, then learns a failure attribution model to diagnose failures and optimize both grasping (via task-level priors and risk penalties) and planning (via targeted data collection). Evaluations in simulation and real-robot experiments show GTP-FA improves base learners across RL, IL, diffusion-policy, and VLA-based settings, achieving higher task success rates.

robotic manipulationfailure attributiongrasp planningmotion planningdiagnosis-driven optimization

Local Guidance, Global Impact: Gaussian-Reshaped Trust Region Unlocks Behavior Transitions

arXiv cs.AI · Bingxu Liu, Jiashun Liu, Johan Obando-Ceron, Hao Wang · 2026-06-02

The paper introduces Gaussian Trust Region Policy Optimization (GTR), a novel reinforcement learning method addressing Proximal Policy Optimization's (PPO) inefficiency in non-stationary environments. GTR reshapes the trust region using a Gaussian kernel, enabling bounded, non-monotonic constraints that balance local stability with progressive relaxation under high-advantage updates. A Mixture Gaussian Anchor further enhances robustness by adapting to recent policy trajectories. GTR demonstrates strong performance across diverse tasks, including games, robotic control, open-world exploration, and language model post-training, showcasing its effectiveness in complex, non-stationary settings.

gaussian trust regionproximal policy optimizationnon-stationary environmentsmixture gaussian anchorreinforcement learning

AI Model Extraction Attacks: Bypassing Single-Client Assumptions in Defenses

arXiv cs.AI · Maxime Schwarzer, Johannes F. Loevenich, Gustavo Sánchez, Laurin Holz · 2026-06-02

This work exposes the vulnerability of existing Model Extraction Attack (MEA) defenses that rely on the Single Client Assumption (SCA), demonstrating their ineffectiveness against coordinated adversaries like Advanced Persistent Threats (APTs). The authors introduce CerberusAI, a modular open-source framework for simulating distributed MEAs, and evaluate attacks against established defenses such as PRADA. Results show that basic round-robin query distribution and adaptive traffic mixing can bypass detection mechanisms, reducing their effectiveness and necessitating stateful, identity-independent defense architectures.

model extraction attackssingle client assumptionadvanced persistent threatspradacerberusai

P\textsuperscript{2}-DPO: Grounding Hallucination in Perceptual Processing via Calibration Direct Preference Optimization

arXiv cs.AI · Ruipeng Zhang, Zhihao Li, Haozhang Yuan, C. L. Philip Chen · 2026-06-02

The paper introduces P²-DPO, a Direct Preference Optimization method for Large Vision-Language Models (LVLMs) targeting perceptual bottlenecks and visual robustness. The approach generates on-policy preference pairs to enhance Focus-and-Enhance perception and employs a Calibration Loss for precise visual-text alignment. Evaluations on Attention Region Fidelity (ARF) and degraded inputs demonstrate superior performance over human-feedback baselines, addressing hallucination and visual degradation challenges effectively.

perceptual processingdirect preference optimizationvisual robustnessattention region fidelitycalibration loss

The Unsampled Truth: Psychometrics in SLMs Measure Prompt Artifacts, Not Psychological Constructs

arXiv cs.AI · Nils Schwager, Christoph Hau, Simon Münker, Achim Rettinger · 2026-06-02

The study demonstrates that psychometric assessments in small language models (SLMs) primarily measure prompt artifacts rather than genuine psychological constructs. Using a prompt variation framework across 13 open-weights models (0.6B to 14B parameters), the authors systematically manipulated personas, instructions, items, and option symbols. Results show artifactual variance frequently dominates semantic signals, indicating models reflect prompt compliance rather than simulated traits, limiting their psychometric utility but offering a diagnostic tool for future research.

small language modelspsychometric assessmentsprompt artifactssemantic signalsprompt variation framework

AugMask: Training Diffusion Models on Incomplete Tabular Data via Stochastic Augmentation and Masking

arXiv cs.AI · Jungkyu Kim, Taeyoung Park, Kibok Lee · 2026-06-02

AugMask introduces a training framework for adapting score-based diffusion models to incomplete tabular data by separating conditioning from supervision. The method employs conditional stochastic augmentation via lightweight auxiliary models to construct numeric inputs, while applying denoising supervision only to observed coordinates, treating augmented missing entries as uncertain context. Theoretical analysis links this approach to a Rao-Blackwellized objective with variance-weighted sensitivity penalties. Experiments demonstrate that AugMask enables standard diffusion-based tabular generators to outperform specialized missing-aware baselines across diverse datasets and missingness regimes.

diffusion modelstabular datastochastic augmentationmissing valuesrao-blackwellized objective

SynCred-Bench: Benchmarking Synthetic Credibility in AI-Generated Visual Misinformation

arXiv cs.AI · Junxiao Yang, Minghao Zhang, Xiaoce Wang, Haoran Liu · 2026-06-02

The paper introduces SYNCRED-Bench, a benchmark comprising 600 AI-generated misinformation images across six credible-form categories and seven circulation styles, alongside FP450, a real-image negative set for false positive evaluation. The study evaluates 15 multimodal large language models (MLLMs), open-source AI-generated content (AIGC) detectors, and commercial APIs, revealing poor performance under a 5% false-positive-rate constraint (10.5%, <5%, and 57.6% true positive rates respectively). Human annotators achieved only 63% true positive rate, highlighting synthetic credibility as a critical yet underexplored challenge in visual misinformation detection.

synthetic credibilitymisinformation detectionmultimodal llmsaigc detectorsbenchmark evaluation

Evaluating LLMs' Effectiveness on Real-World Consumer Device Repair Questions

arXiv cs.AI · Atm Mizanur Rahman, Md Arid Hasan, Syed Ishtiaque Ahmed, Sharifa Sultana · 2026-06-02

The study introduces a benchmark of 991 real-world consumer device repair questions from Reddit, spanning phone repair, computer repair, and data recovery, to evaluate LLMs' effectiveness in this safety-critical domain. The benchmark includes technician-written reference solutions and Bangla translations for cross-lingual evaluation. Six state-of-the-art LLMs were assessed using four repair-specific criteria: correctness, completeness, practicality, and safety. Results indicate LLMs provide useful assistance but remain unreliable for high-risk tasks, with phone repair being the most challenging and safety-sensitive. GPT-5.4 performed best overall, while Bangla responses consistently underperformed English ones.

large language modelsconsumer device repairsafety-critical taskscross-lingual evaluationbenchmark

FLIPS: Instance-Fingerprinting for LLMs via Pseudo-random Sequences

arXiv cs.AI · Gurvan Richardeau, Gohar Dashyan, Erwan Le Merrer, Gilles Tredan · 2026-06-02

FLIPS introduces instance-level fingerprinting for LLMs, addressing the regulatory need to distinguish between different configurations of the same model. The method leverages biases in generated binary pseudo-random sequences to identify model instances, achieving 96% closed-set and 90% open-set accuracy across 237 instances, significantly outperforming the LLMmap baseline (35%). This demonstrates the feasibility of configuration-specific identification for AI regulation. Code is publicly available for reproducibility.

instance-level fingerprintingllmpseudo-random sequencesclosed-set accuracyopen-set accuracy

InfoMem: Training Long-Context Memory Agents with Answer-Conditioned Information Gain

arXiv cs.AI · Tiancheng Han, Yong Li, Wuzhou Yu, Qiaosheng Zhang · 2026-06-02

The paper introduces InfoMem, a reward mechanism for training chunk-wise memory agents in long-context tasks, which evaluates final-memory utility using answer-conditioned information gain. InfoMem measures the increase in per-token log-likelihood of the ground-truth answer provided by the final memory, applying this signal only to successful trajectories and normalizing it for stable RL optimization. Under the GRPO framework, InfoMem outperforms comparable RL baselines, demonstrating that effective final-memory rewards should operate on successful trajectories, be normalized, and be answer-conditioned.

long-context tasksmemory agentsinformation gainrl optimizationlog-likelihood

Calibration Data Trade-offs Across Capability Dimensions: Why Multi-Source Mixing Matters for High-Sparsity LLM Pruning

arXiv cs.AI · Hu Xu, Zhaolong Xing, Congcong Liu, Jiaxing Wang · 2026-06-02

The study reveals that calibration data source selection significantly impacts distinct capability dimensions in high-sparsity LLM pruning, showing opposite-sign correlations between perplexity and retention (General: ρ=+0.71 vs. Math/Code: ρ≈-0.55). It introduces IGSP, an information-guided self-calibration protocol for automated multi-source mixing without capability-aligned corpora, optimizing 4-gram aggregation and perplexity balancing. Evaluated on LLaMA-3.1-8B at 60% sparsity, uniform multi-source mixing achieves 58.8% total retention (+8.8 over best single-source MetaMath), with IGSP outperforming Self-Cal by +2.4 and SGS by +4.8.

post-training pruningcalibration perplexitycapability dimensionsmulti-source mixingsparsegpt

The Violation Situation Pattern: A Knowledge-Graph Pattern for Compliance Violations

arXiv cs.AI · Nima Kamali Lassem, Fuqi Song, Seyid Amjad Ali · 2026-06-02

The Violation Situation Pattern (VSP) introduces a knowledge-graph pattern for persistently representing compliance violations as graph nodes with rule identifiers, temporal intervals, lifecycle states, and evidence links. Building on Gangemi and Mika's Situation pattern, VSP operationalizes four deontic rules (V1-V4) via an FCL->Cypher->MERGE pipeline, evaluated on BODACC corporate-officer publications and GDPRhub enforcement decisions. Key results show rule-body independence: extending V4 from clause-presence to deadline checking improves F1 from 0.312 to 0.602 while maintaining pattern semantics, enabling detector evolution without invalidating audit history.

violation situation patternknowledge-graphdeontic rulesfcl->cypher->merge pipelineprov-o-aligned events

dstack-capsule: Pod-Level Remote Attestation for Confidential Workloads on Kubernetes

arXiv cs.AI · Yang Yang, Kevin Wang, Yuanhai Luo, Hang Yin · 2026-06-02

dstack-capsule introduces a Kubernetes platform enabling Pod-level remote attestation for confidential workloads on Intel TDX, addressing the resource overhead of per-VM isolation in existing solutions like Confidential Containers. The system employs a two-layer attestation architecture: static platform measurements are frozen in RTMR via a privilege fuse, while dynamic Pod identities are embedded in TDX Quote's report_data field and hardware-signed. Key innovations include a Pod-level attestation protocol, a privilege fuse mechanism, a multi-layer sandbox, and an open-source implementation based on Kubernetes 1.32, Intel TDX, and Sysbox. Evaluations confirm Pod-granularity verification without per-VM isolation overhead.

remote attestationintel tdxkubernetesconfidential workloadspod-level

Multi-Modal Graph Neural Network with Transformer-Guided Adaptive Diffusion for Preclinical Alzheimer Classification

arXiv cs.AI · Jaeyoon Sim, Minjae Lee, Guorong Wu, Won Hwa Kim · 2026-06-02

The authors propose a multi-modal graph neural network with transformer-guided adaptive diffusion for preclinical Alzheimer's disease (AD) classification. The method integrates a downstream transformer to guide diffusion processes, aggregating both short-range (via diffusion-kernel) and long-range (via multi-head attention) graph properties. Evaluations demonstrate improved classification performance across modalities and identification of key regions of interest associated with preclinical AD stages.

graph neural networkadaptive diffusionmulti-head attentionneurodegenerative diseaseregion of interest

RobotValues: Evaluating Household Robots When Human Values Conflict

arXiv cs.AI · Jongwook Han, Hyeongjin Kim, Yohan Jo · 2026-06-02

The paper introduces RobotValues, a novel benchmark for evaluating household robot planners in 10,000 value-conflict scenarios where actions prioritize competing human values (e.g., autonomy vs. efficiency). The benchmark is constructed via LLM-assisted scenario generation, stakeholder-grounded value extraction, image generation, and automated quality control. Evaluation of vision-language models reveals default value preferences (safety, accommodation) and failure to override these when instructed (80% incorrect action selection), demonstrating the need for value-aware robot evaluation beyond task completion metrics.

value-conflict scenarioshousehold roboticsllm-assisted generationvision-language modelsrobot evaluation

Learning Multi-Scale Hypergraph for High-Order Brain Connectivity Analysis

arXiv cs.AI · Jaeyoon Sim, Soojin Hwang, Seunghun Baek, Guorong Wu · 2026-06-02

The paper introduces MuHL, an adaptive multi-scale hyperedge learning framework for high-order brain connectivity analysis. The method constructs hierarchical node features and dynamically learns high-order interactions through continuous hyperedge construction over multi-resolution graph signals, addressing limitations of pairwise interaction models. Experiments on brain network benchmarks show MuHL improves neurodegenerative disease classification accuracy and identifies disease-relevant regions of interest (ROIs) and their group-wise interactions.

hypergraph learningbrain connectivityneurodegenerative diseasemulti-scale analysishigher-order interactions

Generalizing Graph Foundation Models via Hyperbolic Retrieval-Augmented Generation

arXiv cs.AI · Yifan Jin, Qirui Ji, Bin Qin, Jiangmeng Li · 2026-06-02

The paper introduces Hyperbolic Retrieval-Augmented Generation (HyRAG), a framework enhancing graph foundation models' generalization by addressing Euclidean space limitations in existing retrieval-augmented generation (RAG) approaches. HyRAG employs Hyperbolic Knowledge Indexing to preserve tree-structured hierarchies, Multi-granularity Retrieval for semantic anchors and nuances, and Dual-path Fusion for knowledge integration at feature and structural levels. Experiments on graph benchmarks demonstrate significant zero-shot performance improvements, validating the method's robustness in cross-domain inference.

hyperbolic retrieval-augmented generationgraph foundation modelsmulti-granularity retrievaldual-path fusionzero-shot learning

The Reliability Gap in Benchmark Auditing: Distribution Shift and Scale as Failure Modes of Contamination Detection

arXiv cs.AI · Wojciech Zarzecki, Jan Dubiński, Sebastian Cygert · 2026-06-02

The study exposes critical reliability gaps in statistical methods for detecting benchmark contamination in LLM evaluation, identifying distribution shift and scale constraints as key failure modes. It systematically evaluates three detection paradigms—LLM Dataset Inference, Post-Hoc Dataset Inference, and CoDeC—across 27 models (including Pythia, OLMo~2, and domain-specific LLMs) and scales up to 27B parameters. Results from 335 evaluations show only 199 correct outcomes, with each method exhibiting distinct vulnerabilities: false positives under distribution shift, insufficient power at benchmark scale, and coarse provenance signals. The findings demonstrate that current statistical detection cannot replace transparent data provenance for reliable auditing.

benchmark contaminationdistribution shiftllm dataset inferencepost-hoc dataset inferencecodec

LEAP: Supercharging LLMs for Formal Mathematics with Agentic Frameworks

arXiv cs.AI · Po-Nien Kung, Linfeng Song, Dawsen Hwang, Jinsung Yoon · 2026-06-02

LEAP introduces an agentic framework that enhances general-purpose foundation models for formal theorem proving, achieving state-of-the-art performance. The system decomposes complex problems into smaller units, leveraging informal reasoning, instruction following, and iterative self-refinement while interacting with the Lean compiler. Evaluated on Lean-IMO-Bench and the 2025 Putnam Competition, LEAP solves all 12 Putnam problems and boosts one-shot formal solve rates from below 10% to 70%, surpassing specialized IMO systems (48%). It also autonomously formalizes proofs for open combinatorial challenges, including a key subproblem in Knuth's Hamiltonian decomposition.

agentic frameworkformal theorem provinglean compileriterative self-refinementimo-style problems

Message Tuning Outshines Graph Prompt Tuning: A Prismatic Space Perspective

arXiv cs.AI · Yancheng Chen, Dun Ma, Shuai Zhang, Yang Liu · 2026-06-02

The paper introduces Prismatic Space Theory (PS-Theory), a mathematical framework quantifying adaptation capacity in Graph Foundation Models (GFMs), establishing an upper bound for graph prompt tuning. It proposes Message Tuning for GFMs (MTG), injecting learnable message prototypes into GNN layers to guide message fusion without updating pre-trained weights. Theoretical analysis proves MTG exceeds graph prompt tuning's capacity limits, with experiments showing consistent performance gains across benchmarks.

graph foundation modelsprismatic space theorymessage tuningadaptation capacitygraph prompt tuning

AI-Generated Traces for Novice Programmers: Learning Effects and Learner Differences in a Multi-Institutional Study

arXiv cs.AI · Yuri Noviello, Naaz Sibia, Anastasiia Birillo, Thomas Overklift Vaupel Klein · 2026-06-02

The study introduces Generated Animated Traces (GATs), AI-generated analogy-based narrated animations for CS1 education, comparing their efficacy against textual explanations in a multi-institutional study (Python, N=961; Java, N=151). Results indicate selective immediate learning benefits for GATs, though effects are context-dependent and short-term, with performance moderated by learner engagement profiles. The findings highlight the need for personalized educational approaches in programming instruction.

generated animated tracescs1 educationprogram visualizationlearner engagementmulti-institutional study

A Negative Result on Cross-Model Activation Transfer in a Pythia Multi-Hop Setting

arXiv cs.AI · Peiyan Zhang · 2026-06-02

This study investigates cross-model activation transfer between language models for multi-hop reasoning, demonstrating that offline representational alignment does not enable useful causal communication. Using Pythia-160M and Pythia-410M, a linear translation layer achieves high normalized cosine similarity (near 0.97) between sender and receiver hidden states. However, injecting translated activations at inference time fails to improve downstream performance: additive injection remains near baseline, replacement-style injection degrades results, and rescaling vectors does not recover performance. The findings suggest that hidden-state alignment alone is insufficient for effective activation transfer in this setting.

activation transfermulti-hop reasoninglinear translation layerhidden statesnormalized cosine similarity

VistaHop: Benchmarking Multi-hop Visual Reasoning for Visual DeepSearch

arXiv cs.AI · Hang He, Chuhuai Yue, Chengqi Dong, Chengcheng Wan · 2026-06-02

VistaHop introduces a benchmark for evaluating multi-hop visual reasoning in Visual DeepSearch, addressing limitations in iterative image inspection, visual-anchor grounding, and multi-hop evidence integration. The benchmark comprises 300 high-resolution images, 25 visual search scenarios, and 350 multi-hop QA tasks, supported by VistaArena, a unified evaluation environment enabling tool-augmented reasoning. Experiments on seven multimodal large reasoning models (MLRMs) reveal significant gaps, with SenseNova-MARS-32B achieving only 24.31% Pass@1, highlighting deficiencies in visual grounding, evidence revisiting, long-chain reasoning, and multi-anchor information fusion.

visual deepsearchmulti-hop reasoningvisual groundingtool-augmented reasoningevidence integration

Are Common Substructures Transferable? Riemannian Graph Foundation Model with Neural Vector Bundles

arXiv cs.AI · Li Sun, Zhenhao Huang, Yiding Wang, Qin Chen · 2026-06-02

The paper proposes GAUGE, a Riemannian graph foundation model that learns transferable substructures via intrinsic geometry. The method introduces Neural Vector Bundles to parse intrinsic geometry with local coordinates, coupled with a Dirichlet loss measuring transfer effort. Theoretical analysis connects transferable substructures to representation space geometry. Experiments demonstrate superior expressiveness in zero-shot link prediction and graph isomorphism tasks, validating the approach's effectiveness.

riemannian geometrygraph foundation modelneural vector bundleintrinsic geometryzero-shot link prediction

Distilling Answer-Set Programming Rules from LLMs for Neurosymbolic Visual Question Answering

arXiv cs.AI · Thomas Eiter, Nelson Higuera Ruiz, Johannes Oetsch · 2026-06-02

The paper presents a method for distilling answer-set programming (ASP) rules from large language models (LLMs) to enhance neurosymbolic visual question answering (VQA). By prompting LLMs to extend initial VQA reasoning theories expressed as ASP programs, the approach uses few-shot examples to validate and correct rules via ASP solver feedback. Experiments demonstrate effectiveness across diverse VQA datasets, showing that few examples suffice for accurate rule elicitation. The results suggest LLM-based rule distillation as a viable alternative to traditional data-driven rule learning.

answer-set programmingvisual question answeringlarge language modelsneurosymbolic reasoningrule distillation

EqGINO: Equivariant Geometry-Informed Fourier Neural Operators for 3D PDEs

arXiv cs.AI · Sungwon Kim, Juho Song, Seungmin Shin, Guimok Cho · 2026-06-02

The authors propose EqGINO, an equivariant geometry-informed Fourier neural operator for 3D PDEs that combines global spectral processing with geometric robustness. The method enforces isotropy in the spectral domain to achieve exact equivariance to discrete symmetries of the computational domain while generalizing to continuous SE(3) transformations. Experiments demonstrate effective modeling of coordinate-invariant physical laws on irregular 3D geometries with limited training samples, overcoming limitations of both local equivariant networks and non-equivariant FNOs.

equivariant neural networksfourier neural operators3d pdesgeometric deep learningspectral methods

PSViT: A Methodology for Structurally Pruning Spiking Vision Transformers

arXiv cs.AI · Rachmad Vidya Wicaksana Putra, Achyuta Muthuvelan, Alberto Marchisio, Muhammad Shafique · 2026-06-02

PSViT introduces a structured pruning methodology for Spiking Vision Transformers (SViTs) to enable efficient deployment on resource-constrained platforms. The approach employs uniform channel-wise filter pruning, sensitivity analysis, and fine-grained pruning to eliminate non-significant weights while maintaining accuracy. Experiments on ImageNet-1K demonstrate that PSViT achieves 22.4% memory savings through single-shot pruning, with accuracy drops limited to 3% (70.3% without fine-tuning, 72.8% with fine-tuning) compared to the original SViT model (73.3%). This structured pruning method avoids the need for specialized hardware, enhancing scalability for embedded applications.

structured pruningspiking vision transformerschannel-wise pruningsensitivity analysisresource-constrained platforms

AirDreamer: Generalist Drone Navigation with World Models

arXiv cs.AI · Zian Liu, Andong Yang, Chunkai Yang, Ruidong An · 2026-06-02

AirDreamer introduces a generalist drone navigation framework combining reinforcement learning with world-model-based environment understanding to improve generalization in unseen, cluttered environments. The method employs a sparse reward function to avoid local minima and promote yaw control, eliminating reliance on hand-crafted perception pipelines. Evaluations in simulation and real-world deployments demonstrate a 5.3% higher success rate in challenging maps compared to baselines, with effective sim-to-real transfer requiring no deployment tuning.

world modelreinforcement learningsim-to-real transfersparse rewardyaw control

Do Real-World Datasets Contain Natural Experiments? An Empirical Study Using Causal Feature Selection

arXiv cs.AI · Gautam Gare, John Galeotti, Michael Mozer, Deva Ramanan · 2026-06-02

This study investigates whether real-world datasets contain natural experiments by employing causal discovery and feature selection. The authors validate their approach using synthetic graphs and systematically evaluate multiple real-world datasets. Results demonstrate that natural experiments are indeed present in such datasets, and leveraging them through causal inference improves model performance. The work provides initial empirical evidence for this phenomenon, establishing a foundation for future research in causal feature selection.

natural experimentscausal discoveryfeature selectioncausal inferencesynthetic graphs

When RLHF Fails: A Mechanistic Taxonomy of Reward Hacking, Collapse, and Evaluator Gaming

arXiv cs.AI · Zelalem Abahana · 2026-06-02

The study presents a mechanistic taxonomy of RLHF failure modes, demonstrating that reward hacking manifests as distinct training dynamics rather than terminal pathologies. Using a compact RLHF pipeline with PPO, DPO, and UP-PPO variants, the authors analyze 1920 transitions across 61 checkpoints via reward trajectories, judge scores, and diagnostic metrics. Results show aggressive PPO exhibits the highest localized reward-hacking rate (14.45%), while UP-PPO reduces it (10.94-11.33%), with a pre-transition logistic model achieving 0.821 ROC-AUC for predicting reward hacking. The work establishes that RLHF failures can be systematically classified and partially anticipated during training.

rlhfreward hackingproximal policy optimizationdirect preference optimizationevaluator gaming

Solipsistic Superintelligence is Unlikely to be Cooperative

arXiv cs.AI · Rakshit S Trivedi, Natasha Jaques, Logan Cross, Alexander Sasha Vezhnevets · 2026-06-02

The paper argues that superintelligence developed through solipsistic optimization paradigms will inherently lack cooperative behavior due to endogenous non-stationarity and train-test-deploy distributional shifts. It introduces the concept of self-undermining unilateral optimization, proposing instead a non-solipsistic research paradigm centered on cooperation as a design primitive. Key recommendations include dynamic evaluation testbeds with adaptive counterparties, institutional primitives, and structural preservation of human agency to address equilibrium-selection challenges in multi-agent systems.

superintelligencenon-stationarityequilibrium-selectionsolipsistic optimizationendogenous feedback

Perceive Before Reasoning: A Pre-Reasoning Perception Framework for Efficient and Reliable Proactive Mobile Agents

arXiv cs.AI · Zhijie Ding, Weinan Hong, Zicheng Zhu, Lei Li · 2026-06-02

The paper proposes the Pre-Reasoning Perception Framework (PRPF) to improve proactive mobile agents by decoupling intervention timing from assistance generation. PRPF employs a lightweight Multimodal Proactive Perceptor (MPP) for intervention gating and context compression, activating the Proactive Agent Reasoner (PAR) only when necessary. Evaluated on the ProactiveMobile benchmark, PRPF reduces false trigger rates by 37% while increasing success rates by 15% and improving inference efficiency compared to unified MLLM pipelines.

multimodal large language modelsproactive mobile agentsintervention gatingcontext compressionfalse trigger rate

GFFMERGE: Efficient Merging of Graph Neural Force Fields and Beyond

arXiv cs.AI · Parth Verma, Parv P. Singh, Vipul Garg, Ishita Thakre · 2026-06-02

The authors present GFFMERGE, a principled framework for closed-form merging of Graph Neural Network (GNN)-based force fields, addressing the costly retraining required for new chemical systems. The method exploits linear message-passing layers, formulating merging as a convex embedding-alignment problem with an analytical solution. Benchmarking across molecular (MD17, MD22), solid-state (LiPS20), and large-scale graph datasets shows GFFMERGE achieves 5-27× speedups while maintaining performance close to joint training, outperforming vision/language merging methods and providing superior initialization for fine-tuning.

graph neural networksforce fieldsmodel mergingmessage-passingembedding-alignment

BotDirector: Robot Storytelling Across the Symmetrical Reality with Multi-modal Interactions

arXiv cs.AI · Zhe Sun, Meng Wang, Lei Wang, Yuxi Wang · 2026-06-02

The paper introduces BotDirector, an interactive system for robot storytelling that combines tangible interfaces with natural language processing. The system enables children to construct narratives using an LLM agent, which are then translated into motion sequences for self-navigating swarm robots based on a physical playground map and characters. This approach enhances flexibility in scenario creation, allowing young users to generate robot dramas with everyday objects. The integration of multi-modal interactions (tangible, linguistic, and robotic) facilitates accessible and engaging storytelling experiences for children.

robot storytellingllm agentself-navigating swarm robotsmulti-modal interactionstangible interfaces

WebRISE: Requirement-Induced State Evaluation for MLLM-Generated Web Artifacts

arXiv cs.AI · Yuxin Meng, Yuhan Suo, Junjie Wang, Yuhan Sun · 2026-06-02

WebRISE introduces Interaction Contract Graphs (ICGs) to evaluate multimodal large language model (MLLM)-generated web artifacts by modeling task requirements as state transitions and DOM/visual assertions. The benchmark comprises 442 tasks across five input modalities (Text, Markdown, Sketch, Image, Video), with 5,495 transitions and 5,271 requirement checks separating explicit functions from implicit constraints. Evaluation of 14 MLLMs reveals maximal performance at 65.6% transition validity and 66.3% requirement coverage, with Video inputs yielding +10.6 pp higher implicit coverage than Text, while ICG-based defect detection outperforms checkpoint evaluation by 2-16x.

interaction contract graphsmultimodal llmsdom assertionsrequirement coveragestate transitions

Effect of Demographic Bias on Skin Lesion Classification

arXiv cs.AI · Ralf Raumanns, Gerard Schouten, Veronika Cheplygina, Josien P. W. Pluim · 2026-06-02

The study systematically analyzes demographic bias in skin lesion classification using ResNet-based models, focusing on sex and age variations. Linear programming generates controlled datasets to evaluate three learning strategies: single-task, reinforcing multi-task, and adversarial learning. Results show sex-specific training optimizes performance, with male inclusion benefiting male subgroups, while reinforcing and adversarial methods reduce bias in balanced/female-majority datasets but not male-majority ones. Age biases consistently favor younger groups regardless of distribution. Cross-dataset validation reveals domain shifts exacerbate bias patterns, suggesting distinct mitigation strategies for sex (data imbalance) and age (inherent model preference).

demographic biasskin lesion classificationresnetadversarial learningmulti-task learning

MedCUA-Bench: A Screenshot-Only Benchmark for Clinical Computer-Use Agents

arXiv cs.AI · Jia Yu, Zilong Wang, Xinyang Jiang, Dongsheng Li · 2026-06-02

We present MedCUA-Bench, a benchmark for evaluating clinical computer-use agents on medical graphical user interfaces, addressing limitations of existing benchmarks that underrepresent medical software. The benchmark comprises 18 clinical scenarios across 10 domains, reconstructed from product manuals and open-source systems to ensure authenticity while avoiding licensing and privacy issues. Tasks include intent- and step-level goals to separate clinical reasoning from UI execution, with evaluation based on task completion and five safety dimensions. Results show the best closed-source model achieves 54.2% strict success, while open-source agents average 2.5%, with the best reaching 16.2%, highlighting significant gaps in reliable clinical software use.

clinical computer-use agentsgraphical user interfacestask completionclinical reasoningsafety validation

Reinforcement Learning from Cross-domain Videos with Video Prediction Model

arXiv cs.AI · Zhao Yang, Xinrui Zu, Jacob E. Kooi, Thomas Delliaux · 2026-06-02

XIPER (Cross-domain Video Prediction Reward) introduces a reward model for reinforcement learning from expert videos across visually distinct domains, addressing challenges of absent reward signals and domain gaps. The method trains a cross-domain video prediction model that maps agent observations into the expert domain, using prediction likelihood as a reward signal. Evaluations on the DMC Color Suite (8 tasks) and DMC Body Suite (3 tasks) demonstrate XIPER's superior performance despite domain gaps like agent color and morphology differences. Additionally, XIPER generates meaningful reward signals for real-robot observations using only simulated expert videos.

cross-domainvideo predictionreward modelsim-to-realreinforcement learning

AI Rater Discrimination Depends on Scoring Protocol in Complex Clinical Decision-Making

arXiv cs.AI · Sangwon Baek, Kyu Yeon Hur, Kyunga Kim · 2026-06-02

This study quantifies AI rater behavior in clinical decision-making by evaluating four open-source LLMs as both CDSS models and raters in adult type 2 diabetes pharmacotherapy. A factorial design crossed scoring protocols (Gold Rubric [GR] vs. Non Gold Rubric [Non-GR]) with five factors: CDSS model, prompt configuration, rater model, prompt character, and prompt type. Results show Non-GR yielded higher, less variable scores (74–78 points) compared to GR, which exhibited greater discrimination (7.69–49.64 points lower) and interquartile ranges (1.68–3.67 times wider). GR amplified discrimination between DRG and Baseline CDSS outputs by 1.76–5.10×, revealing rater model variation suppressed by Non-GR, supporting rubric anchoring for clinical AI evaluation.

clinical decision support systemgold rubricfactorial designinterquartile rangediscriminative power

Fully Automated Identification of Lexical Alignment and Preference-Stage Shifts in Large Language Models

arXiv cs.AI · Thomas Stephan Juzek, Xiaoyang Ming, Jose A. Hernandez · 2026-06-02

The paper introduces two automated metrics for analyzing lexical misalignment in large language models: the Lexical Alignment Score detects lexical overuse, while the Triangulated Preference Shift quantifies preference-stage attribution. The method analyzes continuations from six model families (Falcon, Gemma, Llama, Mistral, OLMo, Yi) on PubMed abstracts using windowed document prevalence, identifying overused terms like 'suggest' and 'additionally' without manual curation. Results confirm prior findings on preference learning's role in lexical shifts, demonstrating robustness across parameters, seeds, and datasets, enabling scalable cross-linguistic alignment studies.

lexical alignmentpreference learninglarge language modelsautomated evaluationmisalignment

OpenAgenet/OAN: Technical Architecture for Trust-Governed Agent Identity and Discovery

arXiv cs.AI · Jinliang Xu · 2026-06-02

The paper presents OpenAgenet/OAN, a protocol-neutral trust layer for open Agent interconnection, detailing its technical architecture. It specifies identity objects, registration workflows, Root-governed lifecycle, and verification requirements to ensure admissible, discoverable, and verifiable Agent identities. The design supports heterogeneous Agent frameworks (MCP, A2A, ANP-like systems) without defining entire business conversations. Key components include signed trusted invocation, authorization-aware Discovery, and security properties for safe protocol-specific interactions.

agenttrustprotocolidentitydiscovery

OpenAgenet/OAN: Open Infrastructure for Trusted Agent Interconnection

arXiv cs.AI · Jinliang Xu · 2026-06-02

OpenAgenet/OAN introduces an open infrastructure for trusted Agent interconnection in multi-operator networks, addressing identity provenance, governance state, and pre-connection trust verification. Designed as a protocol-neutral trust layer, OAN provides Root-governed identity admission, Registrar-assisted onboarding, Root-verified package publication, authorization-aware Discovery, and signed trusted invocation without replacing existing Agent interaction protocols. The architecture integrates with MCP, A2A, and ANP, employs blockchain-backed authorization bulletins, and supports diverse deployment patterns. A prototype demonstrates OAN's performance profile, with ongoing development outlined in a detailed roadmap.

agent interconnectionprotocol-neutralidentity provenanceblockchain-backedtrust verification

NVIDIA OmniDreams: Real-Time Generative World Model for Closed-Loop Autonomous Vehicle Simulation

arXiv cs.AI · NVIDIA, :, Aarti Basant, Amlan Kar · 2026-06-02

NVIDIA introduces OmniDreams, a real-time generative world model for closed-loop autonomous vehicle simulation, addressing limitations of reconstruction-based neural simulators. The model, mid- and post-trained from the Cosmos diffusion model on 21k hours of driving data, autoregressively generates action-conditioned videos, handling novel scenes like extreme weather and dynamic agents. Integrated with Alpamayo 1 and AlpaSim, OmniDreams enables scalable policy evaluation. Preliminary results show a world-action model (WAM) derived from OmniDreams outperforms Alpamayo 1.5 on the NuRec dataset with 1/5 the parameters, suggesting its dual role as a policy backbone.

generative world modelclosed-loop simulationdiffusion modelautonomous vehiclesaction-conditioned videos

ClinicalMC: A Benchmark for Multi-Course Clinical Decision-Making with Large Language Models

arXiv cs.AI · Ruihui Hou, Siyi Zhu, Ziyue Huai, Guangya Yu · 2026-06-02

The authors introduce ClinicalMC, a benchmark for evaluating large language models (LLMs) in multi-course clinical decision-making, addressing the gap in existing single-course assessments. The benchmark comprises 1,275 Chinese and 5,804 English samples across four stages of patient care, with an average of 3.42 and 5.11 clinical courses per patient, respectively. Using a multi-agent evaluation framework (patient, examiner, doctor) and two experimental settings (single-turn static, multi-turn dynamic), they assess closed-source, open-source, and medical LLMs, including GPT5-mini, DeepSeek-V3.2, and HuatuoGPT-o1, to understand LLM performance in evolving clinical scenarios.

clinical decision-makinglarge language modelsmulti-course evaluationmulti-agent frameworkmedical benchmark

GTBench: A Curriculum-Grounded Benchmark for Evaluating LLMs as Mathematical Research Assistants in Graph Theory

arXiv cs.AI · Noujoud Nader, Ibrahem Aljabea, Patrick Diehl, Deepti Gupta · 2026-06-02

GTBench introduces a curriculum-grounded benchmark for evaluating LLMs as mathematical research assistants in graph theory, comprising 63 problems across three difficulty levels (undergraduate definitions, algorithm tracing, graduate proof construction). The study evaluates five models (GPT-5, Claude Sonnet 4.6, Gemini 2.5 Flash-Lite, Llama 3.3 70B, Mistral Large 3) under zero-shot and chain-of-thought prompting, using exact-match and LLM-as-judge evaluation. Results show GPT-5 achieves 95.8% accuracy on basic problems and 82% on graduate proofs, while other models degrade significantly, with Llama scoring 0% on Group 3 under human evaluation. Failure analysis reveals execution errors in basic problems and reasoning gaps in advanced proofs, with notable human-LLM judge disagreement (kappa = 0.48-0.83).

graph theoryllm evaluationmathematical reasoningcurriculum-grounded benchmarkproof construction

Think-Before-Speak: From Internal Evaluation to Public Expression in Multi-Agent Social Simulation

arXiv cs.AI · Kaiqi Yang, Tai-Quan Peng, Sanguk Lee, Hui Liu · 2026-06-02

The paper introduces TBS (Think-Before-Speak), a multi-agent simulation framework that separates private reasoning from public utterance generation in LLM-based social interaction studies. TBS employs interval-based updates where agents modify structured internal states—such as dissonance-related appraisal, perceived opinion climate, and willingness to speak—based on dialogue history and memory. An orchestrator resolves speaking intentions and commits utterances to public dialogue. Evaluated in simulated town hall discussions on climate policy, TBS produces coherent internal-state traces that vary systematically across conditions, revealing that dissonance-related appraisal increases speaking willingness while silence-pressure appraisal decreases it. TBS enables mechanism-sensitive social simulation by making internal evaluation pathways observable.

multi-agent simulationinternal-state tracesdissonance-related appraisalturn-allocation rulesutterance generation

Uncertainty-Aware Clarification in LLM Agents with Information Gain

arXiv cs.AI · Mengyi Deng, Zhiwei Li, Xin Li, Tingyu Zhu · 2026-06-02

The authors propose an uncertainty-aware clarification framework for LLM agents to address underspecified user instructions by optimizing clarification questions for information gain. Their method employs a Bayesian belief update metric (Information Gain Reward) to train the clarifier LLM, aligning questions with ambiguity resolution in agent-tool-user environments. Evaluations on the $τ$-Bench environment with five heterogeneous backbones show a 3.7% improvement in success rate over no-clarification baselines, adding only 0.3 interaction steps on average.

llm agentsinformation gainbayesian belief updateclarification frameworkτ-bench

Decoupled Smart Contract Audits: Lightweight LLM Framework via Distillation and Aggregation

arXiv cs.AI · Bagus Rakadyanto Oktavianto Putra, Muhamad Risqi Utama Saputra, Widyawan, Guntur Dharma Putra · 2026-06-02

The study introduces a lightweight LLM framework for smart contract security audits, decoupling tasks into vulnerability detection, explanation, severity classification, and remediation recommendation. It employs Rank-Stabilized Low-Rank Adapters (rsLoRA), knowledge distillation, and a Chain-of-Verification (CoVe) aggregation strategy to optimize accuracy and computational efficiency. Experimental results show the framework outperforms state-of-the-art open-source LLMs (7B to 34B parameters), achieving 98.25% accuracy in vulnerability detection and a 0.4375 alignment score in generative explanations. Ablation studies validate the decoupled approach and identify a severity centrality bias, setting a benchmark for future LLM-assisted auditing research.

smart contractrsloraknowledge distillationchain-of-verificationseverity classification

GuidedBridge: Training-freely Improving Bridge Models with Prior Guidance

arXiv cs.AI · Zehua Chen, Yucheng Yang, Binjie Yuan, Kaiwen Zheng · 2026-06-02

The authors propose Prior Guidance (PG), a training-free method to enhance bridge models by exploiting instructive clean priors. PG introduces an unseen weak prior to degrade denoising results, contrasts it with seen priors via a scaling factor, and analyzes prior exploitation mechanisms. They further develop Frequency-Modulated Prior Guidance (FMPG), tailoring guidance scales to low- and high-frequency bands, and a cascaded CFG-FMPG framework for image in-painting that combines Classifier-Free Guidance (CFG) and FMPG without compromising inference efficiency. Experiments show PG methods consistently improve pre-trained bridge models across diverse image translation tasks.

prior guidancebridge modelsfrequency-modulated prior guidanceclassifier-free guidanceimage in-painting

AnyAudio-Judge: A Dynamic Rubric-Based Benchmark and Evaluator for Audio Instruction Following

arXiv cs.AI · Haitao Li, Tian Tan, Yuguang Yang, Shan Yang · 2026-06-02

The paper introduces AnyAudio-Judge, a dynamic rubric-based benchmark and evaluator for audio instruction following, addressing limitations in current LLM-based evaluation methods. The approach decomposes audio captions into verifiable binary rubric items and employs a bilingual benchmark with 7,920 samples across four audio domains. A 105K-sample corpus with CoT rationales trains the evaluator using SFT and GRPO, achieving improved zero-shot alignment detection and interpretable reward signals for audio generation tasks.

instruction alignmentrubric-based evaluationaudio generationgroup relative policy optimizationchain-of-thought rationales

EvoTrainer: Co-Evolving LLM Policies and Training Harnesses for Autonomous Agentic Reinforcement Learning

arXiv cs.AI · Guhong Chen, Yingcheng Shi, Yongbin Li, Binhua Li · 2026-06-02

EvoTrainer introduces a framework for autonomous LLM training that co-evolves policies and training harnesses through empirical feedback, addressing limitations of static recipe search in agentic RL. The method diagnoses rollout-level evidence, revises diagnostics, backtests interventions, and accumulates reusable skills. Evaluations on mathematical reasoning, code generation, and software engineering show it matches or exceeds human-engineered RL baselines, with significant gains in long-horizon agentic SWE, while trajectory analyses reveal domain-specific strategy divergence and prevention of invalid high-scoring branches.

autonomous trainingagentic reinforcement learningco-evolutionrollout-level evidencereusable skills

DeskCraft: Benchmarking Desktop Agents on Professional Workflows and Human-in-the-Loop Collaboration

arXiv cs.AI · Wenkai Wang, Tao Xiong, Jingchen Ni, Yunpeng Bao · 2026-06-02

The paper introduces DeskCraft, a benchmark for evaluating desktop agents on long-horizon professional workflows requiring human-in-the-loop collaboration. DeskCraft organizes tasks by difficulty, with some requiring over 50 execution steps, and formalizes interaction protocols for mid-turn (agent-initiated clarification) and post-turn (user feedback) exchanges. Evaluations of 18 agents on 538 tasks show GPT-5.4 achieving 31.6% accuracy on standard tasks and 27.6% on interactive tasks, revealing persistent failures in long-horizon execution and proactive clarification. The benchmark covers creative software domains including design, video, audio, and 3D creation.

desktop agentshuman-in-the-looplong-horizon workflowsinteraction protocolgpt-5.4

PhotoCraft: Agentic Reasoning with Hierarchical Self-Evolving Memory for Deep Image Search

arXiv cs.AI · Kailin Lyu, Zhiqiang Yuan, Jianwei He, Qiwei Yan · 2026-06-02

PhotoCraft introduces a hierarchical memory system for multimodal large language models (MLLMs) to enhance deep image search through agentic reasoning. The system integrates working, episodic, and semantic memory layers, dynamically invoked during multi-step reasoning to ensure logical consistency and knowledge transferability across tasks. Unlike stateless approaches, PhotoCraft mitigates execution drift and experience isolation by maintaining persistent context. Evaluations on DISBench show consistent improvements in context-aware retrieval across diverse MLLM backbones, achieving up to 18.5% performance gains, addressing key limitations in memoryless deep image search systems.

multimodal large language modelshierarchical memorydeep image searchagentic reasoningcontext-aware retrieval

From Long News to Accurate Forecast: Importance-Aware Fusion and PRM-Guided Reflection for Time Series Forecasting

arXiv cs.AI · Mingyang Liu, Qingcan Kang, Yuke Wang, Shixiong Kai · 2026-06-02

The paper introduces a novel framework for time series forecasting that integrates news articles through importance-aware fusion and process-level retrieval supervision. It addresses context window limitations and unguided retrieval by training an importance reward model for selective news compression and a process reward model (PRM) for quality-controlled article selection. Evaluated on finance, energy, traffic, and bitcoin benchmarks, the method improves prediction accuracy, reduces refinement iterations, and handles long articles effectively.

importance-aware fusionprocess reward modeltime series forecastingnews compressioncontext window

Decomposing how prompting steers behavior

arXiv cs.AI · Fan L. Cheng, Nikolaus Kriegeskorte · 2026-06-02

The study introduces a nested geometric decomposition framework to analyze how prompting transforms internal representations in LLMs and VLMs. The method aligns representations under different prompts using stimulus-invariant maps (translation, rigid/affine/nonlinear transformations), then causally tests each map by replacing hidden states to measure geometry and behavior recovery. Results across 3 LLMs, 3 VLMs, and 6 datasets show prompts reshape representations toward task structure, with affine transformations (cross-dimensional linear mixing) being crucial for recovering target geometry and behavior. The framework decomposes prompt-induced changes into interpretable geometric components.

geometric decompositionprompt steeringaffine transformationrepresentational geometrystimulus-invariant maps

The Shadow Price of Reasoning: Economic Perspective on Optimal Budget Allocation for LLMs

arXiv cs.AI · Xu Wan, Speed Zhu, Jianwei Cai, Guang Chen · 2026-06-02

This paper introduces Constrained Latent-utility Equilibrium Allocation for Reasoning (CLEAR), an optimal inference budget allocation framework for Large Language Models based on economic principles. CLEAR models per-query reasoning utility using a shifted-surge function and derives an allocation policy that equilibrates marginal utility under resource constraints through a global shadow price. The method performs rational query abandonment and reallocates resources from insolvent queries to solvable ones near their emergence thresholds. Experiments across reasoning tasks demonstrate that CLEAR improves the Pareto frontier of token cost versus mean accuracy, achieving up to 3x accuracy gains in resource-scarce regimes compared to uniform allocation.

inference budgetshadow pricemarginal utilitypareto frontierreasoning utility

BAHSD: Bridging the Long-tail Gap via Adaptive Distillation in Black-box Sequential Recommendation

arXiv cs.AI · Xi Zhou, Famin Wu, Mingming Li, Hongyue Zhang · 2026-06-02

BAHSD introduces an adaptive distillation framework for black-box sequential recommendation systems to address signal heterogeneity in long-tail distributions. The method employs multi-scale consistency probing to assess signal reliability, combining dynamic-temperature KL divergence for high-confidence signals with ranking consistency and InfoNCE contrastive learning for low-confidence signals. Experiments show BAHSD achieves up to 4.98% improvement over the teacher model and 80%+ gains on tail users, demonstrating robust knowledge transfer.

sequential recommendationblack-box distillationlong-tail distributioncontrastive learningknowledge transfer

"**Important** You should give me full credits!": Exploring Prompt Injection Attacks on LLM-Based Automatic Grading Systems

arXiv cs.AI · Hang Li, Fedor Filippov, Yuling Lin, Pengfei He · 2026-06-02

This work investigates prompt injection (PI) attacks on LLM-based automatic grading (AG) systems, demonstrating their vulnerability to score manipulation regardless of answer quality. The study systematically evaluates PI attack effectiveness in educational settings and assesses existing defensive strategies. Experiments under rubric-based grading show current AG systems remain highly susceptible, highlighting risks to assessment fairness and reliability. The findings aim to spur research on secure LLM-based educational applications.

prompt injectionautomatic gradingllm vulnerabilityeducational assessmentrubric-based grading

Constitutional On-Policy Safe Distillation

arXiv cs.AI · Ming Wen, Yuxuan Liu, Kun Yang, Yunhao Feng · 2026-06-02

Constitutional On-Policy Safe Distillation (COPSD) addresses the collapse of on-policy self-distillation (OPSD) in safety alignment tasks by mitigating geometric leakage in non-orthogonal semantic spaces. COPSD employs a Cross-SFT cold-start to calibrate the teacher model, followed by constitution-conditioned on-policy distillation to balance safety and expressiveness. Evaluated on 12 benchmarks, COPSD achieves a superior safety-helpfulness trade-off while minimizing the safety tax on general reasoning capabilities, outperforming baseline methods.

on-policy self-distillationgeometric leakagecross-sftsafety alignmentconstitution-conditioned

DELTAMEM: Incremental Experience Memory for LLM Agents via Residual Trees

arXiv cs.AI · Haoran Tan, Zeyu Zhang, Zhicheng Cao, Rui Li · 2026-06-02

DeltaMem introduces a novel memory framework for LLM agents that organizes experience into two residual trees: one for goal-conditioned task experience and another for scene-level environment knowledge. Each tree uses a root node for generalized base experiences and delta nodes for incremental variations, reducing redundancy and retrieval conflicts. Retrieval employs a failure-penalized similarity scan to locate the best match, reconstructing full experiences via root-to-match chain composition. An autonomous consolidation mechanism distills high-frequency paths into new root nodes, enabling self-organization from general heuristics to specialized variants. Experiments demonstrate DeltaMem's consistent superiority over existing baselines across diverse interactive environments.

residual treesllm agentsexperience memoryfailure-penalized similarity scanautonomous consolidation

Regret Pre-training: Bridging Prior and Posterior Views for Enhanced Knowledge Grounding

arXiv cs.AI · Mingkuan Zhao, Xiayu Sun, Wentao Hu, Suquan Chen · 2026-06-02

The paper introduces Regret Pre-training, a self-supervised framework that enhances causal language models by incorporating future-aware signals through a dual-view architecture. The method employs a regret loss minimizing KL divergence between future-conditioned Teacher and causal Student distributions, implemented via LocalRegret (1-token lookahead) or GlobalRegret (bidirectional context) configurations in OLMoE-1B-7B. Evaluated on nine tasks after 4B token pretraining, GlobalRegret and LocalRegret achieve 33.9% and 32.2% average accuracy respectively, surpassing the 30.2% baseline, with GlobalRegret showing an 18.1pp gain on BoolQ, all without added parameters.

regret pre-trainingprivileged informationkl divergencedual-view architecturefuture-conditioned

Libra: Efficient Resource Management for Agentic RL Post-Training

arXiv cs.AI · Kaiwen Chen, Xin Tan, Jingzong Li, Hong Xu · 2026-06-02

Libra introduces efficient resource management for agentic reinforcement learning (RL) post-training of large language models, addressing three key challenges: long-tailed trajectory distributions, asymmetric compute patterns between rollout and training, and non-stationary workload drift. The system combines a periodic global resource planner for GPU allocation optimization and a causality-driven multi-level feedback queue (C-MLFQ) scheduler for request routing based on tool-return outcomes. Evaluations on 48 A800 GPUs demonstrate up to 3.0× throughput improvement and 2.5× faster reward convergence compared to baselines.

agentic rlresource managementgpu allocationtrajectory schedulingnon-stationary workloads

Efficient Hyperparameter Optimization for LLM Reinforcement Learning

arXiv cs.AI · Minping Chen, Bowen Xiao, Du Liang, Chuxuan Zeng · 2026-06-02

Joint Fidelity Hyperparameter Optimization (JF-HPO) is introduced to efficiently optimize hyperparameters for LLM reinforcement learning by jointly adapting model size and training budget as fidelity dimensions. The method employs a small proxy model for rapid evaluation, integrates early-stopping strategies based on training dynamics, and implements an efficient checkpointing mechanism to reduce redundant computations. JF-HPO achieves up to 14.9× computational efficiency gains per trial while maintaining competitive predictive accuracy, outperforming VeRL Recipe configurations by 5.8% to 111.6% in performance improvements.

hyperparameter optimizationreinforcement learningproxy modelearly-stoppingcheckpointing

ASymPO: Asymmetric-Scale Policy Optimization for Asynchronous LLM Post-Training Without Behavior Information

arXiv cs.AI · Zehua Liu, Yuxuan Yao, Xiaojin Fu, Tao Zhong · 2026-06-02

Proposes Asymmetric-Scale Policy Optimization (ASymPO) for asynchronous LLM post-training without behavior-policy probabilities, addressing distribution drift from stale responses. The method normalizes token losses by their current average negative log-probability to balance scale-imbalanced loss terms while preserving learning signals. Evaluated alongside Scaled Policy Optimization (SPO) in mathematical reasoning tasks, ASymPO maintains response-level zero-sum balance using only current-policy probabilities.

asynchronous reinforcement learningpolicy optimizationdistribution driftnegative log-probabilitylanguage-model post-training

ROBUST-WT: Robust Uncertainty-aware Segmentation Transform via Whitening and Training Enhancements

arXiv cs.AI · Aqsa Naseer, Maryam Bibi, Syeda Samiya Urooj, Muhammad Khurram Shahzad · 2026-06-02

This study enhances the Whitening Transform-based Probabilistic Shape Regularization Extractor (WT-PSE) for robust medical image segmentation across domains. Four improvements are proposed: domain-adaptive augmentation (random erasing, gamma correction, salt-and-pepper noise), a hybrid BCE and Dice loss function, curriculum-based Dice weight scheduling, and command-line ablation flags. Evaluated on the fundus optic disc segmentation benchmark, the enhanced pipeline achieves a Dice score of 0.956 and ASD score of 13.31, surpassing the baseline Dice score of 0.939. Results demonstrate that training-level enhancements yield consistent gains without altering the WT-PSE architecture.

whitening transformprobabilistic shape regularizationdomain-adaptive augmentationdice lossablation study

Learn When and Where to Connect: Adaptive Virtual Nodes for Dynamic Message Passing on Graphs

arXiv cs.AI · Jaejun Lee, Joyce Jiyoung Whang · 2026-06-02

The paper proposes MAVN, an adaptive virtual node framework for message passing neural networks (MPNNs) that dynamically introduces and connects virtual nodes (VNs) based on learned importance scores. MAVN employs a dual-perspective scoring mechanism to jointly optimize node-VN connectivity patterns across layers, enabling non-constrained connections and on-demand VN creation. Theoretical analysis shows MAVN can simulate arbitrary connectivity patterns. Experiments on nine datasets demonstrate up to 46.5% performance improvement over backbone MPNNs and superior results compared to baselines.

message passing neural networksvirtual nodesgraph neural networksadaptive connectivitydynamic routing

CORE: Conflict-Oriented Reasoning for General Multimodal Manipulation Detection

arXiv cs.AI · Jinjie Shen, Yaxiong Wang, Yujiao Wu, Lechao Cheng · 2026-06-02

The CORE framework introduces conflict-oriented reasoning for detecting multimodal fake news by leveraging intrinsic semantic or physical inconsistencies. It constructs the Conflict Attribution Corpus (CAC) with fine-grained conflict annotations and enhances multimodal large language models (MLLMs) with explicit conflict-capturing capabilities. CORE employs conflict-oriented representation enhancement and reasoning, enabling robust detection in few-shot and zero-shot settings. Extensive experiments show that CORE outperforms state-of-the-art models in generalizing to unseen manipulation types.

multimodal fake newsconflict attribution corpusmultimodal large language modelsconflict-oriented reasoningzero-shot settings

Brief Announcement: Generative Markov Model for Distributed Computing Systems

arXiv cs.AI · Alfreds Lapkovskis, Ali Beikmohammadi, Sindri Magnússon, Praveen Kumar Donta · 2026-06-02

The authors propose a generative Markov model framework for modeling distributed computing systems, factorized over structured system states to capture sparse dependencies. This approach enables tractable simulation, inference, and policy learning in otherwise intractable state spaces, bridging distributed computing with Markov chain theory and reinforcement learning. A case study on collaborative AI inference demonstrates that distributed computation across user devices reduces latency and server resource consumption compared to centralized scheduling, validating the framework's utility for adaptive decision-making in heterogeneous systems.

generative markov modeldistributed computingreinforcement learningsystem state factorizationcollaborative inference

Rethinking Molecular Text Representations for LLMs: An Empirical Study

arXiv cs.AI · Arun Raja, Garrett M. Morris, Kian Ming A. Chai · 2026-06-02

The study systematically evaluates molecular representation choices for LLMs across nine formats and eight chemical tasks using 16 models from five families. Benchmarking reveals strong representation-dependence: CML performs best overall, followed by MolJSON and InChI, while SMILES variants underperform despite pretraining prevalence. Structured representations (CML/MolJSON) excel at structural tasks, IUPAC dominates semantic tasks, and chemistry-specialized models show SMILES bias. Mechanistic analysis via tokenization audits and attention patterns shows differential encoding of representations. Findings argue against representation-agnostic evaluation and advocate task-aware routing.

molecular representationllm benchmarkingchemical taskstokenization auditattention patterns

SkillDAG: Self-Evolving Typed Skill Graphs for LLM Skill Selection at Scale

arXiv cs.AI · Tong Bai, Zhenglin Wan, Pengfei Zhou, Xingrui Yu · 2026-06-02

SkillDAG introduces a typed directed graph for modeling inter-skill relationships in LLM agents, enabling structural retrieval during inference rather than fixed retrieval pipelines. The method allows agents to query vector matches, typed-edge neighbors, and conflict signals, with a propose-then-commit protocol for graph evolution across episodes. Evaluated on ALFWorld and SkillsBench with MiniMax-M2.7, SkillDAG achieves 67.1% success and 27.3% reward, outperforming Graph-of-Skills baselines by +12.8 and +8.6 points, while improving SkillsBench Ret@K from 65.5 to 78.2.

skilldagllm agentstyped directed graphstructural retrievalpropose-then-commit

ToolGate: Token-Efficient Pre-Call Control for Tool-Augmented Vision-Language Agents

arXiv cs.AI · Anjie Liu, Yan Song, Zhixun Chen, Ziqin Gong · 2026-06-02

ToolGate introduces token-efficient pre-call control for tool-augmented vision-language agents, addressing wasteful tool executions in ReAct-style VLMs. The method employs a lightweight external controller that predicts execute/skip decisions using trajectory text and structural features, reducing token costs to 64-69% of baseline while maintaining accuracy. Evaluated across five benchmarks with Qwen3-VL backbones, ToolGate improves average accuracy by 1.65 points when trained on domain-matched trajectories, demonstrating the value of selective tool output integration.

pre-call controltool-augmented agentstoken efficiencyvision-language modelsreact-style agents

RelGT-AC: A Relational Graph Transformer for Autocomplete Tasks in Relational Databases

arXiv cs.AI · Phillip Jiang · 2026-06-02

RelGT-AC introduces three enhancements to the RelGT architecture for autocomplete tasks in relational databases: (1) a column masking strategy to prevent trivial solutions by masking target columns during subgraph encoding, (2) a unified task head supporting binary classification, multiclass classification, and regression tasks, and (3) a TF-IDF text encoder for free-text columns to recover lexical signals. The method represents databases as heterogeneous graphs and applies graph neural networks. Evaluated on 7 tasks across 3 RelBench v2 datasets, RelGT-AC outperforms GraphSAGE on all regression tasks and achieves up to +10 AUROC points on text-heavy eligibility tasks.

relational databasesgraph neural networksautocomplete taskstf-idf encoderheterogeneous graphs

TriEval: A Resource-Efficient Pipeline for LLM Bias, Toxicity, and Truthfulness Assessment

arXiv cs.AI · Akshatha Srikantha, Manpreet Singh, Yash Jajoo, Shyamal Lakhanpal · 2026-06-02

TriEval introduces a resource-efficient pipeline for simultaneous evaluation of LLM outputs across bias, toxicity, and truthfulness, addressing limitations of single-parameter tools and high computational demands. The method supports both open- and closed-source models, operating on standard laptops without GPU clusters. Evaluations on Llama 3 8B, Mistral 7B, Gemma 2 9B, and Claude Haiku reveal performance disparities between open- and closed-source models, particularly in toxicity and truthfulness metrics. The tool is released as open-source to democratize access for resource-constrained researchers.

llm evaluationbias detectiontoxicity scoringtruthfulness assessmentresource-efficient pipeline

Capability Advertisement as a Market for Lemons: A Trust Layer for Heterogeneous Agent Networks

arXiv cs.AI · Gaurav Naresh Mittal · 2026-06-02

The paper proposes a trust layer for heterogeneous agent networks to address capability advertisement problems in LLM-based agent delegation. It identifies confident-wrong faults as a non-adversarial subclass of Byzantine faults, models the market-for-lemons problem in faith-based protocols, and introduces a protocol-agnostic trust layer with probabilistic capability descriptors, screening, and reputation mechanisms. The design achieves a separating equilibrium when overclaim costs exceed gains, requires no model retraining, and includes a reliability-composition bound for delegation chains. Results show graceful degradation when trust anchors are absent or corrupt.

capability advertisementmarket for lemonsbyzantine faultstrust layerdelegation chains

AUDITFLOW: Executable Symbolic Environments for Structured Financial Reporting Verification

arXiv cs.AI · Yan Wang, Xuguang Ai, Jaisal Patel, Xueqing Peng · 2026-06-02

AuditFlow introduces a graph-grounded multi-agent framework for structured financial reporting verification, addressing limitations of language-model agents by separating adaptive search from deterministic verification. The system constructs a symbolic environment from a US-GAAP taxonomy graph and an XBRL filing graph, providing tools for fact retrieval, taxonomy traversal, numerical checking, and rule evaluation. Junior auditors inspect cases from regulatory and evidentiary perspectives, while a senior auditor resolves disputes and requests further investigations. On the FinMR benchmark, AuditFlow achieves 82.09% joint audit accuracy using GPT-5.5, surpassing the strongest baseline by 14.93 points. Removing deterministic checks reduces accuracy to 17.91%, highlighting the necessity of the symbolic environment.

auditflowsymbolic environmenttaxonomy traversalevidential aggregationfinmr

Conditional Hypothesis Generation for LLM-Based Text Analysis with Researcher-Specified Covariates

arXiv cs.AI · Paiheng Xu, Jing Liu, Wei Ai · 2026-06-02

The paper introduces conditional hypothesis generation, a framework for LLM-based text analysis that incorporates researcher-specified covariates to address stratum imbalance and sign reversal. The method combines feature--covariate interactions for sign reversal detection and within-stratum demeaning with inverse-frequency reweighting for stratum imbalance. Synthetic experiments demonstrate superior performance over global baselines in targeted settings, while expert evaluation on real-world datasets confirms improved hypothesis utility within relevant subgroups.

conditional hypothesis generationstratum imbalancesign reversalfeature--covariate interactionsinverse-frequency reweighting

Spike-Aware C++ INT8 Inference for Sparse Spiking Language Models on Commodity CPUs

arXiv cs.AI · Ting Liu · 2026-06-02

The paper introduces a C++ CPU inference runtime optimized for sparse spiking language models, leveraging activation sparsity as an execution primitive rather than post-hoc compression. The method combines manifest-driven weight loading, mixed row/column memory layout, AVX2/FMA kernels, INT8 quantization, and integer-domain accumulation for spike-conditioned paths. On an AMD Ryzen 7 5800X, the runtime achieves 22.63 tokens/s (single-thread) for a 874M-parameter INT8 model, outperforming TinyLlama-1.1B (16.31 tokens/s) and others, with 4-thread scaling reaching 47.90 tokens/s. However, WikiText-2 perplexity degrades to 24.80 compared to dense baselines, highlighting a trade-off between efficiency and quality.

spiking language modelsint8 quantizationavx2/fma kernelsactivation sparsitycpu inference

Hallucinations as Orthogonal Noise: Inference-Time Manifold Alignment via Dynamic Contextual Orthogonalization

arXiv cs.AI · Mingkuan Zhao, Wentao Hu, Tianchen Huang, Yuheng Min · 2026-06-02

We introduce Dynamic Contextual Orthogonalization (DCO), an inference-time intervention method that mitigates hallucinations in Large Language Models by geometrically interpreting them as orthogonal noise relative to the semantic manifold. DCO performs orthogonal decomposition on attention head outputs using the input residual stream as a dynamic context anchor, employing layer-wise Z-score suppression to attenuate outlier orthogonal components. Evaluations on Llama-3-8B and 70B across benchmarks (XSum, NQ-Swap, IFEval, TriviaQA, TruthfulQA) demonstrate DCO's superior contextual faithfulness and knowledge retention, validating the geometric framework while maintaining computational efficiency.

orthogonal noisesemantic manifoldresidual streamz-score suppressionmanifold alignment

Reproducibility is the New Copyleft: Defining AGI-oriented Reproducible Builds

arXiv cs.AI · Masayuki Hatta · 2026-06-02

The paper proposes reproducible builds as a functional analogue of copyleft for AGI systems, addressing the limitations of traditional open-source frameworks in ensuring reconstructability from declared inputs. It critiques Maffulli's Second Liberation thesis and analyzes legal-technical constraints on model components (code, data, weights, etc.), drawing on OSAID, MOF, and deterministic-inference research to define seven reproducibility requirements. The authors argue that protocol-based governance (e.g., Model Context Protocol) supersedes copyleft licensing for AI-to-AI coupling mechanisms.

reproducible buildsartificial general intelligencecopyleftmodel openness frameworkdeterministic inference

ConTraIRL: Factorized Contrastive Abstractions for Transferable IRL

arXiv cs.AI · Yikang Gui, Bikramjit Banerjee, Prashant Doshi · 2026-06-02

ConTraIRL introduces factorized contrastive abstractions for transferable Inverse Reinforcement Learning (IRL), enabling compositional reward transfer across unseen dynamics-goal combinations. The framework employs a dual-encoder architecture that maps observations into separate latent spaces for dynamics and goals, trained via a dual contrastive objective. Temporal alignment ensures the dynamics encoder learns goal-invariant structure, while the goal encoder captures dynamics-invariant features. Experiments on continuous control benchmarks demonstrate ConTraIRL's effectiveness in few-shot transfer to novel dynamics-goal pairings, improving sample efficiency and reward recovery compared to transfer IRL baselines.

inverse reinforcement learningcontrastive learninglatent representationsfew-shot transfercontinuous control

MUSE: A Unified Agentic Harness for MLLMs

arXiv cs.AI · Jianglin Lu, Hailing Wang, Xu Ma, Qihua Dong · 2026-06-02

MUSE introduces a multimodal unified structured execution harness that enhances frozen multimodal large language models (MLLMs) without retraining, addressing harness-level shortcomings through composable modules for task representation, visual processing, perception tool use, structured parsing, deterministic verification, and verifier-guided repair. Evaluated across diverse benchmarks including visual spatial planning, multimodal reasoning, and fine-grained visual discrimination, MUSE consistently improves performance over bare MLLMs, particularly on challenging instances. Analysis reveals that many MLLM failures stem from harness-level issues rather than model deficits, demonstrating the potential of agentic multimodal harnesses as an orthogonal approach to MLLM optimization.

multimodal large language modelsstructured parsingverifier-guided repairvisual spatial planningagentic harness

Exact equivariance, kept through training, buys zero-shot generalisation across the symmetry group

arXiv cs.AI · Hongbo Wang · 2026-06-02

The paper demonstrates that maintaining exact equivariance through training enables zero-shot generalization across symmetry groups in latent world models. Using an equivariant encoder E and predictor f, the authors prove that the one-step prediction relMSE remains invariant under group actions, allowing dynamics learned on a restricted orientation slice to generalize across the entire group orbit. Empirical validation on CPU/MPS shows that equivariant models achieve 4.5-7.4× smaller errors compared to non-equivariant baselines, with invariant closed-loop control trajectories in 2D/SO(2) and statistically flat errors in 3D/SE(3). The approach maintains exact invariance across H-fold rollouts, contrasting with compounding errors in baselines.

equivariancesymmetry grouplatent world modelzero-shot generalizationclosed-loop control

How Quantization Changes Interpretable Features: A Sparse Autoencoder Analysis of Language Models

arXiv cs.AI · Evan Duan · 2026-06-02

The study demonstrates that quantization systematically degrades interpretable features in language models despite preserved task performance. Using sparse autoencoders (SAEs) as fixed measurement bases, the authors evaluate feature survival via Pearson correlation across INT8 to INT4 quantization on Pythia-70M and Gemma-2-2B. Results show graded degradation (62.4% survival at INT6 for Pythia-70M, 51.3% for Gemma-2-2B), predictable from full-precision statistics (AUC 0.92–0.97), with task metrics often masking damage (e.g., INT7 improves perplexity while degrading 18.7% of Gemma-2-2B features). Quantization and magnitude pruning exhibit overlapping feature damage (Jaccard 0.79–0.86), challenging behavioral parity as a compression audit criterion.

quantizationsparse autoencoderinterpretable featurespearson correlationperplexity

Patcher: Post-Hoc Patching of Backdoored Large Language Models

arXiv cs.AI · Anjun Gao, Yueyang Quan, Yufei Xia, Zhuqing Liu · 2026-06-02

Patcher introduces a post-hoc defense framework for repairing backdoored large language models using only a single reported failure case and model parameters. The method localizes backdoor triggers via response-conditioned gradient-based saliency scores and adaptive clustering, then patches the model through constrained fine-tuning with KL-divergence constraints to break trigger-response associations while preserving utility. Evaluations across multiple backdoor attack strategies demonstrate successful trigger localization and backdoor neutralization, maintaining model utility and robustness against adaptive attacks. This approach advances practical defenses against training-time attacks in deployed language models.

backdoor attackssaliency scoreskl-divergenceadaptive clusteringconstrained fine-tuning

Inducing Reasoning Primitives from Agent Traces

arXiv cs.AI · Zhihan Lei, Jiarui Yan, Joshua Momo, William W. Cohen · 2026-06-02

The paper introduces Reasoning Primitive Induction, a method for extracting reusable reasoning patterns from ReAct-style LLM agent traces. The approach clusters recurrent reasoning moves into typed pseudo-tools (natural-language docstrings interpreted by LLMs), which a standard ReAct loop composes at test time. Results show induced libraries outperform their source agents by +44pp on RuleArena NBA, +30pp on MuSR team allocation, and +22pp on NatPlan meeting planning, while matching expert decompositions and outperforming Chain-of-Thought and AWM at lower inference cost across five subtasks.

react-style agentsreasoning primitive inductionpseudo-toolschain-of-thoughtconstraint-satisfaction planning

Pretraining Language Models on Historical Text

arXiv cs.AI · Xiaoxi Luo, Zachary Shinnick, Niclas Griesshaber, Yixuan Wang · 2026-06-02

The authors introduce TypewriterLM, a 7.24B parameter history language model trained exclusively on pre-1913 English text, addressing challenges in historical data quality, temporal leakage, and evaluation. They construct TypewriterCorpus, a 54B-token historical corpus with rigorous cleaning, and propose lexically grounded instruction tuning to ensure responses remain historically grounded, producing two datasets: History-LIMA and History-SelfInstruct. Evaluation is performed using History-Event, a benchmark suite assessing competence, temporal grounding, and leakage, with all resources released for future research.

historical language modeltemporal leakagelexically grounded instruction tuningcorpus constructiontemporal consistency

Towards Compact Autonomous Driving Perception with Balanced Learning and Multi-sensor Fusion

arXiv cs.AI · Oskar Natan, Jun Miura · 2026-06-02

The paper introduces a compact multi-task learning model for autonomous driving perception, handling semantic segmentation, depth estimation, LiDAR segmentation, and bird's eye view projection in a single forward pass. It employs adaptive loss weighting to address task imbalance and utilizes multi-sensor fusion techniques for RGB cameras, DVS, and LiDAR inputs. The model achieves superior performance with fewer parameters, enabling faster inference and reduced GPU memory usage. Validation across three CARLA simulation datasets and the nuScenes-lidarseg dataset confirms consistent results. Code is publicly available.

multi-task learninglidar segmentationsensor fusionadaptive loss weightingautonomous driving

WISE-HAR: A Generalizable Ensemble Deep Learning Framework for WiFi-Based Human Activity Recognition

arXiv cs.AI · Maheen Arshad, Qindeel E Zahra, Muhammad Khuram Shahzad · 2026-06-02

The paper proposes WISE-HAR, an ensemble deep learning framework for WiFi-based human activity recognition that addresses performance variance, small dataset size, and generalization challenges. The method combines five CNN architectures (Deep CNN, Wide CNN, MobileNetV2, ResNet50V2, EfficientNetB0) with aggressive data augmentation (time-warping, frequency masking, noise addition) and cross-scenario/cross-antenna evaluation. The ensemble achieved 94.87% accuracy on Line-of-Sight scenarios (0.66% improvement over best single model), with data augmentation boosting Random Forest performance by 35%. Cross-scenario tests showed minimal accuracy drops (1.37-2.07%), demonstrating robust generalization.

wifi-based harensemble learningdata augmentationcross-scenario evaluationspectrogram dataset

Neuron Populations Exhibit Divergent Selectivity with Scale

arXiv cs.LG · Amil Dravid, Yasaman Bahri, Alexei A. Efros, Yossi Gandelsman · 2026-06-02

The study establishes scaling laws for neuron populations in neural networks, extending beyond macroscopic loss metrics to interpretable neuron-level structure. Analyzing Rosetta Neurons—a class with consistent activation patterns across independently trained models—in language models up to 30B parameters and vision models up to 5B parameters, the authors observe sublinear power-law scaling in their population size. Rosetta Neurons exhibit increasing selectivity and monosemanticity with scale, diverging from a growing non-Rosetta population. An analytical model explains these phenomena by balancing feature utility against neuron capacity. Additionally, Rosetta Neurons become more domain-specialized with scale, demonstrated through a targeted data-filtering case study.

rosetta neuronssublinear scalingneuron polarizationmonosemanticitydomain specialization

Skill-RM: Unifying Heterogeneous Evaluation Criteria via Agent Skill

arXiv cs.LG · Tao Chen, Gangwei Jiang, Pengyu Cheng, Siyuan Huang · 2026-06-02

Skill-RM introduces a unified reward modeling framework that reformulates reward computation as the execution of reusable Reward-Evaluation Skills, enabling dynamic orchestration of heterogeneous evaluation criteria. The method treats reward modeling as a structured agentic task, providing a consistent interface to aggregate diverse evidence types (rule-based verifiers, ground-truth references, etc.) tailored to input requirements. Experiments on reward benchmarks and downstream tasks (best-of-N selection, reinforcement learning) demonstrate Skill-RM's consistent outperformance of traditional judge baselines, achieving superior performance through dynamic evidence aggregation.

reward modelingreinforcement learningagentic taskdynamic orchestrationevaluation criteria

VLESA: Vision-Language Embodied Safety Agent for Human Activity Monitoring

arXiv cs.LG · Hanjiang Hu, Yiyuan Pan, Jiaxing Li, Xusheng Luo · 2026-06-02

The Vision-Language Embodied Safety Agent (VLESA) is introduced as a framework for real-time safety intervention in human activities monitored through egocentric video. VLESA employs a goal-conditioned safety Q-filter trained via GRPO to evaluate actions based on inferred intent, alongside an intent-action prediction agent that jointly infers goals and predicts future actions. A novel dataset pairs egocentric frames with goal-conditioned safety annotations to support this approach. On the ASIMOV-2.0 benchmark, VLESA achieves superior intervention accuracy at the exact ground-truth frame and improves action safety by over 41 percentage points through goal-conditioned constrained decoding.

egocentric videogoal-conditioned safetygrpointent-action predictionconstrained decoding

MLSkip: Data Skipping for ML Filters via Lightweight Metadata

arXiv cs.LG · Mihail Stoian, Mark Gerarts, Pascal Ginter, Andreas Zimmerer · 2026-06-02

MLSkip introduces a data skipping technique for ML filters in databases, leveraging lightweight metadata to prune non-qualifying row groups. The method connects ML query languages and neural network verification, utilizing Parquet's min-max metadata and proposing a size-bounded 2D convex hull for enhanced pruning. Preliminary results on ReLU architectures with TPC-H and TPC-DS datasets show pruning effectiveness of 27.4% for filters with selectivity below 0.1%, increasing to 38.31% with the enhanced metadata. The approach achieves a 1.07× end-to-end speedup over PyTorch in DuckDB.

data skippingml filtersneural network verificationparquet metadataconvex hull

SEAOTTER: Sensor Embedded Autoencoding with One-Time Transcode for Efficient Reconstruction

arXiv cs.LG · Dan Jacobellis, Neeraja J. Yadwadkar · 2026-06-02

The paper introduces SEAOTTER, a compression framework for cloud robotics that combines learned latent representations with JPEG compatibility. The method employs a Sensor Embedded Autoencoder with a One-Time Transcode for Efficient Reconstruction, featuring a learnable JPEG color and quantization transform to maintain accuracy across perception tasks. At 200:1 compression, SEAOTTER achieves 7× faster encoding, 3.5× faster decoding, and +8% ImageNet top-1 accuracy versus AVIF while preserving JPEG infrastructure compatibility.

autoencodercloud roboticsjpeg transcodinglearned compressionrate-distortion

Correcting Neural Operator Spectral Bias via Diffusion Posterior Sampling with Sparse Observations

arXiv cs.LG · Niccolò Perrone, Fanny Lehmann, Stefania Fresca, Filippo Gatti · 2026-06-02

FreqNO-DPS introduces a diffusion posterior sampling framework to correct spectral bias in neural operator surrogates for PDE solutions, leveraging sparse sensor measurements and a spectrally shaped guidance score. The method combines an unconditional score-based diffusion prior trained on high-fidelity simulations with diffusion posterior sampling conditioned on sparse observations, guided by a frozen neural operator. It avoids denoiser backpropagation and ensures frequency-dependent calibration through a closed-form guidance score. Evaluated on 3D elastic wavefield prediction at 5% and 2% sensor coverage, FreqNO-DPS achieves near-zero spectral bias across all frequency bands, outperforming both surrogate-only and sensor-only approaches. The framework requires only paired surrogate/reference data and is validated via a coherence diagnostic.

spectral biasdiffusion posterior samplingneural operatorguidance scorefrequency-dependent calibration

Quadratic integrate-and-fire neurons exhibit less fragmented loss landscapes and outperform leaky integrate-and-fire neurons in spike-based gradient descent

arXiv cs.LG · Carlo Wenig, Raoul-Martin Memmesheimer, Christian Klos · 2026-06-02

The study demonstrates that quadratic integrate-and-fire (QIF) neurons outperform leaky integrate-and-fire (LIF) neurons in spike-based gradient descent, as evidenced by optimized performance on the Spiking Heidelberg Digits dataset. Through hyperparameter tuning and landscape visualization, QIF networks exhibit less fragmented loss landscapes and more stable gradients compared to LIF networks, which suffer from disruptive spike discontinuities. Results indicate that QIF neurons mitigate erratic gradient behavior caused by temporal spike order changes, advocating their use over LIF neurons in gradient-based training of spiking neural networks.

spiking neural networksquadratic integrate-and-fireleaky integrate-and-firegradient descentloss landscapes

Value-Aware Stochastic KV Cache Eviction for Reasoning Models

arXiv cs.LG · Ting-Yun Chang, Harvey Yiyun Fu, Deqing Fu, Chenghao Yang · 2026-06-02

The paper introduces Value-aware Stochastic KV Cache Eviction (VaSE), a training-free method for efficient KV cache management in reasoning models. VaSE addresses two key issues: catastrophic accuracy drops from evicting large-magnitude value states and improved performance through stochastic eviction diversity. The approach combines magnitude-aware protection with randomized eviction, achieving 4% higher accuracy than prior eviction methods at 4x KV cache compression across six reasoning tasks while maintaining compatibility with FlashAttention2.

kv cache evictionreasoning modelsvalue statesstochastic evictionflashattention2

DiffUNet^2: Bidirectional Prediction, Probabilistic Generation and Collaborative Visual Discovery for Scientific Data

arXiv cs.LG · Mengdi Chu, Jiaxin Yang, Angus G. Forbes, Nathan Debardeleben · 2026-06-02

DiffUNet^2 introduces a bidirectional conditional diffusion model for scientific temporal data analysis, enabling any-to-any generation across time and capturing distributions of plausible system evolutions. The framework integrates diffusion-based generative modeling with interactive visual analytics, supporting branching timeline exploration, user-guided state editing, and probability-space navigation. Evaluated on five datasets across diverse scientific domains, the model demonstrates predictive accuracy and ensemble quality. Collaborations with domain experts validate its effectiveness in practical workflows, transforming generative models into tools for hypothesis-driven scientific analysis.

diffusion-based generative modelingbidirectional predictionprobability-space navigationconditional diffusion modeltemporal data analysis

Contrastive Neural Algorithmic Reasoning for Graph Coloring

arXiv cs.LG · Thien Le, Tianyu Zhao, Melanie Weber · 2026-06-02

The paper introduces a contrastive learning framework for neural algorithmic reasoning on graph $k$-coloring problems, addressing limitations of instance-specific GNN optimization. The method learns transferable node embeddings where same-color nodes align representationally while adjacent nodes diverge, theoretically analyzed to converge to a line-prototype geometry under unit-norm constraints. Experiments demonstrate generalization across graph sizes and distributions, achieving competitive conflict minimization compared to greedy baselines on synthetic and real-world graphs.

contrastive learninggraph coloringneural algorithmic reasoninggnn embeddingsline-prototype geometry

Forecasting Conceptual Diffusion in Science: The Case of Quantum Computing

arXiv cs.LG · Thomas Maillart, Thibaut Chataing, David Dosu, Paul Bagourd · 2026-06-02

The study develops predictive models for scientific concept diffusion using quantum computing as a benchmark case. LightGBM models analyze temporal concept co-occurrence networks from OpenAlex data, predicting endogenous reinforcement, exogenous diffusion, their ratio, and diffusion entropy. Results show exogenous diffusion is highly predictable (R² up to 0.78) via upstream heterogeneity and citation breadth, while endogenous reinforcement proves unpredictable in quantum computing. Replications across robotics, advanced materials, and neuro implants confirm field-specific predictability patterns, with neuro implants showing high endogenous predictability (R²=0.83). Entropy dynamics reveal conceptual frontier openings and paradigm shifts, demonstrating structural regularities in scientific diffusion.

lightgbmconcept co-occurrenceshap analysesdiffusion entropyopenalex

Beyond Gradient Descent: Adam for Analog Ising Machines

arXiv cs.LG · Stijn Van Vooren, Guy Van der Sande, Guy Verschaffelt · 2026-06-02

The paper proposes continuous-time Adam optimization for analog Ising machines, addressing limitations of gradient-descent-like dynamics. The authors derive continuous-time versions of momentum and Adam optimizers, suitable for analog systems, and introduce a simplified first-order approximation. On Max-Cut benchmarks, Adam-based dynamics reduce time-to-target and improve solution quality compared to gradient descent and momentum. The simplified approximation outperforms full Adam in continuous-time settings, while discrete-time Adam excels on harder weighted instances. Results suggest continuous-time Adam as a promising design principle for analog Ising machines.

ising machinesadam optimizationcontinuous-time dynamicsmax-cutmomentum

MAdam: Metric-Aware Multi-Objective Adam

arXiv cs.LG · Fengbei Liu, Rachit Saluja, Sunwoo Kwak, Ruibo Wang · 2026-06-02

The paper introduces MAdam (Metric-Aware Multi-Objective Adam), a drop-in wrapper for Adam that resolves two systematic gaps in multi-objective optimization (MOO): weighting mismatch (Adam's second-moment denominator entangling preference vectors with gradient statistics) and geometric mismatch (Adam's adaptive metric distorting Euclidean geometry). MAdam preconditions reconciled directions by the preference-conditioned curvature of the scalarized objective, whitening the input so Adam's second moment collapses to identity. Experiments across multi-task learning, Pareto-front recovery, physics-informed neural networks, and medical imaging show MAdam consistently outperforms Adam for all solver families.

multi-objective optimizationadam optimizerpreconditioningpareto frontgradient statistics

Denoise First, Orthogonalize Later: Understanding Momentum in Muon via Spectral Filtering

arXiv cs.LG · Xianliang Li, Zihan Zhang, Weiyang Liu, Han Bao · 2026-06-02

The work theoretically explains momentum's role in Muon by proving it acts as a spectral filter that suppresses gradient perturbations while preserving signal, thereby stabilizing orthogonalization. Under a signal-plus-perturbation gradient model, the analysis shows momentum enlarges the spectral gap between signal and noise, improving subspace alignment when applied before orthogonalization. Empirical validation includes large language model pretraining, demonstrating superior performance versus momentum-free or reversed-order variants. The findings provide a framework for analyzing momentum in matrix-based optimizers.

spectral filteringmomentumorthogonalizationgradient perturbationsingular subspaces

CoralBay: A Self-Supervised CT Foundation Model

arXiv cs.LG · Ioannis Gatopoulos, Nicolas Känzig, Sebastian Otálora, Fei Tang · 2026-06-02

The authors introduce CoralBay, a self-supervised 3D CT foundation model that extends DINO with a hierarchical 3D Swin backbone and multi-scale feature distillation. The method captures volumetric spatial continuity and tissue properties through self-distillation of concatenated features, enabling efficient transfer learning across radiological tasks. CoralBay demonstrates consistent performance on diverse anatomical targets and contributes a standardized 3D radiology benchmark via the open-source \eva framework.

self-supervised learning3d swin transformermedical imagingrepresentation learninghounsfield units

Attribution via Distributional Paths for Information Revelation

arXiv cs.LG · Kieran A. Murphy, Shameen Shrestha · 2026-06-02

The paper introduces Reveal-IG, a path-based attribution method that lifts feature importance computation from input space to a space of structured probe distributions. Unlike traditional methods like Integrated Gradients that traverse raw input values, Reveal-IG progressively reveals information about the input through distributional paths, attributing changes in the model's expected output. This approach retains completeness, avoids input-space path artifacts, and handles multiscale image probes and feature-wise uncertainty. Evaluations on ImageNet classification and tabular regression show Reveal-IG produces stable, signed attributions, outperforming on sign-sensitive metrics while remaining competitive otherwise.

feature attributionintegrated gradientsdistributional pathsmultiscale probestabular regression

Privacy-Robust Incrementality Measurement for Advertising Systems under Signal Loss

arXiv cs.LG · Prashant Shekhar, Caroline Howard · 2026-06-02

This paper introduces a decision-theoretic framework for privacy-robust incrementality measurement in advertising systems, addressing signal degradation from privacy-preserving mechanisms such as match-rate loss, aggregation-threshold suppression, and randomized reporting noise. The method formulates privacy-constrained measurement as a robust causal decision problem, projecting observation-compatible fibers onto the incrementality functional to yield certified, rejected, and unresolved decisions. Experiments on 2.0M Criteo Uplift and 64K Hillstrom email datasets demonstrate positive clean conversion lifts (0.00112 and 0.00495, respectively), with population certification surviving mild and severe degradation. The framework provides finite-sample certification, sample-complexity guarantees, and a minimax lower bound, establishing a sharp decision frontier for causal claims under signal loss.

incrementality measurementsignal degradationrobust causal decisionprivacy-preserving mechanismsdecision frontier

Visual Instruction Tuning Aligns Modalities through Abstraction

arXiv cs.LG · Luis Palacios, Lorenzo Basile, Diego Doimo, Alberto Cazzaniga · 2026-06-02

This study demonstrates that visual instruction tuning embeds visual features directly into intermediate semantic layers of Large Language Models (LLMs), bypassing early unimodal processing layers. Through probing analyses and causal interventions, the authors identify these intermediate layers as the semantic core for vision-language processing, crucial for performance across multimodal benchmarks. Fine-tuning aligns visual features with pre-existing textual abstractions, extending and strengthening the abstraction phase. Restricting fine-tuning to intermediate layers preserves performance on vision-centric benchmarks while reducing training time, indicating multimodal integration as a localized phenomenon driven by LLM internal abstraction engines.

visual instruction tuninglarge language modelmultimodal benchmarksabstraction phasesemantic core

Explainable Forecasting of Scientific Breakthroughs from Concept Network Dynamics

arXiv cs.LG · Thomas Maillart, Thibaut Chataing, Ntorina Antoni, David Dosu · 2026-06-02

The authors propose an explainable machine-learning method for forecasting scientific breakthroughs by analyzing structural precursors in OpenAlex concept networks. The approach employs a two-stage LightGBM model that jointly predicts concept pair formation and future link weights using 59 semantic and topological features, adding a regression stage to quantify intensity. Validation across four domains achieves ROC-AUC scores of [0.954, 0.967], outperforming prior models (~0.90 AUC), while maintaining explainability through auditable structural features. Feature attribution identifies Adamic-Adar similarity and degree-based Hadamard measures as key drivers, suggesting breakthroughs emerge in tightly connected sub-networks. The method is demonstrated in quantum annealing and AI-enabled quantum architectures, aligning with expert expectations.

concept networkslightgbmadamic-adarhadamard measuresroc-auc

Two-Action Apple Tasting with Switching Costs

arXiv cs.LG · Tommaso Cesari, Roberto Colomboni · 2026-06-02

The paper establishes tight regret bounds for the two-action apple-tasting problem with switching costs against an oblivious adversary. The problem is formulated as a decision between a revealing action (reward 0, reveals hidden value) and a blind action (reward x_t, no revelation), with a unit cost incurred for action switches. Contrary to prior conjectures of Ω(T^(2/3)) regret, the authors prove that the minimax expected regret scales as Θ(√T), specifically bounded between (1/(2√3))√T and 2√3√T. This result resolves a key obstruction in the classification of feedback graphs with switching costs.

regret boundsswitching costsoblivious adversaryfeedback graphsminimax regret

Text-attributed Graph Condensation via Text Selection and Attribute Matching

arXiv cs.LG · Haowei Han, Yuxiang Wang, Guojia Wan, Hao Wang · 2026-06-02

The paper introduces TAGSAM, a method for condensing Text-Attributed Graphs (TAGs) while preserving training accuracy. TAGSAM employs two key techniques: subgraph text selection, which merges representative text chunks by maximizing mutual information, and attribute similarity matching, which aligns stable similarity matrices to address high variance in graph topology condensation. Evaluated against six baselines, TAGSAM improves accuracy by 4.9% on average at the same compression ratio and maintains competitive performance even at 1% of the original TAG size.

text-attributed graphgraph neural networkmutual informationsimilarity matricescondensation

Online Learning with Gradient-Variation Interval Regret

arXiv cs.LG · Yan-Feng Xie, Shuche Wang, Peng Zhao, Zhi-Hua Zhou · 2026-06-02

The paper introduces the first online learning algorithm with interval regret bounds scaling with gradient variation, a fundamental measure of cumulative gradient changes in online functions. The method employs a two-layer online ensemble structure, achieving theoretical guarantees that adapt to problem-dependent quantities while maintaining minimax-optimal rates. A Lipschitz- and smoothness-agnostic variant is proposed, enabled by a novel Lipschitz-adaptive meta algorithm, eliminating the need for hyperparameter tuning. The approach also provides versatile bounds for interval dynamic regret and offers the first piecewise characterization for stochastic extended adversarial optimization. Experimental results validate the theoretical findings.

interval regretgradient variationonline ensembledynamic regretstochastic optimization

Dynamic Short Convolutions Improve Transformers

arXiv cs.LG · Oliver Sieberling, Bharat Runwal, Rameswar Panda, Yoon Kim · 2026-06-02

The paper introduces dynamic short convolutions as a novel neural network primitive to enhance Transformer architectures. Unlike static convolutions, dynamic variants employ input-dependent filters, preserving locality bias while increasing expressivity. Experiments demonstrate improved performance on associative recall tasks and language modeling across scales (150M to 2B parameters), with dynamic convolutions outperforming standard Transformers and static convolutional variants. Scaling laws indicate a 1.33× compute advantage when applied to key, query, and value vectors, and a 1.60× advantage when added after every linear layer. Efficient training is enabled via custom Triton kernels, making dynamic convolutions a scalable and hardware-efficient enhancement for Transformer-based language models.

dynamic convolutionstransformersassociative recallscaling lawstriton kernels

Finding Needles in the Haystack: Transductive Active Labeling in Ecology

arXiv cs.LG · Rupa Kurinchi-Vendhan, Sara Beery · 2026-06-02

The paper critiques inductive evaluation in ecological active learning, proposing transductive labeling to prioritize rare-class discovery over predictive performance. It introduces a metric quantifying sampling difficulty for 'needles in the haystack'—rare instances embedded among abundant classes—and demonstrates that human-in-the-loop considerations prevent premature stopping. A hybrid stopping criterion combining predictive performance with discovery metrics, inspired by ecological rarefaction curves, improves rare-class recovery by 18-32% in long-tailed datasets. Results show transductive objectives better align with ecological monitoring goals where discovery efficiency is paramount.

transductive learningactive labelinglong-tailed distributionrarefaction curvessampling difficulty

A Quantitative Approximation Framework for Flow Distillation in Diffusion Models

arXiv cs.LG · Weiguo Gao, Ming Li, Lei Shi, Hanfei Zhou · 2026-06-02

The authors present a quantitative approximation framework for flow distillation in diffusion models, analyzing few-step sampling as error propagation under composed flow maps. They focus on trajectory distillation for probability-flow ODEs, identifying local approximation error amplification in low-noise multimodal regimes. Using a Gaussian-mixture Ornstein-Uhlenbeck setting, they separate score field approximation (achieving L^p(p_t) guarantees with ReLU-ReQU networks) from dynamical stability control (via time-integrated Jacobian bounds). Theoretical results show deep residual compositions efficiently approximate long-horizon transport, with error controlled by stability amplification. Experiments demonstrate a 51.9% relative MSE reduction using their stability-balanced non-uniform time grid versus uniform grids.

diffusion distillationprobability-flow odeornstein-uhlenbeck processflow map stabilityresidual compositions

Easy-to-Use Shielding for Reinforcement Learning

arXiv cs.LG · Stefan Pranger, Bettina Könighofer · 2026-06-02

The authors present tempestpy, a Python library integrating Tempest-based shield synthesis with the Gymnasium API to enable safe reinforcement learning (RL) without requiring formal methods expertise. The method extends Tempest's capabilities to stochastic multiplayer games while preserving formal safety guarantees, and introduces MiniGridSafe environments for transparent safety evaluation. Experiments demonstrate shielded versus unshielded RL performance across multiple environments, showing improved safety during exploration.

reinforcement learningsafe explorationshield synthesisstochastic gamesgymnasium api

Limit Analysis of Graph Neural Networks with Wireless Conflict Graphs

arXiv cs.LG · Romina Garcia Camargo, Zhiyang Wang, Alejandro Ribeiro · 2026-06-02

The paper establishes theoretical transferability bounds for Graph Neural Networks (GNNs) operating on conflict graphs derived from sparse Random Geometric Graphs (RGGs) in wireless networks. By leveraging the structural similarity between RGGs and Deterministic Grid Graphs (DGGs), the authors derive performance guarantees when scaling GNN-based resource allocation policies. Empirical validation on link scheduling tasks demonstrates that the proposed GNN policies outperform existing benchmarks at scale, with analysis of how theoretical assumptions affect practical performance.

graph neural networksrandom geometric graphstransferabilitywireless networkslink scheduling

Training-Free Multi-Concept LoRA Composition with Prompt-Aware Weighting

arXiv cs.LG · Georgios Tsoumplekas, Stella Bounareli, Vasileios Argyriou · 2026-06-02

The paper introduces a training-free method for multi-concept LoRA composition in text-to-image generation, addressing interference issues through prompt-aware weighting. Two techniques, W-Switch and W-Composite, dynamically weight LoRA modules based on the semantic influence of their trigger words in the input prompt. Evaluated on the ComposLoRA testbed, the approach improves visual quality (measured by a novel image-based similarity metric) and identity preservation, with validation from LLM-based assessment and user studies.

low-rank adaptationmulti-concept customizationprompt-aware weightingtext-to-image generationidentity preservation

Expert-Aware Causal Tracing of Factual Recall in Sparse MoE Language Models

arXiv cs.LG · Yuetian Lu, Ali Modarressi, Yihong Liu, Hinrich Schütze · 2026-06-02

The study introduces expert-aware causal tracing for sparse mixture-of-experts (MoE) language models, addressing how routed expert contributions mediate factual recall. Using CounterFact facts, the method corrupts subject-token embeddings with noise and tests clean MoE-block outputs or expert-level updates to restore true-vs-foil logit contrast. Results on Qwen3-30B-A3B-Base identify layer 44 and expert L44E069 as critical, while Mixtral-8x7B-v0.1 shows mid-layer signal without singleton expert localization, indicating model-dependent expert-level tracing.

mixture-of-expertscausal tracingfactual recallsparse modelsexpert routing

Bregman meets Lévy: Stochastic mirror descent with heavy-tailed noise in continuous and discrete time

arXiv cs.LG · Pierre-Louis Cauvin, Panayotis Mertikopoulos · 2026-06-02

The paper establishes convergence guarantees for stochastic mirror descent (SMD) under heavy-tailed noise with infinite variance, introducing a continuous-time model called Lévy mirror flow (LMF) as a stochastic differential equation driven by centered Lévy noise with finite p-th moments (1 < p ≤ 2). LMF arises as the scaling limit of SMD in heavy-tailed regimes, exhibiting jump discontinuities of arbitrary magnitude when p < 2. Despite this singular behavior, LMF achieves ε-optimality in O(ε^(-p/(p-1))) time for convex objectives and O~(ε^(-1/(p-1))) time for strongly convex objectives. These results extend to discrete-time SMD variants under heavy-tailed noise.

stochastic mirror descentlévy noiseheavy-tailed noiseconvergence guaranteesstochastic differential equation

Neural Navigation Functions for Zero-Shot Generalizable Motion Planning

arXiv cs.LG · Benjamin D. Shaffer, Pei-An Hsieh, Brooks Kinch, Nathaniel Trask · 2026-06-02

The paper introduces Neural Navigation Functions (Neural-NF), a learned reactive navigation function enabling zero-shot transfer across unseen environments. The method combines data-driven adaptation with structured elliptic planning, where Laplacian-derived features map to local PDE coefficients, producing globally consistent value functions via boundary value problem solving. The approach guarantees collision-free policies, monotonic descent, and a global minimum at the goal by construction, with a linearly-solvable optimal-control interpretation. Experiments show Neural-NF achieves 5× better zero-shot transfer performance compared to learned planners predicting value functions directly.

neural navigation functionszero-shot transferelliptic plannerboundary value problemoptimal-control

Resource-Constrained Adaptive Inference for Sequential Pricing

arXiv cs.LG · Ruicheng Ao, Jiashuo Jiang, David Simchi-Levi · 2026-06-02

The paper introduces a target-aware pricing controller for resource-constrained adaptive inference in sequential pricing, addressing support-exclusion failures where fixed-price inference becomes impossible. The method employs localized debiasing to produce studentized intervals, with interval width governed by a realized information clock, and certifies feasible target bands while logging continuous local densities. Results demonstrate calibration within certified bands and diagnostic abstention when resource states collapse target support, showing polynomial rates for polynomial target mass but insufficient shrinkage for pure $1/t$ targets without additional local movement.

adaptive inferencesequential pricingsupport-exclusionlocalized debiasingstudentized intervals

Conformal Language Modeling via Posterior Sampling

arXiv cs.LG · Nicolas Emmenegger, Theo X. Olausson, Armando Solar-Lezama, Chara Podimata · 2026-06-02

The paper proposes conformal language modeling via posterior sampling to address hallucinations in Large Language Models (LLMs). Unlike post-hoc filtering methods, it samples from approximations to an LLM posterior conditioned on calibrated, high-scoring regions, ensuring coherence and utility. The method includes a calibration procedure for conditional sequential generation, achieving target risk control. Empirical evaluations on open-ended biography generation and mathematical problem solving demonstrate comparable statistical guarantees to prior work but with higher downstream utility.

conformal predictionposterior samplinglarge language modelsrisk controlsequential generation

Compress then Merge: From Multiple LoRAs into One Low-Rank Adapter

arXiv cs.LG · Zhengbao He, Ruiqi Ding, Zhehao Huang, Ruikai Yang · 2026-06-02

The paper introduces Compress-then-Merge (CtM), a method for merging multiple Low-Rank Adapters (LoRAs) into a single rank-r LoRA while preserving low-rank structure. Unlike Merge-then-Compress pipelines, CtM enforces the rank-r constraint before merging by computing shared r-dimensional subspaces using LoRA weights, projecting adapters into these subspaces, and merging in the reduced space. This approach avoids post-hoc truncation and ensures efficient computation. Experiments demonstrate that CtM outperforms single-LoRA-output baselines and narrows the performance gap to full-parameter merging methods across various models and tasks.

low-rank adaptationcompress-then-mergerank constraintshared subspacestruncated svd

When Graph Tokens Sink: A Mechanistic Analysis of Graph Language Models

arXiv cs.LG · Ding Zhang, Runtao Zhou, Wenqing Zheng, Rizal Fathony · 2026-06-02

This work mechanistically analyzes how Large Language Models (LLMs) process graph tokens in Graph Language Models (GLMs), revealing a decoupling between activation patterns and semantic utility. Through interventions (pruning, repositioning, swapping) on representative GLM architectures, the study identifies graph sink tokens—activation-level outliers with biased positional distributions—that dominate hidden states but do not correlate with graph information utilization. Results show these tokens neither attract significant attention weights nor contribute meaningfully to downstream tasks, indicating limitations in current graph-token construction and alignment mechanisms.

graph language modelsactivation saliencyattention sinksgraph tokenshidden-state dimensions

Multi$^2$: Hierarchical Multi-Agent Decision-Making with LLM-Based Agents in Interactive Environments

arXiv cs.LG · Sangeun Park, Minhae Kwon · 2026-06-02

The paper introduces Multi$^2$, a hierarchical multi-agent framework for LLM-based decision-making that addresses objective drift in long-horizon interactions. The method decomposes agent behavior into two levels: a high-level agent (System 1) generates context-aware sub-goals via supervised fine-tuning (SFT), while a low-level agent (System 2) executes atomic actions through offline-to-online reinforcement learning (RL). Evaluations across interactive environments show Multi$^2$ outperforms baselines in robustness and coordination, accompanied by the release of three hierarchical benchmark datasets for training and evaluation.

hierarchical multi-agentobjective driftsupervised fine-tuningoffline-to-online rllong-horizon interaction

Speedrunning Tabular Foundation Model Pretraining

arXiv cs.LG · Salih Bora Ozturk, Alexander Pfefferle, Frank Hutter · 2026-06-02

The authors introduce a community speedrun benchmark for accelerating tabular foundation model pretraining, addressing the high computational cost bottleneck. Their method involves modifying a single-file training script for nanoTabPFN, with participants competing to reach a fixed downstream ROC AUC target on subsampled TabArena using one NVIDIA L40S GPU. The current record achieves an 81x speedup (0.92 minutes vs. 74.32-minute baseline) while using 22x fewer synthetic datasets, demonstrating the protocol's effectiveness for accumulating and verifying pretraining improvements.

tabular foundation modelspretraining speeduproc aucnanotabpfntabarena

A 3D Isovist World Model -- Revealing a City's Unseen Geometry and Its Emergent Cross-City Signature

arXiv cs.LG · Xuhui Lin, Stephen Law, Nanjiang Chen, Kunyao Li · 2026-06-02

The paper introduces a 3D isovist world model for embodied navigation that predicts navigable geometry rather than appearance, encoding open space as spherical visibility-depth maps. The model uses a depth residual formulation with self-rollout scheduled sampling and a persistent latent bird's-eye-view map for cross-path consistency. Key findings show emergent cross-city signatures, with city identity linearly decodable from temporal latents (surpassing single-frame baselines), despite being trained on Manhattan and Paris without explicit city labels.

3d isovistembodied navigationvisibility-depth mapself-rollout samplingspatial signature

Set-Preserving Calibration from Conformal P-Values to E-Values

arXiv cs.LG · Nabil Alami, Jad Zakharia, Souhaib Ben Taieb · 2026-06-02

The paper introduces a novel P2E calibrator that converts conformal p-values into e-values while preserving prediction sets, addressing limitations of classical p-to-e calibrators in conformal prediction (CP). The method ensures set preservation and avoids conservatism, theoretically and empirically demonstrating efficiency gains over existing calibrators. Applications in cross-conformal prediction (CCP) and conformal aggregation (CA) show improved coverage guarantees (1-α) and efficiency compared to standard baselines, enabling better use of e-value merging and randomization techniques.

conformal predictione-valuesp-valuesuncertainty quantificationstatistical efficiency

Training a Predictive Coding Network on ImageNet using Equilibrium Propagation

arXiv cs.LG · Tugdual Kerjan, Rasmus Høier, Benjamin Scellier · 2026-06-02

The authors present the first successful training of a predictive coding network (PCN) on ImageNet using equilibrium propagation (EP), achieving 13.23% top-5 test error with a 10-layer convolutional architecture (VGG10). Their method combines centered EP with a novel equilibration scheme for PCNs, bridging computational neuroscience and machine learning. The result approaches the 12.2% backpropagation baseline, demonstrating scalability of both PCNs and EP to large-scale vision tasks. This suggests EP's limitations in physical systems may stem from implementation challenges rather than theoretical constraints.

predictive coding networksequilibrium propagationenergy-based modelsimagenetvgg10

Few-Shot Prediction for Pulsar Noise with Long Short-Term Memory Network

arXiv cs.LG · Qingye Tang, Dechao An, Haoran Peng, Yuqi Ouyang · 2026-06-02

The paper proposes a few-shot learning method for predicting pulsar timing residuals in data-scarce scenarios using an LSTM network optimized via model-agnostic meta-learning. Particle swarm optimization automates hyperparameter tuning, enhancing prediction accuracy. Evaluated on IPTA's second data release, the method achieves robust generalization across high-frequency domains with only 10% fine-tuning data, while maintaining efficiency (16.86 MB memory, 18ms per prediction).

few-shot learninglong short-term memorymodel-agnostic meta-learningparticle swarm optimizationpulsar timing residuals

Analytical Evaluation of DCA Convergence Properties for Minimizing Prediction Functions of Gaussian RBF Support Vector Regression

arXiv cs.LG · Yohei Kakimoto, Yuto Omae, Hirotaka Takahashi · 2026-06-02

The article presents a framework for applying the difference of convex functions algorithm (DCA) to minimize prediction functions of Gaussian RBF support vector regression (RBF-SVR) models. By exploiting the analytical structure of the RBF kernel, the authors derive closed-form bounds for the strong convexity parameter (μ) and gradient Lipschitz constant (L), both dependent on the post-training dual-coefficient sum (Cα), kernel parameter (γ), and decomposition parameter (ρ). Numerical experiments on six benchmarks show that Cαρ primarily characterizes DCA convergence properties and initial-point dependence, enabling pre-training assessment via SVR hyperparameters (C, γ).

nonconvex optimizationsupport vector regressiongaussian rbf kerneldc algorithmconvergence analysis

A Robust Optimization Approach to Sparse Principal Component Analysis

arXiv cs.LG · David Vävinggren, Francis Bach, André M. H. Teixeira, Dave Zachariah · 2026-06-02

Adversarial PCA (AdvPCA) introduces a robust optimization approach to sparse principal component analysis, addressing the limitations of dense representations in high-dimensional data. The method achieves sparsity by optimizing reconstruction against bounded worst-case latent space perturbations, admitting a closed-form reduction. This leads to an iterative algorithm alternating between adversarial linear regression-style updates for the sparse encoder and orthogonal updates for the decoder. Theoretical characterization enables data-adaptive parameterization, facilitating effective out-of-the-box performance. Empirical validation on synthetic and real-world genomics data demonstrates the method's efficacy.

sparse principal component analysisrobust optimizationadversarial pcalatent space perturbationsdata-adaptive parameterization

How Many Trees in a Random Forest? A Revisited Approach with Plateau Search and Optuna Integration

arXiv cs.LG · Vadim Porvatov, Andrey Dukhovny, Andrey Lange · 2026-06-02

The paper introduces a triplet-based plateau-search algorithm for optimizing the number of trees in Random Forest, addressing limitations of standard hyperparameter optimization (HPO) methods like Tree-structured Parzen Estimator (TPE) and Hyperband. The method removes the ensemble size from direct TPE search, instead adaptively tracking a near-minimal sufficient size by monitoring relative changes in out-of-bag (OOB) scores across a triplet of forest sizes. Theoretical analysis links the OOB-score criterion to the gap between current and limiting scores, with asymptotic variance estimates provided. Experiments reveal deviations from common heuristics: smaller sizes suffice for classical benchmarks, while larger ensembles are needed for high-dimensional bioinformatics datasets like Arcene and Dorothea.

random foresthyperparameter optimizationout-of-bag scoreplateau searchensemble size

Demystifying Pipeline Parallelism: First Theory for PipeDream

arXiv cs.LG · Ivan Ilin, Peter Richtárik · 2026-06-02

The paper presents three contributions to pipeline parallelism theory, focusing on PipeDream (PD). First, it introduces Randomized PipeDream (RPD), providing the first nonconvex convergence guarantee for PD-style methods via a stale block-SGD abstraction. Second, it quantifies PD's steady-state delay as scaling quadratically with stages (S) and proves the stale-read term scales as Θ(S⁴/K). Third, it compares PD with LocalSGD, showing PD outperforms on quadratic objectives and small language-modeling tasks, while LocalSGD excels in logistic regression as stages increase. Experiments validate these theoretical findings.

pipeline parallelismstale block-sgdnonconvex convergencestale-read delaylocalsgd

HiSE: A Lightweight Hierarchical Semantic Explainer for Heterogeneous Graph Neural Networks

arXiv cs.LG · Zongrui Li, Yuhang Zhao, Ying Zhao, Yuanzhao Guo · 2026-06-02

HiSE introduces a lightweight hierarchical semantic explainer for heterogeneous graph neural networks (HGNNs), addressing interpretability challenges in high-stakes applications. The method employs local LASSO-based surrogate models for sparse feature representations at the semantic level and adaptively characterizes cross-semantic contributions via KL divergence. Evaluations show HiSE outperforms existing methods in fidelity, robustness, and cross-semantic explanation capability, while maintaining low computational overhead for large-scale heterogeneous graphs.

heterogeneous graph neural networkssemantic hierarchylassokl divergenceinterpretability

Low-Frequency Shortcuts in Texture-Driven Visual Learning

arXiv cs.LG · Utku Şirin, Cathy Hou, David Alvarez-Melis, Stratos Idreos · 2026-06-02

The paper analyzes shortcut learning in texture-driven visual domains, revealing a prevalence of low-frequency shortcuts (LFCs) where models rely on skewed spectral behavior rather than high-frequency details. By pruning LFCs during training and testing, the method achieves up to 8% improvement in in-distribution accuracy and enhances robustness to low-frequency corruptions by 40%, though with a trade-off for high-frequency corruptions. Results demonstrate that balanced spectral behavior improves generalization, while OOD performance depends on the interaction between low- and high-frequency feature dependencies.

shortcut learningtexture-driven domainslow-frequency shortcutsspectral behaviorout-of-distribution robustness

Topology-Aware Gaussian Graph Repair for Robust Graph Neural Networks

arXiv cs.LG · Anubha Goel, Juho Kanniainen · 2026-06-02

We propose Topology-Aware Gaussian Repair (TAGR), a lightweight graph repair framework for robust message passing in graph neural networks. TAGR constructs a sparse feature-neighborhood graph using an adaptive Gaussian kernel and combines it with a topology-aware residual correction of the observed graph. The Gaussian repair introduces auxiliary edges between feature-similar nodes, while residual correction preserves and reweights the original topology based on local feature and structural consistency. Experiments on citation networks demonstrate that TAGR improves GNN robustness under noisy-edge and missing-edge settings, with Gaussian repair providing the main robustness gain and residual correction enhancing stability for incomplete graphs.

graph neural networkstopology-awaregaussian repairresidual correctionsparse graph

KVarN: Variance-Normalized KV-Cache Quantization Mitigates Error Accumulation in Reasoning Tasks

arXiv cs.LG · Lorenz K. Muller, Philippe Bich, Chiara Boretti, Hyun-Min Chang · 2026-06-02

KVarN introduces a calibration-free KV-cache quantization method for autoregressive decoding in large language models, addressing error accumulation from token-scale inaccuracies. The technique employs Hadamard rotation and dual-scaling variance normalization across K and V matrices, mitigating quantization errors that propagate across timesteps. Evaluated on MATH500, AIME24, and HumanEval, KVarN achieves state-of-the-art 2-bit quantization performance, with a vLLM implementation available.

kv-cachequantizationautoregressive decodinghadamard rotationvariance normalization

PerchRL: Vision-Based Agile Perching on Inclined Platforms under Rapid and Irregular Motion

arXiv cs.LG · Zihong Lu, Zongzhuo Liu, Huaxu Li, Jinqiang Cui · 2026-06-02

PerchRL introduces a reinforcement learning framework for vision-based agile perching of quadrotors on rapidly moving inclined platforms, addressing challenges from limited field of view. The method employs a two-stage learning strategy: state-based pre-training followed by vision-based fine-tuning, enhanced by randomized platform trajectories and temporal augmentation for generalization. Vision-based fine-tuning incorporates visibility-aware state augmentation and active perception rewards to handle intermittent visual loss. Experiments in simulation and real-world settings demonstrate PerchRL's feasibility, stability, real-time performance, and adaptability across distinct quadrotor platforms. The source code will be publicly released.

reinforcement learningvision-based perchingquadrotorstemporal augmentationactive perception

Flicker-DDPM: Accelerating Denoising Diffusion via 1/f Colored Noise Injection

arXiv cs.LG · Kexiang Mao · 2026-06-02

Flicker-DDPM introduces a novel denoising diffusion probabilistic model that incorporates 1/f colored noise inspired by self-organized criticality, replacing isotropic white noise in the forward process. The method employs a spatial correlation kernel σ(d) = (d + 1)^{-η} to generate power-law spectra noise, enabling adaptation to datasets with diverse spectral characteristics. Theoretical analysis demonstrates that spectrally matched colored noise linearizes the reverse trajectory, explaining sampling acceleration. On CIFAR-10, Flicker-DDPM achieves comparable or superior generation quality to standard DDPMs while reducing sampling steps by 3.33×, with minimal computational overhead per step.

denoising diffusion probabilistic modelsself-organized criticalitypower-law spectraspatial correlation kernelsampling acceleration

Link Prediction or Perdition: the Seeds of Instability in Knowledge Graph Embeddings

arXiv cs.LG · Guillaume Méroué, Fabien Gandon, Pierre Monnin · 2026-06-02

This work systematically analyzes the stability of knowledge graph embedding models (KGEMs) for link prediction, revealing critical limitations in current evaluation protocols. Through isolation of stochastic factors—initialization, triple ordering, negative sampling, dropout, and hardware—the study demonstrates that each factor independently induces comparable instability in predictions and embedding spaces. Results show that high-performance models exhibit divergent triple-level predictions and variable embeddings, with no guarantee that better MRR configurations enhance stability. Voting marginally improves stability, underscoring concerns about KGEM reliability for knowledge graph completion.

knowledge graph embeddingslink predictionstability analysisstochastic factorsevaluation protocols

Mitigating False Credit Propagation: Probabilistic Graphical Reward Aggregation for Rubric-Based Reinforcement Learning

arXiv cs.LG · Can Lv, Mingju Chen, Heng Chang, Shiji Zhou · 2026-06-02

The paper introduces GEAR (Graphical Event Aggregation for Rubric rewards), a probabilistic graphical framework addressing False Credit Propagation (FCP) in rubric-based reinforcement learning. GEAR models criterion outcomes as latent Bernoulli events in a typed rubric graph, propagates soft suppression from unsupported parent events, and computes normalized expected utility via linear-time aggregation. Evaluated on HealthBench, WritingBench, and PLawBench with two policy backbones, GEAR achieves up to 15.5% relative improvement over flat aggregation and reduces FCP leakage by 96.5% while preserving licensed utility better than deterministic gating.

false credit propagationrubric-based reinforcement learningprobabilistic graphical modelbernoulli eventreward aggregation

Speech Emotion Recognition using Attention-based LSTM-Network with Residual Connection

arXiv cs.LG · Daniil Krasnoproshin, Maxim Vashkevich · 2026-06-02

The paper proposes ResLSTM-SA, a lightweight LSTM-based architecture with residual connections and soft attention for speech emotion recognition. The model combines residual skip connections with attention mechanisms to improve gradient flow and feature extraction while maintaining low parameter counts (46.8k). Evaluated on RAVDESS under speaker-independent conditions, ResLSTM-SA-h64 achieves 0.6517 unweighted average recall, outperforming conventional attention-LSTMs and hybrid CNN-LSTM baselines with three orders of magnitude fewer parameters than self-supervised alternatives.

speech emotion recognitionresidual connectionattention mechanismlstmlightweight architecture

The Impact of Temporal Granularity on Socio-Demographic Inference from Household Load Profiles

arXiv cs.LG · Dejan Radovanovic, Maximilian Schirl, Andreas Unterweger, Günther Eibl · 2026-06-02

This paper investigates how temporal granularity affects socio-demographic inference from smart meter data, analyzing resolutions from 15 minutes to 7 days across 1,589 households. The authors introduce an evaluation framework where classifiers (XGBoost outperforming alternatives) trained on year-round data must generalize to arbitrary weeks. Results reveal two performance plateaus (15min-1hr and 1-7 days), competitive performance of handcrafted/tsfresh features versus CNN embeddings, and differing feature importance for static (dwelling size) versus dynamic (swimming pool) attributes. The study provides insights into privacy-utility trade-offs in smart metering.

temporal granularitysocio-demographic inferencesmart meter datafeature importancexgboost

APIC: Amortized Physics-Informed Calibration using Neural Processes

arXiv cs.LG · Aishwarya Venkataramanan, Sai Karthikeya Vemuri, Joachim Denzler · 2026-06-02

The authors propose Amortized Physics-Informed Calibration (APIC), a scalable Bayesian framework extending the Kennedy-O'Hagan approach for physics model calibration. APIC employs Neural Processes with a two-branch latent architecture to disentangle instance-specific physical parameters from shared structural discrepancies, enabling amortized inference across system realizations. Experiments on damped spring oscillators, Lotka-Volterra systems, and advection-diffusion PDEs demonstrate improved parameter recovery and consistent discrepancy identification compared to baseline methods, while maintaining uncertainty quantification.

physics-informed calibrationneural processesbayesian inferencediscrepancy modelingamortized inference

RogueMerge: Robust and Unified Attacks against LLM Model Merging

arXiv cs.LG · Jinghuai Zhang, Yetian He, Kunlin Cai, Han Zhao · 2026-06-02

RogueMerge introduces a unified framework for robust attacks against LLM model merging, addressing three key challenges: autoregressive decoding, unknown merging configurations, and prompt generalization. The method replaces static arithmetic with joint optimization, formulates attack injection as a stochastic min-max problem solved via meta-learning, and employs distributionally robust optimization with a tractable Taylor approximation. Evaluated across four threats, six merging algorithms, and 170+ merged LLMs, RogueMerge outperforms existing attacks, maintains stability across merging settings, and resists standard defenses.

model mergingautoregressive decodingstochastic min-maxdistributionally robust optimizationtask vectors

IdEst: Assessing Self-Supervised Learning Representations via Intrinsic Dimension

arXiv cs.LG · Julie Mordacq, Vicky Kalogeiton, Steve Oudot · 2026-06-02

The paper introduces IdEst, a method for evaluating self-supervised learning (SSL) representations by estimating their intrinsic dimension (ID) using the Minimum Spanning Tree dimension estimator ($\mathrm{dim}_\mathrm{MST}$). This approach addresses limitations of linear probing by providing a computationally efficient, hyperparameter-robust alternative that reveals geometric properties of the representation space. Experiments across diverse datasets, architectures, and SSL objectives demonstrate strong correlation between IdEst estimates and downstream linear probe performance, while enabling efficient hyperparameter selection. The results establish intrinsic dimensionality as a principled geometric proxy for SSL representation quality.

self-supervised learningintrinsic dimensionminimum spanning treerepresentation evaluationgeometric structure

Lingo_Research_Group at SemEval-2026 Task 9: Evaluating Prompt Variants for Polarization Detection

arXiv cs.LG · Pritam Kadasi, Anuj Tiwari, Mayank Singh · 2026-06-02

The paper presents a systematic evaluation of prompt variants for polarization detection in SemEval-2026 Task 9, covering binary detection, type classification, and manifestation identification. Twelve prompt designs were tested on aya-101 and Gemma3-27B, with the latter selected for final submission. Results show macro F1-scores of 0.762 (Subtask 1), 0.587 (Subtask 2), and 0.444 (Subtask 3), with accuracies of 0.819, 0.678, and 0.498 respectively across 22 languages. Prompt-based methods excel in coarse-grained detection but struggle with fine-grained sociolinguistic classification.

polarization detectionprompt engineeringmultilingual classificationsemantic evaluationin-context learning

Tailoring Strictly Proper Scoring Rules for Downstream Tasks: An Application to Causal Inference

arXiv cs.LG · Roman Plaud, Alexandre Perez-Lebel, Antoine Saillenfest, Thomas Bonald · 2026-06-02

The authors propose a framework for deriving task-specific strictly proper scoring rules by matching local curvature of downstream error metrics, addressing the disconnect between probabilistic training objectives and estimation tasks. Focusing on Inverse Probability Weighting (IPW) for Average Treatment Effect (ATE) estimation, they derive a closed-form loss function and canonical probability mapping compatible with standard models. Evaluations on causal inference benchmarks show their method outperforms likelihood-based and covariate-balancing approaches in reducing bias and variance.

strictly proper scoring rulesinverse probability weightingaverage treatment effectlocal curvature matchingcausal inference

Validation-Gated Multi-Agent Governance for Online Adaptation of Thermal-Hydraulic Surrogate Models under Operating-Regime Shift

arXiv cs.LG · Doyeong Lim, Seungyoon Lee, In Cheol Bang · 2026-06-02

A validation-gated multi-agent framework is proposed for online adaptation of thermal-hydraulic surrogate models under operating-regime shifts, addressing condition-locking in offline-selected models. The framework employs role-separated agents (Monitor, Diagnosis, Adaptation, Safety-Auditor, Orchestrator) to diagnose errors, prioritize model families, and review promotions, with deterministic champion-challenger gates controlling model replacement. Seven surrogate families were evaluated via blocked three-fold cross-validation, selecting a temporal Fourier neural operator as the initial champion for 60-s-history-to-10-s-trajectory forecasting. The MA-Full mode achieved the lowest mean error (5.72 MAE) and 35.8% warning-exceedance ratio, a 19.0% improvement over static deployment. Validated promotions to Transformer and graph neural network models demonstrate auditable surrogate evolution with retained deployment authority.

thermal-hydraulic surrogatechampion-challenger gatestemporal fourier neural operatoroperating-regime shiftvalidation-gated adaptation

A Graph Foundation Model with Spectral Parsing and Prototype-Guided Spatial Propagation

arXiv cs.LG · Ankang Yang, Jitao Zhao, Dongxiao He, Liang Yang · 2026-06-02

The paper proposes SPG, a graph foundation model addressing cross-graph transfer challenges through spectral parsing and prototype-guided spatial propagation. SPG employs learnable Chebyshev filters to decompose node features into frequency-specific components, aligning propagation behaviors with spectral characteristics. It constructs a Gromov-Wasserstein prototype geometry to distill transferable structural relations beyond predefined substructures. Experiments show improved cross-domain generalization, demonstrating the model's ability to handle diverse graph structures and signal frequencies. The approach outperforms existing methods by disentangling spectral components and leveraging prototype-based relational distillation.

graph foundation modelspectral parsingchebyshev filtersgromov-wassersteincross-domain generalization

From Script to Semantics: Prompting Strategies for African NLI

arXiv cs.LG · Anuj Tiwari, Terry Oko-odion, Hannah Nwokocha · 2026-06-02

This study evaluates five prompting strategies for Natural Language Inference (NLI) in low-resource African languages (Swahili, Yoruba, Hausa) using the AfriXNLI benchmark. Focusing on pure prompting without fine-tuning, the research tests Baseline, Script-Aware, Language-Specific, Contrastive, and Native-Label Self-Translation (NL-STP) strategies on Llama3.2-3B and Gemma3-4B models. Results show Contrastive prompting outperforms others, improving class balance and accuracy while avoiding neutral class collapse. Prompt design proves critical, outperforming few-shot and Chain-of-Thought baselines in multilingual NLI tasks.

nlipromptinglow-resourceafrican languagescontrastive

Combining Statistical Features and Deep Encodings for Rehearsal-Based Class-Incremental Time Series Classification

arXiv cs.LG · Pablo García-Santaclara, Bruno Fernández-Castro, Rebeca Pilar Díaz-Redondo · 2026-06-02

The paper introduces a dual-stream feature extraction pipeline for class-incremental continual learning in multivariate time series classification, combining deep temporal embeddings from a pre-trained foundation model with statistical features. The method addresses challenges of temporal data structure and catastrophic forgetting. Evaluated on five benchmark datasets, it achieves competitive accuracy and low forgetting rates across all configurations.

class-incrementalcontinual learningmultivariate time seriesdual-streamfoundation model

A Geometric Lens on Physics-Aligned Data Compression

arXiv cs.LG · Aleix Segui, Wesley Armour · 2026-06-02

The paper develops a geometric framework to analyze rate-distortion tradeoffs in physics-aligned data compression, where physics-informed losses often improve observable preservation at the cost of standard reconstruction fidelity. The authors show that this tradeoff arises from misalignment between latent-space sensitivity directions induced by the entropy model, physical observable, and distortion metric, formalized through a local tangent-space rate-distortion law. They propose an alignment diagnostic based on eigenspace overlap, validated experimentally across scientific domains to correlate with observed tradeoffs between data and physics-space metrics.

physics-informed compressionrate-distortion tradeofflatent-space sensitivityanisotropic error allocationeigenspace alignment

Let There Be Light: Reflection, Refraction and Scattering for Neural Operators

arXiv cs.LG · Keke Wu, Yixuan Zhang, Jingrun Chen · 2026-06-02

The paper introduces Light-inspired neural operator (LiNO), a novel architecture for learning mappings between function spaces in parametric PDEs, inspired by light transport mechanisms. LiNO decomposes latent evolution into three physically interpretable components: reflection and refraction for adaptive pointwise feature transformations, and scattering for nonlocal propagation. The authors reformulate scattering as a normalized pairwise kernel with positional bias, then develop an efficient linear-complexity variant using global propagation and local diffusion. This approach maintains interpretability while addressing trade-offs between physical interpretability, nonlocality, mesh scalability, and computational cost in existing neural operators.

neural operatorslight transportparametric pdesnonlocal propagationmesh scalability

Hierarchies of Calibration: Classification meets Regression

arXiv cs.LG · Johannes Resin, Lu Yang, Tilmann Gneiting · 2026-06-02

The paper systematically reviews and extends calibration concepts for probabilistic predictions across classification and regression tasks, introducing modal calibration for nominal outcomes and distinguishing full, partial, and average calibration. It analyzes hierarchical relations between calibration notions for various data types (real-valued, continuous, count, nominal, binary) and demonstrates the logical independence of double probability integral transform (PIT) calibration from existing discrete-outcome calibration concepts. The work provides algorithmic tools for constructing examples and counterexamples, while generalizing results on calibration expressed via predictive distribution properties (means, quantiles, event probabilities).

calibrationprobabilistic predictionsmodal calibrationprobability integral transformhierarchical relations

Right Makes Might: Aligning Verified Hidden States Empowers RL Reasoning

arXiv cs.LG · Ziyue Wang, Aomufei Yuan, Yongfu Zhu, Shuai Dong · 2026-06-02

Hidden-Align introduces an auxiliary loss function for Reinforcement Learning from Verifiable Rewards (RLVR) that aligns the last-layer hidden states of correct rollouts at the anchor token, promoting a unified 'correct decision' representation. The method leverages the geometric structure of hidden states, specifically their convergence at the anchor token (cosine similarity ~0.84), while accounting for residual variance from unique reasoning paths. Implemented with zero training and inference overhead, Hidden-Align improves average pass@1 by 3.8, 6.2, and 5.4 percentage points on Qwen3-1.7B, 4B, and 14B models across eight mathematical reasoning benchmarks, with consistent pass@k gains validated through ablation studies.

reinforcement learninghidden statesanchor tokenauxiliary lossmathematical reasoning

Learning Temporal Causal Structure via Smooth Differentiable Optimization

arXiv cs.LG · Tong Zhao, Ce Guo, Wayne Luk, Emil Lupu · 2026-06-02

The paper introduces a differentiable optimization approach for learning acyclic temporal causal structures in multivariate time series. By employing the Gumbel-Sinkhorn operator to learn variable permutations and triangularizing the instantaneous coefficient matrix of a Structural Vector Autoregressive (SVAR) model, the method converts acyclicity constraints into a continuous parameterization. This enables efficient gradient-based optimization without multi-stage pipelines or complex algebraic constraints. Evaluated on three real-world benchmarks against 12 baselines, the method achieves superior discovery accuracy and efficiency, with a 6x speedup on large-scale datasets compared to prior work.

causal discoverystructural vector autoregressivegumbel-sinkhorn operatordifferentiable optimizationacyclicity constraint

Sample-Size Scaling of the African Languages NLI Evaluation

arXiv cs.LG · Anuj Tiwari, Oluwapelumi Ogunremu, Terry Oko-odion, Jesujuwon Egbewale · 2026-06-02

This study challenges the assumption that increasing labeled data monotonically improves performance in low-resource African languages by analyzing sample-size scaling effects on natural language inference (NLI) across 16 languages using AfriXNLI. The authors evaluate two multilingual transformers (XLM-R Large and AfroXLM-R Large, ~0.6B parameters) with controlled sample sizes (50-500 examples) via random subsampling. Results reveal language-dependent, often non-monotonic scaling patterns, including early saturation, performance drops, and high variance in low-data regimes, suggesting data volume alone is insufficient for reliable NLI improvements in African languages.

sample-size scalingnatural language inferencelow-resource languagesmultilingual transformersnon-monotonic performance

An Asymptotic Theory of Chain-of-Thought in In-Context Learning

arXiv cs.LG · Kaito Takanami, Cengiz Pehlevan · 2026-06-02

The paper develops an asymptotic theory of chain-of-thought (CoT) reasoning depth in in-context learning for linear regression weight prediction. Using random matrix theory under high-dimensional asymptotics, the authors derive exact generalization error formulas as functions of CoT depth, pretraining data, and context length. Results reveal phase transitions between exponential/polynomial improvement regimes, saturation, and overthinking, with optimal depth scaling characterized by pretraining and context richness. Experiments on learned linear and softmax attention models validate theoretical predictions.

chain-of-thoughtin-context learningrandom matrix theorygeneralization errorlinear regression

Bayesian Tensor Decomposition with Diffusion Model Prior

arXiv cs.LG · Zerui Tao, Qibin Zhao · 2026-06-02

The paper introduces DiffBCP, a Bayesian CP tensor decomposition framework combining a cumulative shrinkage process prior for rank selection with a pre-trained diffusion model as an implicit data prior. The method employs a split Gibbs sampler for tractable inference, separating conjugate updates for CP factors from diffusion-guided denoising steps, aided by a noise-adaptive coupling schedule. Experiments demonstrate improved performance in image inpainting and denoising tasks, including out-of-distribution scenarios, outperforming Bayesian, nonlinear, and plug-and-play tensor decomposition baselines.

bayesian tensor decompositiondiffusion model priorcp decompositionsplit gibbs samplercumulative shrinkage process

Critical evaluation of PINN for FWD inverse analysis and differentiable FEM as an alternative

arXiv cs.LG · Yongjin Choi, Hyeonbin Moon, Seunghwa Ryu · 2026-06-02

This study evaluates physics-informed neural networks (PINNs) and differentiable finite element method (DiffFEM) for inverse analysis of multilayer pavement systems using falling weight deflectometer (FWD) backcalculation. A synthetic benchmark reveals that standard PINNs fail to recover layer moduli due to domain discontinuities, while extended PINNs with domain decomposition (XPINNs) exhibit sensitivity to loss weighting, architecture, and noise. DiffFEM, enforcing physics as a hard constraint, outperforms PINNs in accuracy, stability, and computational efficiency. The findings suggest DiffFEM's practical advantages over PINNs when an efficient differentiable forward solver is available.

physics-informed neural networksdifferentiable finite element methodfalling weight deflectometerinverse analysisdomain decomposition

DECA: Decentralizing Block-Wise Adam for Efficient LLM Full-Parameter Fine-Tuning on Non-IID Data

arXiv cs.LG · Yunsheng Yuan, Shaowei Li, Kai Wang, Zhongyuan Sun · 2026-06-02

DECA introduces a decentralized framework for full-parameter fine-tuning (FPFT) of large language models (LLMs) on non-IID data, addressing resource constraints and privacy concerns. The method partitions model parameters into disjoint blocks and employs sequential block-wise Adam optimization, reducing resource consumption while maintaining decentralized adaptation. DECA stabilizes training through first- and second-order block-wise moment estimates using fresh local gradients and consensus-derived discrepancy signals. Theoretical analysis and experiments demonstrate that DECA achieves fast convergence, strong downstream performance, and significant resource efficiency.

full-parameter fine-tuningblock-wise adamnon-iid datadecentralized optimizationclient drift

Fast Organic Crystal Structure Prediction with Unit Cell Flow Matching

arXiv cs.LG · Alston Lo, Luka Mucko, Austin H. Cheng, Andy Cai · 2026-06-02

Clari introduces a fast flow matching model for organic crystal structure prediction (CSP), reducing computational cost from minutes to seconds per molecule. The method generates redundancy-free unit cells using pure pair-bias attention, requiring only atom types and bonds as input without RDKit sanitization. Clari achieves a 15-30× speedup over OXtal while improving solve rates, supports explicit hydrogens for energy ranking, and introduces the CSD Teaching Subset for benchmarking. Results show maintained performance with 5-8× speedup when selecting top-30 crystals by energy.

crystal structure predictionflow matchingunit cell generationpair-bias attentionenergy ranking

FinStressTS: A Parametric Synthetic Benchmark for Time-Series Forecasting in Finance

arXiv cs.LG · Jiaze Sun, Kelvin J. L. Koa, Ruiyang Ni, Yize Liu · 2026-06-02

FinStressTS introduces a parametric synthetic benchmark for financial time-series forecasting, addressing limitations of real-world data by isolating six mechanism families: volatility clustering, multi-scale persistence, heavy-tailed shocks, regime switching, self-exciting jumps, and zero-inflated processes. The benchmark evaluates 15 models across point (NMAE) and probabilistic (CRPS) forecasting tasks, revealing mechanism-dependent performance: autoregressive models often outperform Transformers in volatility- and jump-driven environments, while parametric probabilistic models excel in stationary settings. Neural models show data inefficiency, with gains primarily in complex regimes. The framework enables failure mode diagnosis and risk-aware forecasting advancement.

synthetic benchmarktime-series forecastingprobabilistic forecastingregime switchingsignal-to-noise ratio

GLINT: Sparsely Gated Vision-Language Alignment for Fine-Grained Radiology Representations

arXiv cs.LG · Jonggwon Park, Seongeun Lee, Junhyun Park, Hannah Yun · 2026-06-02

GLINT introduces a sparsely gated vision-language alignment framework for fine-grained radiology representations, addressing the mismatch between global supervision and localized findings in medical images. The method combines Sparsely Gated Alignment, which activates only relevant patches via sigmoid gates over a separate embedding space, with Dense Feature Regularization to preserve fine-grained patch features by anchoring to frozen SSL teachers (DINOv3 for 2D CXR, V-JEPA 2.1 for 3D CT). GLINT achieves zero-shot classification, grounding, and segmentation from free-text queries, outperforming SSL encoders and medical VLMs on downstream tasks, with particularly strong gains in localization-sensitive tasks.

vision-language modelssparse gatingzero-shot segmentationself-supervised learningradiology representations

Auditing Engagement Incentives in the Kidfluencer Ecosystem: A Multimodal Weak Supervision Approach

arXiv cs.LG · Zijing Wei, Chao Peter Yang, Xuanjie Chen · 2026-06-02

This study introduces a multimodal weak supervision approach to audit engagement incentives in the kidfluencer ecosystem, addressing the lack of scalable exploitation metrics. The method aggregates noisy labeling functions—including LLM-based text classification and GPT-4 Vision analysis—across six dimensions to assign probabilistic exploitation scores to 5,051 videos from 79 channels. Results show strong human alignment (macro-average F1=0.911) and significant engagement premiums for performative labor (65.6% view boost) and emotional bait (56.0%), with exploitation scores correlating with views (Spearman ρ=0.229, p<10^-50). Commercial content showed no premium, highlighting platform rewards for identity commodification over traditional advertising.

weak supervisionmultimodal analysisengagement premiumprobabilistic scoringkidfluencer ecosystem

SketchSong: Hierarchical Song Generation with Sketch Planning and Fine-Grained Multi-Track Modeling

arXiv cs.LG · Xiaoyue Duan, Nanxing Hu, Yutang Feng, Xudong Yan · 2026-06-02

SketchSong introduces a hierarchical framework for song generation addressing two key challenges: incoherent arrangements due to limited song-level planning and coarse musical part modeling. The method employs sketch planning to predict high-level structure before generating detailed audio tokens, and fine-grained multi-track modeling (vocals, bass, drums, other instruments) to capture part interactions. Evaluations show SketchSong outperforms baselines in objective metrics and human listening tests, achieving competitive results against post-trained systems without preference optimization.

hierarchical generationsketch planningmulti-track modelingarrangement coherenceaudio tokens

FederatedSkill: Federated Learning for Agentic Skill Evolution

arXiv cs.LG · Jingbo Yang, Guanyu Yao, Yang Zhang, Ramana Rao Kompella · 2026-06-02

FederatedSkill introduces a privacy-preserving framework for collaborative agent skill evolution using federated learning. Instead of sharing raw trajectories, it employs semantic skill diffs as communication units, enabling personalized skill libraries while preserving privacy. A server-side evolution agent aggregates these diffs to model client-specific capabilities dynamically, avoiding suboptimal global averaging. Evaluated across 20 agent task families, FederatedSkill achieves a 44.4% higher success rate and 37.5% lower computational cost compared to self-evolving baselines.

federated learningskill evolutionsemantic skill diffsprivacy-preservingagentic tasks

How Visible Are Silent Manipulation Failures? An Observability Study of False-Success Detection in Simulated Robot Episodes

arXiv cs.LG · Aarav Bedi · 2026-06-02

The study investigates the observability of false-success episodes in robot manipulation tasks, where imitation-learning policies erroneously label failed episodes as successful. Using a simulated testbed with bimanual ALOHA tasks, the authors compare proprioception-based and vision-based detectors to assess recoverability of false successes. Results show varying recoverability: cube transfer tasks are nearly fully recoverable from joint data, while peg insertion requires vision to close the gap. Proprioceptive separability relies on velocity differences below realistic noise floors, indicating simulator-inflated optimism. The pipeline for generation and evaluation is publicly released.

imitation-learningproprioceptionfalse-successbimanualobservability

HARVE: Hacking-Aware Reward-Head Vector Editing for Robust Reward Models

arXiv cs.LG · Shuang Liu, Yuxuan Bo, Qiuyang Zhao, Caiyue Huang · 2026-06-02

The paper introduces HARVE, a training-free method for improving reward-model robustness against hacking by editing the reward-head vector. HARVE identifies a multi-directional hacking subspace from residual stream directions associated with specific hacking subcategories and removes the reward-head's aligned component, using only contrastive gold-hacked examples without gradient updates. Evaluated on RewardHackBench with 13 hacking patterns across eight reward models, HARVE enhances robustness, outperforms fine-tuning baselines, and preserves general capabilities. Analysis suggests reward hacking manifests as a multidimensional residual-space structure rather than isolated cues.

reward hackingresidual streamreward-head vectorcontrastive examplestraining-free

Synthetic Hallucinations, Real Gains: Hard Negatives from Frontier Models for FIM Hallucination Mitigation

arXiv cs.LG · Mahdi Erfanian, Nelson Daniel Troncoso, Aashna Garg, Amabel Gale · 2026-06-02

The paper introduces a novel method to mitigate Fill-in-the-Middle (FIM) hallucinations in code autocompletion by synthesizing hard negatives using frontier models. It leverages a multilingual dataset from GitHub, generating plausible-but-wrong completions across eight languages and four hallucination types. Fine-tuning Qwen2.5-Coder-7B-Instruct on a curated 100K-row subset improves Delulu exact match by +18.8 points and edit similarity by +0.22, with consistent gains across HumanEval-Infilling and SAFIM benchmarks. Ablations analyze size, type mix, language coverage, base-model family, and difficulty-aware fool rate. The pipeline source code is released for reproducibility.

fill-in-the-middlehallucination mitigationhard negativesmultilingual datasetfine-tuning

Rethinking Neural Width for Alternating Current Optimal Power Flow Proxies

arXiv cs.LG · Dhruvi Khandelwal, Anurag Basistha, Ayushi Jolotia, Parikshit Pareek · 2026-06-02

The paper introduces Loss-Guided Neural Densification (LG-ND), an algorithm that incrementally determines the necessary width of neural networks for approximating the Alternating Current Optimal Power Flow (ACOPF) manifold. LG-ND expands the network topology only when performance plateaus, ensuring minimal architectural complexity. Empirical evaluations across various IEEE systems demonstrate that LG-ND achieves comparable performance to existing baselines while using up to ten times fewer neurons per layer. This architectural minimalism is crucial for formal verification in safety-critical grid operations.

neural densificationacopf manifoldformal verificationarchitectural minimalismieee systems

TiWeaver: Unified Temporal Dynamics Modeling via Contextual Patching

arXiv cs.LG · Zhe Li, Jindong Tian, Hao Miao, Zhi Lei · 2026-06-02

TiWeaver introduces a unified framework for multivariate time series forecasting that adaptively handles diverse temporal dynamics and inter-channel dependencies. The method combines a Graph-Guided Adaptive Tokenizer (G$^2$AT) for contextually coherent patching based on temporal density and representation consistency, with a Fine-grained Asynchronous Dependency Extractor (FADE) for modeling asynchronous inter-channel and long-term dependencies. Evaluated on 12 real-world datasets, TiWeaver achieves state-of-the-art performance, outperforming existing methods by up to 25%.

multivariate time seriestemporal dynamicsadaptive tokenizerasynchronous dependenciescontextual patching

Learning to See via Epiretinal Implant Stimulation in silico with Model-Based Deep Reinforcement Learning

arXiv cs.LG · Jacob Lavoie, Marwan Besrour, William Lemaire, Jean Rouat · 2026-06-02

The study introduces a deep reinforcement learning approach for optimizing image rendering via epiretinal implants by strategically combining isotropic and anisotropic phosphenes. Using a model-based framework (rlretina) with the axon map model, the method trains an agent to assemble shapes that enhance image intelligibility for virtual patients with varying retinal configurations. Results demonstrate superior performance over naive methods, measured through psychophysically validated metrics, advancing techniques for artificial vision restoration.

epiretinal implantsdeep reinforcement learningaxon map modelphosphene renderingstroke-based rendering

Trans GAN-WT: A Feature Extraction and Interactive Learning-Based Anomaly Detection Model for Wind Turbine Time Series Data

arXiv cs.LG · Jingzhe Kang · 2026-06-02

TransGAN-WT introduces a Transformer-GAN fusion model for anomaly detection in wind turbine time series data, addressing limitations in relational modeling and anomaly data utilization. The model amplifies reconstruction errors to reduce minor deviation leakage, employs autoregressive inference for multimodal feature extraction, and constructs a temporal feature extraction module for interactive learning across time scales. Evaluated on real-world wind turbine datasets, TransGAN-WT achieves a 96.10% average F1 score, outperforming state-of-the-art baselines by 5.84% and 2.89%, with a 0.06% false positive rate. Statistical significance is confirmed via Wilcoxon signed-rank test.

transformergenerative adversarial networkanomaly detectionmultimodal feature extractiontemporal feature extraction

Zero-Shot 3D Question Answering via Hierarchical View-to-Token Transportation

arXiv cs.LG · Dongsheng Wang, Dawei Su, Hui Huang · 2026-06-02

The paper proposes KeyVT, a hierarchical view-to-token transportation method for zero-shot 3D question answering using 2D Vision-Language Models (VLMs). The approach combines pixel features with camera parameters to select spatially consistent and task-relevant views, then employs optimal transport (OT) to identify representative tokens across views, minimizing redundancy. Evaluated on three benchmarks, KeyVT outperforms tuning-free methods and matches training-based approaches in performance.

zero-shot learningvision-language modelsoptimal transport3d scene understandinghierarchical representation

FGRPO: Federated GRPO with Adaptive Aggregation on Non-IID Data

arXiv cs.LG · Pengyu Chen, Shaowei Li, Kai Wang, Yunsheng Yuan · 2026-06-02

The paper introduces Federated GRPO (FGRPO), a decentralized framework for fine-tuning reasoning models across heterogeneous data owners while preserving privacy. FGRPO adapts Group Relative Policy Optimization (GRPO) to federated learning by incorporating an adaptive aggregation mechanism that prioritizes clients based on relative performance gain, mitigating instability from divergent reward scales. The method demonstrates robust convergence on non-IID data by dynamically weighting updates according to local task difficulty and historical baselines.

federated learninggroup relative policy optimizationnon-iid dataadaptive aggregationreasoning models

Learning to Solve, Forgetting to Retain: Correct-Set Turnover in RLVR

arXiv cs.LG · Chuanyu Qin, Chenxu Yang, Qingyi Si, Naibin Gu · 2026-06-02

The paper identifies correct-set turnover in reinforcement learning with verifiable rewards (RLVR), where models lose previously mastered solutions during training, and proposes retention-aware optimization as a solution. The authors formalize this phenomenon, demonstrate the repair-window principle (regression repair costs grow exponentially with delay), and introduce a review mechanism that periodically reintroduces mastered prompts without additional rollout overhead. Evaluated on 20 benchmarks across image-text, video, and text-only tasks using Qwen3-VL and Qwen2.5-Math, the method outperforms GRPO, DAPO, and replay baselines while maintaining generalizability across modalities.

correct-set turnoverrlvrrepair-window principleretention-aware optimizationpre-rollout batch replacement

Multi-component Causal Tracing in Large Language Models

arXiv cs.LG · Zirui Yan, Dennis Wei, Dmitriy A. Katz, Prasanna Sattigeri · 2026-06-02

The paper introduces a unified framework for multi-component causal tracing in large language models (LLMs), extending prior single-component or single-layer approaches. The method employs flexible interventions across various metrics and addresses combinatorial complexity through soft interventions and metric transformation, converting the problem into a continuous optimization task. This enables efficient identification of critical subsets of components (e.g., attention heads, multi-layer perceptron neurons) influencing target metrics like accuracy and fairness. Experimental results show the framework outperforms existing baselines in identifying high-impact components. Code is available at https://github.com/ZiruiYan/multi-component-causal-tracing.

causal tracinglarge language modelssoft interventionsmetric transformationcombinatorial complexity

RMPrior: Bridging Propagation Priors and Diffusion Refinement for Efficient Radio Map Construction

arXiv cs.LG · Zixuan Guo, Xiucheng Wang, Nan Cheng · 2026-06-02

RMPrior introduces a mid-start sampling strategy that bridges propagation priors and diffusion refinement for efficient radio map construction. The method perturbs a matched propagation prior to an intermediate diffusion timestep, allowing a pretrained diffusion backbone to focus on multipath-aware refinement rather than full reconstruction from noise. Theoretical analysis provides an upper bound on the initialization gap and characterizes prior-quality sensitivity under aggressive truncation. Experiments on IRT4HighRes demonstrate a 2.01× speedup and improved NMSE, RMSE, SSIM, and PSNR at P_start=0.5 compared to the full-step baseline. Prior-quality ablation confirms reconstruction quality tracks prior fidelity, amplified under shorter reverse trajectories.

diffusion modelspropagation priorsradio map constructionmultipath-aware refinementmid-start sampling

Trajectory-Aware Node Contributions and the Limits of Static Controllability

arXiv cs.LG · Valentina Kuskova, Dmitry Zaytsev, Michael Coppedge · 2026-06-02

The authors introduce 'emergent contribution (EC)', a finite-horizon measure quantifying a node's dynamical leverage in complex networks, computed from Jacobians of differentiable models. EC generalizes average controllability to nonlinear, time-varying systems and reduces to it in the linear, time-invariant case. Using a synthetic family with known ground truth, they construct a phase diagram analyzing agreement between EC and average controllability across nonlinearity, regime structure, persistence, and perturbation amplitude. Results show agreement under static or smoothly drifting dynamics, divergence under persistent regime switching, and degradation at extreme perturbations. EC's utility is demonstrated on five real systems, revealing variance-leverage dissociation not captured by static centralities.

emergent contributionaverage controllabilityjacobiansnonlinear dynamicsphase diagram

What Do Students Learn? A Feature-Level Analysis of Dark Knowledge

arXiv cs.LG · Seungu Kang, Songkuk Kim · 2026-06-02

This work investigates feature-level learning mechanisms in Knowledge Distillation (KD) through the Interaction Tensor framework, revealing that KD acts as a regularizer by pruning low-frequency features and promoting reusable feature sets. The authors identify structural information in dataset-level confusion matrices analogous to teacher Dark Knowledge, leading to Confusion Distillation (CD), a teacher-free self-distillation method using evolving confusion patterns as soft targets. CD achieves 1.2% higher accuracy than CS-KD and PS-KD on CIFAR-100 with ResNet-34/50 while being computationally efficient.

knowledge distillationinteraction tensorconfusion distillationfeature regularizationdark knowledge

Will Accurate Fields Mislead Photonic Design? FromGlobal Accuracy to Port Readout

arXiv cs.LG · Yitian Zhang, Yonghong chen, Youming Chen, Yiyang Li · 2026-06-02

The paper introduces PaNO, a propagation-aligned neural operator for photonic design, addressing the mismatch between global field accuracy and port-readout fidelity in neural field surrogates. By organizing latent states around boundary structure, modal content, and propagation dynamics, PaNO improves design-relevant metrics despite higher global error. The variant PaNO-R2 further optimizes residual field components near ports. Evaluated on a 15-wavelength MMI benchmark (4608 fields), PaNO reduces NeurOLight's port-power error from 0.2018 to 0.0739, while PaNO-R2 achieves the lowest errors across cMAE (72.7%), propagation-profile (72.5%), and port-power metrics.

neural field surrogatesphotonic designpropagation-alignedport-readout fidelitymmi benchmark

A Fast Screening Approach for High-dimensional Outcomes and High-dimensional Predictors

arXiv cs.LG · Hongju Park, Zhenyao Ye, Shuo Chen · 2026-06-02

The authors propose Graph Independence Dual Screening (GIDS), a novel screening framework that simultaneously reduces dimensionality in both predictors and outcomes for high-dimensional multimodal data. GIDS employs computationally efficient algorithms to identify sparse interaction structures, supported by theoretical guarantees. In simulations, GIDS outperforms predictor-only screening methods. Applied to Alzheimer's Disease Neuroimaging Initiative data (865,353 DNA methylation sites and 49,386 transcripts), GIDS reduced features to ~9,000 CpGs and ~2,000 transcripts, revealing blockwise regulatory interactions with biological interpretability.

dimensionality reductionmultimodal datainteraction screeninghigh-dimensional statisticsgenomic feature selection

MOSAIC: Efficient Mixture-of-Agent Scheduling via Adaptive Aggregation and Inference Concurrency

arXiv cs.LG · Saptarshi Mitra, Yifan Zhang, Rachid Karami, Phyo Pyae Moe Aung · 2026-06-02

MOSAIC introduces an efficient scheduling framework for Mixture-of-Agents (MoA) systems, addressing GPU idling and throughput collapse caused by load imbalances. The method combines an Integer Linear Program (ILP) scheduler for joint optimization of expert placement and prompt assignment with confidence-aware adaptive aggregation to bypass heavy final aggregator LLMs for consensus queries. Evaluated on a 4-GPU system, MOSAIC achieves up to 2.5x expert-stage, 4.23x aggregator-stage, and 1.7~2.3x end-to-end speedups over baselines while maintaining accuracy within 0.1 percentage points.

mixture-of-agentsinteger linear programgpu schedulingadaptive aggregationload balancing

CoughSense: Five-Class Respiratory Disease Classification via Whisper Encoder Fine-Tuning and Dual-Encoder Cross-Attention Fusion with Balanced Contrastive Learning

arXiv cs.LG · Nikhil Vincent · 2026-06-02

CoughSense introduces a five-class respiratory disease classifier using cough recordings, achieving 82.3% balanced accuracy via Whisper encoder fine-tuning and dual-encoder cross-attention fusion. Key innovations include active-frame QKV attention pooling to handle Whisper's 30-second input window (focusing on the first 200 of 1500 tokens), balanced contrastive learning for 19:1 class imbalance, and domain adaptation for multi-dataset training. The system outperforms ImageNet-pretrained EfficientNet-B2 by 11.1 points and a ViT baseline by 29.6 points, with the dual-encoder variant reaching 85.4% accuracy. Active-frame pooling contributes +5.1 points in ablations.

active-frame poolingwhisper encoderdual-encoder fusionbalanced contrastive learningrespiratory classification

Neural Networks Provably Learn Spectral Representations for Group Composition

arXiv cs.LG · Jianliang He, Leda Wang, Fengzhuo Zhang, Siyu Chen · 2026-06-02

The paper proves that two-layer neural networks trained on group composition tasks develop structured spectral representations through gradient flow dynamics. By analyzing the training process in the Fourier domain, the authors show that neurons converge to irreducible group representations with rotational rank-one alignment of cross-layer coefficients. For Abelian groups, random initialization leads to uniform diversification across representations and Haar-uniform phases, approximating the indicator function via majority vote. Theoretical results include exponential convergence rates for both phase alignment and representation competition.

spectral representationsgroup compositionirreducible representationgradient flowhaar-uniform phases

DriftSched: Adaptive QoS-Aware Scheduling under Runtime Token Drift for Multi-Tenant GPU Inference

arXiv cs.LG · Kathiravan Palaniappan · 2026-06-02

DriftSched introduces an adaptive QoS-aware scheduling framework for multi-tenant LLM inference on NVIDIA L4 GPUs, addressing runtime token drift caused by output length estimation errors. The method combines workload classification, token-budget estimation, tenant-aware queue management, and drift compensation, evaluating FIFO, Priority, Weighted, SJF, and Aging Priority policies. Results show adaptive bias correction reduces estimation error by 38.8% (MAE) and 40.5% (RMSE), with SJF achieving 42% lower median latency and 16% lower P99 latency versus FIFO under GPU contention.

multi-tenant schedulingtoken driftqos-awarellm inferenceadaptive bias correction

Multi-Segment Attention: Enabling Efficient KV-Cache Management for Faster Large Language Model Serving

arXiv cs.LG · Chunan Shi, Yilei Chen, Yilin Chen, Xupeng Miao · 2026-06-01

The paper introduces AsymCache, a computation-latency-aware KV-cache management system for LLM inference that optimizes both cache hit rate and GPU attention kernel efficiency. The method combines Multi-Segment Attention (MSA) for non-contiguous KV processing, a position-aware eviction policy, and an adaptive chunking scheduler. Experiments demonstrate 1.90-2.03x faster time-to-first-token (TTFT) and 1.62-1.71x improved time-per-output-token (TPOT) versus baselines, with an 18.1% latency reduction in agent serving systems like Continuum.

kv-cachellm inferencemulti-segment attentiongpu optimizationcache management

KForge: LLM-Driven Cross-Platform Kernel Generation for AI Accelerators

arXiv cs.LG · Taras Sereda, Burak Bartan, Ankita Nayak, Tom St. John · 2026-06-01

KForge introduces a cross-platform framework for AI accelerator kernel generation using a dual-agent LLM-driven approach. The system employs a generation agent for iterative kernel refinement via compilation feedback and a performance-analysis agent for profiling-based optimization recommendations. The refinement loop alternates between functional correctness passes and performance optimization passes. Evaluated on NVIDIA B200 and Intel Arc B580, KForge achieves a 2.12% throughput improvement over TensorRT-LLM on GPT-OSS-20B inference and a 5.13× geometric mean speedup on KernelBench Level 2 workloads compared to PyTorch variants, primarily through operator fusion and mixed-precision execution.

kernel generationllm-drivencross-platformoperator fusionmixed-precision

Gate AI: LLM Security Benchmark Evaluation Methodology and Results

arXiv cs.LG · Ryle Goehausen, Marcus Sousa · 2026-06-01

The paper introduces Gate AI, a rigorous evaluation methodology for LLM security benchmarks addressing two common weaknesses: per-dataset threshold tuning and undisclosed operating points. The approach employs 5-fold cross-validation across 16 public benchmarks (12,111 samples) with StratifiedKFold and StratifiedGroupKFold for leakage detection, selecting a single global operating point (max F1 at FPR ≤1%). Generalization is assessed through multiple diagnostics including leave-one-dataset-out cross-validation, adversarial validation, and paraphrase-invariance probes. Results demonstrate robust evaluation by enforcing uniform thresholds and comprehensive leakage checks.

prompt-injectionjailbreak detectorscross-validationnear-duplicate detectiongeneralization diagnostics

The Road Ahead in Autonomous Driving: The KITScenes Multimodal Dataset

arXiv cs.LG · Richard Schwarzkopf, Fabian Immel, Alexander Blumberg, Jonas Merkert · 2026-06-01

The KITScenes Multimodal dataset addresses limitations in existing autonomous driving datasets by offering high-fidelity sensor data and comprehensive HD maps. It features synchronized global-shutter cameras, 400m-range lidar, 4D imaging radar, and redundant GNSS/INS, alongside 3D-mapped traffic elements with full topological connectivity. Recorded in diverse European urban environments, the dataset enables four novel benchmarks: online HD map construction, long-range depth estimation, novel view synthesis, and end-to-end driving. This resource expands geographic diversity and sensor capabilities for embodied AI research.

autonomous drivingmultimodal datasethd maps4d imaging radartopological connectivity

Fast-dLLM++: Fréchet Profile Decoding for Faster Diffusion LLM Inference

arXiv cs.LG · Siva Rajesh Kasa, Yasong Dai, Sumit Negi, Hongdong Li · 2026-06-01

Fast-dLLM++ introduces Fréchet profile decoding to accelerate diffusion large language model inference by leveraging heterogeneous token confidence profiles. The method generalizes Fast-dLLM's parallel decoding through a training-free extension that selects commit sets from sorted confidence profiles rather than relying on worst-case confidence assumptions. Experiments on GSM8K, MATH, HumanEval, and MBPP with LLaDA-8B demonstrate up to 37% throughput improvement at comparable accuracy, validating the theoretical heterogeneity bonus.

diffusion llmparallel decodingconfidence profilekv cachingthroughput optimization

From Non-Convex to Strongly Convex: Curvature-Adaptive FTPL for Online Optimization

arXiv cs.LG · Moses Charikar, Chirag Pabbaraju, Ambuj Tewari · 2026-06-01

The authors present a curvature-adaptive Follow-the-Perturbed-Leader (FTPL) algorithm for online optimization with non-convex losses, achieving optimal regret bounds without prior curvature knowledge. By dynamically adjusting perturbation scales based on past information, the method interpolates between $O(\sqrt{T})$ regret for general non-convex Lipschitz losses and $O(\log T)$ under sufficient cumulative curvature, matching strongly convex rates. Theoretical analysis includes matching lower bounds, demonstrating the fundamental tradeoff between worst-case regret and curvature acceleration.

online optimizationnon-convex lossesfollow-the-perturbed-leadercurvature adaptivityregret bounds

BYORn: Bootstrap Your Own Responses to Defend Large Vision-Language Models Against Backdoor Attacks

arXiv cs.LG · Ivan Sabolić, Marin Oršić, Josip Šarić, Sven Lončarić · 2026-06-01

The paper introduces BYORn, a defense framework against backdoor attacks in autoregressive vision-language models during supervised fine-tuning. The method detects semantically implausible poisoned responses by leveraging pretrained model knowledge, replacing them with model-generated alternatives to disrupt trigger-target correlations. Experiments show BYORn improves robustness (reducing attack success rate by 30-50% across datasets) while maintaining clean-task performance, establishing a superior trade-off frontier, and remains effective against adaptive attacks.

backdoor defensevision-language modelsautoregressive generationsemantic alignmentadaptive attacks

Outsmarting the Chameleon: Counterfactual Decoupling for Tactical OOD Shifts in Live Streaming Risk Assessment

arXiv cs.LG · Yiran Qiao, Jing Chen, Jiaqi Xu, Yang Liu · 2026-06-01

The paper introduces Latent-Predictive Counterfactual Decoupling (LPCD), a framework for detecting adversarial risks in live streaming by addressing tactical out-of-distribution (OOD) shifts. LPCD models latent intent and narrative variation, enforcing counterfactual consistency to stabilize risk prediction under evolving tactics. It includes a lightweight calibration step at inference to mitigate tactic-induced shifts. Evaluations on industrial datasets and production traffic show LPCD outperforms state-of-the-art baselines in moderating adversarial risks.

latent counterfactual consistencytactical ood shiftlive streaming riskadversarial intentparameter-free calibration

ERP-XTTN: Interpretable Prototype-Guided Cross-Attention for Cross-Subject ERP Classification

arXiv cs.LG · Charlotte Genevier Wyman, Leanne Hirshfield · 2026-06-01

ERP-XTTN introduces a prototype-guided cross-attention architecture for interpretable, calibration-free cross-subject ERP classification, leveraging query-key-only cross-attention to route EEG patches to fixed difference-wave prototypes derived from training-fold extrema. Evaluated across three datasets (BNCI Horizon 2020, HRI Cursor, ERP CORE) spanning eight ERP components, ERP-XTTN achieves competitive performance with mean AUROC gaps of .018 (3-channel) and .034 (full montage) against EEGNet and xDAWN+RG baselines. The architecture reveals neurophysiologically explicable error patterns, demonstrating false positives resembling true positives more than true negatives. ERP-XTTN generalizes across diverse ERPs under causal, calibration-free conditions with minimal interpretability cost at reduced montages.

cross-attentionerp classificationdifference-wave prototypesloso evaluationneurophysiological interpretability

Hierarchical RBF-KAN and RBF-SKAN Architectures for Multidimensional Function Approximation and Random Field Learning

arXiv cs.LG · Mingtao Xia, Qijing Shen · 2026-06-01

The manuscript introduces hierarchical RBF-KAN and RBF-SKAN architectures for multidimensional function approximation and random field learning, employing radial basis functions as activations. Theoretical analysis demonstrates universal approximation capabilities, with quantitative estimates showing reduced curse of dimensionality in high-dimensional function learning. Empirical results validate effectiveness in learning multivariate functions and random field models under the Wasserstein-2 metric.

radial basis functionskolmogorov-arnold networksuniversal approximationwasserstein-2 metricrandom field learning

Fast Unlearning at Scale via Margin Self-Correction

arXiv cs.LG · Federico Di Gennaro, Alexander Shevchenko, Fanny Yang · 2026-06-01

The paper introduces MArgin Self-Correction (MASC), an efficient unlearning method for language models that eliminates unnecessary computation in existing approaches. MASC actively reduces the logit gap between original and alternative next tokens for forget sequences, using an online stopping rule without downstream evaluation. Evaluated on TOFU, MUSE News, and MUSE Books, MASC achieves competitive forget-retain trade-offs at reduced computational cost compared to baselines. Results show improved trade-offs with increasing model size, maintaining forget metrics while enhancing retain utility.

language-model unlearninglogit gaponline stopping ruleforget-retain trade-offmargin self-correction

Data-Driven Forecasting of three-Component Seismograms Using Transformer Architectures

arXiv cs.LG · Waleed Esmail, Stuart Russell, Jana Klinge, Alexander Kappes · 2026-06-01

The paper introduces SeismoGPT, a transformer-based autoregressive model for forecasting three-component seismic waveforms in the time domain. The method formulates forecasting as a physically constrained continuation problem, using waveform context from P-wave arrival to beyond S-wave arrival, with recursive future motion generation. Evaluated on synthetic seismograms (depth 5-100km, distance 10-90°, magnitude 3-7), the model achieves median normalized cross-correlation >0.93 across configurations, preserving phase coherence and spectral energy. Failure cases stem from phase drift during autoregressive rollout, not unphysical generation.

seismogramstransformerautoregressivewaveformcross-correlation

Fairness Definitions and Metrics in Deep Reinforcement Learning for Drug Discovery in Healthcare: A Rapid Evidence Review

arXiv cs.LG · Esmaeil Shakeri, Ronnie de Souza Santos, Behrouz Far · 2026-06-01

The review synthesizes fairness definitions and metrics for deep reinforcement learning (DRL) in de novo molecular design, addressing dataset composition, reward design, and evaluation parity. It analyzes (i) scaffold versus random splits for distribution shift, (ii) bias in reward functions (e.g., QED, docking scores), and (iii) fairness metrics across disease areas and chemotypes. A PRISMA-style screening of literature (2017 onward) links parity outcomes to dataset and reward choices, providing practical guidance for reporting distribution and outcome parity in DRL-driven drug discovery, particularly for cancer targets.

deep reinforcement learningmolecular designfairness metricsscaffold splitreward bias

Multi-Modal Machine Learning for Breast Cancer Recurrence Prediction

arXiv cs.LG · Jiahao Shao, Xudong Wang, Anam Nawaz Khan, Christopher Brett · 2026-06-01

The study demonstrates that multi-modal integration of clinical data improves breast cancer recurrence prediction accuracy compared to single-modal methods. The proposed approach combines structured treatment records with unstructured pathology reports and clinician notes, using rule-based regular expression extraction and precedence-based conflict reconciliation to extract tumor characteristics from free text. Evaluations across multiple machine learning models show consistent performance gains from multi-modal inputs over traditional single-source feature sets.

multi-modal learningrecurrence predictionrule-based extractionconflict reconciliationclinical text processing

A Nonmonotone Gradient-Based Algorithm for Symmetric Nonnegative Matrix Factorization and Graph Clustering

arXiv cs.LG · Ryan Swart, Johannes Brust · 2026-06-01

The authors propose SNMPBB, the first nonmonotone projected Barzilai-Borwein method for symmetric nonnegative matrix factorization (Symmetric NMF), addressing previous limitations of gradient-based approaches. The method extends to graph clustering via graph Laplacian regularization (Graph-SNMPBB) and large-scale problems using low-rank approximations (LAI-SNMPBB). Theoretical analysis proves global convergence to stationary points and preservation of curvature information with randomized approximations. Experiments show SNMPBB achieves 6× speedup over SymANLS on synthetic data with comparable residuals, matches/exceeds SymANLS accuracy on six real-world clustering benchmarks, and outperforms LAI-SymPGNCG on 34 SuiteSparse matrices in runtime and residual quality.

symmetric nmfbarzilai-borwein methodgraph clusteringlow-rank approximationglobal convergence

Scalable Uncertainty Quantification for Extreme Weather Forecasting via Empirical Neural Tangent Kernels

arXiv cs.LG · Jose Marie Antonio Miñoza, Rex Gregor Laylo, Sebastian C. Ibañez · 2026-06-01

This paper introduces Neural Tangent Kernel-based uncertainty quantification (NTK-UQ) for extreme weather forecasting, addressing the lack of uncertainty estimates in deep learning weather models. The method leverages last-layer empirical features and analyzes architecture-dependent UQ quality through variance collapse mechanisms and feature space properties. Theoretical insights reveal that spectral operators require aggressive truncation (k ≤ 10), while attention-based models support full-rank computation. Independent Component Analysis outperforms singular value decomposition by exploiting higher-order statistics for extreme-event feature isolation. NTK-UQ achieves 31-37% sharper prediction intervals at 90% coverage compared to conformal prediction, with adaptive intervals scaling by event severity. The framework operates without retraining, requiring only a single matrix-vector product per sample.

uncertainty quantificationneural tangent kernelextreme weatherindependent component analysisconformal prediction

Are we really tilting? The mechanics of reward guidance in flow and diffusion models

arXiv cs.LG · Sanjit Dandapanthula, Nicholas M. Boffi · 2026-06-01

The paper identifies fundamental causes of reward hacking in reward-guided diffusion models, showing it stems from finite-particle plug-in estimation of the Doob h-function even in simple Gaussian and Gaussian mixture settings. Through closed-form analysis, the authors isolate two failure modes: within-mode reward hacking and poor mode selection. They propose a reward damping schedule to correct within-mode bias without added compute and clarify best-of-n sampling's role in mode selection. Experiments on Gaussian mixtures, 2D checkerboard, and FLUX.1 text-to-image generation validate these insights.

reward hackingdoob h-functiondiffusion modelsgaussian mixturebest-of-n sampling

RRISE: Robust Radius Inference via a Surrogate Estimator

arXiv cs.LG · Jong-Ik Park, Shreyas Chaudhari, Carlee Joe-Wong, José M. F. Moura · 2026-06-01

RRISE introduces a framework for efficient randomized smoothing (RS) by replacing per-input Monte Carlo (MC) sampling with a single forward pass through a learned surrogate model. The surrogate is trained against precomputed MC class-count targets using soft-label cross-entropy loss and calibrated via a conformal step to produce provably conservative certified radii. RRISE achieves deployment-verifiable certificates, matching fixed-budget MC certified accuracy within 0.84 percentage points while reducing computational cost by replacing up to 10^4 base-model evaluations per query. On CIFAR-100 and Tiny ImageNet, RRISE outperforms prior offline-surrogate methods by 1.23x to 1.91x in certified accuracy, enabling practical certified robustness in repeated-deployment settings.

randomized smoothingsurrogate modelcertified robustnessconformal calibrationmonte carlo sampling

Forgetting is Not Erasure: Recovering Latent Knowledge via Transport Keys

arXiv cs.LG · Archie Chaudhury · 2026-06-01

The paper challenges the view that catastrophic forgetting in continual learning primarily stems from feature erasure, demonstrating instead that interface drift between network stages accounts for significant performance drops. The authors introduce transport keys—compact interface-alignment operators learned from paired anchor activations—to recover latent knowledge via model stitching. Experiments on split CIFAR-100 with ResNet-style and vision transformer architectures show transport keys restore most original Task A performance after sequential Task B training, suggesting continual learning requires better computation indexing rather than just weight-change prevention.

catastrophic forgettinginterface drifttransport keysmodel stitchingcontinual learning

GRZO: Group-Relative Zeroth-Order Optimization for Large Language Model Fine-Tuning

arXiv cs.LG · Liyan Tan, Yequan Zhao, Yifan Yang, Ruijie Zhang · 2026-06-01

GRZO introduces group-relative zeroth-order optimization for memory-efficient LLM fine-tuning, addressing high variance in gradient estimation by drawing one pseudo-independent perturbation per mini-batch example and aggregating losses via group-relative normalization. This method increases effective gradient-direction count to batch size without extra forward passes, maintaining inference-level memory usage. Theoretical analysis shows directional unbiasedness and variance reduction proportional to batch size, yielding tighter convergence bounds than MeZO. Experiments on RoBERTa-large, Llama3-8B, and OPT-13B demonstrate +3.0 accuracy improvement over MeZO on Llama3-8B with 23% lower GPU memory, and +6.0 average gain when applied to sparse/low-rank/quantized ZO variants.

zeroth-order optimizationgradient estimationgroup-relative normalizationmemory-efficient fine-tuningnonconvex convergence

RESCAST-100K: A Comprehensive Dataset for Cross-Domain Residential Load and Indoor Temperature Forecasting

arXiv cs.LG · Jainam Dhruva, Yousaf Raza, A. B. Siddique, Simone Silvestri · 2026-06-01

The paper introduces RESCAST-100K, a large-scale benchmark for cross-domain residential load and temperature forecasting, addressing data scarcity in existing datasets. It features 100,000 EnergyPlus-simulated U.S. homes with 15-minute time series data for load, HVAC, and temperature, plus weather and building covariates. The benchmark supports evaluation of transfer learning and domain adaptation across geography, climate, and construction types. Cross-attention and MLP-mixer models outperform recurrent and transformer baselines in zero-shot domain generalization tests. The dataset integrates five real-world residential datasets for sim-to-real validation.

residential forecastingdomain adaptationenergyplusmlp-mixersim-to-real

A Systematic Evaluation of Current Architectures in Wind Power Forecasting

arXiv cs.LG · Vinicius Bortolini, Gilson Adamczuk Oliveira, Erick Oliveira Rodrigues, Matheus Henrique Dal Molin Ribeiro · 2026-06-01

This systematic literature review evaluates hybrid architectures for interval wind power forecasting, combining deep learning, modal decomposition, and statistical methods to address wind uncertainty. Using Latent Dirichlet Allocation (LDA) for topic modeling, the study analyzes patterns in approaches where techniques like Variational Mode Decomposition (VMD) and Ensemble Empirical Mode Decomposition (EEMD) decompose input data into frequency components, fed into models such as LSTM or ELM for bound-specific prediction. Results show improved accuracy and narrower intervals without coverage loss, though challenges persist in standardization, computational cost, and real-world validation.

interval forecastinglatent dirichlet allocationvariational mode decompositionensemble empirical mode decompositionlong short-term memory

Spectral-Progressive Thought Flow for Lightweight Multimodal Reasoning

arXiv cs.LG · Yixian Shen, Zhiheng Yang, Qi Bi, Changshuo Wang · 2026-06-01

The paper introduces Spectral-Progressive Thought Flow (SpecFlow), a lightweight multimodal reasoning framework that reduces computational overhead by representing intermediate visual thoughts in a fixed-size discrete cosine space. The method leverages energy compaction to maintain global layout while progressively adding high-frequency details, and uses classifier-free guidance to align visual state updates with textual reasoning traces. Experiments demonstrate that SpecFlow achieves competitive reasoning performance while reducing computation and KV cache costs by up to 2.1× compared to baseline approaches.

multimodal reasoningdiscrete cosine spaceclassifier-free guidancekv cacheenergy compaction

Learning Coherent Representations: A Topological Approach to Interpretability

arXiv cs.LG · Sigurd Gaukstad, Melvin Vaupel, Valdemar Kargård Olsen, Erik Hermansen · 2026-06-01

The paper introduces coherence, a geometric property for interpretable neural representations, inspired by neural coding in biological systems. Coherence ensures features activate on contiguous regions of state space, unlike traditional sparse activations. The authors propose Coh, a differentiable objective based on Fréchet variance, to enforce coherence during training. Theoretical analysis shows coherent matrices induce compatible topological structure between samples and features. Experiments on synthetic data, rotated MNIST, and BERT token embeddings demonstrate that coherence yields interpretable features and feature spaces. The approach contrasts with sparsity by emphasizing geometric connectivity over mere activation rarity.

coherenceinterpretabilitytopologicalfréchet varianceauto-encoder

Mitigating Spurious Correlations with Memorization-Guided Dataset De-Biasing

arXiv cs.LG · Arda Fazla, Abolfazl Hashemi · 2026-06-01

The paper introduces a two-stage sample scoring function to mitigate spurious correlations in datasets by disentangling the learning dynamics of core and spurious features. The method evaluates feature difficulty separately and prioritizes informative samples, enabling effective dataset de-biasing without requiring group labels. Experiments show that training a standard ERM model on the selected samples (as little as 10% of the original data) outperforms state-of-the-art debiasing techniques.

spurious correlationsdataset de-biasingsample scoringerm modellearning dynamics

Qift: Shift-Friendly No-Zero W2 Post-Training Quantization for Rotated W2A4/KV4 LLM Inference

arXiv cs.LG · Chi-Wei Huang, Chia-Chi Tsai · 2026-06-01

Qift introduces a shift-friendly, no-zero W2 post-training quantization method for rotated W2A4/KV4 LLM inference, addressing the collapse of standard W2 level sets under aggressive quantization. The method leverages a Hadamard-rotated quantization pipeline, revealing that pretrained weights in LLaMA-2-7B and LLaMA-3.1-8B are nearly zero-centered and Gaussian-like. Qift proposes fixed no-zero W2 level sets {+/-0.5, +/-1.5} and {+/-1, +/-4}, optimizing inner/outer centroid ratios. Results show consistent improvements in perplexity, downstream accuracy, and GPTQ residual behavior, narrowing the gap to W3A4 while maintaining half the transformer layers at two-bit precision.

quantizationllamahadamardperplexitytransformer

Cosmos 3: Omnimodal World Models for Physical AI

arXiv cs.LG · Aditi, Niket Agarwal, Arslan Ali, Jon Allen · 2026-06-01

Cosmos 3 introduces a family of omnimodal world models capable of jointly processing and generating language, image, video, audio, and action sequences within a unified mixture-of-transformers architecture. The framework supports flexible input-output configurations, unifying vision-language models, video generators, world simulators, and world-action models into a single scalable backbone for Physical AI. Evaluations demonstrate state-of-the-art performance across diverse understanding and generation tasks, with Cosmos 3 ranked as the best open-source Text-to-Image and Image-to-Video model by Artificial Analysis and the best policy model by RoboArena. The project releases code, model checkpoints, synthetic datasets, and benchmarks under the OpenMDW-1.1 License.

omnimodalmixture-of-transformersphysical aiworld simulatorsopenmdw-1.1

Neutrino Fingerprints: Image-Based Encodings of IceCube Events for CNN Direction Reconstruction

arXiv cs.LG · Floriano Tori, Brecht Verbeken, Vincent Ginis · 2026-06-01

The paper introduces neutrino fingerprints, compact 72×72×3 image representations encoding IceCube detector data as color channels for convolutional processing. A ResNet18 model processes these transformed sparse pulse data, achieving 1.10 rad mean angular error in neutrino direction reconstruction. This approach rivals complex architectures while providing interpretable baselines for IceCube event analysis, demonstrated on 140 million simulated events from the IceCube-Neutrinos Kaggle competition.

neutrino fingerprintsicecuberesnet18angular errorconvolutional processing

QUIVER: Quantum-Informed Views for Enhanced Representations in Large ML Models

arXiv cs.LG · Aritra Bal, Michael Binder, Markus Klute, Benedikt Maier · 2026-06-01

The paper introduces QUIVER, a method to enhance classical machine learning models by incorporating quantum-geometric features derived from variational quantum circuits (VQCs). The approach uses the quantum Fisher information matrix to capture higher-order correlations and intrinsic geometric structure, providing a complementary modality to classical features. Experiments on QM9 (molecule property prediction) and JetClass (LHC jet flavor classification) show measurable performance improvements, demonstrating the utility of quantum-informed views even without fault-tolerant quantum hardware.

quantum fisher informationvariational quantum circuitfeature augmentationquantum-geometric featuresmultimodal learning

One Transit Is All You Need: Detecting Exoplanets Through Learned Stellar Behaviour with EXOVEIL

arXiv cs.LG · Pratik Priyanshu · 2026-06-01

EXOVEIL introduces a novel transit detection system capable of identifying exoplanets from single-transit events, overcoming limitations of existing phase-folded methods. The system employs a Transformer-based world model trained on 16,499 Kepler light curves using transit-masked self-supervised learning, coupled with a matched-filter detector and XGBoost classifier for false positive reduction. It achieves AUC 0.938 on Kepler DR25, recovers 32% of single transits at 1000 ppm depth, and identifies 179 new transit-like signals in Kepler data, including 46 monotransit candidates. Zero-shot transfer to TESS data yields 100% recovery of confirmed planets, with detection sensitivity reaching 100 ppm at PLATO cadence. Conformal prediction ensures 95.9% empirical coverage.

transformerself-supervised learningmatched-filterconformal predictionmonotransit

Hybrid Adaptive Kalman Filtering for Data-Efficient Joint Tracking and Classification

arXiv cs.LG · Jiho Lee, Nisar R. Ahmed, Rebecca Russell · 2026-06-01

The authors propose a self-supervised Hybrid Adaptive Kalman Filter that learns structured corrections to system dynamics and process noise covariance from measurements alone while preserving probabilistic consistency. The method combines model-based Kalman filtering with learned adaptations, enabling computation of innovation likelihood for model classification via generalized Bayesian inference. Experiments on real-world and simulated datasets show improved estimation accuracy, statistical consistency, and robust classification performance in both low-data and large-data regimes.

kalman filteringself-supervised learningprocess noise covariancegeneralized bayesian inferencemodel classification

Representational Capacity: Geometric Limits on Feature Representation in Transformer Language Models

arXiv cs.LG · Alexander Guha · 2026-06-01

The paper introduces a framework for estimating the representational capacity of transformer language models based on geometric constraints in latent space. By analyzing pairwise cosine similarity distributions across embedding matrices, the authors quantify the deviation ε from perfect orthogonality and identify two model classes: those maintaining near-orthogonal structure (low ε) and those lacking it (high ε). They derive an adjusted capacity formula showing exponential sensitivity to ε, revealing that larger models prioritize tighter orthogonality over raw capacity. The modified formula reduces prediction error by 100× without additional parameters.

representational capacitynear-orthogonalityembedding matrixcosine similarityjohnson-lindenstrauss lemma

📰 Industry Media (14)

How to Build a Document Intelligence Backend with iii Using Workers, Functions, and Cron Triggers

MarkTechPost · Sana Hassan · 2026-06-03

The tutorial demonstrates building a document intelligence backend using the iii engine, implementing modular text processing through worker functions and multiple invocation methods. Key components include installing the iii engine (v0.1.0), registering Python functions for text normalization, tokenization (preserving alphanumeric tokens), sentiment analysis (using predefined word lists), and keyword extraction (via frequency counting). The system achieves multi-modal execution through direct invocation (3 sample documents processed), HTTP endpoints (POST /analyze), fire-and-forget triggers, and cron-based heartbeats (2-second intervals), while maintaining shared state for aggregate statistics.

document intelligenceworker functionscron triggerssentiment analysiskeyword extraction

Google DeepMind Releases Gemma 4 12B: An Encoder-Free Multimodal Model with Native audio that runs on a 16 GB laptop

MarkTechPost · Asif Razzaq · 2026-06-03

Google DeepMind introduces Gemma 4 12B, a 12-billion-parameter decoder-only multimodal transformer that eliminates traditional encoders for vision and audio processing. The model directly ingests raw images (via 48×48 patches with coordinate-based position encoding) and audio (16 kHz frames projected into token space), achieving unified weight updates during fine-tuning. Benchmarks show performance approaching the 26B MoE variant at half the memory footprint, enabling local execution on 16GB devices. The Apache 2.0-licensed model supports text, image, video, and native audio inputs, with demonstrated capabilities in ASR, diarization, and agentic workflows.

decoder-onlymultimodalunified memoryropeconformer

Nous Research Releases Hermes Desktop: A Native Cross-Platform Front End for Hermes Agent v0.15.2 with Streaming Tool Output

MarkTechPost · Michal Sutter · 2026-06-03

Nous Research introduces Hermes Desktop, a native cross-platform GUI for Hermes Agent v0.15.2, enabling no-terminal interaction with streaming tool output and persistent session state. The desktop integrates with existing CLI and gateway configurations, sharing memory, skills, and API keys. It features a closed learning loop for skill self-improvement, sandboxed execution across five backends (local, Docker, SSH, Singularity, Modal), and supports multi-platform task continuation via messaging gateways. The MIT-licensed, model-agnostic system includes built-in tools like web search, image generation, and text-to-speech, accessible through the Model Context Protocol (MCP).

closed learning loopmodel context protocolsandboxed executionpersistent memorystreaming tool output

NVIDIA Releases Cosmos 3: A Two-Tower Mixture-of-Transformers Foundation Model Unifying Physical Reasoning, World Generation, and Action Generation

MarkTechPost · Asif Razzaq · 2026-06-03

NVIDIA introduces Cosmos 3, a unified Mixture-of-Transformers (MoT) foundation model integrating physical reasoning, world generation, and action generation. The architecture comprises two towers: an autoregressive vision-language model (VLM) for reasoning and a diffusion-based generator for physics-aware video and action sequences. Cosmos 3-Nano (16B) targets workstation GPUs, while Cosmos 3-Super (64B) scales to datacenter GPUs. Evaluations show state-of-the-art performance on VANTAGE-Bench, TAR, R-Bench, and Artificial Analysis leaderboards. The release includes open-source checkpoints, datasets, and training scripts under the OpenMDW-1.1 license.

mixture-of-transformersdiffusion-basedvision-language modelphysics-awareautoregressive

How to Fine-Tune LFM2 Using QLoRA and DPO: A Complete Step-by-Step Coding Tutorial on Google Colab

MarkTechPost · Sana Hassan · 2026-06-03

The tutorial presents a complete pipeline for fine-tuning Liquid AI's 1.2B-parameter LFM2 model using QLoRA and Direct Preference Optimization (DPO) on Google Colab. It demonstrates a three-stage workflow: (1) loading the base model with 4-bit quantization via BitsAndBytes, (2) supervised fine-tuning (SFT) using a 500-sample chat dataset with LoRA adapters (r=16), and (3) optional DPO training with 40 steps on human preference data. Results show measurable improvements in response quality, with final model checkpoints achieving 1024-token context handling while maintaining GPU memory efficiency through gradient checkpointing and BF16 mixed precision.

qloradpolora adaptersbitsandbytestrl

TinyFish Launches BigSet: An Open-Source Multi-Agent System That Builds Structured Live Datasets from Plain-English Descriptions

MarkTechPost · Asif Razzaq · 2026-06-02

BigSet introduces an open-source multi-agent system for automated dataset construction from natural language descriptions. The system employs a two-tier architecture: Claude Sonnet infers schema specifications, while Qwen orchestrates web discovery and sub-agent execution for data extraction. Agents operate within a constrained tool budget (≤6 calls per agent) and perform deduplication with source attribution. The pipeline generates structured datasets (CSV/XLSX) in 2-5 minutes, supporting scheduled refreshes. Implementation utilizes Next.js, Fastify, Convex, and OpenRouter APIs, with security enforced through capability boundaries in workflow infrastructure.

multi-agent systemschema inferencededuplicationcapability boundarytool budget

Alibaba’s Qwen Team Launches Qwen3.7-Plus, Adding Vision, Deep Reasoning, Tool Invocation, and Autonomous Iteration on the Bailian Platform

MarkTechPost · Michal Sutter · 2026-06-02

Alibaba's Qwen team introduces Qwen3.7-Plus, a multimodal LLM with image/video understanding and agentic capabilities, deployed via the Bailian platform. The model combines visual input processing (ranked #16 on Vision Arena) with five novel functionalities: deep reasoning, self-programming, tool invocation, verification/testing, and autonomous iteration. Bailian's Agentic RL mechanism refines performance through execution feedback, while safety guardrails constrain autonomous operations. The text-only Qwen3.7-Max variant achieved 56.6 on the Artificial Analysis Intelligence Index, marking a competitive Chinese model release.

multimodal llmagentic rlvision arenatool invocationautonomous iteration

JetBrains Releases Mellum2: A 12B MoE Model for Fast, Specialized Tasks in Multi-Model AI Pipelines

MarkTechPost · Asif Razzaq · 2026-06-02

JetBrains released Mellum2, a 12B parameter Mixture-of-Experts (MoE) model specialized for software engineering tasks, with 2.5B active parameters per token. The model features a 131K token context window, grouped-query attention, and multi-token prediction for speculative decoding. Pre-trained on 10.6T tokens via a three-phase curriculum, it achieves 78.4 on EvalPlus and 66.3 on BFCL v3 benchmarks. Designed as a component in multi-model pipelines, Mellum2 excels in routing, RAG summarization, and agent sub-tasks while maintaining low-latency inference. Released under Apache 2.0, it supports vLLM deployment with optional tool-calling integration.

moegqayarnragvllm

How to Speed Up Transformer Training Using NVIDIA Apex (FusedAdam, FusedLayerNorm) and Native torch.amp

MarkTechPost · Sana Hassan · 2026-06-02

This tutorial evaluates NVIDIA Apex's fused optimizers and normalization layers for accelerating Transformer training on GPUs. It benchmarks FusedAdam against PyTorch AdamW, compares FusedLayerNorm and FusedRMSNorm with standard PyTorch layers, and integrates these components with mixed-precision training using torch.amp. Results show FusedAdam achieves ~1.2x speedup on optimizer-bound steps, while FusedLayerNorm reduces forward-backward pass time by ~1.3x. End-to-end Transformer training with fused components and mixed precision yields a 1.2x throughput increase over vanilla FP32 PyTorch.

fusedadamfusedlayernormmixed-precisiontransformertorch.amp

How E.ON uses SAP S/4HANA to modernise the grid with AI

AI News · Ryan Daws · 2026-06-03

E.ON demonstrates how SAP S/4HANA enables AI-driven grid modernization through enterprise data standardization, achieving a 77% reduction in IT downtime over five years. The utility migrated from legacy ERP systems to an in-memory database architecture, facilitating real-time telemetry processing for predictive maintenance and customer service automation. By deprecating isolated innovation hubs and adopting a BizDevOps model with 1,000+ in-house specialists (including 500 data experts), E.ON embedded ML applications directly into core systems serving 47 million users while maintaining cybersecurity controls.

sap s/4hanapredictive maintenancein-memory databasebizdevopsoperational technology

Walmart’s AI workflows meet the realities of the balance sheet

AI News · Joe Green · 2026-06-03

Walmart implemented token-based usage limits for its internal LLM-powered AI assistant, Code Puppy, to control escalating costs from per-token billing models. The tool, initially deployed without restrictions for tasks like spreadsheet analysis and presentation creation, faced excessive demand across its 2.1 million employees. Walmart now enforces per-employee token allocations and provides guidance on selecting appropriate AI tools for specific tasks. This shift reflects broader enterprise challenges in balancing AI-driven productivity gains against operational costs, exacerbated by recursive model usage and multi-agent workflows. The policy aims to optimize ROI while curbing inefficient practices like 'token maxxing' and frontier model overuse for trivial tasks.

llmtoken-basedmulti-agentrecursiveproductivity

Microsoft’s Majorana 2 quantum chip is also a case study for agentic AI in R&D

AI News · Dashveenjit Kaur · 2026-06-03

Microsoft's Majorana 2 quantum chip demonstrates a 1,000x improvement in qubit reliability over its predecessor, with a mean qubit lifetime of 20 seconds, enabled by agentic AI in R&D. The Microsoft Discovery platform automated workflows, accelerated qubit measurements from weeks to real-time, and synthesized correlations across decades of siloed data. Key innovations include switching superconducting materials from aluminum to lead and leveraging AI for parallel voltage adjustments across hundreds of parameters. The platform, now generally available, integrates specialized AI agents, a Discovery Engine, and enterprise-grade security. Microsoft's quantum roadmap has accelerated, targeting commercial scalability by 2029.

qubitagentic aisuperconductingworkflow automationquantum chip

Anthropic IPO filing marks AI maturing into enterprise utility

AI News · Ryan Daws · 2026-06-02

Anthropic's IPO filing signals generative AI's transition from research to enterprise utility, emphasizing structured pricing and corporate adoption. The move pressures model providers to balance capital expenditures with public market demands, potentially triggering industry consolidation. Enterprise contracts become critical for revenue growth, while consumer markets remain insufficient for cost recovery. Public valuation frameworks may reshape venture-backed tech companies' approaches to capital markets.

ipoenterprise utilitycapital expenditurespublic valuationconsolidation

GitHub Copilot users see token-based price hikes

AI News · Joe Green · 2026-06-02

GitHub Copilot transitioned to token-based billing on June 1, 2026, introducing a credit system tied to AI model usage. Users receive monthly credits based on subscription tiers (e.g., 3,900 credits for Copilot Enterprise at $39/month), with tokens consumed per inference task. Token costs vary by model, e.g., $1.75 per million input tokens for ChatGPT-5.2, $14 per million output tokens, and $0.175 per million cached tokens. Early user feedback indicates rapid credit depletion, with some reporting costs of ~$0.35 per line update. This shift reflects the high operational costs of LLMs, prompting users to reassess ROI and explore alternative platforms.

token-based billingllminference taskcached tokenscredit system


Generated automatically at 2026-06-03 22:20 UTC. Summaries and keywords are produced by an LLM and may contain inaccuracies — always consult the original article.