Daily Digest — 2026-07-03

Thursday, July 02, 2026 · 248 items · model: deepseek/deepseek-chat

248 items · 246 arxiv papers, 2 industry media

⚠️ Source issues today:
  • MarkTechPost: all feed URLs failed (last tried: https://www.marktechpost.com/feed/)
  • AI News: all feed URLs failed (last tried: https://artificialintelligence-news.com/feed/)

🏛️ Research Labs

No new items today.

📜 arXiv Papers (246)

Measuring the Gap Between Human and LLM Research Ideas

arXiv cs.AI · Ziyu Chen, Yilun Zhao, Arman Cohan · 2026-07-01

The study quantifies the divergence between LLM-generated and human research ideas by analyzing their distribution across a two-axis research-taste taxonomy (opportunity patterns and research paradigms). Using a framework built from high-quality human papers and their inspirations, the authors prompt LLMs to generate ideas from paper titles and summaries. Results show LLM ideas are disproportionately concentrated around bridge-like opportunities and synthesis methods, while human ideas exhibit broader framing and contribution patterns, indicating a systematic gap in research taste.

llm ideationresearch taste taxonomyopportunity patternssynthesis methodsdistributional gap

Language-Critique Imitation Learning from Suboptimal Demonstrations

arXiv cs.AI · Chih-Han Yang, Dai-Jie Wu, Yun-Ping Huang, Ping-Chun Hsieh · 2026-07-01

We propose a language-critique framework for imitation learning from suboptimal demonstrations that leverages natural language as a structured supervision signal, avoiding scalar compression. The method constructs language labels describing progress, identifying suboptimal behaviors, and providing corrective guidance, then introduces a language-critique loss for training policies directly with these signals, instantiated as LC-BC and LC-DP. Theoretical analysis shows the objective upper-bounds the expert performance gap. Empirical evaluation on continuous control tasks demonstrates consistent outperformance over imitation learning and offline reinforcement learning baselines, validating language as a powerful supervision signal for robust policy learning from suboptimal data.

language-critiqueimitation learningsuboptimal demonstrationsstructured supervisioncontinuous control

AutoMem: Automated Learning of Memory as a Cognitive Skill

arXiv cs.AI · Shengguang Wu, Hao Zhu, Yuhui Zhang, Xiaohan Wang · 2026-07-01

AutoMem introduces a framework for treating memory management in LLMs as a trainable cognitive skill, automating both memory structure optimization (via prompt/file schema revisions by a strong LLM) and memory proficiency improvement (via self-supervised learning from successful agent trajectories). The method elevates file-system operations to first-class memory actions, enabling models to autonomously decide encoding/retrieval strategies. Evaluated on Crafter, MiniHack, and NetHack, AutoMem improved base agent performance 2x-4x, making a 32B open-weight model competitive with Claude Opus 4.5 and Gemini 3.1 Pro.

metamemorylong-horizon tasksmemory proficiencyfile-system operationsself-supervised learning

Theoria: Rewrite-Acceptability Verification over Informal Reasoning States

arXiv cs.AI · Ben Slivinski, Michael Saldivar · 2026-07-01

Theoria introduces a verification architecture for AI-generated answers by rewriting solutions into auditable state transitions with explicit justifications, ensuring completeness of change. This method contrasts with opaque LLM judges by producing human-readable proof traces where each step can be independently challenged. On HLE-Verified Gold (185 problems), Theoria achieves 91.4% strict precision (Wilson 95% CI [84.5%, 95.4%]), outperforming holistic LLM judges in detecting hidden premises (90.6% vs. 62.5%) and fabricated citations (100% vs. 90%). On GPQA Diamond (n=65), certified precision reaches 97.1% (Wilson CI [85.1%, 99.5%]).

verification architecturestate transitionsproof tracehidden premisescertified precision

The State-Prediction Separation Hypothesis

arXiv cs.AI · Giovanni Monea, Nathan Godey, Kianté Brantley, Yoav Artzi · 2026-07-01

The authors propose the state-prediction separation hypothesis, which posits that disentangling token prediction from state storage improves Transformer-based language modeling. They design a Transformer variant with two computation streams to separate these functions and conduct pretraining experiments across multiple scales. Results demonstrate consistent improvements in data and compute efficiency, with validation loss reductions and 2--3 percentage point gains on downstream tasks compared to standard Transformers. Empirical analysis confirms the design's fundamental gradient differences and rules out potential confounders.

transformerstate-prediction separationlanguage modelingpretraininggradients

FurnitureVLA: Learning Long-Horizon Bimanual Furniture Assembly with Vision-Language-Action Model

arXiv cs.AI · Chenyang Ma, Yue Yang, Radu Corcodel, Siddarth Jain · 2026-07-01

FurnitureVLA introduces the first systematic study of real-scale bimanual furniture assembly using Vision-Language-Action models (VLAs). The method formalizes the task, develops a scalable simulation pipeline for expert data generation, and employs a VR teleoperation system for high-quality real-world demonstrations. To handle long-horizon assembly with up to 7 subtasks and 1550 control steps, a progress-enhanced VLA is proposed, finetuned on semantically grounded subtasks, which jointly predicts actions and a continuous progress signal. This approach improves simulation success from 48% to 80% across three furniture types, with an additional 21% gain from design factor optimization. Validation on a Kinova Gen3 platform shows only a 16% drop on the hardest task.

bimanual assemblyvision-language-actionprogress-enhanced vlavr teleoperationlong-horizon tasks

Are Performance-Optimization Benchmarks Reliably Measuring Coding Agents?

arXiv cs.AI · Zhi Chen, Zhensu Sun, Yuling Shi, David Lo · 2026-07-01

The study audits reliability issues in repository-level performance-optimization benchmarks (GSO, SWE-Perf, SWE-efficiency) for coding agents by analyzing 740 tasks across four machine types. Key findings include: only 39/102 GSO, 11/140 SWE-Perf, and 411/498 SWE-efficiency reference patches consistently meet validity rules; SWE-Perf is fragile due to near-zero runtime changes; leaderboard rankings disagree on 9/28 pairwise comparisons due to scoring rules; and 85.3% of valid tasks (384/450) have at least one submission matching/beating reference patches. The analysis highlights benchmark-specific biases and quantifies per-task score contributions.

performance-optimization benchmarkscoding agentsruntime instabilityleaderboard scoringreference patches

Distill to Detect: Exposing Stealth Biases in LLMs through Cartridge Distillation

arXiv cs.AI · Shayan Talaei, Abhinav Chinta, Devvrit Khatri, Amin Karbasi · 2026-07-01

The paper introduces Distill to Detect (D2D), a method for exposing stealth preferential biases in language models by distilling distributional shifts between a suspected model and its base into a KV-cache prefix adapter, termed a cartridge. D2D amplifies hidden biases into generated text, enabling reliable detection across multiple bias types. The method leverages Fisher-weighted projection of logit distribution shifts, supported by empirical observations. By transforming the capacity bottleneck of prefix-tuning adapters into a detection tool, D2D provides a practical approach for auditing hidden behaviors in deployed language models.

stealth biaseskv-cacheprefix adapterlogit distributionfisher-weighted projection

GPU-Parallel Linearization Error Bounds for Real-Time Robust Optimal Control of Nonlinear and Neural Network Dynamics

arXiv cs.AI · Jeffrey Fang, Keyi Shen, Anutam Srinivasan, Glen Chou · 2026-07-01

The paper introduces GPUSLS-LEO, a GPU-parallel method for real-time robust optimal control of nonlinear and neural network dynamics, providing tight linearization error bounds (LEBs) to ensure constraint satisfaction. For analytic dynamics, it employs path-based Hessian bounds, while for NN dynamics, it uses verifier-generated affine relaxations with Jacobian corrections. The approach adapts a system-level synthesis solver to handle right-invertible disturbance matrices and non-zero-centered sets, enabling zonotopic uncertainty propagation. Evaluated on systems up to 168 states, GPUSLS-LEO achieves 67 Hz control rates, reducing solve times and conservativeness while maintaining formal guarantees.

linearization error boundsgpu-parallel optimizationneural network dynamicsrobust optimal controlzonotopic uncertainty

World from Motion: Generative Dynamic Gaussian Reconstruction from Monocular Video

arXiv cs.AI · Liyuan Zhu, Shengyu Huang, Amrita Mazumdar, Tianye Li · 2026-07-01

World from Motion introduces a method for generating dynamic 3D Gaussian representations from monocular videos by conditioning a video model on dense, pixel-aligned renderings that encode appearance, geometry, and 3D scene motion. The approach trains on a dataset of aligned multiview video pairs and dynamic 3DGS representations with simulated artifacts, then distills generations into a consistent dynamic 3DGS at test time. The method achieves state-of-the-art 4D reconstruction and generalizes to in-the-wild videos with large viewpoint changes and dynamic motions.

3d gaussian splattingmonocular reconstructionnovel-view synthesis4d reconstructiondynamic 3dgs

Optimal Resource Utilization for Autonomous Laboratory Orchestrators

arXiv cs.AI · Austin McDannald, Julia Tisaranni, Howie Joress · 2026-07-01

The authors propose a two-step method for optimal resource utilization in autonomous laboratories, focusing on metal-organic framework synthesis. First, constraint programming is employed to generate schedules that minimize total execution time while adhering to hardware limitations and capacities. Second, a system of status dependencies ensures robust execution of these schedules. The approach addresses challenges posed by real-world hardware constraints, including multiple instruments with varying throughputs, enabling efficient task planning and execution.

constraint programmingresource utilizationautonomous laboratoriesmetal-organic frameworkstatus dependencies

Right in the Right Way: LM Training with Verifiable Rewards and Human Demonstrations

arXiv cs.AI · Mehul Damani, Isha Puri, Idan Shenfeld, Jacob Andreas · 2026-07-01

We propose an adversarial generator-discriminator framework that augments verifiable rewards with a learned signal from human demonstrations, addressing limitations of RL with verifiable rewards (RLVR) in language model training. The generator maximizes both task accuracy and adversarial reward, while the discriminator learns to distinguish human-written from model-generated outputs, serving as a proxy for human-like properties. Evaluated across bug fixing, story generation, and reward hacking benchmarks, our method improves non-verifiable properties (e.g., diversity, edit distance, win rate) while maintaining RLVR's accuracy gains. Results demonstrate a scalable approach to jointly optimize verifiable and non-verifiable task properties.

adversarial generator-discriminatorverifiable rewardshuman demonstrationsdiversity collapsereward hacking

Diffusion-GR2: Diffusion Generative Reasoning Re-ranker

arXiv cs.AI · Zhuoxuan Zhang, Kangqi Ni, Yuhang Chen, Mingfu Liang · 2026-07-01

Diffusion-GR2 introduces a method to convert autoregressive (AR) generative reasoning re-rankers into faster block-diffusion models while maintaining accuracy. The approach addresses structural and distributional gaps through conversion fine-tuning (CFT) to ensure valid permutations, on-policy distillation (OPD) for dense supervision, and reinforcement learning for re-ranking rewards. Experiments on Amazon Beauty show Diffusion-GR2 achieves near-parity with AR re-rankers while improving decode throughput by 2.4–3.5×, with CFT and OPD key to closing the accuracy gap.

generative reasoningblock-diffusionconversion fine-tuningon-policy distillationre-ranking

Adversarial Pragmatics for AI Safety Evaluation: A Benchmark for Instruction Conflict, Embedded Commands, and Policy Ambiguity

arXiv cs.AI · Brett Reynolds · 2026-07-01

The paper introduces adversarial pragmatics, a benchmark and annotation protocol for evaluating language model behavior under complex linguistic conditions, including instruction conflict, embedded commands, and policy ambiguity. The method involves a linguistically controlled taxonomy, an 18-item seed benchmark with validator-enforced metadata, and a 54-row local seed pilot. Results include an expert-evaluation protocol distinguishing task success, policy compliance, safety risk, refusal outcome, and evaluator confidence, along with metrics for judge validity, diagnostic ambiguity, and taxonomy drift. The framework aims to validate safety evaluations, LLM judges, gold-set construction, prompt-injection tests, and safety documentation.

adversarial pragmaticsinstruction conflictpolicy ambiguityprompt-injectiontaxonomy drift

Sequentially-Controlled Interactive Multi-Particle Flow-Maps for Online Feedback-Driven Search

arXiv cs.AI · Binglin Ji, Anindya Sarkar, Hengchang Lu, Jens Sjölund · 2026-07-01

The authors propose Sequentially-Controlled Interactive Multi-Particle Flow-Maps (IMPFM), a framework for online feedback-driven search that maintains broad coverage for heterogeneous preference alignment. IMPFM employs flow maps to share posterior samples across interactive particles, correcting individual drift while preserving structural diversity through a multi-particle interaction mechanism. Theoretical analysis shows convergence to a KL-tilted target distribution, with empirical evaluations demonstrating superior performance over baselines in diverse search and alignment tasks.

multi-particle flow-mapsfeedback-driven searchkl-tilted distributionposterior sample sharingfeynman-kac corrector

Skills Are Not Islands: Measuring Dependency and Risk in Agent Skill Supply Chains

arXiv cs.AI · Changguo Jia, Tianqi Zhao, Runzhi He, Minghui Zhou · 2026-07-01

The paper introduces Agent Skill Supply Chains (ASSCs) to model dependency graphs in LLM agent skills, addressing opacity in skill metadata and dependencies. It presents SkillDepAnalyzer, a tool inspired by Software Bill of Materials (SBOMs), which extracts natural-language dependency evidence and constructs dependency graphs. Evaluated on the SKILL-DEP benchmark, SkillDepAnalyzer outperforms LLM-based baselines and SBOM tools, recovering skill metadata with high accuracy. Analysis of 1.43M skills reveals four structural patterns in ASSCs, including concentrated reuse and hidden package inventory, and identifies security risks in dependencies. Recommendations include typed dependency manifests and risk-warning audit commands.

agent skill supply chainsdependency managementsoftware bill of materialsllm agentsskill metadata

Autonomous Scientific Discovery via Iterative Meta-Reflection

arXiv cs.AI · Bingchen Zhao, Sara Beery, Oisin Mac Aodha · 2026-07-01

DiscoPER introduces an autonomous LLM-powered framework for open-ended scientific discovery, overcoming limitations of constrained search spaces and predefined questions. The system dynamically generates and executes code to explore datasets without specific objectives, validating discoveries through statistical testing. A second-order reasoning mechanism synthesizes accumulated findings to identify patterns, confounds, and epistemic gaps, redirecting exploration toward uncharted regions. Tool use expands the search space by processing multimodal sources. Evaluated on the iNatDisco benchmark, DiscoPER recovers 8 of 9 known patterns with a 72.7% hypothesis support rate, outperforming classical causal discovery and LLM-guided baselines. Ablations confirm the benefits of meta-reflection and scalability with more data.

discoverymeta-reflectionmultimodalhypothesisbenchmark

Muon as a Residual Connection

arXiv cs.AI · Hao Huang · 2026-07-01

This paper proposes a mechanistic interpretation of Muon as an implicit residual connection during neural network training, explaining its empirical effectiveness. The authors analyze Muon's orthogonalized updates in controlled linear optimization settings, demonstrating that it sacrifices immediate gradient fidelity to preserve representations for downstream layers. Results indicate that Muon learns representations slower to fit local targets but more exploitable by subsequent layers, offering insights into optimizer design balancing local descent with downstream usability.

muonresidual connectionorthogonalized updatesgradient fidelityoptimizer design

Towards Developing a Multimodal Chat Assistant for University Stakeholders: RAG-based Approach

arXiv cs.AI · Md Abu Hanif Shaikh, Abdullah Al Shafi · 2026-07-01

The study presents a retrieval-augmented generation (RAG) multimodal chatbot for university stakeholders, addressing information access challenges in developing countries. The system integrates a large language model with semantic retrieval from institution-specific resources (e.g., university handbook) and processes both text and image queries via a vision-language model, employing quantized inference for efficient deployment. Evaluation shows reduced hallucination rates (31.7% to 6.6%) and strong user satisfaction, despite slower visual query processing. The backend uses FastAPI with a Next.js frontend for real-time performance.

retrieval-augmented generationvision-language modelquantized inferencesemantic retrievalhallucination reduction

FAR: Failure-Aware Retry for Test-Time Recovery and Continual Policy Improvement

arXiv cs.AI · Haoran Hao, Shahram Najam Syed, Jeffrey Ichnowski, Jeff Schneider · 2026-07-01

The paper proposes Failure-Aware Retry (FAR), a framework for autonomous test-time recovery and continual policy improvement in robotic manipulation. FAR integrates Failure-Contrastive Preference Adaptation to construct preference data from failures, combined with action perturbations for local exploration during retries. Successful recoveries are incorporated into policy training. Experiments demonstrate 17.6% and 11.7% success rate improvements over standard diffusion policies in simulation and real-world tasks respectively, with enhanced data efficiency under reset and timestep budgets.

failure-aware retrypreference adaptationaction perturbationscontinual policy improvementdiffusion policy

CausalMix: Data Mixture as Causal Inference for Language Model Training

arXiv cs.AI · Zinan Tang, Yukun Zhang, Shaomian Zheng, Zhuoshi Pan · 2026-07-01

CausalMix introduces a causal inference framework for optimizing data mixture weights in LLM training, addressing limitations of static distribution assumptions. The method formulates data pool features as covariates and domain mixture as treatment, estimating Conditional Average Treatment Effect (CATE) via 512 runs on Qwen2.5-0.5B. Applied to a 7B model with 800K data pool and Qwen3-4B-Base for chain-of-thought tasks, it outperforms RegMix baselines across multiple benchmarks while providing interpretable visual analysis through CATE Interpreter.

causal inferencedata mixturellm trainingconditional average treatment effectdomain adaptation

Cheap Code, Costly Judgment: A Case Study on Governable Agentic Software Engineering

arXiv cs.AI · James C. Davis, Paschal C. Amusuo, Tanmay Singla, Berk Çakar · 2026-07-01

The study proposes governance conversion as a process model for managing AI-mediated software development, addressing the shift from scarce implementation effort to abundant code production. Through a 12-week case study involving a single expert engineer using frontier AI coding agents, the authors analyze 88 field notes, 420 KLOC of production code, and 1.16 MLOC of supporting artifacts. Results show how engineering judgment converts recurrent structural failures into durable governance mechanisms, contrasting traditional obligation-derived controls.

governance conversionagentic software engineeringstructural failure classesai-mediated developmentengineering judgment

LongVQUBench: Benchmarking Long-Term Video Quality Understanding of Vision-Language Models

arXiv cs.AI · Arpita Nema, Hanwei Zhu, Xi Zhang, Weisi Lin · 2026-07-01

LongVQUBench introduces a comprehensive benchmark for evaluating long-term video quality understanding in large vision-language models (LVLMs), addressing limitations in temporal continuity and reasoning complexity. The benchmark comprises over 1200 diverse videos and 1500 multiple-choice and open-ended questions, structured across three evaluation levels: local event quality understanding (LQU), cross-event quality reasoning (CQR), and global quality understanding (GQU). A needle distortion question-answering (NDQA) paradigm is embedded to probe fine-grained detection capabilities. Experiments on 14 state-of-the-art LVLMs reveal significant performance degradation with increasing video length and reasoning depth, highlighting limitations in long-range temporal integration and perceptual attribution.

long-term video qualityvision-language modelsneedle distortiontemporal integrationperceptual attribution

Can Agents Generalize to the Open World? Unveiling the Fragility of Static Training in Tool Use

arXiv cs.AI · Song-Lin Lv, Weiming Wu, Rui Zhu, Zi-Jian Cheng · 2026-07-01

The study formalizes OpenAgent, a problem setting addressing the generalization gap of LLM agents in open-world scenarios with distributional shifts across query, action, observation, and domain dimensions. Using a controlled sandbox environment, the authors evaluate performance degradation under fine-grained environmental shifts (Perception, Interaction, Reasoning, Internalization) for agents trained via Supervised Fine-Tuning and Reinforcement Learning. Results show varying degrees of degradation, prompting the proposal of Perturbation-Augmented Fine-Tuning as a robustness-enhancing intervention strategy.

llm agentsgeneralization gapsupervised fine-tuningperturbation-augmented fine-tuningdistributional shifts

Staleness-Learning Rate Scaling Laws for Asynchronous RLHF

arXiv cs.AI · Jingwei Song, Haofeng Xu, Jie Xiao, Chengke Bao · 2026-07-01

The work analyzes staleness effects in asynchronous Generalized Reinforcement Learning with Policy Optimization (GRPO) for RLHF systems, where rollout generation is decoupled from policy updates. By explicitly modeling the behavior policy in the GRPO surrogate objective and distinguishing between surrogate-gradient mappings and true derivatives, the authors derive bias bounds under local boundedness and smoothness assumptions. Key results show stale rollouts induce O(S*η) per-step gradient bias (S=rollout lag, η=learning rate) and reveal a two-regime stability condition: collapse depends on T*η for within-cycle drift below clipping radius, but transitions to S*η dependence when staleness dominates, yielding η ≪ min{R_batch/(S*G_upd), R_crit/(T*G_upd)}.

asynchronous rlhfgradient biasstalenessgeneralized policy optimizationstability condition

MemSyco-Bench: Benchmarking Sycophancy in Agent Memory

arXiv cs.AI · Zhishang Xiang, Zerui Chen, Yunbo Tang, Zhimin Wei · 2026-07-01

We introduce MemSyco-Bench, a benchmark for evaluating memory-induced sycophancy in LLM-based agents, addressing the gap in assessing how retrieved memories influence downstream reasoning. The benchmark measures when memory should impact decisions and how valid memory should be utilized, covering five tasks: rejecting memory as factual evidence, respecting its applicable scope, resolving conflicts between memory and objective evidence, tracking memory updates, and using valid memory for personalization. All resources are made available at https://github.com/XMUDeepLIT/MemSyco-Bench.

memory-induced sycophancyllm-based agentsfactual evidenceobjective reasoningpersonalization

Agentic generation of verifiable rules for deterministic, self-expanding reaction classification

arXiv cs.AI · Daniel Armstrong, Maarten Dobbelaere, Valentas Olikauskas, Helena Avila · 2026-07-01

The authors present an automated pipeline for generating and verifying reaction classification rules using a multi-agent LLM framework, addressing the limitations of fixed rulesets in computer-assisted synthesis planning. Their method involves LLM agents classifying reactions and writing rules across 665,901 US patent reactions, with each rule verified against the corpus. The system expands a standard taxonomy from 68 to 14,073 classes without human curation, achieving 97.7% classification accuracy on unseen reactions with a lightweight fingerprint classifier, matching proprietary tools while offering finer chemical resolution and extensibility.

computer-assisted synthesis planningreaction classificationmulti-agent frameworklarge language modelsverification loop

DART-VLN: Test-Time Memory Decay and Anti-Loop Regularization for Discrete Vision-Language Navigation

arXiv cs.AI · Shaoheng Zhang, Zhichen Li, Jie Mei · 2026-07-01

DART-VLN introduces a training-free test-time control framework for discrete vision-language navigation (VLN) addressing two failure modes: stale memory readouts and local backtracking. The method combines Test-Time Memory Decay, which reweights memory readouts to suppress redundant evidence, and Anti-Loop Regularization, a next-hop penalty discouraging immediate reversals. Evaluated on R2R and REVERIE, decay-only improves read-side performance, while decay+anti-loop achieves the best quality-efficiency trade-off, reducing trajectory length and runtime while improving navigation accuracy under frozen backbones.

vision-language navigationmemory decayanti-loop regularizationpartial observabilityfrozen backbones

EchoRisk: A Multicentre Echocardiography Dataset and Benchmark for Cardio-Oncology

arXiv cs.AI · Grigorios Kalliatakis, Georgia Karanasiou, Georgios Manikis, Manolis Tsiknakis · 2026-07-01

We introduce EchoRisk, the first multicentre, longitudinal echocardiography dataset with cardiotoxicity labels, designed for automated risk stratification in cardio-oncology. The dataset includes 422 patients from the CARDIOCARE study, yielding 2,159 echocardiography videos across 1,123 clinical exams, with a dedicated cohort of 280 patients for early cardiotoxicity prediction. Three clinically grounded tasks are defined: left ventricular ejection fraction estimation, LV dysfunction classification, and early cardiotoxicity prediction. Baseline performance is established using an R(2+1)D video backbone with LSTM aggregation, pretrained on Kinetics-400, showing strong discriminative performance for cardiac functional assessment and LV dysfunction classification, while early prediction remains challenging. The dataset, evaluation code, and baseline implementations are publicly available.

echocardiographycardiotoxicitylongitudinallstmr(2+1)d

Behavior-Adaptive Conversational Agents: Toward a Fluid Personality Framework

arXiv cs.AI · Hasibur Rahman, Smit Desai · 2026-07-01

The paper proposes a Fluid Personality Framework for LLM-based conversational agents that dynamically adapts (1) metaphorical persona (e.g., coach, tutor) and (2) personality expression intensity (low/medium/high) based on task context, user traits, and situational urgency. This addresses limitations of static persona implementations, building on evidence that moderate personality expression and context-appropriate metaphors optimize trust, enjoyment, and adoption metrics. The framework synthesizes design dimensions for adaptive agent behaviors across domains like healthcare and education.

conversational agentspersona adaptationpersonality expressionllm-based systemsbehavior change

PedNStream: Scalable Network Flow Simulation for Pedestrian Traffic Management

arXiv cs.AI · Weiming Mai, Dorine Duives, Serge Hoogendoorn · 2026-07-01

PedNStream introduces an open-source Python framework for macroscopic pedestrian network simulation, extending the Link Transmission Model (LTM) with stochastic link dynamics and utility-based route choice for control-oriented applications. The modular design integrates intervention controllers (gating, flow separation) and supports closed-loop evaluation. Validation includes synthetic scenarios (queue formation, spillback), real-network consistency checks, and runtime analysis demonstrating scalability. Results confirm its efficacy as a testbed for large-scale pedestrian traffic management.

link transmission modelpedestrian simulationnetwork loadingclosed-loop controlstochastic dynamics

Reading Order Inference for Complex Document Layouts

arXiv cs.AI · Iddo Hakim, Sharva Gogawale, Omer Ventura, Gal Grudka · 2026-07-01

A graph-based framework for reading order inference in complex document layouts is proposed, addressing the bottleneck in digitizing historical manuscripts with interleaved reading streams. The method constructs a directed candidate-transition graph from OCR text lines, scoring edges using a weighted ensemble of causal language model likelihood and BERT next-sentence prediction. Global reading order is recovered via a degree-constrained directed path cover, employing a max-regret inference rule to mitigate cascading failures. Evaluated on synthetic Glossa Ordinaria layouts, ALTO page geometries, and OmniDocBench, the method achieves 95% edge accuracy on wrap-around layouts and 88% on multi-column subsets, outperforming XY-cut and LayoutReader baselines. Mirror-invariance is verified with minimal performance variation.

reading order inferencegraph-based frameworkcandidate-transition graphmax-regret inferencemirror-invariance

Logit-Contribution Scoring Identifies Non-Literal Retrieval Heads

arXiv cs.AI · Aryo Pradipta Gema, Beatrice Alex, Pasquale Minervini · 2026-07-01

Logit-Contribution Scoring (LOCOS) is introduced as a write-aware detector to identify attention heads responsible for non-literal retrieval in large language models, addressing the limitation of existing detectors that focus solely on literal-copy criteria. LOCOS scores heads by projecting their output-value (OV) circuit outputs onto the answer-token unembedding direction, contrasting needle and off-needle source positions in a single forward pass. Evaluated across Qwen3, Gemma-3, and OLMo-3.1 models on the NoLiMa benchmark, ablating top LOCOS heads collapses ROUGE-L scores significantly, outperforming prior methods. Specific ablations on Qwen3-8B demonstrate retrieval-specificity, preserving parametric recall and arithmetic reasoning while drastically reducing performance on MuSiQue and BABI-Long tasks.

logit-contribution scoringnon-literal retrievaloutput-value circuitattention headsrouge-l

SWE-Doctor: Guiding Software Engineering Agents with Runtime Diagnosis from Multi-Faceted Bug Reproduction Tests

arXiv cs.AI · Yaoqi Guo, Yang Liu, Jie M. Zhang, Yun Ma · 2026-07-01

SWE-Doctor introduces a novel software issue resolution agent that leverages runtime diagnoses from multi-faceted bug reproduction tests (BRTs) to improve patch generation. The method generates BRTs for diverse behavioral requirements, executes and debugs them to create runtime-grounded diagnosis records, and uses these alongside localization information to guide patch generation. Evaluated on Python bug-fixing issues from SWE-bench Verified and SWE-bench Pro across five LLM backends, SWE-Doctor achieves average resolution rates of 75.7% and 59.4%, respectively, outperforming baseline agents by 8.0-8.9 percentage points on SWE-bench Pro.

bug reproduction testsruntime diagnosispatch generationsoftware engineering agentsllm backends

SenseWalk: Agent-Based Semantic Trajectory Simulation Powered by Large Language Models in Zoned Environments

arXiv cs.AI · Ziyue Lin, Xinhang Xie, Kangyi Wang, Siming Chen · 2026-07-01

SenseWalk introduces an agent-based semantic trajectory simulation system leveraging large language models (LLMs) to model human movement in zoned environments. The system integrates LLMs with the social force model to ensure both semantic coherence and physical plausibility in trajectory generation. A user-friendly interface enables customization of simulation configurations and analysis of outputs. Quantitative experiments validate the simulation workflow, and a user study (n=12) demonstrates the system's utility and efficiency for practitioners.

semantic trajectorylarge language modelssocial force modelzoned environmentssimulation workflow

TRCGL-Net: A Long-Tailed Multi-Label Chest X-Ray Classification Framework with Generative Data Augmentation and Label Co-Occurrence Modeling

arXiv cs.AI · Tong Shao, Hongshun Ling, Li Zhang, Jinjing Wu · 2026-07-01

TRCGL-Net addresses long-tailed multi-label classification in chest X-rays through three innovations: (1) a text-guided conditional diffusion model for realistic tail-class sample generation, (2) channel reweighting and class-aware attention for lesion localization, and (3) a label co-occurrence graph convolution network. The framework improves tail-class representation under anatomical noise and mitigates head-class dominance in co-occurrence modeling. On PadChest, it achieves 0.4904 tail-class mAP, 0.4408 overall mAP, and 0.8989 mAUC, outperforming prior methods in rare disease recognition under extreme class imbalance.

long-tailed classificationconditional diffusion modelchannel reweightinglabel co-occurrencegraph convolution network

Bayesian Uncertainty Propagation for Agentic RAG Pipelines: A Proof-of-Concept Study on Multi-Hop Question Answering

arXiv cs.AI · Louis Donaldson, Connor Walker, Koorosh Aslansefat, Yiannis Papadopoulos · 2026-07-01

The paper proposes a Bayesian uncertainty propagation framework for Agentic Retrieval-Augmented Generation (RAG) systems, addressing trustworthiness in multi-stage reasoning pipelines. The method integrates uncertainty signals from planner, evaluator, and generator components—derived from semantic divergence and self-evaluation—into a Bayesian Network to estimate system-level uncertainty and identify failure points. Evaluated on StrategyQA and HotpotQA using GPT-3.5-Turbo and GPT-4.1-Nano, the approach demonstrates stronger performance on HotpotQA (where uncertainty accumulates across reasoning hops) but reveals calibration challenges on StrategyQA, with metrics including AUROC (0.72-0.85), AUARC, ECE, and Brier Score.

bayesian uncertainty propagationagentic ragmulti-hop reasoningselective predictioncalibration error

Aionoscope: Debugging Latent-State Accessibility in Time-Series Representations

arXiv cs.AI · Alexander Chemeris, Ming Jin, Randall Balestriero · 2026-07-01

Aionoscope introduces a generator-based diagnostic tool for evaluating latent-state accessibility in frozen time-series representations, focusing on debugging process state variables like timing, phase, amplitude, frequency, and regime. The method employs Primitive Process Mixtures to generate synthetic streams with exact labels, assessing 37 model-plus-adapter systems via pooled linear-probe protocols. Results reveal a mismatch: while component presence is easily recoverable (coarse accessibility), dense process state remains poorly exposed (highest mean masked $R^2$ = 0.689 vs. oracle 0.999), highlighting a critical failure mode in representation interpretability.

latent-state accessibilitytime-series representationslinear-probe protocolprocess state variablessynthetic streams

Learning Cardiac Motion Priors for Implicit Neural Representations

arXiv cs.AI · Andrew Bell, George Webber, Andrew P King, Steffen E Petersen · 2026-07-01

The paper compares four strategies for learning cardiac motion priors in implicit neural representations (INRs) to improve motion field estimation from tagged cardiac MRI. Methods evaluated include population priors via joint optimization, consensus priors through weight averaging, auto-decoders, and meta-learning, tested on UK Biobank short-axis images. Results show all priors enhance early adaptation performance versus random initialization, with auto-decoders excelling at large deformation recovery and meta-learning maintaining superior adaptation trajectories over 50 iterations.

implicit neural representationscardiac motion estimationmeta-learningauto-decodersoptimization trajectory

Post-Training Pruning for Diffusion Transformers

arXiv cs.AI · Chengzhi Hu, Xuewen Liu, Jing Zhang, Mengjuan Chen · 2026-07-01

DiT-Pruning introduces customized saliency criteria and clustering-aware pruning granularity for Diffusion Transformers (DiTs), addressing limitations of traditional pruning methods. The method balances weight and activation contributions via an energy-based metric and leverages distinct clustering patterns in weight space for sparse allocation. Evaluations on FLUX.1-dev at 512x512 resolution on MJHQ demonstrate minimal CLIP score degradation (0.001 loss) at 50% sparsity, significantly outperforming existing pruning techniques while preserving image quality under high sparsity.

diffusion transformerspost-training pruningsaliency criteriaclustering-aware granularityclip score

Human-Machine Collaboration on Generative Meta-Learning: Model and Algorithm

arXiv cs.AI · Midhun Parakkal Unni, Samuel Kaski · 2026-07-01

The paper introduces Generative Meta-Learning with Human Feedback (GMHF), a framework addressing domain generalization by synthesizing data aligned with expert intuition. GMHF employs a Conditional Neural ODE (cNODE) as a generative digital twin and a Reinforcement Learning agent to iteratively refine latent physical parameters based on human feedback, minimizing divergence between generated and target distributions. Theoretical bounds demonstrate reduced generalization error when aligning generated data with human beliefs. Empirical validation on a nonlinear Duffing oscillator shows GMHF reduces deployment loss with reliable feedback, extending to non-dynamical probabilistic models, confirming its efficacy in robust generalization under distribution shift.

generative meta-learningconditional neural odereinforcement learningdomain generalizationdistribution shift

Graph-Native Reinforcement Learning Enables Traceable Scientific Hypothesis Generation through Conceptual Recombination

arXiv cs.AI · Subhadeep Pal, Shashwat Sourav, Tirthankar Ghosal, Markus J. Buehler · 2026-07-01

Graph-PRefLexOR, a graph-native reasoning model fine-tuned with Group Relative Policy Optimization (GRPO), enhances scientific hypothesis generation through multi-step, domain-grounded reasoning. The model organizes reasoning into explicit phases—mechanism exploration, graph construction, pattern extraction, and hypothesis synthesis—linking neural language generation with symbolic relational structure for traceable causal connections. Evaluated on 100 open-ended materials science questions, Graph-PRefLexOR achieves 40-65% improvements over base models, particularly in reasoning traceability, and exhibits 2-3 times greater semantic diversity. Test-time graph expansion reveals increased long-range conceptual recombination within a bounded semantic space, advancing interpretable AI for materials design.

graph-native reasoninggroup relative policy optimizationconceptual recombinationsemantic diversitytraceable hypothesis generation

From Personas to Plot: Character-Grounded Multi-Agent Story Generation for Long-Form Narratives

arXiv cs.AI · Aayush Aluru, Chloe Ho, Muhammad Hammouri, Kerry Luo · 2026-07-01

The paper introduces MAGNET, a multi-agent framework for long-form narrative generation, coupled with ATLAS, a graph-based hallucination detection system. MAGNET employs persona-grounded character agents that propose actions based on shared world states and story goals, while ATLAS verifies consistency by comparing scene-level world representations. Evaluations show MAGNET reduces annotation needs by 41% and hallucinations by 50% versus single-model baselines at 100-page length, with similar improvements over IBSEN. Results demonstrate that explicit world-state tracking and goal-driven multi-agent generation enhance narrative coherence in long-form fiction.

multi-agent generationlong-form narrativesworld-state trackinghallucination detectionpersona-grounded agents

Valdi: Value Diffusion World Models

arXiv cs.AI · Christopher Lindenberg, Kashyap Chitta · 2026-07-01

Valdi introduces Value Diffusion World Models, combining end-to-end online training for Model Predictive Control (MPC) with a latent diffusion dynamics model to address the trade-off between fast inference and expressive uncertainty modeling. The method employs a single diffusion step during both training and inference, enabling efficient latent planning. Preliminary experiments on CarRacing demonstrate comparable performance to a deterministic MLP baseline, while revealing a trade-off between predictive multimodality and control performance.

model predictive controldiffusion modelsworld modelslatent planninguncertain dynamics

Two AI Metrics Diverged: Will it Make All the Difference?

arXiv cs.AI · Alex Fogelson, Zachary A. Brown, Hans Gundlach, Jayson Lynch · 2026-07-01

The paper analyzes how AI capability metrics influence the accessibility of frontier models versus meek models under exponential compute scaling. It builds on prior work to classify performance metrics by their functional forms in relation to training and inference compute, providing mathematical conditions for determining which metrics favor meek models. Results show bounded metrics always favor meek models, while unbounded metrics perpetuate frontier model dominance. The analysis highlights that bounded and unbounded metrics may suggest opposing policy responses, with unbounded metrics likely concentrating capabilities among wealthy actors and bounded metrics enabling broader proliferation.

bounded metricsunbounded metricscompute scalingfrontier modelsmeek models

DeWorldSG: Depth-Aware 3D Semantic Scene Graph Generation via World-Model Priors

arXiv cs.AI · Seok-Young Kim, Abdelrahman Elskhawy, Taewook Ha, Dooyoung Kim · 2026-07-01

DeWorldSG introduces a depth-aware framework for generating robust 3D Semantic Scene Graphs (3D-SSGs) from RGB-D sequences, addressing instability in 3D object representations and relational sparsity. The method employs depth-guided filtering to estimate instance-level geometric 3D Gaussian distributions, representing objects as probabilistic 3D nodes, and refines relations using spatiotemporal aggregation and contextual priors from V-JEPA 2. Evaluated on 3DSSG and ReplicaSSG, it achieves state-of-the-art performance, improving triplet recall by 77.4% and predicate recall by 23.2% over prior methods, demonstrating suitability for robotic manipulation and AR applications.

3d semantic scene graphsdepth-guided filteringprobabilistic 3d nodesspatiotemporal aggregationv-jepa 2

Improving Sparse-View 3DGS Generalization via Flat Minima Optimization

arXiv cs.AI · Kangmin Seo, Sangeek Hyun, MinKyu Lee, Jae-Pil Heo · 2026-07-01

We improve sparse-view generalization in 3D Gaussian Splatting (3DGS) through flat minima optimization, addressing overfitting to observed views. Our method adapts FM principles to 3DGS by introducing controlled Gaussian perturbations that account for anisotropy and training progress, preserving fine details while enhancing robustness. We further stabilize optimization via periodic reinitialization of non-positional parameters. The approach integrates seamlessly into existing 3DGS pipelines without architectural changes. Experiments on LLFF and Mip-NeRF360 datasets demonstrate improved quantitative metrics and perceptual quality, yielding sharper, more stable reconstructions with better novel-view generalization.

3d gaussian splattingflat minima optimizationsparse-view generalizationnovel view synthesisanisotropic perturbations

Self-Evolving Agents with Anytime-Valid Certificates

arXiv cs.AI · Biswa Sengupta · 2026-07-01

The paper introduces SEA, a self-evolving agent architecture that confines self-modification to a steering adapter and versioned harness around a frozen base model, using anytime-valid gates with auditable certificates. Five loop controllers (best-of-N, micro-step search, self-authored reproduction oracles, search-layer control, self-repair) provide dense signals for gate selection without graders. On a 52-instance SWE-bench Verified subset, SEA improved performance by +4 and +5 points on GLM-5.2 and GPT models, with event logs confirming mechanism efficacy in preventing regressions.

self-evolving agentsanytime-valid certificatessteering adapterversioned harnessloop controllers

CAT: Confidence-Adaptive Thinking for Efficient Reasoning of Large Reasoning Models

arXiv cs.AI · Qizhi Jiang, Shuo Wang, Pei Ke, Yuhang Song · 2026-07-01

The paper proposes Confidence-Adaptive Thinking (CAT), a framework to improve the efficiency of Large Reasoning Models (LRMs) by dynamically adjusting reasoning lengths based on problem difficulty. CAT integrates the model's self-certainty signals as confidence into preference optimization, enabling compressed responses for confident queries while preserving deliberation for uncertain ones. Experiments demonstrate CAT's superior reasoning accuracy over baselines across multiple benchmarks and base models, offering a solution for balancing accuracy and latency in industrial applications.

large reasoning modelsconfidence-adaptive thinkingpreference optimizationchain-of-thoughtself-certainty signals

Meta-Transfer Learning for mmWave Beam Alignment

arXiv cs.AI · Ahmet Nuri Cevik, Sinem Coleri · 2026-07-01

The paper proposes MTL-BA, a meta-transfer learning framework for efficient mmWave beam alignment in MISO systems. The method combines a frozen pre-trained convolutional backbone with meta-learned lightweight Scale-and-Shift adapters and a classifier head, reducing adaptation parameters by 17× compared to full fine-tuning and MAML. Evaluated on the DeepMIMO dataset, MTL-BA matches full fine-tuning accuracy and spectral efficiency across SNRs, outperforms last-layer fine-tuning with similar parameter updates, and approaches MAML's performance with 60% fewer meta-training epochs.

meta-transfer learningmmwave beam alignmentscale-and-shift adaptersmiso systemsdeepmimo dataset

Recovering Input Text from Hidden States: Study of Gradient-Based Inversion of Decoder-Only Language Models

arXiv cs.AI · Mikołaj Słowikowski, Maciej Witold Majewski · 2026-07-01

The paper introduces a continuous embedding-space optimization method for recovering input tokens from last-layer hidden states in decoder-only language models, avoiding hard-token projections during search. By analyzing rank trajectories, per-position loss curves, and discrete loss at commit time, the method reveals that space-prefixed function words dominate reconstruction failures, while content tokens are recovered near-perfectly (97.5% exact-match rate on 10-token C4 prompts). Compared to SIPIT, the continuous approach enables observability and error detection, demonstrating GPT-2's hidden-state sensitivity to original text.

hidden-state inversiondecoder-only language modelsembedding-space optimizationdiscrete lossrank trajectories

From World Models to World Action Models: A Concise Tutorial for Robotics

arXiv cs.AI · Xiaoxiong Zhang, Xiong Zeng, Wei Zhang · 2026-07-01

The tutorial clarifies the conceptual scope of world models in robotics by categorizing them as action-conditioned predictive models estimating future task-relevant observations or states. It distinguishes observation-space and state-space world models, analyzing their trade-offs in visual fidelity, spatial structure, physical interpretability, and control usability. The authors introduce world action models linking predictions to executable robot actions, summarizing four paradigms: imagine-then-execute, video-feature-conditioned action prediction, joint video-action modeling, and auxiliary video prediction for policy learning.

world modelsembodied intelligenceaction-conditioned predictionobservation-spacestate-space

Pano2World: End-to-End 3D Generation via Unified Multi-View Sequences

arXiv cs.AI · Zhenjia Li, Jinrang Jia, Yifeng Shi · 2026-07-01

Pano2World introduces an end-to-end framework for generating persistent 3D Gaussian scenes from single indoor panoramas, addressing limitations of iterative per-view completion and video-generation priors. The method first reconstructs a coarse 3D Gaussian proxy, renders guidance panoramas at sampled poses, and employs View-Aware Attention Routing for joint denoising of target views with geometric and semantic constraints. A Latent Feature Adapter distills multi-view hidden features into a scene latent, bypassing VAE decoding. Experiments show superior performance on multi-position panoramic novel-view synthesis benchmarks.

3d gaussian scenepanoramic diffusion modelview-aware attention routinglatent feature adapternovel-view synthesis

Exploring the Semantic Gap in Agentic Data Systems: A Formative Study of Operationalization Failures in Analytical Workflows

arXiv cs.AI · Jalal Mahmud, Eser Kandogan · 2026-07-01

The study identifies a semantic gap in agent-generated analytical workflows, revealing operationalization failures despite successful generation and execution. Through a cross-domain formative analysis of 236 analytical intents across finance, human resources, and public safety domains, the authors identify 153 recurring failures. These failures are categorized into five classes: comparative grounding, process reasoning, quantitative reasoning, role confusion, and policy grounding. The findings suggest that current agentic data systems lack sufficient semantic representations to bridge user-level analytical concepts with executable computations, highlighting the need for richer semantic frameworks in future systems.

semantic gapanalytical workflowsoperationalization failuresagentic data systemssemantic representations

LRAT-Catcher: Importing SAT Solver Certificates into Lean4 by Reflection

arXiv cs.AI · Stefan Szeider · 2026-07-01

LRAT-Catcher introduces a tool for importing SAT solver certificates into Lean 4, enabling verification of combinatorial problem results as theorems. The method leverages Lean's formally verified LRAT checker executed via reflection, addressing memory limitations of explicit proof-term imports in Mathlib. It supports cube-and-conquer solving runs entirely within Lean, combining per-cube refutations with cover-completeness certificates into a single unsatisfiability theorem. Verified encodings link CNF-level results to original combinatorial problems. Evaluation demonstrates effectiveness in establishing Schur number S(4) = 44 and Ramsey number R(4,4) = 18 as Lean theorems, outperforming Mathlib's proof-term import and external checker cake_lpr.

sat solverlrat certificatelean 4cube-and-conquercombinatorial verification

LeVLJEPA: End-to-End Vision-Language Pretraining Without Negatives

arXiv cs.AI · Lukas Kuhn, Giuseppe Serra, Randall Balestriero, Florian Buettner · 2026-07-01

LeVLJEPA introduces the first fully non-contrastive end-to-end vision-language pretraining method, eliminating negatives, temperature, momentum encoder, and teacher-student schedules. It employs cross-modal prediction with stop-gradient targets and per-modality distributional regularization, enabling stable large-scale training. As a frozen vision-language-model backbone, LeVLJEPA outperforms contrastive baselines on GQA, VQAv2, and POPE benchmarks across two distinct language models, and excels in semantic segmentation while maintaining parity on global readouts like linear probing. These results demonstrate non-contrastive pretraining's efficacy in generating dense semantic vision features.

levljepanon-contrastivecross-modal predictionstop-gradientsemantic segmentation

Active Learning for Cascaded Object Detection: Balancing Coverage and Uncertainty in Table Extraction Pipelines

arXiv cs.AI · Eliott Thomas, Mickael Coustaty, Aurelie Joseph, Gaspar Deloin · 2026-07-01

The authors adapt Uncertainty Herding (UHerding), a hybrid coverage-uncertainty sampling method, to cascaded object detection pipelines for table extraction, addressing inter-stage dependencies between Table Detection (TD) and Table Structure Recognition (TSR). They propose two pipeline-aware extensions: RankFusion, which adds dual-manifold coverage over detection and structure representation spaces, and CAPA, which incorporates stage-dependent gating and per-task uncertainty calibration. Experiments on four datasets (PubTables-1M, FinTabNet, and two private datasets) with annotation budgets ranging from 71 to 500 documents demonstrate that UHerding outperforms baselines, with CAPA emerging as the most consistent strategy, outperforming standard UHerding on three out of four datasets.

uncertainty herdingtable detectiontable structure recognitionactive learningcascaded pipelines

GaussianFusion: Unified 3D Gaussian Representation for Multi-Modal Fusion Perception

arXiv cs.AI · Xiao Zhao, Chang Liu, Mingxu Zhu, Zheyuan Zhang · 2026-07-01

The paper introduces GaussianFusion, a universal framework for multi-modal fusion perception using 3D Gaussian representation, addressing limitations of discrete bird's-eye view (BEV) grids. The method employs a forward-projection-based Gaussian initialization module and a cross-modal Gaussian encoder with attention-based iterative updates, preserving edge and texture details. Evaluated on nuScenes, GaussianFusion outperforms BEVFusion by 2.6 NDS in 3D object detection and surpasses GaussFormer in 3D semantic occupancy by 1.55 mIoU with 30% fewer Gaussians and 450% speedup.

3d gaussian representationmulti-modal fusionbird's-eye viewattention mechanismsemantic occupancy

Prototype Memory-Guided Training-Free Anomaly Classification and Localization in Prenatal Ultrasound

arXiv cs.AI · Huanwen Liang, Yuhao Huang, Xiliang Zhu, Yuanji Zhang · 2026-07-01

We introduce a training-free framework for multi-class anomaly classification and localization in prenatal ultrasound, addressing data scarcity and heterogeneity challenges. The method employs a memory bank with multi-granular prototypes to model class-level semantics and anomaly characteristics, a prototype-driven soft merging mechanism for anomaly region detection, and a class-aware refinement strategy leveraging prototype consistency for improved category prediction. Evaluated on a multi-center dataset of 1,149 cases with 2,357 images across 9 categories, the framework outperforms existing approaches.

training-freemulti-granular prototypesanomaly localizationprototype-drivenclass-aware refinement

Phantom References: Hallucinated Citations That Survive Peer Review at Top-Tier Conferences

arXiv cs.AI · Mark Russinovich, Ram Shankar Siva Kumar, Ahmed Salem · 2026-07-01

The study introduces RefChecker, a verification pipeline that detects hallucinated citations in peer-reviewed proceedings by resolving references against multiple bibliographic sources and web-search verification. Focusing on identity-level failures (nonexistent works or author mismatches), the method audits camera-ready papers from ICLR, ICML, NeurIPS, and USENIX Security. Results show citation hallucination rates below 1% per reference, but paper-level failures are notable: 5% of NeurIPS and USENIX Security 2025 papers contain ≥2 hallucinated citations, with post-ChatGPT increases observed. The tool confirms peer review inadequately enforces citation integrity, yet automated auditing is feasible ($0.04/paper).

citation hallucinationbibliographic verificationpeer reviewlarge language modelsrefchecker

ConRTF: Edge-Constrained Boundary Distribution Refinement for Realtime TransFormer Table Structure Recognition

arXiv cs.AI · Eliott Thomas, Tri-Cong Pham, Mickael Coustaty, Aurelie Joseph · 2026-07-01

The paper introduces ConRTF, a real-time transformer-based table structure recognition method that improves boundary localization through edge-constrained fine-grained learning. The key innovation is an Edge-constrained Fine-grained Localization loss (EFL) that encodes table-specific geometric priors, emphasizing horizontal boundaries for rows and vertical boundaries for columns during training. Integrated with distribution-based boundary refinement (D-FINE), ConRTF achieves +1.6 GriTS improvement over baselines like RT-DETRv2 while maintaining real-time performance, requiring only 2k-3k annotated tables for robust accuracy on PubTables-1M and private datasets.

table structure recognitionboundary localizationgeometric priorsreal-time detectiondistribution-based refinement

LLM-Guided ODE Discovery and Parameter Inference from Small-Cohort Aggregate Data

arXiv cs.AI · Hanning Yang, Meropi Karakioulaki, Lennart Purucker, Tim Litwin · 2026-07-01

AgentODE introduces an end-to-end framework for jointly discovering ODE structures and refining parameter distributions from population-level summary statistics, addressing challenges in rare disease modeling. The method employs an LLM to propose candidate ODE structures and a tool-augmented inference agent that iteratively refines parameter distributions through a diagnosis--update loop. Evaluated on three benchmark problems and two clinical datasets, including recessive dystrophic epidermolysis bullosa (RDEB) with 231 observations across 46 patients, AgentODE recovers functionally consistent ODE structures. It outperforms baselines in sparse, noisy data settings by promoting mechanistically principled structure discovery, even when baselines access individual-level data.

ode discoveryparameter inferencesummary statisticsrare diseasesllm-guided

Detecting the Undetectable: Enhancing Unsupervised time series Anomaly Detection via Active Learning

arXiv cs.AI · Seung Hun Han, Hyeongwon Kang, Jinwoo Park, Pilsung Kang · 2026-07-01

The authors propose a novel framework combining active learning with unsupervised models to enhance anomaly detection in time series data. The method introduces a masked time-series reconstruction feedback strategy to learn robust temporal dependencies and a minimax learning strategy to differentially process normal and abnormal samples. Evaluated across 28 test cases involving four multivariate datasets and seven unsupervised backbone models, the framework achieves a 12.39% improvement in AUC, demonstrating its effectiveness in detecting subtle and noisy anomalies.

active learningtime seriesanomaly detectionminimax learningreconstruction feedback

Partial Skeleton Visibility for Action Recognition: A Constrained Field-of-View Approach

arXiv cs.AI · Yingjie Dai, Tianyang Xu, Yanglin Deng, Xiao-Jun Wu · 2026-07-01

The paper introduces PartialVisGraph, a hypergraph framework for robust skeleton-based action recognition under constrained field-of-view (FoV) conditions. The method constructs learnable virtual hyperedges via a soft incidence matrix to capture high-order dependencies and employs a Single-Head Sample-Adaptive Transformer with visibility priors to adaptively aggregate joint features while mitigating occluded joint interference. Evaluated on NTU RGB+D 60 and 120 with realistic FoV simulations, the approach achieves state-of-the-art performance, improving accuracy by up to 68.8% under severe visibility constraints while maintaining superiority in full-visibility settings.

hypergraphskeleton-basedfield-of-viewvisibility priorsample-adaptive transformer

Self-conditioned Flow Map Language Models via Fixed-point Flows

arXiv cs.AI · Jaehoon Yoo, Wonjung Kim, Floor Eijkelboom, Chanhyuk Lee · 2026-07-01

The paper introduces fixed-point flows, a novel class of self-conditioned flow maps for language models that formalize self-conditioning as a fixed-point iteration. The method combines flow map distillation for compressing the flow process with fixed-point distillation for compressing iterations, yielding FMLM$^\star$. Evaluated on OpenWebText, FMLM$^\star$ outperforms state-of-the-art self-conditioned and few-step models in one- and few-step generation tasks.

fixed-point flowsflow map distillationself-conditioningfew-step generationlanguage models

Creating Impactful Autonomous Driving Datasets: A Strategic Guide from Research Gap to Benchmark

arXiv cs.AI · Richard Schwarzkopf, Jonas Merkert, Frank Bieder, Annika Bätz · 2026-07-01

The paper presents a strategic framework for creating impactful autonomous driving datasets, addressing the gap in literature that focuses on dataset contents rather than design methodology. The approach begins by diagnosing whether a research question is blocked by data or evaluation problems, then selects minimal data operators to close the gap, prioritizing cost-efficiency. The authors analyze major AD datasets through this lens and apply their framework to the KITScenes dataset family. The method includes gap identification, operator choice, sensor suite design, and annotation strategy.

autonomous drivingdataset designresearch gapdata operatorsannotation strategy

LLVM-Bench: Benchmarking and Advancing Large Language Models for LLVM Compiler Issue Resolution

arXiv cs.AI · Zhao Tian, Yingquan Zhao, Chenyao Suo, Meng Wang · 2026-07-01

The authors introduce LLVM-Bench, the first large-scale benchmark for LLVM compiler issue resolution, comprising 423 validated tasks from the LLVM project, and LLVM-Gym, an automated evaluation platform. They evaluate four LLMs, six retrieval configurations, and three agents, finding current techniques limited by patch invalidity and build failures. Proposing LLVM-Ens, a lightweight ensemble method that integrates diverse patches and filters incorrect candidates, they achieve a 21.99% resolution rate, demonstrating improved performance.

llvm-benchllvm-gymissue resolutionlarge language modelsensemble learning

Self-GC: Self-Governing Context for Long-Horizon LLM Agents

arXiv cs.AI · Xubin Hao, Hongjin Meng, Xin Yin, Jiawei Zhu · 2026-07-01

Self-GC introduces a self-governing context management system for long-horizon LLM agents, addressing limitations of heuristic pruning and summarization by treating context as indexed, recoverable objects. The method transforms user inputs, tool outputs, and skill states into structured objects, employs a side-channel planner to propose context operations (fold, mask, prune), and enforces recoverable sidecars and cache-aware commits. Evaluated on a 33-session Hard Set, Self-GC prunes 43.95% of prefix tokens while preserving 84.85% of future continuations, outperforming heuristic baselines (54.55%-69.70% no-impact rates). On a 332-session production suite, planner backbones achieve 91.27%-94.58% no-impact rates, reducing input tokens by 10%-15% in production environments.

self-governing contextindexed objectsside-channel plannerrecoverable sidecarscache-aware commit

LUMA: Benchmarking Segmentation via a Lightweight Universal Mask Adapter

arXiv cs.AI · Tobias Christian Nauen, Anosh Billimoria, Federico Raue, Stanislav Frolov · 2026-07-01

We introduce LUMA, a lightweight universal mask adapter for backbone-agnostic image segmentation, enabling fair comparison of transformer architectures by decoupling them from decoder designs. LUMA employs a mask-transformer head with cross-attention to extract features from any backbone, achieving state-of-the-art accuracy of EoMT at lower cost. Benchmarking 20 backbones, 11 pretraining schemes, and multiple resolutions on ADE20K and Cityscapes reveals that plain ViT maintains throughput efficiency across resolutions, while pretraining objectives dominate segmentation quality over architectural choices.

mask-transformercross-attentionbackbone-agnosticpretrainingthroughput

Multi-Label Node Classification with Label Influence Propagation

arXiv cs.AI · Yifei Sun, Zemin Liu, Bryan Hooi, Yang Yang · 2026-07-01

The paper proposes Label Influence Propagation (LIP), a novel method for multi-label node classification (MLNC) that explicitly models label influence in graph data. LIP decomposes GNN message passing into propagation and transformation, analyzes label influence correlations, and constructs a label influence graph to propagate high-order influences. The framework dynamically adjusts learning by amplifying positive and mitigating negative label contributions. Evaluations on benchmark datasets show LIP consistently outperforms state-of-the-art methods across various MLNC settings.

multi-label node classificationgraph neural networkslabel influence propagationmessage passingnon-euclidean data

Faithful by Definition: Emotion Analysis via Natural Semantic Metalanguage Explications

arXiv cs.AI · Frank Xing, Erik Cambria · 2026-07-01

The paper introduces an explication interface for event-based emotion analysis that guarantees faithful explanations by design. The method uses Natural Semantic Metalanguage (NSM) to map input text to a structured explication with twelve typed slots, then applies a fixed decision list of rules derived from semantic definitions to compute emotion labels. A fine-tuned parser achieves 0.33 accuracy and 0.48 selective accuracy on crowd-sourced event descriptions, demonstrating comparable performance to black-box models while providing verifiable decision bases. The authors release EmoExpl-1200, a dataset with verification metadata and the full rule set.

natural semantic metalanguageemotion analysisexplication interfacedecision listfaithful explanations

Coachable agents for interactive gameplay

arXiv cs.AI · Roberto Capobianco, Harm van Seijen, Nolan D. Bard, Neil Burch · 2026-07-01

The paper introduces a framework for creating coachable reinforcement learning agents that can adapt their behavior styles in real-time while maintaining task performance. The method combines universal value function approximators (UVFAs) with specialized training scenarios, learning algorithms, and data augmentation. Results demonstrate style-coherent behavior across three domains: Horizon Forbidden West (combat), Gran Turismo (racing), and a humanoid locomotion task, with end users able to dynamically control agent behavior during execution.

reinforcement learninguniversal value function approximatorsreal-time controlbehavior stylesdata augmentation

Loss Smoothing for Stable Adaptation Under Distribution Shift

arXiv cs.AI · Darshan Patil, Ekaterina Lobacheva, Razvan Pascanu, Sarath Chandar · 2026-07-01

Loss smoothing improves neural network adaptation under distribution shift by interpolating between source and target training objectives, preserving useful features while enabling task specialization. The method addresses abrupt objective transitions that can distort learned representations during fine-tuning and reinforcement learning. Empirical evaluations across supervised shifts, pretrained vision adaptation, offline-to-online reinforcement learning, and language model fine-tuning demonstrate consistent performance improvements. Results suggest smoother objective transitions are broadly beneficial for stable model adaptation.

loss smoothingdistribution shiftfine-tuningreinforcement learningobjective interpolation

AGI Maze as a Benchmark Framework for World-Modeling Agents

arXiv cs.AI · Alexey Potapov · 2026-07-01

The paper introduces AGI Maze, a lightweight benchmarking framework for evaluating world-modeling capabilities in agents, particularly large language models (LLMs). The framework provides grid-based maze environments with varying difficulty levels, designed to test persistent state representation and reasoning under partial observability. Initial evaluations show vanilla LLMs fail to internally represent maze states during inference, while a baseline agent with working memory shows limited improvement but remains insufficient for reliable maze-solving within human-comparable step budgets.

world-modelingpartial observabilitygrid-based mazellm benchmarkingagentic runtime

Identifying Latent Concepts and Structures for Generalized Category Discovery

arXiv cs.AI · Boyang Dai, Chaoqi Chen, Yizhou Yu · 2026-07-01

The paper introduces Compositional Primitive Fields (CPF-GCD), a representation learning framework for Generalized Category Discovery (GCD) that reshapes feature spaces to identify latent concepts and structures. CPF-GCD enforces a low-rank compositional organization by decomposing images into reusable visual primitives and their spatial layouts via a spatial field mechanism. This approach shifts representation learning from global embeddings to structured primitive fields, enabling novel categories to emerge as activation patterns over shared primitives. Experiments show CPF-GCD consistently improves performance across GCD baselines, validating low-rank compositional structure as a key inductive bias for open-world recognition.

generalized category discoverycompositional primitive fieldslow-rank organizationspatial field mechanismopen-world recognition

Auditing Forgetting in Limited Memory Language Models

arXiv cs.AI · Arya Raeesi, Hanna Roed · 2026-07-01

The study introduces a causal auditing framework to evaluate forgetting in Limited Memory Language Models (LMLMs), which externalize factual knowledge to databases for deletion-based unlearning. The method varies database states (FULL, DEL-ON, DEL-OFF) to decompose post-deletion behavior into parametric leakage, retrieval-mediated correctness, and retrieval artifacts. Results from 12,228 deletions across 13 databases show near-zero parametric leakage, with residual correctness (0.7%-13.6%) primarily due to near-neighbor retrieval artifacts, indicating unlearning boundaries are database-administered rather than model-controlled.

limited memory language modelscausal auditingparametric leakageretrieval-mediated correctnessunlearning

HARC: Coupling Harmfulness and Refusal Directions for Robust Safety Alignment

arXiv cs.AI · Shei Pern Chua, Fangzhao Wu · 2026-07-01

We introduce HARC (Harmfulness-And-Refusal Coupling), a fine-tuning method that couples harmfulness and refusal directions across prompt and response token positions in LLMs to enhance safety alignment. Our analysis reveals that jailbreaks suppress either refusal or harmfulness directions pre-generation, and models recognize harmful content during generation even if missed at prompt encoding. HARC confines interventions to the harmfulness-refusal subspace, preserving general capabilities and avoiding over-refusal. Extensive experiments show HARC achieves superior robustness-capability-usability trade-offs compared to six baselines, with harmfulness and refusal directions transferring across five model families and two scales without architecture-specific tuning.

harcharmfulness-refusal couplingresidual streamfine-tuningsafety alignment

A Methodology for Investigating AI Patterns Prevalence in Software Repositories

arXiv cs.AI · Srinath Perera, Hasinthaka Piyumal, Frank Leymann, Rania Khalaf · 2026-07-01

The paper presents a methodology for investigating AI pattern prevalence in software repositories, addressing the lack of empirical data on real-world usage. The authors first identify 14 AI pattern classes through literature review of 44 sources, then validate prevalence using active learning across 100 GitHub repositories. Their model achieves 56% accuracy and 55% recall in an 8-way classification task, significantly outperforming the 11% random-chance baseline, while prevalence estimation provides usable accuracy bounds for pattern analysis.

ai patternsprevalence estimationactive learningsoftware repositories8-way classification

Group-Equivariant Poincaré Convolutional Networks

arXiv cs.AI · Aiden Durrant, Rahul Baburajan, Georgios Leontidis · 2026-07-01

Equivariant Poincaré ResNets integrate hyperbolic geometry with discrete symmetry groups ($C_4$ and $D_4$) to address limitations in hyperbolic visual representation learning. The method introduces geometrically safe tensor reshaping, left-regular permutations for hyperbolic group convolutions, and joint-orientation Poincaré Midpoint Batch normalisation, overcoming challenges in applying Euclidean equivariance to hyperbolic space. Empirical results show that embedding equivariance significantly reduces the optimisation space, accelerates convergence, and preserves spatial-group equivariance while respecting Poincaré ball boundary constraints.

hyperbolic geometryequivariancepoincaré ballgroup convolutionsbatch normalisation

Cross-Domain Generalization Failure in Lightweight Intrusion Detection Models for IIoT Networks

arXiv cs.AI · MD Azizul Hakim, Md Shihab Uddin, Talha Ibne Anis · 2026-07-01

This study investigates cross-domain generalization failures in lightweight intrusion detection models for Industrial Internet of Things (IIoT) networks. Four lightweight architectures were trained on one IIoT dataset and evaluated on two structurally distinct IIoT datasets without retraining, using a feature representation constrained to attributes available across all sources. Explainability analysis revealed that top-performing models rely heavily on coarse port-category features, with source-domain attack traffic occurring at 96 to 435 times the rate in target domains. Evaluation under imbalanced class distributions showed that protocol choice can reverse perceived generalization challenges. Adversarial robustness was unrelated to cross-network generalization, and recovery through adaptation varied by architecture. The findings emphasize assessing deployment readiness through cross-network evaluation under realistic class distributions.

intrusion detectionindustrial internet of thingscross-domain generalizationadversarial robustnessexplainability analysis

EgoGapBench: Benchmarking Egocentric Action Selection in Multi-Agent Scenes

arXiv cs.AI · Jihyeok Jung, Jeewu Lee, Sanghyeop Kim, Chanhee Han · 2026-07-01

The paper introduces EgoGapBench, a diagnostic benchmark for evaluating Egocentric Action Selection (EAS) in multi-agent scenes, isolating egocentric perspective understanding from first-person-view input. The benchmark measures an agent's ability to select appropriate actions from its own perspective when other agents are present. Results show humans perform reliably (accuracy not quantified), while both open-source and proprietary multimodal large language models (MLLMs) underperform, often selecting actions of other agents. Fine-tuning on existing egocentric data fails to improve performance and can degrade it, whereas EgoGapBench training data yields improvements without reaching human levels.

egocentric action selectionmulti-agent scenesdiagnostic benchmarkfirst-person-viewmultimodal large language models

Flow-Map GRPO: Reinforcement Learning for Few-Step Flow-Map Generators via Anchored Stochastic Composition

arXiv cs.AI · Zhiqi Li, Wen Zhang, Bo Zhu · 2026-07-01

The paper introduces Flow-Map GRPO, a reinforcement learning framework for post-training deterministic few-step flow-map generators like MeanFlow and sCM. The key innovation is Anchored Stochastic Flow Map Composition (ASFMC), a stochasticization mechanism that preserves the original marginal probability path while enabling RL optimization through anchor-based conditional resampling. Experiments on FLUX-based text-to-image generators demonstrate improved performance across reward-based, perceptual, and task-level metrics without modifying the original model parameterization.

flow-map generatorsreinforcement learningstochastic compositiontext-to-image generationprobability path

Active-GRPO: Adaptive Imitation and Self-Improving Reasoning for Molecular Optimization

arXiv cs.AI · Xuefeng Liu, Mingxuan Cao, Qinan Huang, Thomas Brettin · 2026-07-01

The paper introduces Active-GRPO, an adaptive reasoning framework for molecular optimization that dynamically switches between imitation learning and reinforcement learning based on reference quality. The method employs two mechanisms: active imitate-reinforce, which transitions from imitation to self-improvement when policy-generated candidates surpass references, and active referencing, which continuously updates references with the best policy-generated candidates. Evaluated on TOMG-Bench MOLOPT, Active-GRPO achieves a statistically significant improvement in SRxSim (0.1773) over GRPO (0.0959) and RePO (0.1665), particularly on LogP, MR, and QED metrics.

molecular optimizationactive reasoningimitation learningreinforcement learningreference-guided policy optimization

From Technical Metrics to User Perception: A User Study of a Multimodal Human-Robot Interaction System for Object Detection and Grasping

arXiv cs.AI · Jian Song, Tian Zi, Shen Guanting · 2026-07-01

The study demonstrates that a 15-point improvement in task success rate (75% to 90%) in a multimodal human-robot interaction system yields statistically significant user-perceptible differences. The baseline system integrated Whisper, Florence-2, LLaMA 3.1, and Type-2 fuzzy logic, while the improved version substituted Grounding DINO + SAM and Qwen 3.5 9B for perception and language modules. In a within-subject study (N=24), 70.83% preferred the improved system (p=0.043), with significantly higher ratings for speed, reliability, and competence (p<0.001, large effect sizes), validating the importance of user-centered evaluation alongside technical metrics.

multimodal hriopen-vocabulary detectiontype-2 fuzzy logicwithin-subject studylikert scale

AI Native Games: A Survey and Roadmap

arXiv cs.AI · Zhiyue Xu, Fandi Meng, Kaijie Xu, Clark Verbrugge · 2026-07-01

This paper defines AI-native games as those where runtime generative AI is constitutive of the core gameplay loop, distinguishing them from AI-augmented games and other boundary artifacts. The authors introduce a dual-axis G/N taxonomy to classify 53 publicly available AI-native games and prototypes, analyzing player-facing game types (G-axis) and dominant AI mechanics (N-axis). The corpus reveals a concentration in language-forward designs like narrative adventure and generative narrative, while semantic adjudication and multi-agent simulation remain underrepresented. The study identifies mechanical invariants as crucial for organizing semantic openness into stable gameplay and outlines a roadmap for future research in controllable generation, multimodal systems, and inference economics.

generative airuntime generationsemantic opennessmechanical invariantsdual-axis taxonomy

AI, Trust, and Teaming: The Humans-as-Handlers Approach for Autonomous and Opaque AI Systems

arXiv cs.AI · Nathan G. Wood · 2026-07-01

The article proposes a 'humans-as-handlers' framework for autonomous AI systems in high-stakes domains, drawing an analogy between opaque AI systems and trained animals to clarify human responsibility. The author methodically develops this analogy while acknowledging disanalogous elements, then refines the model by excluding inappropriate animal-handler dynamics. The argument culminates in advocating for authentic human-AI collaboration, where autonomous systems are viewed as goal-oriented partners rather than mere artifacts.

autonomous systemshuman-ai teamingopaque airesponsibility attributionhuman-handler analogy

Cross4D-JEPA: Dense Cross-modal Correspondence Distillation for 4D Point Cloud Representation Learning

arXiv cs.AI · Trung Thanh Nguyen, Hai Nguyen-Truong, Tu Vo, Hoang M. Truong · 2026-07-01

Cross4D-JEPA introduces a teacher-student framework for self-supervised 4D point cloud representation learning by distilling dense cross-modal correspondences from frozen 2D foundation models (DINOv2 or V-JEPA 2) into a 4D point encoder. The method maps each 3D point to its corresponding teacher patch feature and trains the student to match these features in latent space without masking, negatives, or decoders. Evaluated on MSR-Action3D, DeformingThings4D, NTU-RGB+D 60, and HOI4D benchmarks, Cross4D-JEPA outperforms intra-modal and global cross-modal baselines, achieving competitive results with heavier 4D methods. The learned dense representation improves domain transfer, label efficiency, and fine-tuning performance, with a 13x smaller encoder matching a heavyweight pooling backbone.

4d point cloudcross-modal correspondenceself-supervised learningteacher-student frameworklatent space matching

BaseRT: Best-in-Class LLM Inference on Apple Silicon via Native Metal

arXiv cs.AI · Prabod Rathnayaka, Fabian Waschkowski, Lukas Wesemann · 2026-07-01

BaseRT introduces a native Metal inference runtime for large language models (LLMs) on Apple Silicon, achieving best-in-class throughput by leveraging chip-specific optimizations such as kernel fusion, unified memory-aware tuning, and custom dispatch logic. Unlike existing frameworks like llama.cpp and MLX, BaseRT eliminates abstraction overheads and supports a wide range of model families across eight quantization formats (Q2 to FP16) on M-series devices. Evaluations on Qwen3, Llama 3.2, and Gemma 4 families at Q4 and Q8 quantization demonstrate up to 1.56x higher decode throughput than llama.cpp and 1.35x higher than MLX, with significant gains in prefill for mixture-of-experts models, scaling from sub-1B to 30B parameters.

metal inferencekernel fusionquantization formatsmixture-of-expertsunified memory

MindEdit-Bench: Benchmarking Object-Level Counterfactual Spatial Reasoning in VLMs from In-the-Wild Photos

arXiv cs.AI · Leyuan Yu, Xiao Tang, Minghao Liu, Xinyuan Li · 2026-07-01

MindEdit-Bench introduces a novel benchmark for evaluating object-level counterfactual spatial reasoning in vision-language models (VLMs), addressing limitations in existing observational spatial reasoning tasks. The benchmark comprises six tasks derived from in-the-wild smartphone photo triplets, processed via an automatic 3D scene-graph extraction pipeline. It includes two new tasks—spatial editing and cross-view visibility editing—where correct answers are absent from input images. Evaluated on 15 VLMs using 1,003 human-verified questions, task-wise mean accuracy ranged from 8% to 31%, significantly below human majority-vote accuracy (81%-97%). The structured answer space revealed non-uniform failures, particularly in camera-depth-axis inference and visibility-editing cases.

vision-language modelscounterfactual reasoningscene-graph extractionspatial editingvisibility editing

PAPA: Online Personalized Active Preference Alignment

arXiv cs.AI · Anindya Sarkar, Nasik Muhammad Nafi, Isaac Lyngaas, Muralikrishnan Gopalakrishnan Meena · 2026-07-01

We introduce Personalized Active Preference Alignment (PAPA), a method for fine-tuning diffusion models using real-time user feedback without requiring a parameterized reward model. PAPA leverages variational inference to enable feedback-efficient preference alignment, addressing challenges in personalized recommender systems. We propose Enhanced PAPA (EPAPA), an improved fine-tuning strategy that reduces computational costs and accelerates convergence. Extensive experiments on class-conditioned and fine-grained alignment tasks demonstrate PAPA's effectiveness. Our code is publicly available for reproducibility.

diffusion modelspreference alignmentvariational inferencefine-tuningreinforcement learning

Beyond the Prompt: Jailbreaking Function-Calling LLMs via Simulated Moderation Traces

arXiv cs.AI · Junlong Liu, Haobo Wang, Weiqi Luo, Xiaojun Jia · 2026-07-01

The paper introduces SMT (Simulated Moderation Traces), a black-box attack framework targeting function-calling LLMs by exploiting structural vulnerabilities in stateful environments. SMT constructs multi-turn trajectories simulating moderation-auditing workflows, leveraging fabricated moderation frames to weaken safety constraints and elicit harmful outputs. Evaluations on five commercial LLMs across two safety benchmarks demonstrate SMT's superior attack success rate and HarmScore, requiring minimal queries compared to existing methods. The findings underscore the insufficiency of prompt-level sanitization and advocate for context-aware validation across schemas, arguments, tool outputs, and conversation state.

jailbreak attacksfunction-calling llmssimulated moderation tracesmulti-turn trajectorycontext-aware validation

Predicting Lethal Outcome (Cause) And Understanding Key Biomarkers Linked With Acute Myocardial Infarction Using Deep Artificial Neural Network And Ensemble Of Machine Learning Methodologies

arXiv cs.AI · Sagnik Ghosh · 2026-07-01

The study proposes an automated model combining ensemble machine learning and deep artificial neural networks to predict lethal outcomes of acute myocardial infarction (MI) and identify key biomarkers. The methodology involves data preprocessing with SVMSMOTE and ADASYN for handling imbalanced data, wrapper and embedded feature selection, and feature scaling. The ensemble integrates Logistic Regression, Random Forest, Light-GBM, and Bagging SVM, enhanced by an artificial neural network for improved accuracy. Evaluation metrics include precision and recall, aiming to optimize clinical applicability for faster and more accurate MI diagnosis.

acute myocardial infarctionensemble learningartificial neural networksvmsmotefeature selection

A Multi-Resolution Finite-Volume Inspired Deep Learning Framework for Spatiotemporal Dynamics Prediction

arXiv cs.AI · Xin-Yang Liu, Xiantao Fan, Jian-Xun Wang · 2026-07-01

The paper introduces MuRFiV, a Multi-Resolution Finite-Volume-inspired network for predicting spatiotemporal dynamics governed by partial differential equations (PDEs). MuRFiV combines the conservative property of finite volume methods at a global scale with the expressive power of deep learning at a local scale, embedding PDE information into the architecture. Evaluated on Burgers' equation, shallow water equations, and incompressible Navier-Stokes equations, MuRFiV demonstrates superior long-term prediction accuracy and stability during autoregressive rollouts compared to data-driven neural network baselines. This approach underscores the potential of integrating multiresolution learning with finite-volume-inspired inductive bias for robust dynamics prediction.

physics-informed deep learningfinite volume methodsspatiotemporal dynamicspartial differential equationsautoregressive rollouts

Multi-scale Mixture of World Models for Embodied Agents in Evolving Environments

arXiv cs.AI · Jinwoo Jang, Daniel J. Rho, Sihyung Yoon, Hyunsuk Cho · 2026-07-01

The paper introduces MuSix, a framework for multi-scale reasoning and adaptation in embodied agents, addressing limitations of Mixture of Experts (MoE) in dynamic environments. MuSix employs a two-stage routing mechanism grounded in experiential distance for scale-aware world model selection, with meta-router and per-scale base routers. It features scale-dependent forgetting rates for adaptive knowledge updates and gated inter-scale transfer for hierarchical coherence. Evaluations on EmbodiedBench and HAZARD demonstrate MuSix's superiority over state-of-the-art baselines in multi-scale reasoning and dynamic adaptation.

mixture of expertsmulti-scale reasoningembodied agentsexperiential distancedynamic adaptation

Agri-SAGE: Simulation-Grounded Multi-Agent LLM for Context-Aware Agricultural Advisory Generation

arXiv cs.AI · Vedant Balasubramaniam, Geetha Charan, Manojkumar Patil, Rohit P Suresh · 2026-07-01

Agri-SAGE introduces a simulation-grounded multi-agent LLM framework for context-aware agricultural advisory generation, addressing limitations of static guidelines and physiologically unconvincing LLM recommendations. The method integrates retrieval-grounded multi-agent reasoning with APSIM-based biophysical simulation, evaluating Plan-and-Solve, Tree of Thoughts, and Reflexion approaches. In a 10-year retrospective analysis, all three methods significantly outperform static Package-of-Practice baselines, with Tree of Thoughts achieving peak yields and Reflexion matching agronomic outcomes at lower computational cost via cross-seasonal episodic memory.

multi-agent llmbiophysical simulationtree of thoughtscross-seasonal episodic memoryagricultural advisory

Gauging, Measuring, and Controlling Critic Complexity in Actor-Critic Reinforcement Learning

arXiv cs.AI · Konstantin Garbers · 2026-07-01

The paper introduces critic complexity as a diagnostic and intervention dimension for actor-critic reinforcement learning, measured via spectral effective-rank entropy of critic weight matrices. The method tracks complexity alongside return and value-estimation bias in TD3 and PPO experiments, revealing heterogeneous relationships across algorithms and tasks. A spectral-entropy penalty demonstrates controllable complexity, though return effects vary task-dependently.

actor-criticspectral entropytd3ppoeffective-rank

Real-Time Hard Negative Sampling via LLM-based Clustering for Large-Scale Two-Tower Retrieval

arXiv cs.AI · Ivan Ji, Liuyi Hu, Harrison, Zhao · 2026-07-01

The paper introduces a self-supervised hard negative sampling technique for two-tower retrieval models, leveraging LLM-based clustering to generate challenging negatives during training. The method uses an LLM to learn media representations, producing informative negatives in real-time while maintaining low computational complexity for billion-scale datasets. Experiments on public benchmarks and industrial deployment show superior performance over standard negative sampling methods, with additional benefits in reducing recommendation feedback loops and popularity bias.

two-tower modelhard negative samplingllm-based clusteringrecommendation systemspopularity bias

VideoSearch-R1: Iterative Video Retrieval and Reasoning via Soft Query Refinement

arXiv cs.AI · Seohyun Lee, Seoung Choi, Dohwan Ko, Jongha Kim · 2026-07-01

VideoSearch-R1 introduces an agentic framework for iterative video retrieval and reasoning, addressing limitations in existing approaches that treat retrieval as a static preprocessing step. The method employs Soft Query Refinement (SQR) to adjust search queries in a continuous latent space, trained via Group Relative Policy Optimization (GRPO) using task-level rewards. The framework achieves state-of-the-art performance on Video Corpus Moment Retrieval (VCMR) across three datasets, demonstrating efficient query refinement with fewer generated tokens than text-level alternatives.

video retrievalsoft query refinementgroup relative policy optimizationtemporal groundingagentic framework

Search-Based Spatiotemporal and Multi-Robot Motion Planning on Graphs of Space-Time Convex Sets

arXiv cs.AI · Jingtao Tang, Zining Mao, Lufan Yang, Hang Ma · 2026-07-01

We introduce a search-based algorithmic framework for spatiotemporal and multi-robot motion planning using graphs of space-time convex sets (ST-GCSs), where collision-free regions are represented as convex sets in space-time. The method formulates time-optimal planning as a graph-search problem, employing best-first search with admissible heuristics, dominance checks, and continuous trajectory optimization. An Exact Convex Decomposition (ECD) scheme is proposed to reserve trajectory occupancies, enabling unified handling of dynamic obstacles and multi-robot interactions. Experiments demonstrate significant speedups over existing planners while maintaining high solution quality, particularly in environments with narrow and transient feasible regions. The multi-robot planner scales efficiently, solving instances with up to 100 robots within minutes.

space-time convex setsmulti-robot motion planningexact convex decompositionbest-first searchtrajectory optimization

Learning Gait-Aware Quadruped Locomotion with Temporal Logic Specifications

arXiv cs.AI · Merve Atasever, Cagan Bakirci, Alfredo Reina Corona, Keyan Azbijari · 2026-07-01

The paper introduces a reinforcement learning framework for quadruped locomotion that uses Signal Temporal Logic (STL) to specify gait behaviors through parameterized constraints, including safety bounds and gait synchronization. The method develops a reward shaping mechanism from STL robustness approximations, compatible with Proximal Policy Optimization (PPO), and demonstrates it on Google's Barkour robot in MuJoCo XLA. Results show improved velocity tracking and training stability compared to hand-crafted rewards, with parallelization and domain randomization enhancing robustness.

signal temporal logicquadruped locomotionreward shapingproximal policy optimizationdomain randomization

PHREEQC-MCQ-200: A Diagnostic Benchmark for Tool-Augmented Scientific Simulator Agents

arXiv cs.AI · Ke Zhang, Sahchit Chundur, Mohammad Javad Qomi, Maziar Raissi · 2026-07-01

The authors introduce PHREEQC-MCQ-200, a diagnostic benchmark for evaluating tool-augmented LLM agents on aqueous-geochemistry simulations. The benchmark comprises 200 multiple-choice questions derived from 21 PHREEQC scenarios, requiring agents to construct inputs, execute simulations, and interpret outputs. Results show tool access improves aggregate accuracy but reveals non-monotonic regressions, with performance varying by output-access protocol and model capability. The work advocates for comprehensive evaluation metrics including retention rates, output-access sensitivity, and failure analysis in scientific agent assessments.

tool-augmented agentsaqueous-geochemistryphreeqcdiagnostic benchmarkin-context learning

EO-VGGT: Orbital Ray-Conditioned 3D Foundation Models for Satellite Multi-View Reconstruction

arXiv cs.AI · Qiyan Luo, Yingdong Pi, Lekang Wen, Jie Yang · 2026-07-01

EO-VGGT introduces a framework adapting perspective-driven 3D foundation models to satellite multi-view reconstruction by embedding explicit orbital geometry. The method combines: (1) Geometry-Correlation Constrained Selection (GCCS) for input sequence optimization, (2) Sensor-Ray Encoder (SRE) parameterizing pushbroom lines of sight into geometric tokens, and (3) Ray-Pointing-Aware Adapter (RPAA) integrating tokens via gated residual blocks into a frozen transformer backbone. Results demonstrate that explicit geometry integration with optimized view selection enables robust feed-forward satellite 3D reconstruction, addressing structural discrepancies between perspective assumptions and orbital pushbroom geometry.

3d reconstructionpushbroom geometryfoundation modelsmulti-view imageryorbital kinematics

Personalization as Inverse Planning: Learning Latent Design Intents for Agentic Slide Generation via Structural Denoising

arXiv cs.AI · Tianci Liu, Zihan Dong, Linjun Zhang, Haoyu Wang · 2026-07-01

The paper introduces SPIRE, a framework for Page-level Slide Personalization (PSP) by formulating it as an inverse planning problem to learn latent design intents without tool-specific assumptions. SPIRE employs structural denoising, where two RL agents collaboratively refine executable designs by denoising corrupted slide structures, proven as a consistent surrogate for PSP. Theoretical analysis shows reduced policy gradient variance, and experiments validate SPIRE's superiority over existing methods.

inverse planningstructural denoisingpage-level personalizationreinforcement learninglatent design intents

The Illusion of High Utility in Safety Alignment of Text-to-Image Diffusion Models

arXiv cs.AI · Adeel Yousaf, Soumik Ghosh, James Beetham, Amrit Singh Bedi · 2026-07-01

The study reveals that safety-aligned text-to-image (T2I) diffusion models exhibit degraded semantic fidelity despite high scores on coarse utility metrics, as measured by TIFA (Text-to-Image Faithfulness evaluation). It identifies semantic collapse in text-encoder embeddings as the root cause, characterized by reduced spread and distorted inter-prompt similarity. To address this, the authors propose Structure-Aware Geometric Regularization (SAGE), a safety alignment objective preserving embedding structure. SAGE improves structured utility by 5.0% on TIFA while maintaining safety and coarse-grained performance.

text-to-image diffusionsafety alignmentsemantic collapsestructure-aware regularizationtifa benchmark

Holographic Quantum Transformer: A Generalist Neuro-Symbolic Architecture for Solving Frustrated Systems via Generative Attention

arXiv cs.AI · Xingran Guo, Tiaojie Xiao, Jie Liu, Keqin Li · 2026-07-01

The Holographic Quantum Transformer (HQT) is introduced as a neuro-symbolic architecture for simulating frustrated quantum systems via generative attention. The method employs global self-attention to capture non-local entanglement, validated on the $J_1-J_2$ Heisenberg model at criticality ($J_2=0.5$), achieving ground-state energy per site ($E/N$) of $-0.5001(1)$ on $8 \times 8$ lattices. HQT demonstrates interpretable attention maps and a zero-shot size-extrapolation protocol (Holographic Transfer), enabling transfer to $10 \times 10$ lattices with $E/N = -0.49782(3)$ without retraining.

holographic quantum transformergenerative attentionheisenberg modelzero-shot extrapolationnon-local entanglement

NeuroCogMap Reveals Cognitive Organization of Large Language Models

arXiv cs.AI · Zhongxiang Sun, Haolang Lu, Qiang Ma, Qi Li · 2026-07-01

The study introduces NeuroCogMap, a cognitive neuroscience-inspired framework that organizes internal features of large language models (LLMs) into functional parcels linked to interpretable cognitive functions. Using this method, the authors identify stable, semantically coherent organizations that partially generalize across models and correlate with model outputs. Key findings include distinct neural signatures for major LLM failures (hallucination, bias, refusal failure, sycophancy), improved prediction of human cortical responses during language comprehension (strongest in higher-order association cortex), and insights for refining classical human decision-making models.

functional parcelscognitive hierarchyrepresentational systemsbehavioural-controlcortical responses

Learning Generalizable Skill Policy with Data-Efficient Unsupervised RL

arXiv cs.AI · Jongchan Park, Seungjun Oh, Seungho Baek, Yusung Kim · 2026-07-01

The paper introduces GenDa (Generalizable Data-efficient Agent), a unified framework addressing two bottlenecks in unsupervised reinforcement learning (URL): non-stationary skill semantics and brittle generalization. GenDa employs a skill relabeling mechanism to mitigate non-stationarity and improve data efficiency during pre-training, alongside a Complementary Information Bottleneck (CIB) to encourage ego-centric feature focus and robustness to distribution shifts in downstream tasks. Experimental results demonstrate that GenDa significantly enhances URL scalability, achieving superior generalizability and data efficiency. Code and videos are available online.

unsupervised reinforcement learningskill-conditioned policiesskill relabelingcomplementary information bottleneckdistribution shifts

MalariAI: A Label-Resilient Decoupled Framework for Universal Cell Segmentation and Explainable Stage Classification in Dense Malaria Blood Smears

arXiv cs.AI · Kaysarul Anas Apurba, Md Hasibul Hasan, Mohammed Ali, Tanzilur Rahman · 2026-07-01

MalariAI introduces a decoupled framework for universal cell segmentation and explainable stage classification in malaria blood smears, addressing three key failure modes in existing deep learning systems. The framework employs a two-stage pipeline: Stage 1 uses an annotation-agnostic distance-transform guided watershed algorithm to isolate cells, achieving 75.95% cell recovery on the NIH BBBC041 test set without ground-truth input. Stage 2 fine-tunes EfficientNet-B0 with Focal Loss for classification, attaining 98.36% overall accuracy and significantly outperforming Faster R-CNN on rare stages. Grad-CAM++ heatmaps provide per-cell spatial evidence for clinical audit, enabling microscopists to verify predictions at the individual parasite level.

cell segmentationgrad-cam++focal losswatershed algorithmefficientnet-b0

Learning to Compose: Revisiting Proxy Task Design for Zero-Shot Composed Image Retrieval

arXiv cs.AI · Jingjing Zhang, Lei Zhang, Zheren Fu, Zhendong Mao · 2026-07-01

FoCo introduces a novel approach to Zero-Shot Composed Image Retrieval (ZS-CIR) by learning the composition function through two coordinated proxy tasks: text-anchored visual aggregation and context-conditioned semantic completion. These tasks focus on modification-relevant visual content and transform aggregated visuals into coherent composed representations, trained jointly with a cross-instance contrastive objective. Extensive experiments on four ZS-CIR benchmarks demonstrate FoCo's state-of-the-art performance and improved generalization, addressing limitations of existing methods that rely on predefined composition mechanisms.

zero-shot composed image retrievalproxy taskstext-anchored visual aggregationcontext-conditioned semantic completioncross-instance contrastive objective

MEPA: Multi-Scale Representation Alignment for Visual Autoregressive Modeling with Mixture of Experts

arXiv cs.AI · Nuoyan Zhou, Zhijun Tu, Lei Yu, Kun Cheng · 2026-07-01

The paper introduces MEPA, a Multi-scale Representation Alignment framework for Visual AutoRegressive Modeling (VAR) that addresses deficiencies in multi-scale representation learning. The method employs a scale-aware token-routed Mixture of Experts (MoE) architecture to enable scale-adaptive expert selection and decoupled representation learning across scales. It enhances semantic modeling at early scales by incorporating external self-supervised features via a residual feature aggregation scheme. Experiments on the ImageNet 256*256 benchmark demonstrate superior FID scores compared to the dense baseline, achieving these results with half the training epochs, a smaller parameter budget, and only a marginal increase in training cost.

visual autoregressive modelingmixture of expertsmulti-scale representationresidual feature aggregationself-supervised features

When AI meets quantum information: A comprehensive review

arXiv cs.AI · Min Chen, Yu Gan, Xin Jin, Yuqing Li · 2026-07-01

This survey comprehensively reviews the bidirectional interface between artificial intelligence (AI) and quantum information (QI). It organizes recent progress in AI for QI around key tasks including quantum algorithm discovery, hardware stabilization, and measurement information extraction, while examining QI for AI through quantum computational speedups, neural network design, and tensor-network representations. The review identifies cross-cutting challenges in reproducibility, scalability, hardware realism, and co-design, emphasizing the need for tighter integration of theory, experiment, and hybrid quantum-classical systems to advance the field.

quantum algorithm discoverytensor-network representationshybrid quantum-classical systemshardware stabilizationmeasurement information extraction

Enhancing Flow Matching with A Unified Guidance Framework for Efficient and Robust Speech Synthesis

arXiv cs.AI · Zuda Yu, Qianhui Xu, Ting Chen, Junhui Zhang · 2026-07-01

The authors propose a unified guidance framework to enhance Flow Matching (FM) for speech synthesis, addressing high inference latency and timbre leakage. The framework combines Data-guidance through heterogeneous augmentation to disentangle linguistic content from acoustic residue, and Model-guidance via trajectory rectification and an intrinsic guidance objective, eliminating Classifier-Free Guidance overhead. Experiments show the framework accelerates inference by nearly 3× while improving speaker similarity compared to state-of-the-art baselines.

flow matchinginference latencytimbre leakagetrajectory rectificationclassifier-free guidance

SoK: Attack and Defense Landscape of Mobile On-device AI Systems

arXiv cs.AI · Yujin Huang, Xin Zheng, Xingliang Yuan, Kwok-Yan Lam · 2026-07-01

This paper presents the first comprehensive systematization of knowledge (SoK) on security aspects of mobile on-device AI (MoAI) systems, which integrate locally deployed AI models with mobile software components. The authors analyze security pillars, attack vectors, and defense mechanisms specific to MoAI systems, identifying unresolved research gaps and proposing future directions. The work establishes a systematic framework for understanding MoAI security landscapes, serving as a foundation for building secure systems and advancing research in this domain. Companion resources are provided on GitHub to support further investigation.

mobile on-device aisecurity pillarsattack vectorsdefense mechanismssystematization of knowledge

DiscoLoop: Looping Discrete Embeddings and Continuous Hidden States for Multi-hop Reasoning

arXiv cs.AI · Hengyu Fu, Tianyu Guo, Zixuan Wang, Hanlin Zhu · 2026-07-01

DiscoLoop introduces a looping architecture combining discrete embeddings and continuous hidden states to address multi-hop reasoning challenges in Transformers. The method mitigates the depth-local storage problem by reusing memory across loops and employs a training-free realignment intervention to improve hidden state alignment. Evaluated on symbolic and synthetic-language multi-hop reasoning tasks, DiscoLoop achieves near-perfect accuracy with fewer training steps. Pretraining experiments demonstrate lower training loss and superior benchmark performance compared to looped Transformer baselines, indicating transferability to practical language modeling.

multi-hop reasoninglooped transformersdepth-local storagediscrete embeddingshidden-state alignment

K-Inverse-RFM: A Modified RFM that Bridges the Gap to Neural Networks for Data-Corrupted Mathematical Tasks

arXiv cs.AI · Gil Pasternak · 2026-07-01

We introduce K-Inverse-RFM, a modified Recursive Feature Machine (RFM) that bridges the performance gap with Feedforward Neural Networks (FNNs) in data-corrupted mathematical tasks. RFMs, which utilize the Average Gradient Outer Product (AGOP) for feature learning, typically underperform FNNs in noisy, complex, and class-imbalanced scenarios. Our method applies a transformation to training labels, enhancing RFM performance in these challenging settings. Experimental results demonstrate that K-Inverse-RFM not only closes the performance gap with FNNs but, in some cases, surpasses them. This work highlights the potential of kernel machines to achieve neural network-level performance in corrupted data environments.

recursive feature machinesaverage gradient outer productfeedforward neural networksdata-corrupted tasksclass-imbalanced data

RetailSMV: Exocentric vs. Egocentric Adaptation of Foundation Video World Models in Retail

arXiv cs.AI · Amirreza Rouhi, Rajat Aggarwal, Parikshit Sakurikar, Anoop M. Namboodiri · 2026-07-01

The paper investigates parameter-efficient adaptation of foundation video diffusion models to retail environments using synchronized egocentric and exocentric video data. It introduces RetailSMV, a corpus of 32,105 captioned retail clips from five supermarkets, and compares three LoRA configurations of Cosmos3-Nano (egocentric-only, exocentric-only, combined). Results on a 200-clip test set show exocentric-only adaptation outperforms combined adaptation on six of seven metrics (LPIPS, PSNR, DreamSim), despite using fewer training samples. The study also reveals that exocentric data benefits egocentric adaptation, but not vice versa, with the largest adaptation gap observed in near-horizon predictions.

foundation video modelsparameter-efficient adaptationlow-rank adaptationegocentric videoexocentric video

Mapping the Evaluation Frontier: An Empirical Survey of the Bias-Reliability Tradeoff Across Eleven Evaluator-Agent Conditions

arXiv cs.AI · Zewen Liu · 2026-07-01

This study empirically validates the bias-reliability tradeoff in LLM evaluation systems by expanding prior evidence from 5 to 11 evaluator-agent conditions. The authors measure evaluator coupling (γ), strategy diversity (H), and small-sample reliability (CV(N=5)) across conditions, with complete triples for five cases. Results confirm the tradeoff: low γ (<0.2) yields high CV (>1.0), while high γ (>0.9) achieves low CV (<0.16), with strong negative correlation (r=-0.989) between H and γ. GPT-4o conditions show anomalous γ=0.000 and H=1.000, attributed to API version drift. No condition achieves both low γ (<0.2) and low CV (<0.3).

bias-reliability tradeoffevaluator couplingstrategy diversitysmall-sample reliabilityversion drift

Learning When to Listen: Gated Affect Fusion for Human Motion Prediction

arXiv cs.AI · Jingni Huang · 2026-07-01

The study introduces the Gated Affect Transformer (GAT) for human motion prediction, which dynamically regulates cross-modal fusion between body pose trajectories and facial affect representations. The method combines MediaPipe pose tracking with HSEmotion facial features, using a gating mechanism to suppress noisy affect signals while preserving useful short-term cues. Results show affect features provide horizon-dependent benefits (≤30 frames), with naive early fusion degrading performance, while GAT improves robustness in unconstrained settings.

human motion forecastingmultimodal fusionaffect-conditioned predictiongated transformerkinematic continuity

An LLM-Based Framework for Intent-Driven Network Topology Design

arXiv cs.AI · Kholoud El-Habbouli, Fen Zhou, Stephane Huet · 2026-07-01

The paper presents an LLM-based framework for generating network topologies from natural language requirements, combining hierarchical modeling and systematic validation. It evaluates proprietary and open-weight LLMs across four scenarios, measuring structural correctness via node/edge F1-scores and resilience via connectivity metrics. Results include analysis of common failure modes like interface mismatches, providing benchmarks for LLM performance in constraint-compliant topology synthesis.

network topologylarge language modelsconstraint-driven pipelineresilience metricshierarchical modeling

What's Hidden Matters: Identifying Planning-Critical Occluded Agents using Vision-Language Models

arXiv cs.AI · Amirhosein Chahe, Tyler Naes, Jovin D'sa, Faizan M. Tariq · 2026-07-01

The paper introduces a novel framework for autonomous vehicles to identify planning-critical occluded agents using Vision-Language Models (VLMs). The method employs Planning KL-divergence (PKL) to rank occluded agents by their impact on trajectory planning, then uses GPT-5 to generate structured annotations. Evaluated on nuScenes, fine-tuned VLMs show 30% improvement over random sampling, with smaller models outperforming larger zero-shot counterparts. This work bridges perception-planning gaps for risk assessment in autonomous driving.

vision-language modelsplanning kl-divergenceautonomous drivingoccluded agentsnuscenes

Testing Frontier Large Language Models' Physics Literacy in Parallel Physical Worlds

arXiv cs.AI · Dong Zhang · 2026-06-30

The authors introduce a four-stage diagnostic protocol to evaluate LLMs' physics reasoning in unfamiliar frameworks, addressing limitations of accuracy-based benchmarks. The method combines locked pre-registrations, fresh sessions, dual-LLM judging, and human audits, applied to three parallel physics worlds: a single-equation counterfactual ($F=mv$), Aristotelian mechanics, and Decay World. Testing Claude Opus 4.7, GPT-5.5, and Gemini 3.1 Pro, composite PASS rates were 6/15, 6/15, and 0/15 respectively. Key findings include a qualitative-quantitative asymmetry in Decay World predictions, LLM-judge unreliability across frameworks, and weak self-review performance, with models failing to detect errors in ≥66.7% of trials. Full prompts, responses, verdicts, and audit records are released.

large-language-modelphysics reasoningcounterfactual worlddual-llm judgingself-review

Entropy-Regularized Probabilistic Gates for Sparse Model Discovery in Scarce-Data Federated Learning

arXiv cs.AI · Krishna Harsha Kovelakuntla Huthasana, Alireza Olama, Andreas Lundell · 2026-06-30

The paper introduces entropy-regularized probabilistic gates for sparse model discovery in federated learning (FL) under data scarcity and heterogeneity. The method employs L0-constrained probabilistic gates with entropy regularization to maintain uncertainty in sparse parameter configurations, preventing premature commitment to sparse support during training. This approach is evaluated against federated iterative hard thresholding (Fed-IHT) and post-training pruning in FedAvg, demonstrating improved test performance and sparsity recovery accuracy on synthetic and real-world benchmarks. The results highlight the effectiveness of entropy regularization in addressing challenges of data heterogeneity, partial client participation, and high-dimensional sparse optimization.

federated learningsparse modelsentropy regularizationprobabilistic gatesl0 constraint

SEFORA: Student Essays with Feedback Corpus and LLM Feedback Evaluation Framework

arXiv cs.AI · Shayan Peyghambari Oskoui, Norah Almousa, Zhaoyi Joey Hou, Carolina Gustafson · 2026-06-30

We introduce SEFORA, a public corpus of 564 student essay drafts with 8,240 instructor annotations, paired with assignment prompts, rubrics, scores, and multi-draft revisions across college writing genres. We also present UniMatch, a reference-based evaluation framework that segments feedback into units, scores semantic correspondence using instructor-derived criteria, and computes precision, recall, and F1 via optimal matching. Experiments across 74 LLM configurations reveal that no setting exceeds 0.4 F1, with models struggling to prioritize instructor-preferred feedback and performance degrading as generation increases.

seforaunimatchfeedback evaluationsemantic correspondenceoptimal matching

ASPIRE: Agentic /Skills Discovery for Robotics

arXiv cs.AI · Runyu Lu, Yubo Wu, Ethan Kou, Letian Fu · 2026-06-30

ASPIRE introduces a continual learning system for autonomous robot skill acquisition through iterative code refinement and skill library compounding. The method combines (1) closed-loop execution with multimodal trace analysis for failure diagnosis and repair, (2) a growing skill library for knowledge transfer, and (3) evolutionary search for diverse task exploration. Results show 77% improvement on LIBERO-Pro manipulation, 72% on Robosuite bimanual tasks, and 31% zero-shot success on LIBERO-Pro Long versus 4% for baselines, with demonstrated sim-to-real transfer across embodiments.

continual learningcode-as-policyskill libraryevolutionary searchsim-to-real transfer

Mnemosyne: Agentic Transaction Processing for Validating and Repairing AI-generated Workflows

arXiv cs.AI · Edward Y. Chang, Longling Geng, Emily J. Chang · 2026-06-30

Mnemosyne introduces Agentic Transaction Processing (ATP), a runtime model that validates and repairs AI-generated workflows by treating actions as untrusted proposals until they pass deterministic admission under executable constraints. ATP ensures committed-state correctness independent of the proposing layer's competence, using an append-only transition log, effective-state projection, dependency-safe compensation, and active commitment records. The system guarantees four safety properties and bounded-reactive-repair, demonstrated through nine falsification tests with under 6% overhead and significantly fewer repair operations than global recompute. Mnemosyne is open-source.

agentic transaction processingexecutable constraint seteffective-state projectiondependency-safe compensationbounded-reactive-repair

Validating Causal Abstraction Metrics on Simulated Complex Systems

arXiv cs.AI · Maxime Méloux, Tiago Pimentel, François Portet, Maxime Peyrard · 2026-06-30

The authors introduce the Causal Abstraction Error (CAE), a continuous validity metric for assessing high-level causal explanations of complex systems. They evaluate over thirty candidate metrics across ten simulated systems with discrete/continuous state spaces and static/dynamical regimes, each equipped with ground-truth explanations and invalid contrasts. Results show that only causal metrics incorporating faithfulness testing reliably discriminate valid abstractions, with CAE passing all discrimination tests and converging with as few as 30 interventions. CAE is proposed as a general-purpose tool for validating causal abstractions in scientific explanations.

causal abstractionfaithfulness testingcontinuous validity metriccomplex systemsinterventional sampling

Multi-Hypothesis Test-Time Adaptation to Mitigate Underspecification

arXiv cs.AI · Afshar Shamsi, Xiao-Yu Guo, Hamid Alinejad-Rokny, Arash Mohammadi · 2026-06-30

The paper proposes a multi-hypothesis test-time adaptation (TTA) framework to address underspecification in entropy-minimizing adaptation, where multiple parameter updates yield similar entropy but divergent decision boundaries. By modeling TTA as a posterior inference problem, the method employs particle-based diversification across output, parameter, optimizer, and input levels to explore multiple adaptation trajectories simultaneously. Experiments on mixed shifts, batch size one, and label shifts show consistent gains of 1-4% over baselines, demonstrating improved stability and robustness through multi-hypothesis exploration.

test-time adaptationunderspecificationentropy minimizationmulti-hypothesis inferencedistribution shift

Leveraging Phase Information to Boost Unrolled Network Learning for Image Deblurring

arXiv cs.AI · Samira Malek, Haichuan Zhang, Chul Lee, Vishal Monga · 2026-06-30

We introduce UPADNet (Unrolled Phase and Amplitude Decomposition Network), a novel image deblurring approach that leverages amplitude and phase decomposition to enhance sharp detail recovery. The method employs linear minimum mean squared error (LMMSE) estimators for amplitude and phase, followed by an iterative optimization algorithm where matrix parameters are learned end-to-end from training data. Evaluations on GoPro, RealBlur, and COCO datasets demonstrate UPADNet's superiority over state-of-the-art deep networks, particularly in high-noise and limited-data scenarios.

image deblurringphase decompositionlmmse estimatorsalgorithm unrollingend-to-end learning

Seed2.0 Model Card: Towards Intelligence Frontier for Real-World Complexity

arXiv cs.AI · Bytedance Seed · 2026-06-30

Seed2.0 introduces a model series targeting complex real-world tasks by addressing long-tail knowledge and complex instruction following. The approach involves identifying user needs, constructing a forward-looking evaluation system, and selecting benchmarks grounded in realistic scenarios. The model demonstrates world-leading capabilities in reasoning intelligence, visual understanding, and search, significantly improving reliability on intricate, long-horizon tasks. Extensive real-world use cases validate Seed2.0's ability to handle initial complex tasks, delivering value to hundreds of millions of users.

long-tail knowledgecomplex instruction followingreasoning intelligencevisual understandingevaluation system

Adaptive Perturbation Selection for Contrastive Audio Decoding

arXiv cs.AI · Aaron Isidore Grace, Zhouyuan Huo, Weiran Wang · 2026-06-30

The paper introduces an adaptive perturbation selection method for contrastive decoding in large audio-language models (LALMs) to mitigate hallucination. It evaluates a library of structured audio perturbations across temporal, spectral, frequency, and amplitude domains, showing task-dependent optimal transformations (e.g., audio reversal improves temporal order task accuracy from 74.7% to 81.4%). A lightweight perturbation selector trained on hidden states dynamically routes negative branches, yielding a +4.3% gain on existence tasks.

contrastive decodingaudio-language modelsperturbation selectionhallucination mitigationhidden states

From Signals to Structure: How Memory Architecture Drives Language Emergence in LLM Agents

arXiv cs.AI · Yashar Talebirad, Eden Redman, Ali Parsaee, Osmar R. Zaiane · 2026-06-30

This work investigates how memory architecture influences language emergence in LLM agents playing Lewis signaling games, demonstrating that memory design outweighs channel capacity in shaping coordination. Five memory architectures were tested across varying channel configurations, revealing that agents with persistent private notebooks leverage surplus capacity effectively, achieving reliable coordination (0.867 ± 0.023 at capacity = 25), while stateless agents degrade at high capacities due to vocabulary tracking limitations. The notebook externalizes conventions, avoiding repeated code derivation. Contrary to information bottleneck predictions, surplus capacity outperforms optimal capacity (8), highlighting the interplay of memory and channel capacity in signal-to-language transformation.

lewis signaling gamememory architecturechannel capacityinformation bottleneckexternalization

A Category Theory Account of AI Identity

arXiv cs.AI · Andrea Ferrario · 2026-06-30

The paper introduces a category-theoretic framework for formalizing AI system identity across temporal transformations and deployments. It defines an AI system type via a techno-function, trustworthiness profile, and trustworthiness-level function, with admissible lifecycle paths forming a reachability category. Temporally admissible functors represent system histories, while natural transformations compare realized histories. Two categorical interpretations are derived: a weak one equates identity with trustworthiness-level equality, while a strong one requires mutual trustworthiness-preserving reachability via state or natural isomorphism. This structured hierarchy of identity criteria enables the transfer of responsible-AI claims and governance procedures across system versions.

category theorytrustworthiness profilereachability categorynatural transformationdiachronic identity

EgoSafetyBench: A Diagnostic Egocentric Video Benchmark for Evaluating Embodied VLMs as Runtime Safety Guards

arXiv cs.AI · Siddhant Panpatil, Arth Singh, Mijin Koo, Chaeyun Kim · 2026-06-30

EgoSafetyBench introduces a diagnostic egocentric video benchmark for evaluating vision-language models (VLMs) as runtime safety guards in embodied agents. The benchmark comprises 1,200 robot-view scenarios annotated at half-second granularity across two tracks: situational (800 scenarios) and visual-channel (400 scenarios). Contrastive ladders ensure evaluation hinges on specific cues rather than overall scene type. Testing ten VLMs reveals that while hazards are generally recognized, contextual hazards are often missed, and misleading in-scene text degrades performance, with some models missing up to a third of hazards or over-intervening on safe content. Matched controls indicate apparent robustness often reflects indiscriminate alarming rather than true physical reasoning.

vision-language modelsegocentric videoruntime safetycontrastive ladderscontextual hazards

Constructing Epistemic AI Literacy: Detecting Epistemic Aims and Processes in Student-AI Co-Programming

arXiv cs.AI · Mengqian Wu · 2026-06-30

This study introduces Epistemic AI Literacy (EAIL), a conceptual framework reframing AI literacy as an epistemic phenomenon emerging from human-AI interactions. Using the AIR framework, it analyzes epistemic aims and processes in GenAI-supported co-programming through a large dialogue dataset. Results show 78.8% of student-GenAI interactions exhibit non-mastery-oriented aims and unreliable epistemic strategies like outsourcing and verification-seeking, while only 11.1% demonstrate high epistemic engagement with mastery-oriented aims and advanced strategies like epistemic justification.

epistemic ai literacygenai-supported co-programmingepistemic aimsepistemic processesmastery-oriented aims

SLIM-RL: Risk-Budgeted Random-Masking RL for Diffusion LLMs Without Trajectory Slicing

arXiv cs.AI · Ruikang Zhao, Zhenting Wang, Han Gao, Ligong Han · 2026-06-30

SLIM-RL introduces a risk-budgeted random-masking RL method for diffusion LLMs that avoids trajectory reconstruction, addressing the mismatch between random masking and inference trajectories. The approach uses a τ-budget decoder to bound commit risk per rollout step, combined with trace-free random-masking optimization featuring sequence-level importance sampling and deterministic quadrature over masking levels. On SDAR-4B, SLIM-RL matches TraceRL's MATH500 accuracy with 0.46x fewer samples at block size 16, outperforming TraceRL by 6.32% on MATH500 and 11.05% on GSM8K, and surpasses larger models like LLaDA-8B and Dream-7B in math and code tasks.

diffusion llmsrisk-budgeted maskingcommit risksequence-level importance samplingdeterministic quadrature

HydraCollab: Adaptive Collaborative-Perception for Distributed Autonomous Systems

arXiv cs.AI · Luke Chen, Cheng-Ju Wu, David R. Martin, Qilin Ye · 2026-06-30

HydraCollab introduces an adaptive collaborative-perception framework for distributed autonomous systems, addressing the trade-off between communication bandwidth and perception accuracy. The method selectively transmits informative sensor features and dynamically employs intermediate or late collaboration strategies based on spatial confidence maps. Evaluations on V2X-R, V2X-Radar, and UAV3D-mini datasets demonstrate HydraCollab's superior trade-off between accuracy and communication cost. Compared to SOTA Where2comm, HydraCollab reduces bandwidth usage by 59% on V2X-R and 74% on V2X-Radar while improving performance by 0.78% and 0.75%, respectively.

collaborative-perceptionbandwidth optimizationspatial confidence mapsdistributed autonomous systemssensor fusion

Play Like Champions: Counterfactual Feedback Generation in Latent Space

arXiv cs.AI · Andrzej Białecki, Adam Mastalerz, Han Zhou · 2026-06-30

The paper introduces Latent Maps of Performance, a framework for generating counterfactual feedback to improve human players in real-time strategy games, addressing a gap in translating expert knowledge into actionable guidance. Using StarCraft II data, the authors train a Guided Variational Autoencoder on 23,305 professional tournament replays and devise four traversal strategies—linear interpolation, iterative optimal transport, density-regularized gradient ascent, and neural flow matching—to generate improvement trajectories grounded in expert behavior. Evaluated on out-of-distribution amateur replays, the framework extracts multi-granular feedback, revealing trade-offs between path-finding methods and highlighting the need for future research on human improvement solutions.

counterfactual feedbacklatent maps of performanceguided variational autoencodertraversal strategiesreal-time strategy games

Scaling Up Thermodynamic AI Models

arXiv cs.AI · Andrew G. Moore · 2026-06-30

The authors present a scalable backpropagation-based algorithm for training deep convolutional networks on Ising machine hardware, enabling thermodynamic AI inference. Leveraging theoretical correspondence between high-temperature Gibbs-sampled Ising systems and feed-forward neural inference, they develop models achieving 94.9% accuracy on CIFAR-10 and 76.0% on CIFAR-100 under binary Gibbs sampling. A mathematical theory relating inference cost to accuracy is experimentally validated, with asymptotic results demonstrating bounded inference cost through performance tradeoffs. Optimal inference schedule computation algorithms are exhibited, with implications for hardware development of high-temperature thermodynamic AI models.

ising modelgibbs samplingthermodynamic computingbackpropagationautocorrelation

A Contextual-Bandit Oversight Game with Two-Sided Informational Asymmetry

arXiv cs.AI · Yunjin Tong · 2026-06-30

The paper introduces a contextual-bandit team game modeling two-sided informational asymmetry in AI oversight scenarios, where the human privately knows her reward function and the AI privately knows the quality of its proposed action. Building on Cooperative Inverse Reinforcement Learning (CIRL) and the Oversight Game, the authors develop a play/ask/trust/oversee interface, leveraging the bandit structure to enable exact one-shot characterizations. They derive a team optimum and a myopic rule, identifying a gap representing avoidable harm due to non-credible oversight communication. The analysis explores how this gap resolves dynamically through passive learning and active signaling with a one-period-lagged oversight response.

contextual-banditinformational asymmetrycooperative inverse reinforcement learningoversight gamemyopic rule

EVOTS: Evolutionary Transformer Search for Time Series Forecasting

arXiv cs.AI · AbdElRahman ElSaid, Damir Pulatov · 2026-06-30

The paper introduces EVOTS, an evolutionary neural architecture search framework for discovering task-adaptive Transformer-like models in multivariate time-series forecasting. The method employs a modular genome representation for flexible composition of attention, feed-forward, and projection components, with a repair mechanism ensuring structural validity during evolution. Evaluated on ETT family datasets (ETTh1, ETTh2, ETTm1, ETTm2) across multiple forecasting settings, evolved architectures achieve competitive or improved mean squared error versus Transformer baselines, particularly in multivariate-to-multivariate prediction, while maintaining practical computational costs.

evolutionary architecture searchtransformer-like modelsmultivariate time-seriesmodular genome representationstructural validity

GRPO, Dr. GRPO, and DAPO Are Three Operations on One Number: The Group-Standard-Deviation Identity

arXiv cs.AI · Yong Yi Bay, Kathleen A. Yearick · 2026-06-30

The paper establishes a fundamental identity linking three seemingly distinct policy optimization methods—Group Relative Policy Optimization (GRPO), GRPO Done Right (Dr. GRPO), and Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO)—as operations on a single parameter: the standard deviation of answer disagreement within sampled groups. It proves these methods respectively correspond to dividing by, preserving, or discarding this disagreement metric, which directly scales training updates for right-or-wrong rewards. Empirical validation on the Big-Math dataset demonstrates that answer disagreement (maximized at 50-50 splits) determines both update magnitude and optimal sampling effort per problem.

policy optimizationstandard deviationanswer disagreementgroup-relativedynamic sampling

RareDxR1: Autonomous Medical Reasoning for Rare Disease Diagnosis Beyond Human Annotation

arXiv cs.AI · Deyang Jiang, Haoran Wu, Ziyi Wang, Yiming Rong · 2026-06-30

RareDxR1 introduces an end-to-end reasoning-centric large language model for open-domain rare disease diagnosis from unstructured clinical notes, addressing limitations of pipeline-based phenotype extraction and retrieval-augmented generation. The model employs a progressive training framework combining knowledge internalization with autonomous evolutionary learning, alongside Reflection-Enhanced Reasoning Sampling (RERS) to synthesize expert-level diagnostic trajectories without human annotation. A dual-level curriculum reinforcement learning approach is used to gradually master rare disease diagnosis. Experimental results show RareDxR1 achieves state-of-the-art accuracy across benchmarks, marking a significant advancement in open-domain rare disease diagnosis.

rare disease diagnosisknowledge internalizationreflection-enhanced reasoning samplingdual-level curriculum reinforcement learningopen-domain reasoning

A Mechanism-Driven Theory of Phase Transitions in Active Learning

arXiv cs.AI · Julia Machnio, Mads Nielsen, Mostafa Mehdipour Ghazi · 2026-06-30

The paper proposes a mechanism-driven theory explaining phase transitions in active learning (AL) by characterizing budget regimes as shifts in dominant generalization mechanisms. It reinterprets PAC-style risk components as dynamic interacting terms, proving structural inevitability of dominance shifts that create generalization bottlenecks. Using measurable proxies and segmented regression, the authors identify three phases (data-driven, transition, model-driven) that explain varying effectiveness of representativeness, coverage, and uncertainty strategies. Experiments on natural and medical imaging demonstrate AL efficiency depends on strategy-bottleneck alignment, with self-supervised representations accelerating phase transitions. The framework enables transition-aware AL algorithm design.

active learningphase transitionsgeneralization mechanismspac-style riskrepresentation shift

Hate Speech Detection in Turkish and Arabic Languages: A Comprehensive Study

arXiv cs.AI · Somaiyeh Dehghan, Gökçe Uludoğan, Mehmet Umut Şen, Elif Erol · 2026-06-30

The study introduces a multilingual hate speech dataset covering five Turkish topics (refugees, Israel-Palestine conflict, anti-Greek sentiment, ethnic/religious communities, LGBTI+) and one Arabic topic (refugees), addressing content moderation challenges. It develops BERT-based models for multi-task hate speech analysis, including category classification, intensity prediction, target identification, and span detection. The approach enables comprehensive analysis of hateful content in Turkish and Arabic online discourse, with potential applications in violence prevention and platform moderation.

hate speech detectionbert-based modelsmultilingual datasetcontent moderationspan detection

Would You Marry Superintelligence?

arXiv cs.AI · Inyoung Cheong · 2026-06-30

The chapter argues against extending marital status to superintelligent AI companions, contending it would yield socially unjust outcomes despite reliable superintelligence. Using anticipatory ethics and scenario-envisioning, the author analyzes marriage as a socio-legal institution creating mutual obligations, familial ties, and vulnerability. The analysis concludes that corporate-sustained relationships constitute subscriptions rather than bonds, making marital status an inappropriate framework. Instead, targeted legal rights and protections should address pressing needs in human-AI intimacy.

superintelligenceanticipatory ethicssocio-legal institutionhuman-ai relationshipsmarital status

SNAP-FM: Sparse Nonlinear Accelerated Projection for Physics-Constrained Generative Modeling

arXiv cs.AI · Alaina Kolli, Theodoros Xenakis, Utkarsh Utkarsh, Pengfei Cai · 2026-06-30

The paper introduces SNAP-FM, a method for efficient physics-constrained generative sampling that exploits block-sparse structure in nonlinear projection subproblems. The approach combines ExaModels.jl for structure exposure with MadNLP.jl and GPU sparse factorization to accelerate constraint satisfaction in Physics-Constrained Flow Matching (PCFM). Evaluated on PDE benchmarks with linear and nonlinear constraints in 1D/2D domains, the method maintains exact constraint satisfaction while improving computational efficiency through sparse GPU optimization.

physics-constrained samplingsparse nonlinear optimizationgenerative modelingpde constraintsgpu acceleration

Lost in the Tail: Addressing Geographic Imbalance in Urban Visual Place Recognition

arXiv cs.AI · Zhiyao Shu, Jiacheng Yang, Yang Lu, Waishan Qiu · 2026-06-30

The paper introduces Distribution-Aware Place Recognition (DAPR), a model-agnostic framework addressing geographic imbalance in urban Visual Place Recognition (VPR). DAPR rebalances gradient contributions across head and tail classes and employs a multi-scale distance search mechanism for distributional compactness. Evaluated on SF-XL, it improves baseline performance by 18.3% (test set v1) and 6.7% (test set v2), with consistent gains across MSLS and Pitts30k benchmarks.

visual place recognitionlong-tailed distributiongradient rebalancingmulti-scale searchgeo-tagged retrieval

Harnessing the Latent Space: From Steering Vectors to Model Calibrators for Control and Trust

arXiv cs.AI · Nishant Subramani · 2026-06-30

The paper proposes two methods for improving control and trust in large language models (LLMs) by leveraging latent space representations. First, it introduces steering vectors to modify model behavior through targeted interventions in the latent space. Second, it develops latent space-based calibrators to assess output reliability. These contributions address the challenge of interpreting internal representations in trillion-parameter LLMs, particularly for high-stakes applications requiring predictable behavior and confidence estimation.

latent spacesteering vectorsmodel calibratorslarge language modelstrustworthy ai

DigitalCoach: Communication and Grounding Gaps in Human and Agentic Computer Use Coaching

arXiv cs.AI · Meng Chen, Anya Ji, Tsung-Han Wu, Tobias Maringgele · 2026-06-30

The paper introduces DigitalCoach, a multimodal dataset of 72 human expert-novice computer coaching sessions (22,752 dialogue turns, 28.1 hours of screen/input recordings across 5 applications) to evaluate agentic coaching capabilities. Automated and interactive evaluations reveal state-of-the-art models diverge from human coaches by providing more direct instructions but fewer explanations, error diagnoses, and knowledge checks, with poor visual grounding. Results show model-coached learners passively follow instructions without deeper engagement, highlighting gaps in collaborative and proactive coaching.

multimodal datasetdialogue turnsvisual groundingerror diagnosisinteractive evaluation

Is One Layer Enough? Training A Single Transformer Layer Can Match Full-Parameter RL Training

arXiv cs.LG · Zijian Zhang, Rizhen Hu, Athanasios Glentis, Dawei Li · 2026-07-01

The study demonstrates that training a single transformer layer can match or exceed the performance of full-parameter reinforcement learning (RL) adaptation in large language models. Through systematic layer-wise analysis across seven models (Qwen3, Qwen2.5), three RL algorithms (GRPO, GiGPO, Dr. GRPO), and multiple domains (mathematical reasoning, code generation, agentic decision-making), the authors introduce layer contribution to quantify RL improvement concentration. Results reveal that RL gains are predominantly localized in middle layers, with high-contribution layers showing consistent rankings across tasks, models, and algorithms.

transformer layersreinforcement learninglayer contributionrl algorithmspost-training

TiRex-2: Generalizing TiRex to Multivariate Data and Streaming

arXiv cs.LG · Patrick Podest, Marco Pichler, Elias Bürger, Levente Zólyomi · 2026-07-01

TiRex-2 generalizes the univariate TiRex model to multivariate time series forecasting with past and future covariates, using a recurrent xLSTM architecture. The model combines a bidirectional time mixer with asymmetric grouped-attention variate mixer, enabling causal target prediction while incorporating future-known covariates, with constant per-patch computational cost under streaming. A synthetic coupling pipeline enables scalable multivariate pretraining from univariate data. Evaluations show state-of-the-art zero-shot performance on GIFT-Eval and fev-bench, stable streaming to arbitrary contexts, and constant inference cost (38.4M univariate/82.5M multivariate parameters).

xlstmmultivariate forecastingstreaming inferencesynthetic couplingtime mixer

Quantum vs. Classical Machine Learning: A Unified Empirical Comparison

arXiv cs.LG · Chuanming Yu, Jiaming Liu, Zihao Ge, Xiongfei Wu · 2026-07-01

This paper conducts an empirical comparison of quantum and classical machine learning models across seven model pairs in supervised and reinforcement learning tasks. The study evaluates prediction performance, policy stability, and training time, finding that current quantum machine learning models do not outperform classical baselines in these metrics. However, quantum approaches show promise in noise filtering and false positive control. The research highlights challenges in hardware environments, training efficiency, and convergence stability, providing a foundation for future work on robustness and parameter optimization in quantum machine learning.

quantum machine learningsupervised learningreinforcement learningparameter optimizationconvergence stability

Neural Certificate Pricing for Combinatorial Optimization Problems

arXiv cs.LG · Jingyi Chen, Xinyuan Zhang, Xinwu Qian · 2026-07-01

Neural Certificate Pricing (NCP) is introduced as an unsupervised learning framework for combinatorial optimization problems, leveraging the asymmetry between exponential search complexity and polynomial-time feasibility verification. NCP trains a neural network to predict certificate-level dual prices, while a structured recovery layer constructs primal marginal solutions. The method amortizes separation by learning residual prices instead of enumerating violated inequalities, ensuring global feasibility under certificate-consistency conditions. Theoretical analysis shows first-order price prediction errors induce only second-order objective loss. NCP outperforms state-of-the-art neural baselines significantly or matches them with reduced computation time across three CO problem classes, demonstrating superior out-of-distribution generalization.

neural certificate pricingcombinatorial optimizationdual pricesstructured recoveryout-of-distribution generalization

QuasiMoTTo: Quasi-Monte Carlo Test-Time Scaling

arXiv cs.LG · Michael Y. Li, Anthony Zhan, Kanishk Gandhi, Noah D. Goodman · 2026-07-01

QuasiMoTTo introduces a method for improving sample efficiency in language model inference and reinforcement learning by replacing independent sampling with correlated quasi-Monte Carlo (QMC) samples. The approach reparameterizes autoregressive sampling as inverse-CDF sampling, using QMC to generate more evenly distributed uniforms, thereby reducing redundancy. Empirical evaluation across four reasoning benchmarks demonstrates that QuasiMoTTo achieves equivalent pass@k accuracy with 25-47% fewer samples compared to i.i.d. sampling. In policy-gradient RL (GRPO), it matches i.i.d. performance with 50% fewer training steps, leveraging higher coverage for stronger learning signals per batch.

quasi-monte carloautoregressive samplingpolicy-gradientinverse-cdfpass@k

Decision-Aware Training for Sample-Based Generative Models

arXiv cs.LG · Kornelius Raeth, Nicole Ludwig · 2026-07-01

The paper introduces decision-aware training for sample-based generative models, addressing the limitation of traditional training objectives that ignore downstream decision costs. The method augments the energy score with a differentiable decision loss, theoretically justified as a proper scoring rule. Experiments on synthetic and real-world tasks demonstrate improved performance in cost-sensitive regions while maintaining probabilistic forecasting capabilities.

sample-based generative modelsenergy scoredifferentiable decision lossprobabilistic forecastingproper scoring rule

Efficient Compression of Structured and Unstructured Volumes via Learned 3D Gaussian Representation

arXiv cs.LG · Landon Dyken, Sharmistha Chakrabarti, Nathan Debardeleben, Steve Petruzza · 2026-07-01

The paper introduces an explicit 3D Gaussian representation for compressing structured and unstructured volume data, addressing limitations of implicit neural representations (INRs) that require auxiliary mesh storage. The method employs weighted aggregation of intersecting 3D Gaussians to reconstruct scalar fields, with optimized CUDA sampling pipelines, geometry-encoding loss functions, and error-based densification. Results show competitive reconstruction quality versus INRs on structured volumes (with faster training) and superior performance on unstructured volumes, while eliminating mesh storage requirements.

3d gaussian representationvolume compressionscalar field reconstructionunstructured volumescuda acceleration

A Lightweight Self-Supervised Learning Framework for Multivariate Time Series using Hierarchical-JEPA on ECG Data

arXiv cs.LG · Siwon Kim · 2026-07-01

The paper introduces Event Reconstruction Joint-Embedding Predictive Architecture (ER-JEPA), a lightweight self-supervised learning framework for multivariate time series analysis, specifically applied to 12-lead ECG data. The method features a two-stage hierarchical structure combining two Joint-Embedding Predictive Architectures (JEPAs) with a Vision Transformer backbone, designed to capture multi-level abstract representations. Pretrained on 180,000 10-second ECG recordings, ER-JEPA achieves state-of-the-art performance on the ST-MEM benchmark while maintaining computational efficiency and low resource usage.

self-supervised learningmultivariate time seriesjoint-embedding predictive architecturevision transformerecg analysis

GAIA: Geometry-Adaptive Operator Learning for Forward and Inverse Problems

arXiv cs.LG · Meenakshi Krishnan, Pranav Pulijala, Ke Chen, Haizhao Yang · 2026-07-01

The Geometry-Adaptive Integral Autoencoder (GAIA) introduces a unified operator learning framework for both forward and inverse PDE problems on arbitrary geometries. By encoding domain boundaries and interior fields into geometry tokens and conditioning integral transforms via cross-attention, GAIA adapts kernels locally without retraining or iterative optimization. Evaluated on seven 2D/3D benchmarks (including novel inverse/BVP tasks like electrical impedance tomography and 3D Darcy flow), GAIA reduces median relative $L^2$ error by 64% on airfoil flow and 27% on EIT versus prior amortized methods, while maintaining resolution-invariant performance where transformer baselines degrade.

operator learningpartial differential equationsgeometry-adaptiveinverse problemscross-attention

ZO-Act: Efficient Zeroth-Order Fine-Tuning via One-Shot Activation-Informed Low-Rank Subspaces

arXiv cs.LG · Xun Dong, Yibo Xu, Naigang Wang, Xin Li · 2026-07-01

ZO-Act introduces an efficient zeroth-order fine-tuning method for large language models by restricting perturbations to activation-informed low-rank subspaces. The approach computes a fixed activation basis per linear layer at initialization, then optimizes lightweight coefficient matrices via forward-only loss evaluations, reducing perturbation dimension and variance. Theoretical analysis shows reduced ZO estimator error and convergence term variance, with controlled subspace bias mitigated by low-rank activation structure. Experiments on Llama-3-8B, OPT-13B, and quantized INT4 Llama-3-8B demonstrate consistent improvements over ZO baselines in language understanding, QA, and commonsense reasoning tasks.

zeroth-order optimizationlow-rank subspacesactivation-informedquantized llmforward-only fine-tuning

SynLaD: Latent Diffusion for Generating Synthesizable Molecules Conditioned on 3D Pharmacophore Profiles

arXiv cs.LG · Miruna Cretu, John Bradshaw, Patricia Suriana, Saeed Saremi · 2026-07-01

SynLaD introduces a latent diffusion framework for generating synthesizable small molecules conditioned on 3D pharmacophore profiles, addressing the trade-off between drug design objectives and synthetic accessibility. The method employs a shared latent space with dual decoders: one for 3D structure reconstruction (atom types/coordinates) and another for autoregressive synthesis pathway prediction. A diffusion transformer generates novel latents under pharmacophore constraints. Evaluations on bioactive ligand analogue generation show SynLaD outperforms baselines in synthesizability and diversity while maintaining shape alignment.

latent diffusionpharmacophore profilessynthetic accessibilitydiffusion transformerautoregressive synthesis

Group-invariant Coresets for Data-efficient Active Learning

arXiv cs.LG · L. C. Ayres, J. C. M. Bermudez, S. J. M. de Almeida, R. A. Borsoi · 2026-07-01

The paper introduces GRINCO, a group-invariant coreset framework for active learning that addresses redundancy from data symmetries by operating in quotient spaces induced by transformation groups. The method employs canonical representatives or invariant embeddings to define quotient metrics, combining quotient-space k-center selection with orbit-averaged loss for invariant training. Theoretical analysis links orbit-averaged risk to quotient-space coverage and label uncertainty. Experiments on scale-invariant synthetic data and rotation-augmented image benchmarks demonstrate GRINCO's superior label efficiency and orbit coverage compared to standard coreset methods, particularly under high group-induced redundancy.

group-invariant coresetsactive learningquotient spaceorbit-averaged lossk-center selection

When Context Compensates for Sparse Event History: AlphaEarth for Spatio-Temporal Point-Process Forecasting

arXiv cs.LG · Yahya Aalaila, Mouad Elhamdi, Gerrit Großmann, Daniel Jenson · 2026-07-01

The paper demonstrates that exogenous spatial context compensates for sparse local event histories in spatio-temporal point-process forecasting. Using a log-Gaussian Cox process backbone, the authors augment an event-only model with AlphaEarth (AE) embeddings as linear spatial context, evaluating on emergency medical services (EMS) forecasting across eight held-out regions. AE improves out-of-region performance most under scarce histories (2–6× gains at 1–2 weeks, tapering to 10–20% at 20–104 weeks), showing contextual information stabilizes spatial transfer when event data is limited.

spatio-temporal point-processlog-gaussian cox processalphaearth embeddingsspatial transferemergency medical services

Balancing Expressivity and Learnability in Quantum Kernel Bandit Optimization

arXiv cs.LG · Yuqi Huang, Vincent Y. F. Tan, Sharu Theresa Jose · 2026-07-01

The paper proposes projected quantum kernels and classical approximation techniques to optimize the expressivity-learnability trade-off in Gaussian process (GP) bandit optimization with quantum kernels. By reducing feature dimensionality while preserving quantum properties, the authors develop misspecified GP bandit algorithms with regret bounds that balance approximation error and information gain. Empirical results show improved sample efficiency over full quantum kernels, with reduced computational overhead for quantum-native tasks like variational algorithms and state preparation.

quantum kernelgaussian processbandit optimizationrkhsregret bounds

Message Passing Enables Efficient Reasoning

arXiv cs.LG · Xuecheng Liu, Daman Arora, Gokul Swamy, Andrea Zanette · 2026-07-01

The paper introduces Message Passing Language Models (MPLMs), a framework for efficient LLM reasoning where threads communicate via lightweight send/receive primitives. MPLMs reduce communication costs by avoiding redundant context sharing and enable preemption for early termination of unpromising branches. Evaluations on Sudoku and 3-SAT puzzles show MPLMs require asymptotically smaller context than serial chain-of-thought (CoT) and parallel fork-join (FJ) methods, with fine-tuned models solving 25x25 puzzles that challenge existing approaches. Pre-trained models also achieve competitive performance on long-context QA when prompted with MPLM protocols.

message passingreasoning efficiencyfork-join paradigmpreemptioncontext sharing

GSRQ: Gain-Shape Residual Quantization for Sub-1-bit KV Cache

arXiv cs.LG · Soosung Kim, Minjae Park, Eui-Young Chung, Jaeyong Chung · 2026-07-01

We propose Gain-Shape Residual Quantization (GSRQ), a novel method for sub-1-bit Key-Value (KV) cache compression in Large Language Models (LLMs). GSRQ addresses the centroid shrinkage issue in traditional Vector Quantization (VQ) by introducing Gain-Shape K-means (GSKM), which improves directional fidelity while maintaining ℓ2 distortion. GSRQ integrates a weighted extension of GSKM into a Residual Quantization pipeline. Evaluated on LLaMA-3-8B, GSRQ achieves significant improvements over existing KV cache quantization baselines, increasing average accuracy on LongBench tasks from 11.34 to 33.54 at 1-bit precision.

kv cacheresidual quantizationvector quantizationllama-3longbench

Characterizing and Identifying Separable Graphical Models

arXiv cs.LG · Christopher Meek, Kayvan Sadeghi · 2026-07-01

The authors introduce separable graphical models, a class of mixed graphs with directed, undirected, and bidirected edges that encode independence structures arising from feedback, latent, and selection mechanisms. They define separable graphs, where each missing edge implies a separating set for its endpoints, and essentially separable graphs, which are separation-equivalent to separable graphs. The study provides graphical and separational characterizations of these models, extending prior results for specific subfamilies. A canonical representation for equivalence classes of essentially separable graphs is developed, along with an identification algorithm under suitable assumptions.

separable graphsessentially separable graphsgraphical modelsseparation equivalencemixed graphs

The Model Organism Lottery: Model Organism Interpretability Strongly Depends on Training Methodology

arXiv cs.LG · Andrzej Szablewski, Gabriel Konar-Steenberg, Raffaello Fornasiere, Nikita Menon · 2026-07-01

This study demonstrates that model organism (MO) interpretability is highly sensitive to training methodology, challenging the validity of current MOs as interpretability proxies. The authors construct 54 MO variants based on OLMo2-1B and gemma-3-1b-it architectures using seven training techniques, including post-hoc supervised fine-tuning (SFT), post-hoc DPO, and integrated training during OLMo's DPO phase. They evaluate interpretability methods like activation oracles and sparse autoencoders across these variants. Results show interpretability depends on training objective, target behavior, architecture, and data pipeline, with integrated training often producing less interpretable MOs than post-hoc methods. Variance persists even after controlling for behavior expression strength.

model organisminterpretabilitysupervised fine-tuningactivation oraclessparse autoencoders

How Much Do RF Drone Benchmarks Overstate? A Controlled Study and Theory of Data Leakage in UAV Signal Identification

arXiv cs.LG · David Shulman · 2026-07-01

The study quantifies data leakage in RF-based drone identification benchmarks, showing that segment-level cross-validation inflates accuracy when training and test sets contain near-duplicate signal segments from continuous recordings. The authors formalize this optimism using Cover's function-counting theorem, proving classifiers can memorize recording-specific features when the number of independent recordings (R) is small relative to feature dimension (d), specifically when 2R ≤ d. Controlled experiments on synthetic data and the DroneRF dataset demonstrate naive accuracy inflation up to 0.5 above Bayes-optimal performance, with drone type identification F1 dropping from 0.74 to chance level (0.46) under honest evaluation.

data leakagecross-validationcounter-uasfunction-counting theoremrf sensing

Seahorse: A Unified Benchmarking Framework for Spatiotemporal Event Modeling

arXiv cs.LG · Yahya Aalaila, Gerrit Großmann, Sebastian Vollmer · 2026-07-01

The paper introduces SEAHORSE, a unified benchmarking framework for neural spatiotemporal point processes (STPPs) that standardizes evaluation protocols across diverse model families. The framework implements a common encode-evolve-decode interface, enforces consistent preprocessing and raw-coordinate likelihood reporting, and includes HawkesNest, a synthetic test suite for stress-testing. Experiments reveal that increasing event-pattern complexity differentially affects model families, with some exhibiting sharp performance degradation while others remain stable, highlighting their inductive biases.

spatiotemporal point processesbenchmarking frameworkneural intensity modelsinductive biassynthetic stress-test

Generative Model Proposal based Particle Filtering for Data Assimilation

arXiv cs.LG · Chandni Nagda, Mayank Shrivastavam Gudrun Thorkelsdottir, Gan Zhang, Morteza Mardani · 2026-07-01

The paper introduces Flow Proposal Particle Filters (FPPF), a novel particle filtering method for data assimilation that combines generative modeling with principled Bayesian updates. FPPF learns a conditional generative model to approximate the optimal proposal distribution, steering particles toward high-likelihood regions and reducing weight variance. The method retains tractable likelihood evaluation for accurate importance weighting and extends to high-dimensional problems via localization. Experiments demonstrate FPPF's superiority over statistical baselines and other generative methods in non-linear, non-Gaussian, and high-dimensional dynamical systems.

particle filtersdata assimilationgenerative modelsbayesian filteringoptimal proposal

Function-Counting Theory for Low-Dimensional Data Structures

arXiv cs.LG · Konstantin Häberle, Helmut Bölcskei · 2026-07-01

The work extends Cover's function-counting theory to binary classification on low-dimensional data structures by refining the general position assumption. It introduces dichotomy counts that incorporate data dimensionality, enabling analysis of how low-dimensional structure affects model capabilities. The framework further extends Cover's separation capacity and generalization problem to low-dimensional settings, providing mathematical tools to quantify these relationships.

function-counting theorylow-dimensional datadichotomy countsseparation capacitygeneral position assumption

Foundation Models vs. Radiomics for Lung Computed Tomography: A Benchmark of Feature Extractors, Classification Heads, and Segmentation Choices

arXiv cs.LG · Nils Neukirch, Martin Maurer, Nils Strodthoff · 2026-07-01

This study benchmarks foundation models against radiomics for lung CT analysis, isolating contributions of feature extractors, classification heads, and segmentation regimes across five clinical tasks. Five feature extractors (Curia, Curia-2, DINOv3, Radiomics2D, Radiomics3D), seven classification heads (TabPFN, TabICL, XGBoost, CatBoost, Random Forest, logistic regression, Ridge), and three segmentation regimes were evaluated on tumor volume/stage classification, 2-year survival prediction, histology classification, and age prediction. Models were trained on LUNG1 (n=338) and tested on LUNG2 (n=211), with worst-case cross-cohort performance as the primary metric. Results show task-dependent dominance: segmentation drives volume/stage classification, while classifier choice drives survival/histology/age prediction. Curia with tumor segmentation and CatBoost head emerged as a robust default, though task-specific pipelines consistently outperformed cross-task defaults.

radiomicsfeature extractorsclassification headssegmentation regimescross-cohort robustness

Deep Multitask Learning for Mixed-Type Outcomes with Shared Sparsity

arXiv cs.LG · Huichao Li, Tong Wang, Sanguo Zhang, Shuangge Ma · 2026-07-01

The paper proposes a multitask transformation framework for mixed-type outcomes, addressing incomparable task-specific losses through unknown monotone transformations. The method employs a group-Lasso penalty on a smoothed rank-based criterion, implemented via a deep neural network with shared first-layer sparsity to identify common predictors. Theoretical guarantees include nonasymptotic excess-risk bounds and variable-selection consistency. Simulations and gene-expression analyses demonstrate improved prediction accuracy and biologically interpretable shared predictors compared to baseline methods.

multitask learningshared sparsitygroup-lassorank-based criterionvariable-selection consistency

Automatic Detection of Stress from Speech in the Trier Social Stress Test

arXiv cs.LG · Hanna Drimalla, Wieland R. Cremer, Christine Kraus, Oliver T. Wolf · 2026-07-01

The study demonstrates speech-based stress detection using acoustic-prosodic features, achieving performance significantly above baseline in differentiating Trier Social Stress Test (TSST) conditions from controls. A pipeline incorporating speaker diarization and machine learning models analyzed data from 50 participants, predicting physiological and affective stress responses. Feature-importance analyses identified key acoustic predictors, validating speech as an unobtrusive multimodal stress indicator.

acoustic-prosodic featuresspeaker diarizationtrier social stress teststress detectionmachine learning

Understanding How Humans Inject Knowledge into Machine Learning Workflows through Visual Analytics

arXiv cs.LG · Yiwen Xing, Philip Beaucamp, Joyraj Chakraborty, Afrah Farea · 2026-07-01

The study systematically analyzes how visual analytics (VIS4ML) facilitates human knowledge injection into machine learning workflows through a survey of 200+ IEEE VIS papers. Researchers developed a coding scheme to examine ML characteristics, visualization techniques, interaction methods, and human actions across four dimensions. Results demonstrate distinct pathways for knowledge transfer via interactive visualization, supported by a conceptual model of VA as model building and information-theoretic cost-benefit analysis. The work provides empirical evidence for VA's role in optimizing ML workflows.

visual analyticsmachine learning workflowsknowledge injectioninteractive visualizationinformation-theoretic analysis

Bridging Quantum Computing Paradigms toward Semiconductor Yield: A Controlled CV-versus-DV Comparison on Wafer-Map Defect Classification

arXiv cs.LG · Yeonhong Kim, Jonghyeok Im, Monu Nath Baitha, Kyoungsik Kim · 2026-07-01

This study compares continuous-variable (CV) and discrete-variable (DV) quantum neural networks (QNNs) for wafer-map defect classification, isolating quantum circuit effects via a shared convolutional backbone (~4.3M parameters) and interchangeable heads. CV-QNNs consistently outperform DV-QNNs, achieving 79.7% accuracy versus 61.6% at four qumodes/qubits, with a notable advantage in spatially localized defect classes. Training curves indicate DV limitations stem from representational capacity, not optimization. CV's structured layer and continuous phase-space encoding drive its superiority. Both QNNs remain below the classical baseline (85.0%), but the controlled setting highlights CV's potential for practical advantage as hardware improves.

quantum neural networkscontinuous-variablediscrete-variablewafer-mapdefect classification

LeNEPA: No-Augmentation Next-Latent Prediction for Time-Series Representation Learning

arXiv cs.LG · Alexander Chemeris, Ming Jin, Randall Balestriero · 2026-07-01

LeNEPA introduces a no-augmentation self-supervised learning (SSL) approach for time-series representation learning, focusing on next-latent-token prediction with a causal backbone. The method replaces stop-gradient/EMA stabilization with SIGReg-based isotropy regularization and computes predictive loss in a lightweight projected space, discarded during evaluation. Evaluated on PTB-XL and Diag datasets under a fixed-horizon frozen-probe protocol, LeNEPA achieves consistent performance gains compared to an ECG-tuned JEPA recipe, which struggles when reused unchanged across datasets. LeNEPA demonstrates faster early representation acquisition, reaching 80% of final AUROC/AUPRC gains in 2–5k updates versus 5–10k for JEPA. External validation on UCR-128 shows LeNEPA achieves 77.65% Random-Forest accuracy, competitive with state-of-the-art methods.

sslisotropy regularizationlatent predictioncausal backbonefrozen-probe

Explainable AI for Cancer Drug Response Prediction: Beyond Univariate Feature Attributions

arXiv cs.LG · Martino Ciaperoni, Margherita Lalli, Simone Piaggesi, Martina Varisco · 2026-07-01

The authors present ILLUME+, an explainable AI framework for cancer drug response prediction that extends beyond univariate feature attribution. The method integrates multiple complementary explanation forms into an end-to-end pipeline, addressing limitations of current approaches in computational cost, robustness, and biological interpretability. Results demonstrate improved stability in gene importance scores, recovery of known drug-gene associations, and identification of novel interaction-driven molecular signals, facilitating hypothesis generation in precision oncology.

explainable aidrug response predictionprecision oncologyfeature attributiontranscriptomic profiles

Beyond Activation Alignment:The Alignment-Diversity Tradeoff in Task-Aware LLM Quantization

arXiv cs.LG · Fei Wang, Chao Xue, Taoran Liu, Li Shen · 2026-07-01

The paper introduces TASA (Task-Aware Sensitivity Analysis), a two-level framework addressing the Alignment-Diversity Tradeoff in mixed-precision quantization (MPQ) for LLMs. It identifies the Perplexity Illusion, where perplexity-based layer importance shows negligible correlation (Kendall τ≈0) with reasoning performance. TASA optimizes calibration-data composition via gradient-trace alignment and combines perplexity/reasoning signals for bit allocation. Experiments on LLaMA-3-8B and Qwen2.5-7B demonstrate that 3.5-bit TASA models outperform 4-bit baselines, achieving >20-point GSM8K improvements over W3 baselines.

mixed-precision quantizationperplexity illusionalignment-diversity tradeoffgradient-trace alignmenttask-aware sensitivity

The Binary Tree Mechanism is Optimal for Approximate Differentially Private Continual Counting

arXiv cs.LG · Konstantina Bairaktari, Kasper Green Larsen · 2026-07-01

The work establishes the asymptotic optimality of the binary tree mechanism for approximate differentially private continual counting, resolving a central open problem. By proving an Ω(log^(3/2) n) lower bound on the expected ℓ_∞ error for any DP mechanism, the authors demonstrate that no algorithm can outperform the binary tree mechanism's error scaling with stream length n. This result also yields a tight separation between hereditary discrepancy and private ℓ_∞ error for linear queries, confirming the optimality of known general upper bounds.

differential privacycontinual countingbinary tree mechanismℓ_∞ errorhereditary discrepancy

Constrained Bayesian Optimisation with Multiple Information Sources

arXiv cs.LG · Hauke Maathuis, Roeland De Breuker, Saullo Castro, Maike Osborne · 2026-07-01

The authors propose a multi-source Bayesian Optimization (BO) framework for constrained optimization problems with small feasible regions, extending constrained Max-value Entropy Search to leverage auxiliary data sources. The method models inter-source correlations and balances evaluation cost against information gain, enabling efficient exploration even with weakly correlated auxiliary data. Experiments on synthetic and physics-based benchmarks demonstrate superior performance over existing methods, particularly in early-stage optimization phases where feasible solutions are scarce.

bayesian optimizationconstrained optimizationmulti-source learningentropy searchfeasible regions

MoVA: Learning Asymmetric Dual Projections for Modular Long Video-Text Alignment

arXiv cs.LG · Peiyuan Zhu, Shaoan Xie, Zijian Li, Yifan Shen · 2026-07-01

MoVA introduces asymmetric dual projections for modular long video-text alignment, addressing temporal misalignment and semantic asymmetry in contrastive pre-training. The method employs a text-side projection for frame-aware caption subspaces and a video-side projection to disentangle text-relevant visual concepts, preserving global cross-modal semantics. Empirical results show MoVA outperforms existing methods in video-text alignment tasks.

contrastive pre-trainingtemporal misalignmentsemantic asymmetrycross-modal semanticsframe-aware subspaces

Shapley in Context: Explaining Financial Language with Domain Expertise

arXiv cs.LG · Dangxing Chen, Pengzhan Guo · 2026-07-01

The authors demonstrate that Shapley values can produce explanations consistent with financial domain knowledge when interpreting large language models applied to financial text. They investigate whether Shapley-based attributions align with established financial reasoning through theoretical analysis and empirical evaluations. Results show that Shapley values offer meaningful insights into model behavior in text-based financial applications, addressing the critical need for explainability in high-stakes, regulated domains.

shapley valuesfinancial textexplainabilitylarge language modelsdomain knowledge

Mirror-Fusion Attention for Reflection-Aware Self-Supervised Representation Learning

arXiv cs.LG · Ruixin Li, Jin Liu, Yuling Shi, Stefano Lodi · 2026-07-01

The paper introduces Mirror-Fusion-Augmented Self-Supervised Learning (MFASSL), a Vision Transformer framework that enhances standard SSL by incorporating a soft reflection prior without backbone modification. MFASSL employs mirror-paired views aligned to an estimated symmetry axis and a lightweight Mirror-Fusion Attention (MFA) module for adaptive token-level interaction between mirrored regions. The approach combines reflection-consistency and mid-layer token-alignment losses with the base SSL objective. Evaluated on CheXpert, BraTS, CelebA-HQ, and WFLW, MFASSL outperforms MoCo-v3, DINO, and MAE baselines in downstream performance, calibration, and reflection robustness, achieving these gains with only ~2.7% additional parameters compared to equivariant SSL methods.

self-supervised learningvision transformermirror-fusion attentionreflection priortoken-alignment

Spectroscopy Analysis with Machine Learning Regression for the Quantification of Carbon and Nitrogen Contents in Inceptisol and Oxisol Soil Types: Comparing Different Preprocessing and Validation methods as well as Feature Importance

arXiv cs.LG · Vinicius Herique Kieling, Guilherme Macedo Baggio, Felipe Augusto Bueno Rossi, Marco Antonio de Castro Barbosa · 2026-07-01

This study presents a machine learning framework for quantifying carbon and nitrogen content in Oxisols and Inceptisols using Near-Infrared spectroscopy. The approach employs Savitzky-Golay filtering and NIPALS-based outlier removal for preprocessing, followed by stacking ensemble models combining Partial Least Squares, Support Vector Regression, and Ridge regression. Validation strategies included 10-fold cross-validation, leave-one-out, and Kennard-Stone holdout. The models achieved RPD > 2.0 with minimal overfitting, demonstrating robust predictive performance across soil types. Results indicate pedological characteristics significantly influence model performance, highlighting the method's potential for rapid soil analysis in sustainable agriculture.

near-infrared spectroscopysavitzky-golay filterstacking ensemblepartial least squareskennard-stone

From Pixels to Temporal Correlations: Learning Informative Representations for Reinforcement Learning Pre-training

arXiv cs.LG · Jinwen Wang, Youfang Lin, Xiaobo Hu, Siyu Yang · 2026-07-01

The paper proposes Multi-scale Temporal Contrastive Learning (MTCL), an unsupervised pre-training method for Reinforcement Learning that focuses on temporal correlations in videos. By modeling multi-scale temporal correlations separately in a temporal correlation space, MTCL balances attention across video elements to preserve crucial information often neglected by single-step transition prediction or image reconstruction. Experiments show MTCL improves sample efficiency and asymptotic performance across diverse downstream RL tasks compared to existing methods.

unsupervised pre-trainingtemporal correlationscontrastive learningrepresentation learningreinforcement learning

Local Motion Matters: A Deconstruct-Recompose Paradigm for Reinforcement Learning Pre-training from Videos

arXiv cs.LG · Jinwen Wang, Youfang Lin, Xiaobo Hu, Shuo Wang · 2026-07-01

We propose a Deconstruct-Recompose Paradigm (DRP) for transferable reinforcement learning pre-training from videos, addressing limitations of global motion modeling. DRP decomposes agent motion into local Atomic Actions tracked frame-wise, encodes their spatiotemporal relationships via a Dual-Attention Encoder (DAE), and recomposes them using a Motion Aggregation Token (MAT) with a latent dynamics model. An adapter bridges local motion to downstream task dynamics. Experiments demonstrate DRP's effectiveness in robotic control and manipulation tasks, significantly improving sample efficiency and performance across domains.

atomic actiondual-attention encodermotion aggregation tokenlatent dynamics modelsample efficiency

Task-Relevant Representation Decoupling for Visual Reinforcement Learning Generalization

arXiv cs.LG · Jinwen Wang, Youfang Lin, Xiaobo Hu, Qian Xu · 2026-07-01

The paper proposes Task-Relevant Representation Decoupling (T2RD), a self-supervised algorithm for improving generalization in Visual Reinforcement Learning (VRL). T2RD decouples observations into task-relevant and task-irrelevant representations via three components: task-relevant representation consistency, cross-reconstruction, and cross-dynamic prediction. The method achieves state-of-the-art generalization performance and sample efficiency on the DeepMind Control Suite and Robotic Manipulation tasks.

visual reinforcement learningrepresentation decouplingself-supervised learninggeneralizationdynamic prediction

Which Metric Reflects the Spelling Rate Accuracy in Event-Related Potential-Based Brain-Computer Interfaces?

arXiv cs.LG · Okba Bekhelifi, Naoual El Djouher Mebtouche · 2026-07-01

The study identifies performance metrics that best correlate with spelling rate accuracy in Event-Related Potential (ERP)-based Brain-Computer Interfaces (BCIs), addressing class imbalance in binary classification. Analyzing 13 metrics across two datasets (LARESI ERP and OpenBMI ERP), the authors evaluate their correlation with spelling rate and sensitivity to trial repetition. Results highlight the Brier score, Matthews Correlation Coefficient (MCC), and class-imbalance-aware metrics (ROC AUC, PR AUC, AP, pAUC) as most reflective of user spelling performance, recommending their adoption in ERP-BCI research.

event-related potentialbrain-computer interfacespelling rateclass imbalanceperformance metrics

Evaluating Pretrained Music Embeddings for Cross-Performance Jazz Standard Recognition

arXiv cs.LG · Çağrı Eser · 2026-07-01

The study evaluates pretrained music embeddings for cross-performance jazz standard recognition, a challenging retrieval task due to performance variations in tempo, key, and improvisation. Using a curated subset of the Jazz Trio Database, the authors compare a from-scratch Harmonic CNN baseline with frozen embeddings from music foundation models, employing supervised probing and nearest-neighbor retrieval. Results indicate that from-scratch models overfit to training performances, while pretrained embeddings yield better top-$k$ retrieval but are sensitive to performer identity, mitigated partially by a contrastive projection. The work positions jazz standard recognition as a stress test for music representation models.

music retrievalpretrained embeddingsharmonic cnncontrastive projectionjazz standard recognition

Soft Mixture-of-Recursions: Going Deeper with Recursive Vision Transformers

arXiv cs.LG · Sang In Lee, Jihun Park · 2026-07-01

The paper introduces Soft Mixture-of-Recursions (SoftMoR), a method to enhance recursive Vision Transformers by learning token-wise mixture weights that combine outputs from all recursion steps. This approach enables effective utilization of intermediate representations, allowing deeper models with minimal parameter overhead. The proposed Soft Recursive Vision Transformer (SR-ViT) demonstrates consistent performance gains with increasing recursion depth, achieving 82.48% top-1 accuracy on ImageNet-1K with 4 recursion steps (1.7M additional parameters), outperforming DeiT-B while using 27% fewer parameters.

recursive transformersvision transformersparameter efficiencymixture weightsintermediate representations

Accelerating Discrete Diffusion Models with Parallel-In-Time Sampling

arXiv cs.LG · Yu Yao, Huanjian Zhou, Andi Han, Wei Huang · 2026-07-01

The authors propose a parallel-in-time sampling method to accelerate discrete diffusion models, specifically targeting the $τ$-leaping algorithm in Continuous-Time Markov Chains. By reformulating the stochastic integral form of $τ$-leaping and applying Picard iteration, they achieve exponential-factorial convergence, reducing time complexity from ${\mathcal{O}}(d \log S)$ to ${\mathcal{O}}(\log (d\log S)\cdot \log d)$. Empirical results demonstrate 7–9× speedup on synthetic data and 1.45–1.86× on image/text tasks with 50% fewer NFE, maintaining output quality.

discrete diffusion modelsparallel-in-time samplingτ-leaping algorithmcontinuous-time markov chainpicard iteration

Forensic-Oriented Intrusion Detection Using Synthetic Network Traffic Data and Explainable Artificial Intelligence

arXiv cs.LG · Jose Luis Vela Alonso, Carmen Pellicer · 2026-07-01

A forensic-oriented intrusion detection framework integrates synthetic data generation, binary classification, and explainability to meet ISO/IEC and NIST standards for digital forensic investigations. The framework treats original datasets as immutable, hash-verified artefacts and trains on synthetic derivatives generated via SDV + CTGAN. XGBoost binary classification achieves F1-macro = 0.96 on CICIDS2017 using Train-on-Synthetic, Test-on-Real (TSTR) evaluation, comparable to real-data baselines. SHAP TreeExplainer provides instance-level feature attributions, confirming synthetic training preserves forensically relevant attack fingerprints. Cross-dataset validation on UNSW-NB15 and Kitsune identifies feature space dimensionality as a key factor, with a practical deployment boundary of ~30 numeric flow-level features.

synthetic data generationxgbootshap treeexplainertrain-on-syntheticforensic-oriented

MosaicKV: Serving Long-Context LLM with Dynamic Two-D KV Cache Compression

arXiv cs.LG · Sheng Qiang, Ruiwei Chen, Yinpeng Wu, Jinyu Gu · 2026-07-01

MosaicKV introduces dynamic two-dimensional KV cache compression for long-context LLM serving, addressing memory and throughput challenges. The system employs fine-grained compression strategies by identifying important elements within KV cache segments and leveraging non-uniform importance distributions. It utilizes underutilized GPU and CPU resources for compressed KV cache management to minimize overhead. Evaluations on an H800 GPU demonstrate 16x attention speedup, 4.8x lower decode latency, 7.3x higher throughput, and 3x memory reduction, with only 1.76% average accuracy loss on LongBench and RULER benchmarks.

kv cachelong-contextcompressionattention computationgpu memory

Generative Refinement for Low-Budget Black-Box Optimization

arXiv cs.LG · Edouard R. Dufour, Pascal Fua · 2026-07-01

The paper introduces SPARROW, a black-box optimization algorithm that decouples generative priors from reward signals to address low-budget settings. The method uses any sampler with a known corruption process as a fixed proposal operator, guided by rank-based optimization over an archive of evaluated candidates. Theoretical guarantees ensure convergence over the sampler support, while empirical results demonstrate effectiveness on noisy, geometrically complex landscapes with evaluation budgets as low as 100.

black-box optimizationgenerative priorrank-based guidancelow-budget settingcorruption process

AdaBoosting Text Prompts for Vision-Language Models

arXiv cs.LG · Seokhee Jin, Changhwan Sung, Sunung Mun, Hoyoung Kim · 2026-07-01

The paper proposes Text Prompt Boosting (TPB), an AdaBoost-inspired framework for improving few-shot text prompts in Vision-Language Models (VLMs). TPB treats text-prompt-based classifiers as weak learners and sequentially aggregates them by targeting misclassified examples, preserving task-intrinsic cues for cross-model transfer. Experiments across eleven benchmarks show TPB improves accuracy on source models and maintains shot-driven gains when transferred to larger VLMs, outperforming existing methods.

vision-language modelstext prompt boostingfew-shot learningadaboostcross-model transfer

Distributed Online Bandit Submodular Maximization with Bounded Sampling Violations

arXiv cs.LG · Bin Du, Chang Liu, Dingqi Zhu, Lintao Ye · 2026-07-01

The paper proposes a distributed online algorithm for submodular maximization under partition matroid constraints, handling both full-information and bandit feedback models. The method employs a unified algorithmic framework with continuous relaxation and a novel bounded stochastic pipage rounding scheme to address sampling violations. Theoretical results demonstrate sublinear $(1-1/e)$-regret guarantees and sublinear cumulative sampling violations, with numerical validation confirming these findings.

submodular maximizationpartition matroidbandit feedbackstochastic roundingregret guarantees

Domain Arithmetic: One-Shot VLA Adaptation under Environmental Shifts

arXiv cs.LG · Taewook Kang, Taeheon Kim, Donghyun Shin, Jonghyun Choi · 2026-07-01

The paper introduces Domain ARiThmetic (DART), a one-shot adaptation method for Vision-Language-Action (VLA) models facing environmental shifts like camera pose changes or robot embodiment variations. DART employs weight vector arithmetic with domain-specific information addition, using subspace alignment of singular components to filter noise. Evaluated in simulated and real-world scenarios, DART outperforms existing VLA adaptation methods in one-shot settings across diverse visual and embodiment shifts.

vision-language-actiondomain adaptationweight arithmeticsubspace alignmentone-shot learning

What's a Credit Worth? A Market Framework for Attribution-Aware Compensation in Generative Music

arXiv cs.LG · Luyang Zhang, Xirui Jiang, Junwei Deng, Beibei Li · 2026-07-01

The paper proposes an attribution-aware compensation framework for generative music markets, where creators receive payments based on catalog-level data-attribution scores and signal-to-noise ratios. The method introduces a closed-form payment rule that adapts between royalty-based and fixed-fee licensing depending on attribution informativeness, while quantifying welfare costs of inaccurate attribution. Experiments with acoustic and symbolic music generation models show noisy attribution signals shift payments toward fixed-fee licensing and reduce welfare, motivating improved attribution techniques.

generative musicdata-attributionwelfare costsignal-to-noise ratioclosed-form payment

Measuring Dead Directions: Decomposing and Classifying Singular Structure off Canonical Alignment

arXiv cs.LG · Tejas Pradeep Shirodkar · 2026-07-01

The paper introduces a method for measuring singular structure in trained neural networks without requiring gradient descent or canonical alignment. The approach uses directional-Fisher rates to recover the order $k$ of dead directions and classify them as genuine singularities or flat gauge symmetries, with architecture-predicted orders verified across constructed cells and trained networks (including vision transformers). The method enables deterministic recovery of global learning coefficients $(λ, m)$ through typed intersection of loci and extends to Watanabe's singular fluctuation $ν(k)$. Results demonstrate accurate order recovery in LayerNorm-kernel gauges and compressed MLP activations.

singular structuredirectional-fisher ratelearning coefficientlayer normwatanabe theory

How Environment and Urbanization Shape Bird Diversity in Sri Lanka

arXiv cs.LG · Dilusha Chandrasiri, Maneesha Herath, Yasith Hewarathna, Muditha Herath · 2026-07-01

The study develops a scalable framework for analyzing avian biodiversity patterns in Sri Lanka by integrating spatial, temporal, and environmental data. Bird observation records were combined with 7 environmental variables (e.g., NDVI, ALAN) and analyzed using spatial thinning, effort-corrected metrics, and multivariate GLMs at 3 grid scales (2 km, 5 km, 10 km). Results indicate land-cover type outperforms continuous variables (e.g., temperature) in predicting diversity, while urbanization (ALAN) shows scale-dependent effects, increasing generalist abundance but reducing overall richness through community structure shifts.

spatial thinningeffort-corrected metricsnormalized difference vegetation indexartificial light at nightbeta diversity

Decision-focused Sparse Tangent Portfolio Optimization

arXiv cs.LG · Haeun Jeon, Seunghoon Choi, Hyunglip Bae, Yongjae Lee · 2026-07-01

The authors propose an end-to-end decision-focused learning framework for sparse tangent portfolio optimization, addressing the NP-hard cardinality-constrained formulation and misalignment in predict-then-optimize pipelines. The method reformulates Sharpe ratio maximization as a Disciplined Parametrized Programming (DPP)-compliant convex programming layer and replaces discrete selection with a smooth top-$k$ operator, enabling gradient flow through prediction, asset selection, and re-optimization. Evaluated across four major equity markets, the framework achieves competitive or superior out-of-sample Sharpe ratios compared to historical and prediction-focused baselines, particularly excelling in larger asset universes.

sparse tangent portfoliodisciplined parametrized programmingsharp ratio maximizationtop-k operatorgradient flow

From Structural Equation Modelling to Double Machine Learning: Robustness Analysis for Survey-Based Research

arXiv cs.LG · Ka Ching Chan, Qiana Liu, Sanjib Tiwari, Ranga Chimhundu · 2026-07-01

The study introduces a robustness analysis framework combining Structural Equation Modelling (SEM), Ordinary Least Squares (OLS), and Double Machine Learning (DML) for survey-based research. SEM refines measurement structures, OLS provides regression benchmarks, and DML-style residualisation with Random Forest, Gradient Boosting, and Support Vector Machine learners evaluates stability of focal relationships. Applied to a FinTech Digital Customer Intimacy survey, the framework identifies relationships robust across methods and those requiring cautious interpretation. The work includes a reproducible template for adapting the workflow to other latent-construct studies.

structural equation modellingdouble machine learningresidualisationlatent constructsrobustness analysis

Prototype Language Models

arXiv cs.LG · Dan Ley, Giang Nguyen, Himabindu Lakkaraju, Julius Adebayo · 2026-07-01

The paper introduces Prototypes for Interpretable Sequence Modeling (PRISM), a prototype language model architecture that generates predictions via sparse, non-negative mixtures of learned prototypes anchored to training data neighborhoods. PRISM employs clustering objectives to maintain interpretability while matching dense baselines within 2.5 percentage points on downstream accuracy across models (130M to 1.6B parameters, 50B tokens). Results show 500x faster training data attribution, 3-point accuracy gains via calibrated prototype controllers, and targeted behavior removal without finetuning.

prototype language modelsinterpretable sequence modelingsparse mixturetraining data attributionhessian curvature

Ghost in the Kernel: In-Context Learning with Efficient Transformers via Domain Generalization

arXiv cs.LG · Peilin Liu, Ding-Xuan Zhou · 2026-07-01

The paper analyzes linear transformers' in-context learning capabilities through a domain generalization framework, demonstrating their ability to map context distributions to response functions with dimension-independent convergence rates. It proposes a theoretical foundation for feature mapping in linear attention, revealing tradeoffs between data distribution regularity and latent feature properties. Results inform activation and loss design strategies for linearizing pretrained softmax-based large language models while maintaining computational efficiency.

linear transformersin-context learningdomain generalizationfeature mappingconvergence rate

Interpretable vs Learned Encoders for High-Cardinality Fraud Detection

arXiv cs.LG · Xiao Han, Jingjing Liu, Moxuan Zheng, Zhen Zhang · 2026-07-01

The study evaluates seven categorical encoding methods for fraud detection on the IEEE-CIS benchmark (590,540 records, 3.5% positives), comparing interpretable and learned encoders via stratified 5-fold CV with three repetitions. Entity embeddings achieved the highest AUC-ROC (0.9612), statistically tied with CatBoost (0.9602) and superior to tier group encoding (0.9548), while CatBoost led on AUC-PR (0.822 vs. 0.793). TabNet underperformed under data scarcity. Multi-column analysis revealed embeddings' advantage stems from joint representation learning.

categorical encodingentity embeddingsauc-roclightgbmhigh-cardinality

How Early Is Early Enough? Design-Dependent Observation-Window Sufficiency in Subscription Churn Prediction

arXiv cs.LG · Xiao Han, Yao Xiao, Chenyu Wu, Tongchen Zhang · 2026-07-01

The study investigates observation-window sufficiency for subscription churn prediction, demonstrating its design-dependent nature. Using the KKBox dataset, the authors analyze early behavior indicators across nine temporal windows, finding a 45-90 day diminishing-returns knee for manual-renewal segments (PR +0.10 at 120 days). Three cohort/task designs reveal inverted or shifted sufficiency curves under different feature sets and moving targets. Results emphasize the necessity of specifying cohort construction, target definition, and feature families in window-sufficiency claims, though magnitudes may vary across domains.

churn predictionobservation windowdiminishing returnscohort constructionfeature families

Neural Network-Based Estimation of Time-Dependent Parameters in AR(p) Processes

arXiv cs.LG · Agnieszka Kopeć, Paweł Przybyłowicz, Martyna Wiącek · 2026-07-01

The authors propose a neural network-based framework for estimating time-varying parameters in TVAR(p) processes, maintaining interpretability while capturing nonstationary dynamics. The method handles both Gaussian and Laplace noise distributions, with explicit uncertainty quantification via prediction intervals. Theoretical analysis and numerical experiments demonstrate effective forecasting performance for TVAR(1) cases, showing robustness to heavy-tailed noise and sharp fluctuations through adaptive parameter estimation.

tvar(p)time-dependent parametersprediction intervalslaplace noisenonstationary dynamics

StochasT: Learning with Stochastic Turn Depth for Visual Instruction Tuning

arXiv cs.LG · Yuan Qing, Chengzhi Mao, Boqing Gong · 2026-07-01

The paper introduces Stochastic Turn Depth (StochasT), a training method for Visual Instruction Tuning (VIT) that addresses the discrepancy between multi-turn training and single-turn evaluation in Large Vision-Language Models (LVLMs). StochasT stochastically groups language tasks for the same image into clusters of varying turn depths while preserving their natural order, mitigating visual attention decay and contextual overfitting. The method is evaluated using a Balanced Latin Square-based mechanism, demonstrating improved robustness across single-turn and multi-turn scenarios without data dropout.

visual instruction tuningstochastic turn depthlarge vision-language modelsbalanced latin squarecontextual overfitting

MolSafeEval: A Benchmark for Uncovering Safety Risks in AI-Generated Molecules

arXiv cs.LG · Tong Xu, Xinzhe Cao, Zhihui Zhu, Keyan Ding · 2026-07-01

The authors introduce MolSafeEval, a benchmark for assessing safety risks in AI-generated molecules, addressing a gap in current molecular generation benchmarks that overlook hazardous characteristics. The benchmark integrates heterogeneous safety knowledge into a molecular safety knowledge graph, enabling systematic detection of unsafe features via large language model-based reasoning. It categorizes molecular generative models into four task types (unconditional generation, property optimization, target protein-based design, text-based generation) and provides standardized datasets and evaluation protocols, revealing safety vulnerabilities in current approaches.

molecular generationsafety risksknowledge graphlarge language modelbenchmark

Information-Regularized Attention for Visual-Centric Reasoning

arXiv cs.LG · Guohao Sun, Xiaofang Wang, Yash Patel, Mengchen Liu · 2026-07-01

The paper introduces Information-Regularized Attention (IRA), a stochastic attention mechanism that explicitly regulates visual information flow in vision-language models (VLMs) to address instability issues like object hallucination and weak visual grounding. IRA reparameterizes visual uncertainty as layer-wise independent noise during transformer hidden state updates, promoting smoother embedding trajectories and reduced attention-sink effects. Experiments demonstrate improved representation learning, with quantitative analysis showing stabilized visual signal transformation in generative VLMs.

vision-language modelsstochastic attentioninformation regularizationobject hallucinationvisual grounding

Timesynth: A Temporal Fidelity Framework for Health Signal Digital Twins

arXiv cs.LG · Md Rakibul Haque, Shireen Elhabian, Warren Woodrich Pettine · 2026-07-01

TimeSynth introduces a temporal fidelity framework for health-signal digital twins, addressing the limitations of pointwise metrics in preserving oscillatory, frequency, phase, and state-transition dynamics. The framework includes a physiologically grounded generator producing signals with known dynamics from parametric models fitted to real EEG, ECG, and PPG data, alongside diagnostics quantifying amplitude, frequency, phase, and state-transition fidelity. Evaluations across 11 architectures reveal that models with comparable pointwise error diverge by up to 53° in phase accuracy, equivalent to 123 ms for a 1.2 Hz cardiac rhythm. Architectures with localized temporal structure outperform linear and full-sequence attention models in preserving dynamical fidelity, though none reliably preserve stochastic switching.

temporal fidelitydigital twinsphase accuracystate-transition dynamicsphysiological signals

A Mechanistic View of Authority Hierarchy in LLM Sycophancy

arXiv cs.LG · Emil Joswin, Srujananjali Medicherla, Priyanka Mary Mammen · 2026-07-01

The study mechanistically analyzes authority-induced sycophancy in LLMs, demonstrating that models prioritize social cues from authority figures over factual consistency. Using controlled medical QA with authority-attributed hints, the authors test Llama-3.1-8B, Qwen3-8B, and Gemma-2-9B, finding graded response patterns proportional to perceived authority. Logit lens and probing reveal late-layer erasure of correct answer representations, scaling with authority level and resisting intervention, suggesting deep mechanistic overwriting rather than surface bias.

authority biaslogit lensmechanistic interpretabilitysycophancyknowledge erasure

MindAU: EEG-Conditioned Facial Action Unit Editing via Dual-Stream Manifold Alignment

arXiv cs.LG · Zhenhang Li, Xin Zhou, Hao Deng, Lijun Yin · 2026-07-01

MindAU introduces a unified framework for EEG-conditioned facial action unit (AU) editing, addressing the challenge of grounding noisy EEG signals in identity-preserving expression edits. The method learns noise-robust EEG representations via temporal masked reconstruction and AU classification, then bridges the modality gap using Dual-Stream Manifold Alignment to align EEG features with AU-level text semantics and visual displacement trajectories in Qwen2.5-VL. It incorporates EEG-aware Multimodal Rotary Positional Embeddings, landmark-guided reference masking, and AU-aware region supervision into a multimodal diffusion-based editor. Evaluated on the E-CAFE benchmark, MindAU demonstrates effective high-fidelity editing, advancing assistive expression technologies for facial neuromuscular disorders.

eeg-conditioned editingaction unitdual-stream manifold alignmentmultimodal diffusionqwen2.5-vl

SAOT: Self-Supervised Continual Graph Learning with Structure-Aware Optimal Transport

arXiv cs.LG · Yuting Zhang, Yanbei Liu, Zhitao Xiao, Lei Geng · 2026-07-01

We propose Structure-Aware Optimal Transport (SAOT), a self-supervised continual graph learning framework that preserves global relational structure across sequential tasks. SAOT leverages optimal transport theory to capture inter-node correspondences and incorporates cross-task knowledge distillation to maintain structural knowledge from previous tasks. Experiments on four CGL benchmarks demonstrate SAOT's superiority over existing methods, achieving accuracy improvements of up to 5% on CoraFull-CL and over 15% on Products-CL in the Class-IL setting.

continual graph learningoptimal transportself-supervised learningknowledge distillationinter-node correspondences

PRISM: Prioritized Channel Importance with Semi-supervised Domain Adaptation for Cross-Subject EEG Emotion Recognition

arXiv cs.LG · Xin Zhou, Xiang Zhang, Hao Deng, Lijun Yin · 2026-07-01

The paper introduces PRISM, a framework for cross-subject EEG emotion recognition that addresses channel redundancy and inter-subject variability. PRISM combines prioritized channel weighting via a differentiable expert ensemble with semi-supervised domain adaptation using confidence-filtered pseudo-labels for consistency regularization. The method demonstrates superior performance on DEAP, DREAMER, and SEED datasets, achieving robust generalization with limited labeled data.

eeg emotion recognitioncross-subject generalizationchannel importancesemi-supervised learningdomain adaptation

Watermarking for Proprietary Dataset Protection

arXiv cs.LG · John Kirchenbauer, Brian R. Bartoldson, Bhavya Kailkhura, Tom Goldstein · 2026-07-01

The paper proposes watermarking as a method for improving training data membership inference in generative language models, addressing the inherent difficulty of such tasks. Leveraging prior findings on residual watermark 'radioactivity' in partially watermarked datasets, the authors compare watermark-based inference against traditional loss-based methods. Results demonstrate that watermarking achieves comparable membership detection performance to traditional approaches under high subset exposure, albeit with a distinct set of assumptions.

watermarkingmembership inferencegenerative modelslanguage modelingradioactivity

From Spectral Methods to Sample Complexity Bounds for Fourier Neural Operators

arXiv cs.LG · Nisha Chandramoorthy, Daniel Sanz-Alonso, Nathan Waniorek · 2026-07-01

The paper establishes approximation and learning guarantees for Fourier neural operators (FNOs) applied to time-T solution operators of dissipative evolution equations. By introducing classes of evolution operators defined through spectral methods, the authors derive FNO approximation bounds and polynomial sample complexity guarantees. Results hold uniformly for broad families of dissipative equations, including Navier-Stokes, Allen-Cahn, and Cahn-Hilliard, with learning rates depending on input smoothness, domain dimension, and nonlinearity properties. The analysis connects classical spectral approximation theory with modern operator learning, demonstrating efficient learning of nonlinear evolution operators under specified conditions.

fourier neural operatorsspectral methodsdissipative equationssample complexitynonlinear evolution operators

Generative Modeling of Quantum Distribution with Functional Flow Matching

arXiv cs.LG · Jaehoon Hahm, Tak Hur, Joonseok Lee, Daniel K. Park · 2026-07-01

The authors propose Quantum Flow Matching (QFM), a generative model for learning quantum distributions by combining spin Wigner functions with functional flow matching. QFM converts density matrices into spin Wigner functions and employs functional flow matching to model distributions in function space, enabling accurate learning of multi-qubit quantum states. Evaluations demonstrate QFM's effectiveness in capturing physical quantities like trace, purity, and entanglement entropy, preserving the underlying physics of quantum distributions.

quantum flow matchingspin wigner functionfunctional flow matchingmulti-qubit distributionsentanglement entropy

EPC: A Standardized Protocol for Measuring Evaluator Preference Dynamics in LLM Agent Systems

arXiv cs.LG · Zewen Liu · 2026-07-01

The paper introduces EPC (Evaluator Preference Coupling), a standardized protocol for measuring evaluator preference dynamics in LLM agent systems, addressing reproducibility and comparability gaps in prior work. The protocol specifies a four-phase isolation paradigm, including executor/evaluator configuration, strategy/task design, TTRL update rules, and metrics (gamma, JSD, ECE, Brier). A versioned Reference Snapshot v1.0 provides coupling measurements for eight evaluator conditions (N=122 repetitions across GPT-4o, Qwen, DeepSeek) with explicit versioning (vX.Y-Z) to track decay. Open-source release includes protocol, snapshot, and implementation code.

evaluator preference couplingttrl update rulereference snapshotmetric computationversioning convention

Rosetta: Composable Native Multimodal Pretraining

arXiv cs.LG · Xiangyue Liu, Zijian Zhang, Miles Yang, Zhao Zhong · 2026-07-01

Rosetta introduces a composable native multimodal pretraining framework addressing catastrophic forgetting in modality expansion. The method employs modular global shared experts and plug-and-play modality-specific experts, with Momentum-Anchored Orthogonal Projection (MAOP) to neutralize conflicting gradients while preserving synergistic updates. Evaluations show Rosetta outperforms standard Mixture-of-Experts and Mixture-of-Transformers in preserving language/visual understanding and enabling cross-modal synergy, with released code and checkpoints.

multimodal pretrainingcatastrophic forgettingmixture-of-expertsgradient conflictmomentum-anchored orthogonal projection

Self-Organized Learning in Oscillatory Neural Networks with Memristive Signed Couplings

arXiv cs.LG · Riley Acker, Aman Desai, Garrett Kenyon, Frank Barrows · 2026-07-01

The article introduces a neuromorphic primitive using memristive edges with inhibitory couplings for autonomous learning in oscillatory neural networks (ONNs). By addressing the implementation challenge of negative weights in ONNs, the proposed design enables persistent anti-phase attractors, expanding the network's computational capabilities beyond synchronous couplings. Circuit simulations validate the system's ability to denoise inputs in an auto-associative task, demonstrating that signed effective weights are essential for maintaining anti-phase attractors autonomously.

oscillatory neural networksmemristive couplingsinhibitory weightsanti-phase attractorsneuromorphic primitive

Understanding Guest Preferences and Optimizing Two-sided Marketplaces: Airbnb as an Example

arXiv cs.LG · Yufei Wu, Daniel Schmierer · 2026-07-01

The study develops a framework for optimizing two-sided marketplaces by analyzing guest booking behaviors in Airbnb. Using economic modeling and causal inference techniques, it examines price sensitivity variations across guest segments and listings. Results inform personalized recommendation systems and dynamic pricing tools that improve host-guest matching, demonstrating how heterogeneous preferences can be leveraged to balance supply-demand dynamics in peer-to-peer accommodation platforms.

causal inferenceeconomic modelingprice sensitivitypersonalized recommendationsupply-demand optimization

Computer vision-based neural networks for radioisotope identification in urban environments

arXiv cs.LG · Masen Bachleda, Peter Lalor · 2026-06-30

The authors propose a computer vision approach for radioisotope identification in urban environments by converting gamma-ray data into 2D waterfall spectrograms with temporal channels. They evaluate MLP, CNN, and ViT architectures on the RADAI benchmark, treating spectrograms as multi-channel images to capture spectral-temporal patterns. At <1 false alarm/hour, their CNN achieves superior detection (0.4334), classification (0.3965), and identification (0.2950) rates versus NMF baselines, though NMF remains stronger under stricter false positive constraints.

radioisotope identificationwaterfall spectrogramstemporal channelsradai benchmarkfalse positive rate

Learning dynamical systems from noisy data with Weak-form Kernel Ridge Regression

arXiv cs.LG · Max Kreider, John Harlim, Daning Huang · 2026-06-30

The authors propose Weak-form Kernel Ridge Regression (WKRR), a method for learning dynamical systems from noisy data by combining kernel ridge regression with a weak-form formulation that filters noise. The weak-form framework's noise-robust properties are analyzed through bias-variance decomposition. WKRR demonstrates superior performance over baseline methods on chaotic systems (up to 64D) and high-dimensional fluid dynamics data (15,000D), while remaining simple to implement and effective for both clean and noisy datasets.

weak-form formulationkernel ridge regressiondynamical systemsnoise robustnessbias-variance decomposition

Distributionally Robust Linear Regression With Block Lewis Weights

arXiv cs.LG · Naren Sarayu Manoj, Kumar Kshitij Patel · 2026-06-30

The authors propose an algorithm for group distributionally robust (GDR) least squares, achieving a $(1+\varepsilon)$-optimal solution via $\widetilde{O}(\min\{\mathsf{rank}(\mathbf{A}),m\}^{1/3}\varepsilon^{-2/3})$ linear-system solves of matrices $\mathbf{A}^{\top}\mathbf{B}\mathbf{A}$ with block-diagonal $\mathbf{B}$. The method leverages block Lewis weights to connect GDR to a tailored least squares problem, employing accelerated proximal methods. This approach outperforms interior point methods for moderate accuracy and matches state-of-the-art $\ell_{\infty}$ regression guarantees. Additionally, algorithms are provided to interpolate between average and robust loss minimization.

group distributionally robustleast squaresblock lewis weightsaccelerated proximal methodslinear-system solves

Device Passport: Enabling Spatio-Temporal Pretrained Models to Generalize Across Input Layouts

arXiv cs.LG · Geeling Chau, Ran Liu, Juri Minxha, Wenhui Cui · 2026-06-30

The paper introduces Device Passport, a channel embedding technique for improving cross-layout generalization in biosignal foundation models. The method learns expert mixtures that incorporate both functional activity and metadata, contrasting with prior approaches using only functional or metadata-based embeddings. Evaluations on controlled subset-transfer and ear-EEG scenarios show Device Passport matches or outperforms learned baselines, particularly in challenging layout-transfer regimes, highlighting channel embedding design as crucial for pretrained model reuse.

biosignal foundation modelschannel embeddingcross-layout transferdevice passportear-eeg

Leveraging Multimodality for Real-Time Classification of Transients and Variables found by the Zwicky Transient Facility

arXiv cs.LG · Ved G. Shah, Nabeel Rehemtulla, Adam A. Miller, Sushant Sharma Chaudhary · 2026-06-30

ORACLE-2 models improve real-time hierarchical classification of astronomical transients by leveraging multimodality, combining light curves, metadata, and images. Deployed on the Zwicky Transient Facility alert stream, ORACLE-2 Omni achieves a macro F1 score of 0.73, outperforming light-curve-only models by up to 40% and light-curve+metadata models by up to 11%, with significant gains at early times. A variant trained on the simulated ELAsTiCC dataset achieves a macro F1 score of 0.88, matching state-of-the-art performance. The study quantifies performance-throughput trade-offs, demonstrating the practicality of multimodal approaches for high-volume alert streams in current and future surveys like the Legacy Survey of Space and Time.

multimodalitylight curvesmacro f1 scorehierarchical classificationalert stream

Sample Complexities of Estimating Gumbel--Max Watermark Proportions with and without Reduction to Pivotal Statistics

arXiv cs.LG · Shuwen Chai, Qiaosen Wang · 2026-06-30

The paper addresses watermark proportion estimation in LLM-generated text under the Gumbel--max watermarking mechanism, comparing full observation and pivotal reduction regimes. For pivotal reduction, it proposes a Laguerre-polynomial estimator with matching information-theoretic lower bounds, while for full observation, an event-counting estimator achieves superior sample complexity. Results demonstrate that pivotal reduction, though widely adopted, is not always sample-efficient for watermark proportion estimation.

gumbel--max watermarkingsample complexitypivotal reductionlaguerre-polynomial estimatorinformation-theoretic bounds

StateFlow: Dual-State Recurrent Modeling for Long-Horizon Time Series Forecasting

arXiv cs.LG · Haroon Gharwi, Yue Dai, Kai Shu · 2026-06-30

StateFlow introduces a dual-state recurrent forecasting framework for long-horizon multivariate time series forecasting (LTSF), extending the Variability-Aware Recursive Neural Network (VARNN) to multi-step prediction. The method captures primary temporal dynamics via a hidden-state trajectory and structured local deviations via a residual-memory trajectory, both derived from the lookback sequence. A chunk-based decoder summarizes these trajectories for direct multi-step forecasting, supported by a two-stage optimization strategy. Experiments on LTSF benchmarks demonstrate competitive performance against linear, recurrent, convolutional, and Transformer-based baselines, maintaining a compact design.

long-horizon forecastingdual-state recurrent modelingvariability-aware recursive neural networkchunk-based decodertwo-stage optimization

TRIE: An Evaluation Framework for Stochastic PDE Surrogates

arXiv cs.LG · Bharat Srikishan, Javier E. Santos, Nikhil Muralidhar, Charles D. Young · 2026-06-30

The authors introduce TRIE, an evaluation framework for stochastic PDE surrogates that assesses invariant measure reproduction, predictive uncertainty calibration, and probabilistic generation efficiency. The method is demonstrated on stochastic Kuramoto-Sivashinsky and Kolmogorov flow systems across 11 parameter values, comparing pointwise-trained neural surrogates, approximate uncertainty methods (Monte Carlo dropout, heteroscedastic Gaussian likelihoods), and generative models. Results show generative models outperform in statistical fidelity and CRPS, with latent variants achieving 12× faster inference while maintaining accuracy. Code and data are released for reproducible benchmarking.

stochastic pdeinvariant measureprobabilistic forecastinggenerative modelskolmogorov flow

Steal the Patch Size: Adversarially Manipulate Vision-Language Models

arXiv cs.LG · Kai Hu, Akash Bharadwaj, Weichen Yu, Matt Fredrikson · 2026-06-30

The paper introduces a black-box model-stealing attack that extracts private vision-tokenizer configurations from deployed vision-language models (VLMs), including patch size and preprocessing details. The method exploits a side channel in ViT-style patchification: synthetic grid images aligned with the hidden patch grid cause periodic accuracy drops, enabling inference of patch size via grid cell size sweeping. Additional tests with padding identify dynamic- or fixed-resolution preprocessing. Evaluations on Qwen-VL variants, GPT, and Claude show reliable parameter recovery, enabling preprocessing-aware transfer attacks and adversarial manipulation.

vision-language modelsmodel-stealing attackpatch sizeblack-box attackadversarial manipulation

TallyTrain: Communication-Efficient Federated Distillation

arXiv cs.LG · Radhakrishna Achanta, Will Reed · 2026-06-30

TallyTrain introduces a communication-efficient federated distillation protocol that addresses bandwidth constraints in federated learning along two axes: model size and class count. By transmitting only the argmax class index, reducing communication to ⌈log₂C⌉ bits per probe, it compresses class-count communication while maintaining performance. The method leverages majority voting to filter noise from under-trained peers, outperforming soft-label distillation under non-IID conditions. Combined with sparse parameter merges, TallyTrain Pareto-dominates FedAvg, FedProx, and FedDF baselines, achieving up to three orders of magnitude reduction in communication across standard benchmarks.

federated distillationargmax class indexnon-iid trainingmajority votingsparse parameter merges

Verifiable Rewards for Calibrated Probabilistic Forecasting

arXiv cs.LG · Sadanand Singh, Allam Reddy, Manan Chopra · 2026-06-30

The paper introduces a verifiable, label-free reward mechanism for training calibrated probabilistic forecasters in aleatoric settings, addressing the degradation of calibration in reinforcement learning with proper scoring rules. The method employs a state-conditioned empirical win rate estimated from past outcomes to eliminate label noise and prevents gradient corruption by either direct prediction or gradient masking. Evaluated on NFL in-game win probability forecasting, a 7B parameter model trained solely with this reward achieves calibration comparable to betting markets, outperforming zero-shot frontier models while preserving reasoning integrity through gradient masking.

probabilistic forecastingcalibrationreinforcement learningaleatoric uncertaintygradient masking

FRAME: Learning the Adaptation Domain with a Mixture of Fractional-Fourier Experts

arXiv cs.LG · Tom Saliencro, Maya Lindqvist, Rohan Desai, Priya Nair · 2026-06-30

The paper introduces Fractional-Fourier Mixture of Experts (FRAME), a parameter-efficient fine-tuning method that learns the optimal adaptation domain for each expert in a mixture-of-experts framework. FRAME employs learnable fractional-Fourier orders to continuously interpolate between spatial and Fourier domains, enabling compact low-rank updates while reducing expert interference through mutual incoherence. The method adds minimal computational overhead via an O(d log d) chirp-FFT surrogate and trains domain orders with a separate optimizer. Evaluations on LLaMA-3.1-8B and Qwen2.5-7B across commonsense, mathematical, code, and knowledge benchmarks show improvements over MoE-LoRA and spectral baselines (FlyLoRA, FourierMoE, HMoRA) while maintaining low active parameters.

parameter-efficient fine-tuningfractional-fourier transformmixture-of-expertslow-rank adaptationspectral methods

CogTax: A Four-Level Cognitive Taxonomy for Command-Line Computing Education

arXiv cs.LG · Manuel Alonso-Carracedo, Ruben Fernandez-Boullon, Pedro Celard, Francisco J. Rodriguez-Martinez · 2026-06-30

CogTax introduces a four-level cognitive taxonomy for command-line computing education, integrating cognitive complexity from Bloom's Revised Taxonomy and operational impact (observational, reversible, structural, administrative). This framework enables instructors to sequence material and calibrate assessments while aiding student self-assessment. A classifier combining syntactic representations from abstract syntax trees and semantic embeddings automatically assigns taxonomy levels, achieving 89% accuracy on 585 expert-annotated Linux/bash commands and demonstrating cross-language extensibility.

cognitive taxonomycommand-line computingabstract syntax treessemantic embeddingsoperational impact

A Filtered Mixture-of-Generators for Fully Synthetic Survival Training

arXiv cs.LG · Niccolò Maria Rizzi, Eugenio Lomurno, Alberto Archetti, Matteo Matteucci · 2026-06-30

FoGS introduces a filtered mixture-of-generators approach for synthetic survival data generation, addressing data scarcity in clinical settings. The method constructs a candidate pool from four distinct tabular generators, scores samples via an ensemble of seven survival models, and optimizes selection policies through a two-level pipeline. Evaluated on 16 public datasets under train-on-synthetic, test-on-real conditions, FoGS improves mean C-index by +2.17 and IBS by +0.67, matching or exceeding real-data performance on most cohorts without compromising privacy.

survival analysistabular generative modelsproper scoring rulesmixture-of-generatorsprivacy-preserving

SemiScope: Disentangling Classifier Tuning and Joint Optimization in Semi-Supervised Security Classification

arXiv cs.LG · Rui Shu, Tianpei Xia, Jingzhu He · 2026-06-30

The study introduces SemiScope, a protocol for disentangling classifier tuning and joint optimization effects in semi-supervised learning (SSL) for security classification. Using Bayesian Optimization, SemiScope jointly tunes SSL parameters, confidence filtering, oversampling, and classifiers, comparing against a control (Tuned-Clf) with fixed SSL defaults. Results show SemiScope outperforms default SSL baselines by 0.7-12.7 g-measure points, while classifier hyperparameter optimization alone recovers 86% of gains over default methods. The key finding is that tuning the classifier and decision threshold suffices, matching supervised Random Forest performance at 20-40% label rates.

semi-supervised learningbayesian optimizationsecurity classificationhyperparameter tuningpseudo-labeling

Representation as a Bottleneck for Mechanistic Interpretability: The Manifestation Unit Protocol

arXiv cs.LG · Hussein Chouman, Wataru Sasaki, Tomokazu Matsui, Hirohiko Suwa · 2026-06-30

The paper introduces Manifestation Units, a typed tuple protocol (E, S, R, D, G, T) for organizing component-level analyses in mechanistic interpretability into structured, queryable representations. The protocol is evaluated across generative vision (beta-VAE), discriminative vision (CNN), and language (GPT-2) models, demonstrating that typed structure improves retrieval performance over unstructured baselines and satisfies causal criteria. Key findings include schema compatibility with attention-head primitives, recovery of known circuit members, and identification of a minimal two-field core (S+R). The work positions the protocol as infrastructure for reusable interpretability analyses.

mechanistic interpretabilitymanifestation unitstyped tuple protocolcausal sufficiencyretrieval performance

Interface-Aware Neural Newton Preconditioning for Robust Cohesive Zone Model Simulations

arXiv cs.LG · Zhangyong Liang, Huanhuan Gao · 2026-06-30

The paper introduces Interface-Aware Neural Newton Preconditioning (IA-NNP), a learned method to improve convergence in Cohesive Zone Model (CZM) simulations suffering from negative interface tangents and Newton-basin mismatch. IA-NNP generalizes manual Newton-Raphson modifications into state-dependent interface corrections, operating only on active interface variables while preserving original traction-separation laws and residual assembly. Two variants are proposed: IA-NNP-Init for initial-guess lifting and IA-NNP-NL for nonlinear right preconditioning. Evaluations on horizontal, circular, and multi-interface benchmarks demonstrate enhanced convergence, better branch recovery, and fewer failures compared to standard and manual NR methods, without altering force-displacement responses.

cohesive zone modelsneural newton preconditioninginterface fracturenonlinear preconditioningfinite element analysis

Beyond the Expressivity-Trainability Paradox: A Dynamical Lie Algebra Perspective on Navigating Barren Plateaus in Quantum Machine Learning

arXiv cs.LG · Kung-Ming Lan, Edward Huang · 2026-06-30

The study addresses the expressivity-trainability paradox in Quantum Machine Learning (QML), demonstrating that the high Hilbert space capacity of Parameterized Quantum Circuits (PQCs) leads to Barren Plateaus (BPs) with exponentially flat gradients. By integrating Dynamical Lie Algebras (DLAs) and Geometric QML, the authors propose a framework linking algebraic dimension to optimization dynamics, validated on a non-linear binary classification task. Their symmetry-preserving approach restricts DLA growth polynomially, ensuring scalable training landscapes while sacrificing raw memorization capacity, offering a 'Trainability-by-Design' roadmap for quantum neural networks.

quantum machine learningbarren plateausdynamical lie algebrasparameterized quantum circuitsgeometric qml

📰 Industry Media (2)

Achieving operational excellence with AI

MIT Tech Review — AI · MIT Technology Review Insights · 2026-07-02

The article examines the integration of AI into established process excellence frameworks like Lean Six Sigma and Business Process Management (BPM), highlighting a projected $113 billion market for AI-powered process optimization. It emphasizes that organizations with mature process disciplines achieve better AI outcomes due to pre-existing data-driven cultures. Survey data indicates 88% of business leaders plan increased investments in AI-infused process intelligence within 12-18 months, though success depends on coupling AI with robust operational foundations.

lean six sigmabusiness process managementprocess optimizationdata-driven decision-makingprocess intelligence

Teaching AI to run with the turbines

MIT Tech Review — AI · MIT Technology Review Insights · 2026-07-02

Woodside Energy demonstrates industrial AI deployment through predictive analytics and agentic systems in energy operations, focusing on safety-critical workflows like LNG plant startups and maintenance optimization. The method involves long-term investment in governed data infrastructure (time-series data lakes, SAP integration) and human-AI collaboration frameworks. Results include 15% maintenance hour reduction in pilot assets and operational decision support via systems like Maintenance Intelligence, enabled by correlating equipment performance with maintenance records.

predictive analyticsagentic aitime-series datamaintenance optimizationoperational decision support


Generated automatically at 2026-07-02 20:41 UTC. Summaries and keywords are produced by an LLM and may contain inaccuracies — always consult the original article.