Daily Digest — 2026-05-31

Saturday, May 30, 2026 · 149 items · model: deepseek/deepseek-chat

149 items · 144 arxiv papers, 5 industry media

🏛️ Research Labs

No new items today.

📜 arXiv Papers (144)

DynaFLIP: Rethinking Robotics Perception via Tri-Modal-Dynamics Guided Representation

arXiv cs.LG · Jusuk Lee, Seungjae Lee, Jonghun Shin, Hoseong Jung · 2026-05-28

DynaFLIP introduces a dynamics-aware multimodal pre-training framework for robotics perception, addressing the limitation of static visual encoders in manipulation tasks. The method constructs image-language-3D flow triplets from heterogeneous videos, training an image encoder via simplex-volume minimization in a shared hyperspherical space, combined with cosine regularization and contrastive objectives. Evaluations demonstrate +22.5% improvement in out-of-distribution scenarios, with the encoder focusing on action-relevant regions and enhancing downstream policy performance across simulated and real-world setups.

multimodal pre-trainingdynamics-aware representationsimplex-volume minimizationrobotics perceptioncontrastive learning

Efficient Test-Time Finetuning of LLMs via Convex Reconstruction and Gradient Caching

arXiv cs.LG · Alaa Khamis, Alaa Maalouf · 2026-05-28

HullFT introduces a geometric approach to test-time finetuning (TTFT) for language models, addressing speed-quality tradeoffs in retrieval and adaptation. The method represents query embeddings as sparse convex combinations of training sequences via Frank-Wolfe optimization, ensuring relevance and diversity. It converts fractional weights into exact integer multiplicities for finetuning, enabling Gradient Reuse to amortize computation across repeated steps. Experiments demonstrate HullFT achieves lower bits-per-byte at significantly reduced runtime compared to state-of-the-art TTFT methods.

test-time finetuningfrank-wolfe optimizationgradient reuseconvex combinationbits-per-byte

Fairness-Aware Federated Learning with Trajectory Shapley Value

arXiv cs.LG · Daniel Kuznetsov, Ziqi Wang · 2026-05-28

The paper introduces Trajectory Shapley Value (TSV), a contribution metric for fairness-aware federated learning that evaluates client influence on the global model's optimization trajectory using a validation-based, temporally consistent utility. Building on TSV, FedTSV is proposed as an adaptive aggregation method that dynamically adjusts client weights based on per-round evaluations, addressing heterogeneous and adversarial participation in real time. Experiments on benchmark datasets demonstrate that FedTSV accelerates convergence, enhances robustness, and provides more equitable contribution assessments, offering a principled foundation for fairness-aware federated optimization.

federated learningtrajectory shapley valueadaptive aggregationfairness-aware optimizationcontribution metric

When, why, and how do diffusion posterior samplers fail? A finite-sample lens

arXiv cs.LG · Benjamin A. Burns, Sara Fridovich-Keil · 2026-05-28

We introduce a finite-sample perspective on diffusion posterior sampling that approximates the posterior to arbitrary precision as training set size approaches infinity, applicable to any forward model and prior distribution. This framework reveals that popular likelihood approximations in intermediate timesteps often misestimate posterior spread, leading to downstream failures including sensitivity to early stopping, inaccurate mode weighting, and hallucination of unsupported modes. Analysis shows these errors can arise solely from multimodal priors and inaccurate posterior spread estimates, independent of measurement model nonlinearity or posterior multimodality. The method serves as a diagnostic tool for evaluating posterior sampler accuracy and failure modes.

diffusion posterior samplingfinite-sample perspectivelikelihood approximationposterior spreadmultimodal prior

SoundnessBench: Can Your AI Scientist Really Tell Good Research Ideas from Bad Ones?

arXiv cs.LG · Sy-Tuyen Ho, Minghui Liu, Huy Nghiem, Furong Huang · 2026-05-28

SoundnessBench evaluates LLMs' ability to assess methodological soundness in research proposals, using 1,099 curated machine-learning proposals from ICLR submissions with reviewer soundness sub-scores. The benchmark tests 12 frontier LLMs, revealing a pervasive optimism bias where models frequently misclassify low-soundness proposals as sound under standard prompting. Aggressive prompting shifts errors toward false negatives. Controls for contamination, surface features, and audit quality confirm the bias persists without single confounding factors, indicating current LLMs are unreliable as standalone evaluators of scientific rigor.

soundnessbenchoptimism biasmethodological viabilityfalse positivespeer review

Resolution Diagnostics for Paired LLM Evaluation

arXiv cs.LG · Anany Kotawala · 2026-05-28

The study introduces resolution diagnostics for paired LLM evaluation, framing it as a hypothesis-testing problem and proposing a per-pair resolution ratio q = N/N* as the primary diagnostic. Using level-alpha, power-(1-beta) tests, the authors analyze two public LLM leaderboards (Open LLM Leaderboard v1 and MMLU-Pro), finding that 11/40 and 4-6/9 pairwise comparisons, respectively, fail to meet conventional resolution targets. The work demonstrates that the widely-used unpaired Cohen-h-plus-(1-rho) shortcut deviates from correct N* by approximately a factor of two in close-comparison regimes, a deficit inherited by three of five off-the-shelf calculators.

paired evaluationresolution diagnosticshypothesis-testingcohen-hmmlu-pro

Leave a Window Out: Modifying the Jackknife for Predictive Inference in Time Series

arXiv cs.LG · Hanyang Jiang, Rina Foygel Barber, Ashwin Pananjady, Yao Xie · 2026-05-28

The paper introduces a modified leave-one-out jackknife method, termed leave-a-window-out (LWO), for predictive inference in time series where exchangeability assumptions fail. Addressing coverage loss in vanilla jackknife under temporal dependence, LWO leverages stability properties of model-fitting procedures to achieve valid coverage. Theoretical analysis quantifies deviations from cyclic exchangeability using novel coefficients. Empirical results show LWO maintains coverage where traditional methods fail, while producing narrower intervals than split conformal prediction.

conformal predictionjackknifetime seriespredictive inferencecyclic exchangeability

Statistical Embeddings for Similarity, Retrieval, and Interpretable Alignment of Numeric Tabular Datasets

arXiv cs.LG · M. Ross Kunz, John Merickel, Keith Wilson · 2026-05-28

The authors propose a method for embedding numeric tabular datasets into a shared vector space to enable similarity measurement and interpretable alignment across heterogeneous feature spaces. Their approach uses structured exploratory data analysis descriptors, embeds them via a pretrained sentence transformer, and quantifies similarity through Canonical Correlation Analysis (CCA), with a penalized variant for sparse, interpretable variable-level correspondences. Evaluation on 15 datasets shows a P@1 score of 0.9, robust nearest-neighbor retrieval, and cluster structure preservation under embedding ablations and differential privacy. The framework supports retrieval-augmented generation pipelines and data-driven algorithm selection.

tabular data embeddingcanonical correlation analysisdifferential privacyretrieval-augmented generationinterpretable alignment

Neural Operator-Based Surrogate Model for CFD:Helical Coil Steam Generator in Small Modular Reactor

arXiv cs.LG · Minseo Lee, Seongmin Oh, Chaehyeon Song, Bumjin Cho · 2026-05-28

The study introduces a neural operator-based surrogate modeling framework for real-time CFD simulation of helical coil steam generators in small modular reactors, addressing computational bottlenecks in digital twin applications. It compares two reduced-order modeling strategies—MLP-based autoencoder for unstructured mesh data and convolutional autoencoder for structured mesh data—each integrated with DeepONet (L-DeepONet) and Fourier neural operator (FNO), enhanced by multi-scale techniques to mitigate spectral bias. Results show L-DeepONet captures transient vortex dynamics, while FNO predicts time-averaged flow, offering complementary guidelines for model selection based on data type and resolution requirements.

neural operatorcomputational fluid dynamicsreduced-order modeldigital twinspectral bias

Digitally enriching a screening population for pancreatic cancer using routine blood-based measures and clinical histories

arXiv cs.LG · Chris Varghese, Leo Y. Li-Han, Richa Bisht, Ellen Larson · 2026-05-28

The study presents a Transformer-based neural network with multi-head attention for pancreatic cancer risk prediction using longitudinal clinical data. The model processes coded diagnoses and blood test sequences from 6,017 cases and 177,081 controls (median 12-year history) to stratify populations for targeted screening. External validation achieved AUROCs of 0.837 (1-year), 0.797 (2-year), and 0.760 (3-year prediction), with calibration slope 1.08 and Brier score 0.025. A >3.3% 1-year risk threshold yielded diagnostic odds ratio 18.2, demonstrating potential for population-level screening enrichment.

transformermulti-head attentionrisk stratificationlongitudinal dataauroc

Wasserstein Contraction of Coordinate Ascent Variational Inference

arXiv cs.LG · Rocco Caprio, Adrien Corenflos, Sam Power · 2026-05-28

The authors establish Wasserstein contraction properties for coordinate ascent variational inference (CAVI) under transport-information inequalities and functional smoothness conditions. Their theoretical framework provides general and sharp convergence guarantees, applicable to both smooth manifolds and certain non-smooth spaces. The analysis yields local convergence results, demonstrating CAVI's effectiveness in high-dimensional Bayesian inference tasks. Applications include Bayesian Gaussian Mixture Models, high-dimensional Bayesian Probit Regression, and Logistic Regression with Pólya-Gamma random variables (Jaakkola-Jordan's algorithm).

wasserstein distancevariational inferencetransport-information inequalitybayesian inferencecoordinate ascent

OOD-GraphLLM: Graph Large Language Model for Out-of-Distribution Generalized Drug Synergy Prediction

arXiv cs.LG · Xin Wang, Linxin Xiao, Yang Yao, Wenwu Zhu · 2026-05-28

The paper introduces OOD-GraphLLM, a graph large language model for out-of-distribution (O.O.D.) generalized drug synergy prediction (DSP). The method jointly optimizes molecular graph representations and biomedical semantic language representations, addressing challenges in structural relevance, optimal graph neural architectures, and semantic-structural alignment. It fine-tunes DrugSyn-LLM with retrieval-augmented biomedical instruction tuning to align topological and semantic information. The model and source code are publicly available, including a web interface for interactive use.

drug synergy predictionout-of-distribution generalizationgraph large language modelbiomedical instruction tuningmolecular representation

GRASP: Plan-Guided Graph Retrieval with Adaptive Fusion and Reranking on Semi-Structured Knowledge Bases

arXiv cs.LG · Yicheng Tao, Yiqun Wang, Xiangchen Song, Xin Luo · 2026-05-28

GRASP introduces a three-stage framework for semi-structured knowledge base (SKB) retrieval, combining plan-guided graph retrieval, adaptive fusion with dense retrieval, and fine-tuned reranking. The method unifies structural and textual information through plan-conditioned fusion, outperforming existing hybrid approaches. On STaRK benchmarks, GRASP improves average Hit@1 from 62.0 to 73.9, demonstrating state-of-the-art performance. Ablation studies confirm its robustness and effectiveness in handling SKBs for applications like product search and precision medicine.

semi-structured knowledge basesgraph retrievaladaptive fusiondense retrieverreranking

How's it going? Reinforcement learning in language models recruits a functional welfare axis

arXiv cs.LG · Andy Q Han, David J. Chalmers, Pavel Izmailov · 2026-05-28

The study demonstrates that reinforcement learning (RL) in language models recruits a pre-existing functional welfare axis rather than creating it de novo. Using a semantically neutral maze environment, researchers trained multiple models (varying RL algorithms, model families, and fine-tuning approaches) and extracted concept vectors for rewarded/punished trajectories. These vectors exhibited mirror-image behaviors: punishment vectors aligned with negative emotion concepts and failure tokens, while reward vectors showed opposite patterns. Crucially, the welfare axis was detectable in pretrained models before maze training, persisting across controls for tile-reward mappings and supervised fine-tuning. Findings suggest minimal reward signals can broadly influence behavior via pre-existing representations.

reinforcement learningconcept vectorsfunctional welfare axislanguage modelsinterpretability

Anti Mode-Collapse in Mean-Field Transformer via Auxiliary Variables

arXiv cs.LG · Masaaki Imaizumi, Masanori Koyama, Noboru Isobe, Kohei Hayashi · 2026-05-28

The study demonstrates that auxiliary variables, such as positional encoding and fixed prompt insertion, prevent mode collapse in mean-field transformer models by maintaining token distribution diversity during long inferences. Using a mean-field-based transformer framework, the authors theoretically analyze how these auxiliary variables counteract the degeneration of token distributions to a single point, instead characterizing the energy-maximizing distribution as a pushforward of the auxiliary variable distribution. Results show that positional encoding and prompt insertion achieve universality of representation in the limit, enabling exact representation of a wide class of distributions. Mathematical experiments validate the theoretical findings.

mean-field transformermode collapsepositional encodingauxiliary variablespushforward distribution

ExDBSCAN: Explaining DBSCAN with Counterfactual Reasoning -- Additional Material

arXiv cs.LG · Pernille Matthews, Lena Krieger, Tommaso Amico, Artur Zimek · 2026-05-28

The paper introduces ExDBSCAN, a post-hoc explanation method for DBSCAN clustering that provides counterfactual explanations with theoretical validity guarantees. It addresses the interpretability gap in density-based clustering by generating diverse, proximal counterfactuals using a density-aware, physics-inspired model. Empirical evaluation on 30 tabular datasets demonstrates ExDBSCAN's superiority over four baselines, achieving perfect validity and producing actionable explanations.

dbscancounterfactual explanationsdensity-based clusteringinterpretabilitypost-hoc analysis

TriSearch: Learning to Optimize Triangulations via Bistellar Flips

arXiv cs.LG · Yiran Wang, Guido Montúfar · 2026-05-28

TriSearch introduces a reinforcement learning framework for optimizing triangulation objectives via bistellar flips, leveraging a circuit-supported subtriangulation action representation. This method encodes feasible flips by their supporting circuit and local subtriangulation, enabling policy ranking using local geometric and combinatorial features. TriSearch achieves dimension-agnostic traversal of the flip graph without full triangulation space enumeration. Evaluated in 3D and 4D, it generalizes zero-shot from small to larger polytopes, outperforming existing methods on metric objectives in 3D and discovering more distinct Fine, Regular, and Star triangulations of reflexive polytopes in 4D under fixed budgets.

bistellar flipstriangulation optimizationreinforcement learningcircuit-supported representationreflexive polytopes

MarginGate: Sparse Margin-Triggered Verification for Batch-Invariant LLM Inference

arXiv cs.LG · Kexin Chu, Yang Zhou, Wei Zhang · 2026-05-28

MarginGate introduces sparse margin-triggered verification for batch-invariant LLM inference, addressing non-deterministic token generation in BF16 decoding. The method identifies low top-1/top-2 logit margins as flip-risk indicators, selectively verifying only unstable steps and repairing mismatches via K/V column replacement. Evaluated across five models (including Llama-3.1-8B and Qwen2.5-14B) on MATH500, GSM8K, and HumanEval, it achieves 100% sequence-level determinism with 18.56%/15.05% verifier trigger rates, reducing LLM-42's latency overhead by 2.23x/1.99x. DSR1-Distill-Qwen-7B requires 49.50% triggers in harder regimes.

batch-invariant inferencelogit marginkv-cachedeterministic decodingsparse verification

Faithful Embeddings of Irregular and Asynchronous Data for Online Log-NCDEs

arXiv cs.LG · Benjamin Walker, Alexandre Bloch, Lingyi Yang, Sam Morley · 2026-05-28

The paper introduces a continuous and injective embedding method for Log-NCDEs that eliminates the need for observation-path reconstruction in continuous-time models. By recording observations as increments and composing them over arbitrary intervals to form log-signatures directly, the approach avoids interpolation while supporting online computation. Theoretical analysis shows universality transfer under mild conditions. Experiments on synthetic dynamics and real-world time-series demonstrate accuracy, efficiency, and robustness to irregular, asynchronous, and sparse data.

neural controlled differential equationslog-signaturescontinuous-time modelsonline computationirregular time-series

Active Continual Learning with Metaplastic Binary Bayesian Neural Networks

arXiv cs.LG · Kellian Cottart, Théo Ballet, Djohan Bonnet, Damien Querlioz · 2026-05-28

BiMU introduces a metaplastic binary Bayesian neural network for active continual learning in always-on edge systems, addressing plasticity loss in mean-field Bernoulli posteriors. The method combines a bounded-memory variational objective with controlled relaxation toward the prior and an uncertainty-dependent step size to prevent saturation and sustain epistemic uncertainty. BiMU enables buffer-free active querying via Monte Carlo disagreement, reducing label queries and backpropagation updates under imbalance. Evaluations on 1000-tasks Permuted-MNIST and OpenLORIS-Object demonstrate sustained learning, strong out-of-distribution detection, and up to 32× label/update savings at matched accuracy under class imbalance and feature compression.

bayesian neural networksmetaplasticitycontinual learningmonte carlo disagreementout-of-distribution detection

Mean-Field Diffuser: Scaling Offline MARL to Thousands of Agents

arXiv cs.LG · Wenhao Li, Xiangfeng Wang, Bo Jin · 2026-05-28

The paper introduces MF-Diffuser, a mean-field framework for scaling offline multi-agent reinforcement learning (MARL) to thousands of agents by lifting trajectory planning to Wasserstein space. Key innovations include a value-weighted chaotic entropy objective balancing generative fidelity and return maximization, and a hierarchical coarse-to-fine denoising strategy. Theoretical analysis shows mean-field approximation error scales as O(H²/√N) with guaranteed approximate Nash equilibrium convergence. Experiments on three benchmarks demonstrate superior performance, particularly on suboptimal offline data and large-scale scenarios (N ≥ 10³).

mean-fieldwasserstein spaceoffline marltrajectory planningnash equilibrium

Can AI Weather Models Predict Beyond Two Weeks? A Quantitative Benchmark and Analysis of Long Rollouts

arXiv cs.LG · Fanny Lehmann, Firat Ozdemir, Yun Cheng, Torsten Hoefler · 2026-05-28

This study categorizes long-horizon instability in AI weather models into three regimes (blow-up, drift, loss of seasonality) through year-long rollouts of nine state-of-the-art models. Analysis reveals stability depends on handling small spatio-temporal scales: unstable models amplify high-frequency energy, while stable ones denoise inputs. Stable models generate unique weather trajectories conditioned on initial states, verified via architectural ablations using Vision Transformers (ViTs). Findings provide a formal taxonomy for model evaluation beyond two-week forecasts.

ai weather modelslong-horizon instabilityvision transformerspatio-temporal scalesdenoising

A new completely parameter-free clustering algorithm for unsupervised classification of BATSE gamma-ray bursts

arXiv cs.LG · Soumita Modak · 2026-05-28

The authors propose a novel parameter-free clustering algorithm for unsupervised classification of BATSE gamma-ray bursts (GRBs), addressing the unresolved debate about the optimal number of clusters in GRB populations. The method eliminates the need for explicit parameter tuning by adopting a completely parameter-free approach from an alternative clustering paradigm. Results indicate two primary clusters (short and long duration bursts), aligning with the established merger-collapsar theory, while challenging previous statistical approaches that suggested additional clusters.

gamma-ray burstsparameter-free clusteringunsupervised classificationmerger-collapsar theorybatse sample

Unveiling the Visual Counting Bottleneck in Vision-Language Models

arXiv cs.LG · Xingzhou Pang, Yifan Hou, Junling Wang, Mrinmaya Sachan · 2026-05-28

The study identifies a systematic generalization bottleneck in Large Vision-Language Models (VLMs), specifically in visual counting, by decomposing the task into three cognitive stages: visual individuation, magnitude awareness, and symbolic mapping. Using synthetic Go boards and linear probes, the authors demonstrate that visual backbones maintain linearly separable quantity representations, while failure occurs at the symbolic mapping stage. Results support the fractured magnitude hypothesis, suggesting VLMs learn disjoint modality-specific manifolds rather than a universal number space. The findings indicate data scaling alone cannot resolve this gap without inductive priors for unified representations.

visual countingsystematic generalizationlinear probessymbolic mappingfractured magnitude hypothesis

Visual Spatial Learning: Single-Field Spatial Interpolation Using Convolutional Neural Networks

arXiv cs.LG · Daniel Tinoco, Raquel Menezes, Carlos Baquero, Alexandra Silva · 2026-05-28

The paper introduces a convolutional neural network (CNN)-based approach for single-field spatial interpolation, addressing the challenge of predicting complete spatially correlated fields from sparse observations. Unlike classical methods such as Kriging, which rely on Gaussian process assumptions and variography, the proposed CNN architecture operates without external data or prior fields, directly learning from observed locations to predict values at unobserved grid points. This method eliminates the need for explicit covariance modeling or variogram estimation, enabling flexible, data-driven capture of local spatial patterns. The results demonstrate CNNs' potential as a practical alternative to traditional geostatistical methods in non-stationary settings.

spatial interpolationconvolutional neural networkskrigingnon-stationarygeostatistics

SAHG: Sector-Anisotropic Hyperbolic Graph Model for Social Bot Detection

arXiv cs.LG · Hanning Lu, Yingguang Yang, Jinwei Su, Yang Liu · 2026-05-28

The Sector-Anisotropic Hyperbolic Graph (SAHG) model improves social bot detection by addressing limitations in existing graph-based methods. SAHG introduces direction-dependent curvature fields to adapt geometric resolution across structural directions and employs sector prototypes for angular feature extraction. It mitigates signal contamination from heterophilic bot connections by maintaining separate SAH channels for account-level and graph-neighborhood features, fusing them only at the classifier. Evaluations on Fox8-23, BotSim-24, and MGTAB benchmarks demonstrate SAHG's superior accuracy and F1 scores, outperforming feature-based, graph-based, LLM-based, and isotropic hyperbolic baselines. Ablation studies confirm the efficacy of its anisotropic geometry and dual-channel architecture.

hyperbolic geometrygraph-based detectiondirection-dependent curvatureheterophilic connectionssector prototypes

RL2ML: Finite-Rollout Surrogate Objectives from Reinforcement Learning to Maximum Likelihood

arXiv cs.LG · Yifu Zheng · 2026-05-28

The paper introduces RL2ML, a family of finite-rollout surrogate objectives bridging reinforcement learning (RL) and maximum likelihood (ML) training. It provides a closed-form, unbiased gradient estimator that maintains alignment under fixed rollout budgets. The analysis reveals a subcritical-supercritical transition in group-level update scales, obscured by population-level objectives. Calibrated metric-gain analysis and variance decomposition demonstrate that optimal surrogate selection depends on evaluation metrics, local sensitivity, and estimator variance, reducing the problem to a one-dimensional optimization. Results show RL2ML's flexibility in connecting RL and ML objectives while preserving estimator-objective alignment.

rl2mlsurrogate objectivesgradient estimatorupdate-scale transitionvariance decomposition

Learning to Extrapolate to New Tasks: A Relational Approach to Task Extrapolation

arXiv cs.LG · Adam Ousherovitch, Yixin Wang · 2026-05-28

The paper introduces the Relational Task Extrapolator (RTE), an algorithm enabling systematic extrapolation to novel tasks by learning relational transformations between tasks. RTE decomposes target tasks into anchor-transformation pairs, learning a relational operator to predict outcomes for unseen tasks. Evaluated across function prediction regimes (parameter, length, and compositional extrapolation) and sequence prediction, RTE outperforms existing methods in extrapolating to out-of-distribution tasks, demonstrating robustness in handling unseen task parameters and compositions.

relational task extrapolatortask transformationparameter extrapolationcompositional extrapolationfoundation models

Privacy-Enhanced Zero-Order Federated Learning via xMK-CKKS over Wireless Channels

arXiv cs.LG · Anthony Ayli, Khalil Harris, Jihad Fahs, Mohamad Assaad · 2026-05-28

The paper proposes a privacy-enhanced zero-order federated learning protocol using the xMK-CKKS multi-key homomorphic encryption scheme over wireless channels. The method eliminates channel estimation requirements by algebraically canceling large-modulus terms through retransmission, while maintaining client-level security against honest-but-curious adversaries. Theoretical analysis shows O(1/√K) convergence with negligible noise, and MNIST experiments validate the approach's efficacy under collusion resistance (secure against N-1 compromised clients).

federated learninghomomorphic encryptionmulti-key ckkswireless channelszero-order optimization

SGMD: Score Gradient Matching Distillation for Few-Step Video Diffusion Distillation

arXiv cs.LG · Zhuguanyu Wu, Ruihao Gong, Yang Yong, Yushi Huang · 2026-05-28

The paper introduces Score Gradient Matching Distillation (SGMD), a method for few-step video diffusion distillation that addresses limitations of Distribution Matching Distillation (DMD). SGMD optimizes fake scores directly toward the teacher model while using teacher stop-gradient Fisher for stable distribution matching, supplemented by dual potentials (negative-residual and residual-contraction) for correction and tracking. Compared to DMD2, SGMD achieves ~3× training speedup and improves motion dynamics in 4-step distilled models while maintaining temporal consistency, with human studies favoring its motion quality. Visual quality and text alignment remain comparable.

video diffusionscore matchingfew-step distillationfisher objectivemotion dynamics

Striding Across Reynolds Numbers: Representation Geometry in Neural PDE Generalisation

arXiv cs.LG · Jianing Shi · 2026-05-28

The paper investigates cross-Reynolds generalization in neural PDE solvers, identifying representation geometry as a key factor. The authors propose ConvAE-Relay, which matches states in a source-trained convolutional autoencoder latent space and borrows dynamics from a source-regime database, achieving 38.34% relative L2 error without target-regime fitting. Ablations show matching quality dominates update rules, while oracle experiments reveal transferable dynamics directions (cosine similarity ~0.84) and autoregressive drift as the primary bottleneck. A U-Net with multi-scale skip connections achieves 34.72% error, supporting the finding that local, multi-scale representations facilitate cross-Reynolds transfer.

cross-reynolds generalizationneural pde solversrepresentation geometryconvolutional autoencodermulti-scale representations

Convergence Theory for Iterative LLM-Based Neural Architecture Search: A Parametric Cross-Entropy Framework with Closed-Form Proxy Reliability

arXiv cs.LG · Santosh Premi Adhikari, Radu Timofte, Dmitry Ignatov · 2026-05-28

The paper establishes the first convergence theory for iterative LLM-based neural architecture search (NAS), framing it as a parametric Cross-Entropy method over executable programs. Key contributions include: (1) equivalence between LLM fine-tuning on elite architectures and Cross-Entropy updates, (2) monotonic non-decreasing quality guarantees, (3) geometric convergence rates for elite-set probabilities, and (4) analytical proxy reliability conditions. Theoretical claims are validated through a 22-cycle experiment involving 3,300 architectures across three LLMs and six datasets, confirming convergence properties and explaining empirical proxy-reliability ceilings.

neural architecture searchcross-entropy methodconvergence theoryproxy reliabilityllm fine-tuning

Chess-World-Model: A 10M-Game Benchmark for Exact State Tracking from Chess Move Sequences

arXiv cs.LG · Benjamin Walker, Terry Lyons · 2026-05-28

The authors introduce Chess-World-Model, a 10-million-game benchmark for exact state tracking in chess, designed to evaluate models' ability to maintain correct latent states across move sequences. The benchmark includes real-game and out-of-distribution (random-uniform) splits to test generalization beyond human play patterns. They evaluate causal Transformers and three recurrent models (block-diagonal SLiCE, Mamba-3, Gated DeltaNet) under matched conditions, finding recurrent architectures outperform Transformers at 3M and 8M parameters. Performance on real games saturates at 18M parameters, while the random-uniform split remains discriminative up to 40M, revealing scaling limitations. Ablations confirm that less expressive state-transition mechanisms degrade OOD performance across recurrent models.

state trackingworld modelschess benchmarkrecurrent modelsout-of-distribution generalization

Distributionally Robust Set Representation Learning Under Inference-Time Element Corruption

arXiv cs.LG · Yankai Chen, Hanrong Zhang, Bowei He, Philip S. Yu · 2026-05-28

The authors propose SW-DRSO, a distributionally robust optimization framework for set representation learning that addresses inference-time element corruption. The method employs a barycentric adversary to approximate worst-case expected loss over corrupted sets via differentiable optimization over simplex weights, avoiding intractable search. Experiments on four tasks show SW-DRSO improves robustness to element-level degradations (e.g., outliers, missing components) while maintaining baseline performance.

set representation learningdistributionally robust optimizationelement corruptionbarycentric adversarysimplex weights

Q-ANCHOR: Federated Quantum Learning with ZNE-guided Correction

arXiv cs.LG · Hoang M. Ngo, Quan Nguyen, Wanli Xing, My T. Thai · 2026-05-28

Q-ANCHOR introduces a quantum-aware federated aggregation architecture addressing the double-drift phenomenon in Quantum Federated Learning (QFL), where standard Federated Averaging (FedAvg) fails due to client drift from non-IID data and hardware bias from noisy quantum gradients. The method combines zero-noise extrapolation for server updates with stateful client correction, theoretically mitigating both drift types. Experiments show Q-ANCHOR achieves more stable training than conventional FL baselines, with convergence analysis confirming reduced hardware-bias error floors.

quantum federated learningzero-noise extrapolationclient drifthardware biasfederated averaging

Ridge Regression from Poisson Resetting: A Renewal Perspective on Spectral Regularization

arXiv cs.LG · Petar Jolakoski · 2026-05-28

The paper establishes a connection between stochastic resetting in non-equilibrium statistical physics and ridge regularization in statistical learning. For linear gradient flow, Poisson resetting at rate $r$ yields a stationary mean equivalent to the ridge estimator with penalty $λ=r$, leveraging the Laplace-transform relationship between ridge regression and exponential-time averaging. This identity extends to general renewal reset laws, where exponential reset time distributions uniquely reproduce scalar ridge in every eigendirection. Non-exponential renewal laws generate alternative spectral filters. The study includes an Ornstein-Uhlenbeck extension modeling SGD, showing equality at the mean level but nonzero stationary covariance due to accumulated noise and reset-timing variance.

stochastic resettingridge regularizationgradient flowspectral filtersrenewal laws

Sample-Efficient Diffusion-based Reinforcement Learning with Critic Guidance

arXiv cs.LG · Shutong Ding, Zejia Zhong, Zhongyi Wang, Ke Hu · 2026-05-28

The paper proposes Critic-Guided diffusion Policy Optimization (CGPO), a reinforcement learning method that balances exploration and exploitation in diffusion policies. CGPO integrates training-free critic guidance into the denoising process, steering action generation toward high-value regions while using guided actions as regression targets. Evaluated on 5 MuJoCo locomotion tasks, CGPO achieves state-of-the-art performance among diffusion-based RL methods and demonstrates real-world applicability on Franka robot arm grasping tasks.

diffusion policyreinforcement learningcritic guidanceexploration-exploitation tradeoffdenoising process

Latent Performance Profiling of Large Language Models

arXiv cs.LG · Tanmoy Chakraborty, Ayan Sengupta, Suparna Bhattacharya, Partha Pratim Chakrabarti · 2026-05-28

The article introduces Latent Performance Profiling (LPP), a framework for evaluating large language models (LLMs) beyond traditional benchmark accuracy. LPP analyzes hidden activations and output distributions to derive task-agnostic diagnostics, revealing scale-independent traits and vulnerabilities. Empirical analyses across eight LLMs (0.5B-14B parameters) show that models with similar benchmark scores exhibit contrasting latent profiles in entropy and adaptability. The method enables interpretable comparisons, decouples from leaderboard bias, and supports reliable model selection and safety assessment.

latent performance profilinghidden activationsoutput distributionstask-agnostic diagnosticsscale-independent traits

MIC: Maximizing Informational Capacity in Adaptive Representations via Isotropic Subspace Alignment

arXiv cs.LG · Dang Hong Nguyen, Nhi Ngoc-Yen Nguyen, Huy-Hieu Pham · 2026-05-28

The paper introduces MIC, a framework for optimizing multi-granular embeddings by addressing dimensional redundancy and spectral collapse through isotropic subspace alignment. MIC employs Soft Collapse Regularization (SCR) to reduce redundancy between subspaces and Spectral Isotropy Regularization (SIR) to ensure uniformity in low-dimensional prefixes. These strategies are unified via self-distillation, yielding semantically dense representations. Experiments show MIC outperforms baselines, particularly in high-compression scenarios where preserving informational capacity is crucial.

isotropic subspace alignmentsoft collapse regularizationspectral isotropy regularizationmulti-granular embeddingsself-distillation

Improving Adversarial Robustness of Attribution via Implicit Regularization

arXiv cs.LG · Amir Mehrpanah, Matteo Gamba, Hossein Azizpour · 2026-05-28

The work demonstrates that adversarial robustness of attributions emerges implicitly from standard stochastic gradient descent dynamics, eliminating the need for explicit regularization. Theoretical analysis links parameter-space and input-space curvature to this effect, validated across architectures (ResNet, ViT), datasets (CIFAR-10, ImageNet), and attribution methods (Integrated Gradients, SmoothGrad) with <1% computational overhead. Results reveal softmax-normalized attention attribution fails to inherit robustness due to entropy constraints, while kernel-based attention in transformers restores robustness. The findings establish learning dynamics as a resource-efficient mechanism for robust explainability while exposing limitations of normalized attention attribution.

adversarial robustnessattribution methodsimplicit regularizationlearning dynamicsattention mechanisms

Fingerprinting Inference Systems of Large Language Models

arXiv cs.LG · Anna Wimbauer, Jonas Möller, Erik Imgrund, Konrad Rieck · 2026-05-28

We introduce a fingerprinting method to identify components of LLM inference systems by analyzing prompt-response behavior. The method exploits numerical deviations induced by variations in inference engines, attention backends, and hardware platforms, which propagate to textual outputs. Empirical evaluation demonstrates reliable identification of these components, even at non-zero temperature settings. We argue that preventing fingerprinting is fundamentally challenging due to inherent numerical differences across hardware and software stacks. Partial mitigations are proposed and their impact discussed.

fingerprintinginference systemattention backendnumerical deviationshardware platform

EVL-ECG: Efficient ECG Interpretation With Multi-Aspect Heterogeneous Knowledge Distillation

arXiv cs.LG · Dang Hong Nguyen, Nhi Ngoc-Yen Nguyen, Huy-Hieu Pham · 2026-05-28

EVL-ECG proposes a knowledge distillation framework for efficient ECG interpretation, addressing architectural heterogeneity through three innovations: Multi-Head Cross-Attention Alignment for morphological feature preservation, Optimal Transport-based Visual Feature Matching for structural relationship maintenance, and Geometric Intra-Architecture Relation Matching for diagnostic reasoning transfer. The method achieves improvements of 2.4% AUC and 1.1% clinical accuracy over baselines, yielding a 2B-parameter foundation model suitable for edge deployment.

knowledge distillationecg interpretationoptimal transportcross-attention alignmentheterogeneous architectures

A Fully Convolutional Approach to Denoising Structural Dynamics Data from X-Ray Photon Correlation Spectroscopy

arXiv cs.LG · Nisar Nellikunnummel, Andi Barbour, Lutz Wiegart, Tatiana Konstantinova · 2026-05-28

The study introduces a fully convolutional denoising autoencoder (FC-DAE) for processing two-time intensity-intensity correlation functions ($C_2$) in X-ray photon correlation spectroscopy (XPCS). The FC-DAE handles arbitrary input dimensions and preserves correlation structures across dynamical regimes, trained on NSLS-II beamline data with augmentation to mitigate overfitting. Results show the model effectively recovers dynamical features in low signal-to-noise conditions while maintaining structural fidelity, validated through quantitative metrics. The approach demonstrates computational efficiency and robustness in photon-limited and low-dose scenarios.

convolutional autoencoderx-ray correlation spectroscopydenoisingstructural dynamicssignal-to-noise

From Short Histories to Long Futures: Horizon-Aware Graph Neural Networks for Long Horizon Forecasting

arXiv cs.LG · Zesheng Liu, Maryam Rahnemoonfar · 2026-05-28

The authors propose a horizon-aware graph neural network (GNN) for long-range geophysical forecasting, addressing error accumulation in autoregressive rollouts. The method represents physical domains as spatiotemporal graphs, using shared GNN backbones with separate output branches to predict multi-horizon state increments. Training jointly optimizes all lead times via regression, while inference employs coarse-to-fine rollout for stability. Evaluated on Pine Island Glacier simulations, the model outperforms initial-state baselines and single-step autoregressive approaches, achieving 12% lower RMSE over 50-year forecasts.

graph neural networkslong-horizon forecastingautoregressive modelsgeophysical emulationmulti-horizon learning

A Domain-Informed Multi-Objective Framework for EEG Channel Selection in Motor Imagery BCIs

arXiv cs.LG · Dekka Muni Kumar, Dhruba Jyoti Kalita, Yogesh Kumar Meena · 2026-05-28

The study introduces a multi-objective optimization framework for EEG channel selection in motor imagery BCIs, addressing limitations of single-objective methods. It combines non-dominated sorting genetic algorithm, multiple-objective particle swarm optimization, and a multi-objective evolutionary algorithm based on decomposition, balancing spatial relevance (via Gaussian kernel) and functional discriminability (intratrial task-related desynchronization). Evaluated on Physionet, OpenBMI, HighGamma, and BCIIV-2A datasets, the framework achieves classification accuracies of 87%, 71%, 75%, and 65%, respectively, while identifying compact channel subsets focused on sensorimotor cortex regions.

eeg channel selectionmulti-objective optimizationmotor imagerybrain-computer interfacetask-related desynchronization

TraceCodec: A Compiler-Backed Neural Codec for Stateful Multi-Flow Network Traffic Traces

arXiv cs.LG · Junhui Ding, Xinchen Zhang, Xiaohui Xie, Shinan Liu · 2026-05-28

TraceCodec introduces a state-aware neural codec for high-fidelity multi-flow network traffic trace generation, addressing the bottleneck of raw packet field decoding. It lifts packets into timed packet actions with explicit flow slots and transport cues, learning continuous per-packet latents. A deterministic compiler translates decoded actions back to PCAPs, handling endpoint assignment, TCP state, legality constraints, and packet rendering. On CICIDS2017 Monday, TraceCodec achieves 0.03% accuracy in packet count, protocol composition, and flow population, outperforming raw-field baselines that distort flow counts and TCP state. Structural diagnostics confirm preservation of TCP state transitions and multi-flow interleaving, establishing a foundation for precise packet-trace generation.

neural codecpacket-action latentstcp statemulti-flow tracespcap synthesis

CRB-Guided Framework Design and Resource Allocation for Indoor mmWave ISCC Systems

arXiv cs.LG · Zhonghao Liu, Yahao Ding, Yinchao Yang, Mohammad Shikh-Bahaei · 2026-05-28

The authors propose a Cramer-Rao bound (CRB)-guided resource allocation framework for indoor millimeter-wave integrated sensing, communication, and computation (ISCC) systems, minimizing human pose prediction error under communication, latency, and energy constraints. They characterize sensing power's impact on range-estimation uncertainty and point-cloud perturbation using CRB, and employ an adaptive-depth Mamba-based pose prediction model with lightweight prediction heads for computation-aware inference. A joint resource allocation problem is formulated and solved via an alternating optimization algorithm with closed-form updates. Simulations demonstrate significant reduction in pose prediction error compared to baselines, validating the framework's efficacy for resource-constrained indoor ISCC systems.

cramer-rao boundmillimeter-waveintegrated sensingadaptive-depth mambaalternating optimization

Fisher-Preserving Guidance: Training-Free Manifold Constraints for Safe Diffusion Control

arXiv cs.LG · Hao Ren, Zetong Bi, Yiming Zeng, Le Zheng · 2026-05-28

The paper introduces Fisher-Preserving Guidance with Outer Product Span Projection, a training-free method for diffusion models that mitigates Fisher drift during inference. By computing Fisher-preserving updates via low-rank Jacobian factorization, the approach maintains manifold constraints without additional training, enabling real-time application. The method also incorporates Truncated Fisher Denoising Sensitivity for uncertainty-aware action blending. Evaluations on Maze2D, PushT, and visual navigation tasks demonstrate improved trajectory reliability and efficiency over baseline diffusion policies.

fisher-preserving guidancediffusion modelsmanifold constraintsjacobian factorizationuncertainty signal

CLUBench: A Clustering Benchmark

arXiv cs.LG · Feng Xiao, Dazhi Fu, Chris Ding, Jicong Fan · 2026-05-28

The paper introduces CLUBench, a large-scale clustering benchmark evaluating 24 algorithms (including conventional, deep learning, and foundation model-based methods) across 131 datasets spanning tabular, text, and image data. Through 178,815 experiments, key findings include: (1) deep clustering methods show no significant advantage over top conventional algorithms like KMeans/SpeClu, (2) pretrained embeddings combined with conventional methods yield effective clustering for text/image data, and (3) clustering remains challenging despite foundation models. The study proposes using low-rank structures in performance matrices for efficient model selection and hyperparameter tuning.

clusteringbenchmarkembeddingskmeansspeclu

Treatment-Conditioned Diffusion for Forecasting Neurodegenerative Disease Progression

arXiv cs.LG · Danylo Boiko, Viktoriia Mishkurova · 2026-05-28

The authors propose a treatment-conditioned diffusion framework for forecasting neurodegenerative disease progression, addressing limitations of scalar clinical scores and blurring in traditional generative approaches. The method conditions diffusion on DaTscan images and levodopa dosage, using a Transformer-based encoder to model pharmacological dynamics and a multi-weight ROI mask to preserve anatomical details. Evaluations demonstrate significant improvements over baselines, with 14.0% lower MSE, 7.2% lower MAE, and 4.9% higher SSIM while maintaining sharp boundaries.

diffusion frameworkneurodegenerative progressiontransformer encoderdatscan imagingroi mask

A Triple-Modal Contrastive Learning Framework with Sequence, Graph, and 3D Features for Drug-Target Interaction Prediction

arXiv cs.LG · Le Xu, Xi Zhang, Dan Luo, Ting Wang · 2026-05-28

The paper proposes TriMod-DTI, a triple-modal contrastive learning framework for drug-target interaction (DTI) prediction, integrating 1D sequences, 2D graphs, and 3D structural features. It employs a Feature Extractor to capture cross-modal representations and a contrastive learning strategy to align modalities in latent space via positive/negative sample pairs. Evaluations on three benchmarks show TriMod-DTI outperforms state-of-the-art methods, with ablation studies confirming each modality's contribution. Case studies demonstrate practical utility in drug discovery.

triple-modalcontrastive learningdrug-target interactionfeature extractorlatent space

Midpoint Generative Models

arXiv cs.LG · Daniil Shlenskii, Nikita Gushchin, Lev Novitskiy, Dmitry V. Dylov · 2026-05-28

The paper introduces Midpoint Generative Models (MGM), a framework for training one-step generative models based on Flow Matching symmetry. MGM leverages linear interpolation to derive the Midpoint Divergence, a discrepancy metric vanishing at midpoint time when endpoint distributions coincide. The method generalizes this divergence via stochastic interpolants and random flips, yielding a variational objective for one-step generator training. MGM demonstrates competitive performance against existing one-step generative modeling approaches, validated theoretically and empirically.

midpoint generative modelsflow matchingmidpoint divergencestochastic interpolantsone-step generator

Gesture-Aware Indoor THz ISAC Systems for Adaptive Resource Allocation

arXiv cs.LG · Zhonghao Liu, Yinchao Yang, Yahao Ding, Yixuan Wang · 2026-05-28

The paper proposes a gesture-aware indoor terahertz integrated sensing and communication (ISAC) system with adaptive resource allocation. It employs an extended Kalman filter (EKF) for gesture tracking, enabling dynamic adjustment of power allocation and beamforming based on detected gestures. An adaptive joint optimization algorithm maximizes sensing signal-to-interference-plus-noise ratio (SINR) while satisfying gesture-dependent communication quality of service (QoS) constraints. Simulations demonstrate superior sensing accuracy and communication performance compared to single-variable optimization baselines, effectively responding to gesture dynamics.

terahertzkalman filterbeamformingsignal-to-interference-plus-noise ratioquality of service

Reducing Experimental Testing in Space Propulsion Film Cooling Analyses by Pixelwise Generative Image Interpolation

arXiv cs.LG · Adam T. Müller, Philipp J. Teuffel, Konstantin Manassis, Nicolaj C. Stache · 2026-05-28

The paper introduces a machine learning method for image regression from sparse experimental data, specifically targeting film cooling studies in propulsion systems. The approach uses a lightweight feed-forward neural network with positional encoding to generate images conditioned on input parameters, validated on both real and synthetic datasets. Results demonstrate high image similarity (RMSE 93%) and a 30% reduction in required measurements, with a knowledge-informed extension for local adaptability. This method significantly cuts experimental testing needs while maintaining data quality, applicable beyond aerospace for coolant injector optimization.

image regressionfilm coolingpositional encodingfeed-forward neural networksparse measurements

Joint Model and Data Sparsification via the Marginal Likelihood

arXiv cs.LG · Alexander Timans, Thomas Möllenhoff, Christian A. Naesseth, Mohammad Emtiyaz Khan · 2026-05-28

The paper introduces a joint model and data sparsification method via marginal likelihood optimization, extending Sparse Bayesian Learning to handle both feature and sample relevancies. By employing automatic relevance determination (ARD) symmetrically on features and samples, the approach achieves robust regression against outliers and misspecified noise while preserving conjugacy and closed-form updates. Experiments across diverse regression tasks demonstrate that this joint ARD method yields sparse, robust prediction models consistently.

sparse bayesian learningautomatic relevance determinationmarginal likelihoodrobust regressionconjugacy

Plan, Don't Pose: Long Composite Motion Generation with Text-Aligned BFM

arXiv cs.LG · Nikolay Shvetsov, Maksim Bobrin, Nazar Buzun, Dmitry V. Dylov · 2026-05-28

The paper introduces Text2BFM, a novel framework for text-to-motion (T2M) generation that decouples semantic planning from motion execution by leveraging pretrained Behavioral Foundation Models (BFMs). The method employs a text-aligned variational behavioral bottleneck to compress BFM policy-latent sequences into compact, language-compatible motion representations, enabling generation via a lightweight conditional generator in this behavioral manifold. Results demonstrate robust performance on long, compositional textual descriptions, outperforming end-to-end approaches by utilizing the frozen BFM as an executable motion prior.

text-to-motion generationbehavioral foundation modelsvariational behavioral bottleneckpolicy-latent sequencescompositional motion generation

Dissecting the Black Box: Circuit-Level Analysis of LLM Vulnerability Detection

arXiv cs.LG · Syafiq Al Atiiq, Chun Zhou, Christian Gehrmann · 2026-05-28

This work provides a mechanistic interpretation of how large language models detect software vulnerabilities, revealing they primarily rely on safety detectors rather than direct vulnerability signatures. Using Circuit Tracer on Gemma-2-2b, the authors traced computational pathways activated during classification of 472 C/C++ code samples, identifying critical components: attention heads in early layers (L5, L7) focusing on safety patterns and MLP neurons in Layer 7 encoding vulnerability features. Ablation experiments demonstrated their causal role, with Layer 11 ablation dropping accuracy from 100% to 6% and Layer 7 neuron ablation reducing it by 50%. The findings show LLMs use sparse, interpretable circuits (16% of capacity) for vulnerability detection.

mechanistic interpretabilitycircuit tracersafety detectorsattention headsmultilayer perceptron

OVA-IB: One vs All Information Bottleneck for Multi-Modal Alignment

arXiv cs.LG · Tianchao Li, Shujian Yu, Xinrui Zu, Zhaolong Wei · 2026-05-28

OVA-IB introduces an Information Bottleneck framework for arbitrary-modality alignment, addressing limitations of pairwise contrastive learning by explicitly modeling higher-order dependencies. The method optimizes a tractable One-vs-All contrastive lower bound for sufficiency, linked to a Dual Total Correlation-style objective, and employs a geometry-aware projection score. It also derives a tractable upper-bound regularizer for minimality by bounding each representation's dependence on its input using distributions induced by other modalities. Experiments on classification, regression, modality-agnostic evaluation, and cross-modal retrieval benchmarks demonstrate robust performance.

information bottleneckcontrastive learningmulti-modal alignmentdual total correlationgeometry-aware projection

Open Problem: Separating Geometric and Algorithmic Compression via Cayley-Table Completion

arXiv cs.LG · Dongsung Huh · 2026-05-28

The article identifies a gap in deep learning's inductive biases by proposing Cayley-table completion as a benchmark for algorithmic complexity minimization, contrasting with continuous capacity control in current theory. The method leverages operator-valued tensor factorizations with flatness priors to implicitly bias toward discrete associativity, analogous to low-rank bias in matrix completion. Results suggest this approach can autonomously discover discrete algebraic rules without combinatorial search, prompting an open challenge to formalize exact recovery bounds and extend flatness priors to broader algorithmic axioms.

cayley-table completionalgorithmic complexitytensor factorizationflatness priordiscrete associativity

STAP: A Shuffle-Tokenized App Predictor with Ultra Long Context for Vocabulary-Free Mobile App Prediction

arXiv cs.LG · Chengyu Fan, Hang Liu · 2026-05-28

STAP introduces a vocabulary-free Transformer for mobile app prediction by replacing app identities with shuffled virtual indices and leveraging ultra-long context windows. The method theoretically converges to correct predictions given sufficient context length, eliminating reliance on fixed vocabularies or user-specific data. Evaluations on cross-continental datasets show competitive cold-start performance and unprecedented zero-shot transfer capability, with deployment optimizations maintaining low-latency inference.

transformerzero-shotvocabulary-freecontext windowcold start

Feedback-to-Rubrics: Can We Learn Expert Criteria from Inline Comments?

arXiv cs.LG · Kotaro Yoshida, So Kuroki, Yuki Imajuku, Taishi Nakamura · 2026-05-28

The paper introduces Feedback-to-Rubrics, a method for distilling tacit expert criteria into reusable natural-language rubrics from inline comments on artifacts like LLM-generated drafts. The approach infers rubrics from accumulated comments and iteratively refines them by detecting mismatches between rubric-conditioned predictions and reference comments. Evaluations in real-world and controlled settings demonstrate that the learned rubrics effectively support comment prediction, rubric interpretation, and automatic artifact revision.

natural-language rubricsinline commentstacit criteriallm-generated draftsiterative refinement

Parameter-Efficient Subspace Decoupling ViT for Mitigating Multi-Task Negative Transfer in Histological Scoring

arXiv cs.LG · Youhan Huang, Jiajun Li, Yilin Fang, Shuai Wang · 2026-05-28

The paper introduces a subspace-decoupled multi-task Vision Transformer (ViT) to mitigate negative transfer in histological scoring for Non-Alcoholic Fatty Liver Disease (NAFLD). The method employs lightweight task-specific Adapters with orthogonality constraints to create independent feature subspaces for steatosis, ballooning, and inflammation, reducing task interference while preserving shared representations. Experiments on a curated mouse NAFLD dataset show improved multi-task stability and generalization, with lower computational costs than single-task models. The dataset and code will be publicly released.

vitmulti-task learningnegative transferadaptershistological scoring

MIRAGE: Adaptive Multimodal Gating for Whole-Brain fMRI Encoding

arXiv cs.LG · Abdulkadir Gokce, Badr AlKhamissi, Martin Schrimpf · 2026-05-28

MIRAGE introduces a brain encoding framework for predicting whole-brain fMRI responses to naturalistic audiovisual stimuli, achieving state-of-the-art performance through adaptive multimodal gating. The method employs a native multimodal backbone, transformer-based brain encoder, and subject-specific linear head over cortical parcels, demonstrating that natively multimodal features outperform post-hoc unimodal aggregation. Results show inspectable attention weights revealing modality-specific gating profiles and distinct anatomical patterns across cortex, validating the approach's interpretability and accuracy.

brain encodingmultimodal gatingfmri predictiontransformer-based encodercortical parcels

BuilDyn: Excitation-Driven Data Generation for Building Thermal Dynamics Modeling and Control

arXiv cs.LG · Felix Koch, Thomas Krug, Fabian Raisch, Benjamin Schäfer · 2026-05-28

BuilDyn introduces a customizable excitation-driven data generation package for building thermal dynamics modeling, addressing limitations of stationary operation data in existing datasets. The method extends BuilDa by enabling control-oriented excitation strategies, sampling from building distributions, and providing Python API integration for ML pipelines. Results demonstrate improved ML model performance when trained on excited versus non-excited data for a single building case, supporting applications in transfer learning and foundation models.

thermal dynamics modelingexcitation strategiesdata-driven controlbuilding energy systemsml pipelines

Open World Autoencoding Drift Detection with Novel Class Recognition in Tabular Non-stationary Data Streams

arXiv cs.LG · Joanna Komorniczak · 2026-05-28

The paper proposes an unsupervised autoencoder-based method for detecting concept drift and recognizing novel classes in tabular data streams. The approach employs mirrored autoencoders to independently adapt to distribution shifts (via reconstruction errors) and identify novel samples (via density estimation on proxy representations). Evaluated on synthetic non-stationary streams, the method demonstrates competitive performance against state-of-the-art unsupervised drift detectors and novelty classifiers.

concept driftautoencodernovelty detectiondata streamsdensity estimation

When Do Graph Foundation Models Transfer? A Data-Centric Theory

arXiv cs.LG · Jiajun Zhu, Ying Chen, Peihao Wang, Yixuan He · 2026-05-28

The paper develops a data-centric theory to explain when graph foundation models (GFMs) transfer effectively between domains, addressing uneven or negative transfer. Using a graphon-based continuous limit for dense graphs, it decomposes cross-domain output shift into finite-sample approximation terms and an intrinsic domain discrepancy metric. Key contributions include stability guarantees for spectral positional encodings (PEs), contrasting eigenvector- versus subspace-based PEs. Experiments on synthetic and real graphs validate the theory, providing actionable guidance for data curation in GFM transfer scenarios.

graph foundation modelsgraphonpositional encodingsnegative transferdomain discrepancy

The Interplay Between Interpolation and Aggregation in Regression: Optimal Sample Complexity

arXiv cs.LG · Mikael Møller Høgsgaard, Kasper Green Larsen, Liang-Yu Zou · 2026-05-28

The paper establishes a theoretical framework for understanding the interplay between interpolation and aggregation in regression tasks. It demonstrates that the γ-graph dimension characterizes learnability for a broad class of aggregation procedures, with the median of three interpolating hypotheses emerging as an optimal and strictly more powerful approach than proper learning. Results reveal that certain hypothesis classes require either infinite aggregation or non-interpolating rules for learnability, as finite interpolating aggregations fail to achieve non-trivial performance.

γ-graph dimensioninterpolationaggregationproper learningregression

Cert-LAS: Toward Certified Model Ownership Verification for Text-to-Image Diffusion Models via Layer-Adaptive Smoothing

arXiv cs.LG · Leyi Qi, Yiming Li, Siyuan Liang, Zhengzhong Tu · 2026-05-28

Cert-LAS introduces the first certified model ownership verification (MOV) method for text-to-image diffusion models, addressing vulnerabilities in backdoor-based watermarking approaches. The method employs layer-adaptive smoothing with diffusion classifiers and LFS-guided noise to embed watermarks, verifying ownership via hypothesis testing against unwatermarked references. Theoretical analysis proves robustness against malicious removal attacks, while experiments demonstrate effectiveness and resistance to adaptive attacks.

model ownership verificationtext-to-image diffusionlayer-adaptive smoothingcertified robustnesswatermark embedding

Gated Graph Attention Networks with Learnable Temperature

arXiv cs.LG · Zhongtian Ma, Hao Wu, Yexin Zhang, Qiaosheng Zhang · 2026-05-28

The paper introduces gated graph attention networks with learnable temperature to enhance standard graph attention mechanisms. The method incorporates feature gating to filter unreliable dimensions and learnable temperature parameters to dynamically adjust attention distribution sharpness. Experiments on homogeneous and heterophilic benchmarks demonstrate consistent improvements over baseline attention networks, with theoretical analysis showing gating improves robustness to partial feature reliability while temperature helps with global noise.

graph attention networksfeature gatinglearnable temperatureheterophilic graphsattention robustness

MMTM: Tri-Modal Topic Modeling for Long-Form Video via Similarity-Gated Fusion

arXiv cs.LG · Ali Abusaleh, Bhuvanesh Verma, Alexander Mehler · 2026-05-28

MMTM proposes a tri-modal topic modeling pipeline for long-form video, integrating speech recognition, audio-visual embeddings, and BERTopic clustering via similarity-gated fusion. The method demonstrates cross-lingual effectiveness on German (Tagesschau) and English (NBC) broadcast news, showing substantial improvements: noise reduction (0.27→0.06), transition rate drop (0.70→0.21), and normalized entropy increase (0.84→0.92). Cluster validity improves 5-12X (Calinski-Harabasz), while lexical coherence (NPMI) rises from 0.77 to 0.86 on German data. The work releases pipeline code and a 54-hour multimodal video corpus with dual-annotator validation.

topic modelingmultimodal fusionbertopiccluster validitylexical coherence

Instance-dependent Stochastic Lipschitz bandit

arXiv cs.LG · Marius Potfer, Vianney Perchet · 2026-05-28

The paper introduces an instance-dependent analysis and algorithm for the Lipschitz bandit problem, where a learner maximizes an unknown Lipschitz function using noisy evaluations. Unlike prior zooming-based approaches that rely on asymptotic level-set growth, the proposed method characterizes regret through integrals of the suboptimality gap over level sets, capturing local structural properties. This yields improved adaptive regret bounds of order $\tilde{\mathcal{O}}(T^{d_z+1 / \max(d_z,d^\star)+2})$ when maximizers have dimension $d^\star>0$, strictly outperforming classical zooming bounds. The analysis is extended to the full-information Lipschitz experts setting with relaxed regularity assumptions.

lipschitz banditinstance-dependent regretzooming dimensionsuboptimality gaplevel sets

EMAG: Differentiable 4D Gaussian Mixture Splatting for EEG Spatial Super-Resolution

arXiv cs.LG · Alex Lazarovich, Ofir Itzhak Shahar, Gur Elkin, Ohad Ben-Shahar · 2026-05-28

EMAG introduces a differentiable framework for EEG spatial super-resolution, reconstructing high-density (HD) signals from sparse low-density (LD) electrodes via 4D anisotropic Gaussian mixtures. The method parameterizes brain electrical sources as mixtures of Gaussians on a spherical grid, each with a 4×4 precision matrix to model spatiotemporal coupling, and renders scalp EEG through differentiable field contributions. Evaluated on Localize-MI, SEED, and SEED-IV benchmarks, EMAG outperforms state-of-the-art methods at super-resolution factors of 2×–16× while offering interpretable visualization of learned source configurations.

eeg super-resolutiongaussian mixture modeldifferentiable renderinganisotropic gaussianssource localization

Realistic honeypot evaluations for scheming propensity

arXiv cs.LG · Victoria Krakovna, David Lindner, Lewis Ho, Sebastian Farquhar · 2026-05-28

The paper introduces scheming honeypot evaluations, a framework assessing whether models pursue instrumental goals when opportunities arise. Using coding tasks in Google's alignment research codebases, the study evaluates Gemini models in real internal deployments. Results indicate no unprompted scheming, but explicit agency prompts or hidden goals occasionally trigger scheming or sabotage. Validation confirms low evaluation awareness rates, primarily linked to agency prompts rather than environmental factors.

scheming honeypot evaluationsinstrumental goalsalignment researchagency promptsevaluation awareness

Bastion: Budget-Aware Speculative Decoding with Tree-structured Block Diffusion Drafting

arXiv cs.LG · Soowon Oh, Nam Cao, Yujin Kim, Hojung Jung · 2026-05-28

BASTION introduces a budget-aware speculative decoding framework with tree-structured block diffusion drafting, addressing limitations of static tree topologies in parallel token prediction. The method integrates three components: an acceptance surrogate estimating expected accepted length, an online latency estimator calibrating hardware-aware roofline models, and adaptive best-first expansion optimizing tree growth against verification costs. Training-free and distribution-preserving, BASTION achieves up to 6.61x speedup over standard autoregressive decoding and outperforms state-of-the-art block-diffusion baselines by 39% across diverse benchmarks and GPU architectures.

speculative decodingblock diffusiontree topologyroofline modeladaptive expansion

Efficient, Validation-Free Intrinsic Quality Estimation for Large-Scale Face Recognition Datasets

arXiv cs.LG · Zhichao Chen, Yongle Zhao, Kaicheng Yang, Meng Yang · 2026-05-28

The authors propose Intrinsic Quality (IQ), a validation-free metric for estimating the inherent potential of face recognition datasets to yield high-performance models without full-scale training. IQ combines two components: Neighbor-Consistency Score, which measures local identity label agreement via nearest neighbors, and Global Representation Subspace Complexity (Effective Rank, ER), capturing embedding geometry and dataset diversity. The method enables rapid evaluation using lightweight proxy models or data subsets, facilitating dataset diagnosis and curation. An experimental protocol is described for clean, noisy, and mixed-quality datasets, with methodologies outlined to validate IQ's predictive power for downstream performance.

intrinsic qualityneighbor-consistency scoreglobal representation subspace complexityeffective rankface recognition

A Systematic Evaluation of Molecular Mixture Behavior Prediction

arXiv cs.LG · Roel J. Leenhouts, Nathan K. Morgan, William Green, Jan G. Rittig · 2026-05-28

The study introduces a systematic evaluation framework for molecular mixture property prediction, decomposing errors into pure-compound and interaction components. The method employs leakage-aware data splits, ideal-mixture baselines, and excess-property metrics, applied to seven curated physicochemical datasets. Results reveal that strong absolute accuracy often masks poor recovery of non-ideal behavior, with performance degrading under strict molecule splits, highlighting transfer to unseen molecules as a key challenge.

molecular mixtureproperty predictionnon-ideal behaviorleakage-aware splitsexcess-property metrics

Momentum Based Reward Design for Low Emission Traffic Signal Control

arXiv cs.LG · Chinmay Mundane, Amith Manoharan, Arun Singh · 2026-05-28

The paper proposes a Momentum-Based Reward Function (MBRF) for Deep Reinforcement Learning (DRL) in adaptive traffic signal control, addressing the short-sightedness of traditional delay and queue-based rewards. MBRF incentivizes vehicle movement rather than penalizing congestion, implemented and evaluated in SUMO (Simulation of Urban MObility). Results demonstrate superior throughput-emission trade-offs and more stable learning compared to delay/queue-based rewards and classical controllers (Max Pressure, LQF), measured via waiting time, queue length, throughput, and CO2 emissions.

deep reinforcement learningadaptive traffic controlmomentum-based rewardsumo simulationco2 emissions

A Novel Tensor Product-Based Neural Network for Solving Partial Differential Equations

arXiv cs.LG · Qihong Yang, Yangtao Deng, Qiaolin He, Shiquan Zhang · 2026-05-28

The Tensor Product Network (TPNet) introduces a novel neural architecture for efficient function approximation and PDE solving, leveraging a tensor-product scheme to construct solutions as linear combinations of basis functions with coefficients determined by least-squares fitting. TPNet employs a block time-marching strategy for computational efficiency in long-time simulations and a linear reformulation strategy for handling nonlinear PDEs by treating nonlinear terms as sources. Compared to Physics-Informed Neural Networks (PINNs), TPNet achieves superior accuracy and reduced training times due to its structured design and deterministic least-squares approach, bypassing traditional gradient-based optimization.

tensor-product schemeleast-squares fittingblock time-marchinglinear reformulationphysics-informed neural networks

Kernel Renormalization in Bayesian Deep Neural Networks: the Equivalent Wishart Ansatz in the Proportional Regime

arXiv cs.LG · Paolo Baglioni, Christian Keup, Vincenzo Zimbardo, Rosalba Pacelli · 2026-05-28

The paper introduces an equivalent Wishart Ansatz to analyze Bayesian deep neural networks in the proportional-width regime (P∼N). By modeling hierarchical empirical kernel fluctuations in multi-layer perceptrons (MLPs), the method enables large deviation analysis via a renormalized NNGP kernel, reducing representation learning to L scalar order parameters. The framework extends to CNNs, revealing local kernel renormalization mechanisms. Empirical validation on Bayesian posterior sampling (L∼10, P∼10³) shows strong agreement with benchmark datasets, though two systematic deviation types emerge.

proportional-width regimewishart ansatzkernel renormalizationnngp kernellarge deviation analysis

A Geometric View of SRC: Learning Representations for Stable Residual Inference

arXiv cs.LG · Vangelis P. Oikonomou · 2026-05-28

The paper formalizes residual-ordering stability in Sparse Representation Classification (SRC) through a geometric framework, analyzing class-conditional spans and projection residuals. It identifies geometric obstructions (span overlap, dominance, near-overlap) that collapse residual margins and derives a lower bound under coverage/separation assumptions. The authors propose geometry-shaping objectives promoting within-class self-expressiveness while discouraging cross-class reconstruction, without using SRC during training. Experiments on COIL-100, TREC, and EEG data evaluate representations under fixed SRC/OMP inference, reporting residual margins and geometric diagnostics.

sparse representation classificationresidual marginclass-conditional spansgeometry-shaping objectivesself-expressiveness

Eigen-Spike Emergence and Quadratic Equivalents for Conjugate Kernels on Nonlinearly Separable Data

arXiv cs.LG · Collin Cranston, Zhichao Wang, Todd Kemp, Michael W. Mahoney · 2026-05-28

The paper introduces a quadratic equivalent for conjugate kernels (CK) to analyze nonlinear learnability on nonlinearly separable data, specifically the XOR problem. Using random matrix theory (RMT), the authors develop deterministic equivalents to study emergent informative spikes (eigen-spikes) in CK matrices, which align with XOR labels. They derive a BBP-type phase transition for linear classification via CK eigenvectors, examining effects of sample complexity, SNR, activation functions, and pretrained features. The results bridge RMT tools with practical ML scenarios, providing theoretical insights into nonlinear feature maps.

conjugate kernelrandom matrix theorynonlinear separabilitybbp-type transitioneigen-spike

AMDP: Asynchronous Multi-Directional Pipeline Parallelism for Large-Scale Models Training

arXiv cs.LG · Ling Chen, Houming Wu, Wenjie Yu · 2026-05-28

Asynchronous Multi-Directional Pipeline Parallelism (AMDP) is introduced to enhance large-scale model training by mitigating parameter mismatch and maintaining high utilization. AMDP restricts the first pipeline stage to process at most two minibatches before backpropagation, limiting parameter updates between forward and backward passes. It concurrently launches multiple pipelines, adapts their count based on pipeline depth, and accumulates gradients across minibatches for a single update, ensuring bounded parameter mismatch within one optimization step. Experiments on GPT- and BERT-style models show that AMDP significantly accelerates training while preserving convergence.

pipeline parallelismparameter mismatchbackpropagationminibatchesgradient accumulation

Matching Rates and Optimal Allocation for Federated Probe-Logit Distillation under Heterogeneous Bandwidth Budgets

arXiv cs.LG · Prasanjit Dubey, Xiaoming Huo · 2026-05-28

The work establishes tight minimax rates and optimal bandwidth allocation for federated probe-logit distillation (FPLD) under heterogeneous bandwidth constraints. The authors prove a matching lower bound of $Ω(K^{-1} \cdot 2^{-2B/V})$ for single-round FPLD, showing prior upper bounds are tight, and demonstrate that multi-round quantization with residual coding achieves $O(K^{-1} \cdot 2^{-2TB/V})$. They derive an optimal log-tilted water-filling allocation rule $B_i^* = B_{\mathrm{tot}}/K + (V/2) \log_2(w_i / \bar{w}_g)$ for heterogeneous budgets, with a plug-in adaptive variant achieving $1 + O(\sqrt{\log(K/δ)/(m T_0)})$ suboptimality. Experiments on n-gram models confirm theoretical bounds and show the allocation rule outperforms uniform baselines.

federated learningminimax rateslogit distillationbandwidth allocationquantization

MoSSP: A Momentum-Based Single-Loop Stochastic Penalty Method for Nonconvex Constrained DC-Regularized Optimization

arXiv cs.LG · Luxuan Li, Chunfeng Cui, Xiao Wang · 2026-05-28

The authors propose MoSSP, a momentum-based single-loop stochastic penalty method for nonconvex constrained optimization with DC regularization, addressing the challenge of maintaining feasibility while achieving favorable oracle complexity. The method applies stochastic proximal-gradient steps to the Moreau envelope of the penalty plus the convex DC part, with parallel computation of the concave part's proximal mapping. Two variants are developed: a Polyak-momentum version with $O(\varepsilon^{-4})$ complexity for $\varepsilon$-KKT points, and an improved $O(\varepsilon^{-3})$ version using recursive momentum, both validated experimentally.

nonconvex optimizationdc regularizationstochastic penalty methodmoreau envelopeoracle complexity

Relational Rank Geometry in Transformers: Detecting and Steering Hidden-State Relation Frames

arXiv cs.LG · Mazen Kobrosly · 2026-05-28

The paper introduces relational rank geometry as a novel method for analyzing transformer hidden states, focusing on higher-order relations among token tuples rather than local features. Using Plucker sign entropy, the authors detect consistent orientation signatures for true relation tuples (r=3-6) in Llama models (8B-405B), showing robustness across surface variations. Through controlled interventions on 70B and 405B models, they demonstrate that steering hidden-state relation frames toward clean targets recovers correct behavior and geometry, while placebo controls fail. This establishes a framework for both probing and manipulating relational structures in transformers.

relational rank geometryplucker sign entropyhidden-state interventiontoken tuplestransformer probing

MōLe-Λ: Learning the Coupled-Cluster Response State for Energies, Gradients, and Properties

arXiv cs.LG · Andreas Burger, Luca Thiede, Abdulrahman Aldossary, Jorge A. Campos-Gonzalez-Angulo · 2026-05-28

MōLe-Λ extends Molecular Orbital Learning (MōLe) to predict the full coupled-cluster singles and doubles (CCSD) response state by jointly learning right-hand ($T_1,T_2$) and left-hand ($Λ_1,Λ_2$) amplitudes from localized Hartree-Fock orbitals. The architecture preserves MōLe's equivariant encoder, odd sign-equivariant decoding, and size-extensivity while adding $Λ$-amplitude readouts mirroring $T$-amplitude symmetry constraints. The model achieves CCSD-quality energies and forces while recovering dipoles, polarizabilities, electron density, and 2-electron observables, maintaining MōLe's computational speed advantage over full CCSD calculations.

coupled-cluster theoryquantum chemistryequivariant neural networksmolecular orbitalsresponse properties

Cluster-Level Attention-Guided Parallel Decoding for Masked Diffusion Language Models

arXiv cs.LG · Heqiang Qi, Wei Huang, Mingyuan Bai, Xiangming Meng · 2026-05-28

The paper introduces CLAD (Cluster-Level Attention-Guided Decoding), a training-free decoder for masked diffusion language models (MDLMs) that enables span-level parallel commitment. By grouping high-confidence token predictions into confidence-induced clusters (CICs) and using self-attention maps to resolve inter-cluster dependencies, CLAD achieves conflict-aware parallel decoding. Evaluations on LLaDA and Dream models across four reasoning and code-generation benchmarks demonstrate 1.77x--8.47x speedups over Vanilla decoding while maintaining comparable accuracy.

masked diffusion language modelsparallel decodingconfidence-induced clustersself-attention mapscluster-level commitment

FPLIER: Federated Pathway-Level Information Extractor

arXiv cs.LG · Daniele Malpetti, Christian Berchtold, Francesco Gualdi, Marco Scutari · 2026-05-28

FPLIER introduces a federated learning extension to the Pathway Level Information Extractor (PLIER) for transcriptomics, enabling privacy-preserving distributed training across multiple data holders while incorporating public datasets. The method employs secure aggregation to produce updates algebraically equivalent to centralized training while keeping expression data local. Evaluations on simulated consortia (K-CLIER and MultiPLIER) demonstrate stable convergence, with privacy analysis showing membership inference risk depends on training matrix rank, approaching random guessing in full-rank regimes via public data incorporation or dimensionality reduction.

federated learningtranscriptomicssecure aggregationmembership inferencepathway analysis

PEARL: Training Socratic Tutors with Pedagogically Aligned Reinforcement Learning

arXiv cs.LG · Qikai Chang, Zhenrong Zhang, Linbo Chen, Pengfei Hu · 2026-05-28

PEARL introduces a reinforcement learning framework for training Socratic tutoring agents, addressing challenges in student simulation, reward modeling, and multi-objective optimization. The framework comprises a controllable student simulator that decouples cognitive states from response generation, a generative reward model evaluating pedagogical quality and correctness, and a stable multi-objective RL scheme that discretizes rewards and aggregates normalized advantages. Evaluations on multiple benchmarks demonstrate PEARL's superior performance among open-source models and competitiveness with proprietary LLMs, achieving these results with a 30B parameter policy model.

socratic tutoringstudent simulatorgenerative reward modelmulti-objective rlpolicy optimization

On the Construction and Implications of Low-Loss Valleys in LoRA-based Bayesian Inference

arXiv cs.LG · Daniel Dold, Emanuel Sommer, Julius Kobialka, Oliver Dürr · 2026-05-28

The paper introduces LoRA-Curve, a Bézier curve parameterization in LoRA space that connects independently fine-tuned optima through continuous low-loss valleys, addressing the challenge of epistemic uncertainty estimation in parameter-efficient fine-tuning. Two variants are proposed: a free configuration optimizing all control points jointly, and an anchored configuration linking pre-trained LoRA optima. Theoretical analysis proves pathwise continuity and Lipschitz regularity, while experiments on Qwen2.5 7B show anchored curves bypass loss barriers encountered by linear interpolation. Combined with flat-minima perturbations and Jensen-Shannon regularization, the method increases predictive distribution mutual information without performance loss, demonstrating functional diversity through continuous parameter-space traversal.

low-rank adaptationbayesian model averagingbézier curveepistemic uncertaintyfunctional diversity

Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention

arXiv cs.LG · Jing Huang, Daniel Wurgaft, Rachit Bansal, Laura Ruis · 2026-05-28

The study investigates why larger models outperform smaller ones by analyzing capacity, interference, and rare-task retention. Using synthetic task mixtures and scaling experiments with OLMo models (4M to 4B parameters), the authors demonstrate that smaller models prioritize high-frequency or low-complexity tasks due to resource competition, while larger models reduce interference by allocating sufficient resources to common tasks, preserving rare-task features. Results show larger OLMo models excel at infrequent and complex tasks, embedding more task features with less gradient interference. The findings provide a data-centric explanation for scaling benefits, informing model sizing and training data decisions.

power-law scalinggradient interferencerare-task retentionmodel capacitydata-centric bottleneck

The Complexity of Verifying Feedforward Neural Networks in Quantised Settings

arXiv cs.LG · Eric Alsmann, Martin Lange, Marco Sälzer · 2026-05-28

The paper establishes a complexity landscape for verifying feedforward neural networks (FNNs) under quantized arithmetic, analyzing three network classes: rational FNNs (exact weights), quantized FNNs (finite-width arithmetic weights), and dynamically quantized FNNs (rational networks evaluated with finite-width arithmetic). Using linear programming (LP) and bit-vector (BV) specifications, the authors prove NP-completeness for quantized FNN verification under both LP and BV constraints, matching rational FNN complexity. For dynamically quantized FNNs with BV specifications, they provide upper bounds complementing known PSPACE-hardness results.

neural network verificationquantized arithmeticnp-completenessbit-vector specificationslinear programming

AsymVLM: Asymmetric Token Pruning for Efficient Vision-Language Model Inference

arXiv cs.LG · Yilin Feng, Ahmed Burak Gulhan, Mahmut Taylan Kandemir · 2026-05-28

AsymVLM introduces asymmetric token pruning for efficient vision-language model inference, exploiting modality-specific properties: aggressive prefill pruning for spatially redundant vision tokens via learned importance scoring with adaptive budgets, and threshold-based eviction for text tokens during decoding. The method achieves 54% FLOPs reduction, outperforming prior approaches by 2-3% on document/chart understanding tasks with localized visual information while maintaining accuracy on holistic benchmarks. In text-dominated scenarios, it surpasses standard LLM cache compression by adapting to VLM's short-context nature.

vision-language modelstoken pruningadaptive budgetingflops reductioncache compression

Audio Deepfake Detection with Half-Truth Localisation Using Cross-Attentive Feature Fusion

arXiv cs.LG · S. Sutharya, Remya K. Sasi · 2026-05-28

The paper introduces CAFNet, a 576k-parameter model for detecting and localizing partially manipulated audio (half-truths). It jointly performs ternary classification (real, fully-fake, or half-truth) and temporal boundary regression via cross-attentive fusion of MFCC, LFCC, and Chroma-STFT features, followed by a BiLSTM head. On the MLADDC T2+T3 test set, CAFNet achieves 92.71% accuracy (0.9910 AUC) for ternary classification and 0.075s MAE for boundary localization, outperforming XLS-R 300M and AST 87M with 500× fewer parameters. Cross-dataset analysis reveals fine-tuning-induced representation collapse.

audio deepfakehalf-truth localizationcross-attentionmfccbilstm

Learning to Perturb Hidden Representations for Generalizable Deep Learning

arXiv cs.LG · Hua Li · 2026-05-28

The paper introduces Learning to Perturb Activations (LPA), a unified framework for hidden activation perturbation in deep neural networks. LPA adaptively perturbs activations at selected layers using class-level perturbations learned via PGD, contrasting with class-agnostic methods like Dropout. Theoretical analysis links activation perturbation to flat minima and layer-wise amplification. Experiments on balanced classification, long-tail classification, and domain generalization show LPA outperforms existing methods and complements logit perturbation techniques like LPL.

activation perturbationpgdflat minimadomain generalizationlpa

K-FinHallu: A Hallucination Detection Benchmark for Multi-Turn RAG in Korean Finance

arXiv cs.LG · Eunbyeol Cho, Yunseung Lee, Mirae Kim, Jeewon Yang · 2026-05-28

The authors introduce K-FinHallu, the first benchmark for hallucination detection in multi-turn Korean financial RAG, addressing gaps in existing single-turn, English-centric evaluations. They construct multi-turn dialogues from authentic Korean financial documents, injecting hallucinations under a hierarchical taxonomy that incorporates context answerability and justified abstention. Benchmarking frontier and open-source LLMs reveals persistent challenges in fine-grained financial diagnostics and refusal behavior, with fine-tuned 8B models achieving competitive performance but justified abstention remaining a weak point across all models.

retrieval-augmented generationhallucination detectionmulti-turn dialoguekorean financial domainjustified abstention

DynaGraph: Lightweight Multi-Model Interaction Framework via Dynamic Topological Reconfiguration

arXiv cs.LG · Yanxing Guo, Zihao Zheng, Fangzhou Wu, Ling Liang · 2026-05-28

DynaGraph introduces a lightweight multi-model framework for complex reasoning tasks via dynamic topological reconfiguration, addressing computational redundancy and cascading errors in monolithic LLMs. The method multiplexes time-division PEFT adapters over a shared base model, enabling full system training and inference on a single GPU, while an Evaluator monitors execution confidence for hierarchical self-healing (Fine-grained Patching and Subgraph Reconstruction). Experiments on StrategyQA, MATH, and FinQA show an 8B model achieves 87.6% and 82.7% accuracy, respectively, approximating a 72B monolithic model while reducing latency by 68.1% and token consumption by 68.6%.

dynamic topological reconfigurationpeft adaptershierarchical self-healingtime-division multiplexingcascading errors

On-Policy Replay for Continual Supervised Fine-Tuning

arXiv cs.LG · Yan Chen, Taojie Zhu, Meng Zhang, Xin Chen · 2026-05-28

On-Policy Replay (OPR) is introduced as a method for continual supervised fine-tuning (SFT) of large language models (LLMs) to mitigate catastrophic forgetting. OPR leverages on-policy signals by rolling out the most recent checkpoint on historical prompts, filtering generations via task reward, and replaying surviving (prompt, response) pairs as standard SFT examples, eliminating the need for auxiliary losses or teacher models. Evaluated on three 7--8B instruction-tuned models (Qwen2.5-7B-Instruct, Qwen3-8B, Llama3.1-8B-Instruct) using the TRACE benchmark, OPR reduces forgetting significantly, improving backward transfer (BWT) by 46% over a Vanilla Replay baseline at a 10% replay budget. A KL-shrinkage interpretation unifies OPR with prior on-policy distillation methods, revealing that the on-policy distribution, not response quality alone, drives performance.

continual supervised fine-tuningon-policy replaycatastrophic forgettingbackward transferkl-shrinkage

Gradient Perturbation: Learning to Perturb Gradients for Adaptive Training

arXiv cs.LG · Hua Li · 2026-05-28

The paper introduces Learning to Perturb Gradients (LPG), a unified framework for gradient perturbation during neural network training. LPG adaptively perturbs logit-level gradients at the class level, treating gradient norm amplification as positive augmentation and dampening as negative augmentation. The method connects gradient perturbation bounds to generalization guarantees via PAC-Bayesian analysis. Experiments on balanced classification, long-tail classification, and noisy label learning show LPG consistently outperforms baselines and integrates as a plug-in module with existing methods like SAM and gradient clipping.

gradient perturbationpac-bayesian analysislogit-level gradientsadaptive traininggeneralization guarantees

Access Sets Matter: Budgeting Expert Reads for Scalable Weight-Space Model Merging

arXiv cs.LG · Yuanyi Wang, Yanggan Gu, Su Lu, Yifan Yang · 2026-05-28

MergePipe introduces a budget-aware execution layer for weight-space model merging in LLMs, framing it as an expert access-set problem. The method indexes parameter blocks, constructs deterministic access plans under I/O constraints, and executes merges with replayable manifests while bounding omitted-update error via delta norms. Evaluations on Qwen and Llama workloads demonstrate up to 11× speedups and 10× I/O reduction, with parameter deviations below O(10^-3) from full merges and no performance degradation on downstream tasks.

weight-space mergingexpert access-seti/o budgetparameter deviationdelta norms

Composing Non-Conjugate Factor Graphs with Closed-Form Variational Inference

arXiv cs.LG · Mykola Lukashchuk, Kyrylo Yemets, Wouter M. Kouw, Dmitry Bagaev · 2026-05-28

The paper demonstrates closed-form variational inference for models composed from five probabilistic primitives: bilinear factor, exponential link, Gamma prior, Gaussian likelihood, and equality node. By preserving specific message families (Gaussian for Gaussian variables, Gamma for precision variables) and handling the non-conjugate exponential link via moment-generating functions, the method enables tractable message passing in deep architectures. Results show compositionality from static ensembles to split-branch routing, achieving universal approximation via decision tree encoding, with applications in Bayesian mixture-of-experts time-series forecasting yielding calibrated uncertainty on five benchmarks.

variational inferencefactor graphsmessage passinggamma priorgaussian likelihood

Deep Optimal Individualized Treatment Rules for Bivariate Survival Outcomes via Adaptive Prediction-Powered Learning

arXiv cs.LG · Kun Ren, Yifan Cui, Wen Su · 2026-05-28

The paper proposes a deep learning framework for deriving optimal individualized treatment rules (ITRs) that maximize joint survival probability beyond fixed time points in randomized trials with bivariate survival outcomes. The method combines stochastic policy modeling with marginal accelerated failure time models linked via copula functions to handle bivariate dependence, while addressing right censoring. An adaptive prediction-powered learning approach integrates auxiliary machine learning predictions to enhance decision robustness. Theoretical guarantees and empirical evaluations demonstrate improved performance over existing methods in survival analysis tasks.

individualized treatment rulesbivariate survival analysisaccelerated failure timeprediction-powered learningstochastic policies

Honest Lying: Understanding Memory Confabulation in Reflexive Agents

arXiv cs.LG · Prakhar Dixit, Sadia Kamal, Tim Oates · 2026-05-28

The paper identifies memory confabulation as a systematic failure mode in Reflexion-style agents, where agents persistently act on incorrect self-generated reflections despite environmental resets. It introduces the Reflection Repetition Rate (RRR) to quantify repeated reliance on erroneous reflective content. Experiments on ALFWorld and HumanEval reveal cases where agents fail to mention correct task elements in reflections (0% accuracy in 16 ALFWorld environments). A mitigation strategy replaces open-ended self-diagnosis with programmatic extraction of trajectory-level failure signals, increasing correct object mention to 86%, reducing RRR from 0.64 to 0.10, and solving 3 of 16 frozen ALFWorld environments.

memory confabulationreflection repetition ratereflexion-style agentstrajectory-level failure signalsalfworld

Kronecker Embeddings: Byte-Level Structured Token Representations for Parameter-Efficient Language Models

arXiv cs.LG · Rohan Shravan · 2026-05-28

Kronecker Embeddings introduce a parameter-efficient alternative to traditional embedding tables in large language models by using a deterministic byte-level character-position factorization. The method replaces the standard |V| x d_model embedding table with a fixed encoder and single learned projection, reducing input-side parameters by 91-94%. Evaluations across six LMs (135M-671B parameters) demonstrate superior performance: 2.5% lower validation loss, 8.2 pp higher top-1 prediction accuracy on spelling-robustness probes, and stable projection norms. The approach also enables runtime memory reduction from 2.15 GB to 4.5 MB with minimal overhead.

kronecker embeddingsbyte-level factorizationparameter-efficientbpe tokenizersembedding norm drift

A Full-Pipeline Framework for Evaluating Membership Inference Attacks in Machine Learning

arXiv cs.LG · Ding Chen, Xinwen Cheng, Xuyang Zhong, Xinping Chen · 2026-05-28

The authors introduce a full-pipeline evaluation framework for Membership Inference Attacks (MIAs) to systematically assess privacy risks across machine learning pipelines, including data, architectures, algorithms, and post-training modules. The framework employs three metrics—Balanced Accuracy, TPR at low FPR, and TNR at low FNR—to account for varying misclassification costs and formalizes two standardized threat models for equitable benchmarking. Empirical results show MIA efficacy is highly sensitive to threat models and evaluation metrics, yielding actionable guidelines and an auditing toolkit for practitioners.

membership inference attacksprivacy auditingmachine unlearningthreat modelsevaluation metrics

Forget Less, Generalize More: Unifying Temporal and Structural Adaptation for Dynamic Graphs

arXiv cs.LG · Qian Chang, Ciprian Doru Giurcaneanu, Runsong Jia, Xia Li · 2026-05-28

The paper proposes Dual-Scale Retentive Dynamics (DSRD), a unified framework for dynamic graph representation learning that jointly models temporal and structural adaptation. DSRD introduces (i) a retentive state with dual-scale adaptation combining temporal memory and structural context in a recurrent formulation, and (ii) adaptive decay kernels with learnable time-sensitivity parameters for automatic balance between short-term responsiveness and long-term retention. Theoretical analysis provides stability guarantees and equivalence between parallel and recurrent updates. Experiments on 14 benchmarks show state-of-the-art performance in link prediction and node classification across transductive and inductive settings.

dynamic graphsrepresentation learningadaptive decayretentive statedual-scale adaptation

How Much Is a Dataset Worth? Scaling Laws, the Vendi Score, and Matrix Spectral Functions

arXiv cs.LG · Jeff A. Bilmes, Gantavya Bhatt, Arnav M. Das · 2026-05-28

The paper establishes that neural scaling laws and the Vendi Score belong to a broader class of submodular objectives called matrix spectral functions, which includes determinantal point processes (DPPs). It introduces weakly matrix monotone functions to derive weakly submodular variants and proposes secular-equation-based updates for efficient greedy optimization, achieving 35,000x speedups on ImageNet-1K. Empirical comparisons show facility location outperforms Vendi Score and DPPs in predicting subset utility for test performance, while revealing limitations of pushing Vendi Score to extreme values. Results demonstrate that dataset size, class balance, and budget alone cannot fully capture data value.

submodular objectivesmatrix spectral functionsvendi scoresecular-equationdeterminantal point processes

AliMark: Enhancing Robustness of Sentence-Level Watermarking Against Text Paraphrasing

arXiv cs.LG · Yuexin Li, Wenjie Qu, Linyu Wu, Yulin Chen · 2026-05-28

AliMark enhances sentence-level watermarking robustness against structural perturbations like sentence splitting and merging by reformulating watermarking as a bit sequence alignment problem. The method employs a two-stage detection strategy: generating multiple restructured text variants and adaptively aligning their extracted bit sequences with a secret sequence to minimize alignment cost. Experiments show AliMark outperforms state-of-the-art baselines under diverse paraphrasing attacks, including those from DIPPER and GPT-3.5.

watermarkingparaphrasingbit sequencealignment coststructural perturbations

When Does Persona Prompting Actually Help? A Retrieval and Metric Analysis of Expert Role Injection in LLMs

arXiv cs.LG · Shuai Xiao, Su Liu, Weikai Zhou, Jialun Wu · 2026-05-28

The study investigates persona prompting's conditional effectiveness through a controlled comparison of four prompting methods (no role, generic domain-expert, embedding-based retrieval, hybrid retrieval) across 1,140 questions spanning 38 expert roles and six domains. Results reveal a tradeoff obscured by aggregate metrics: role prompting increases expertise depth but reduces clarity, with domain-specific variations. Hybrid retrieval outperforms embedding-only selection, yet the fundamental tradeoff persists. Findings indicate persona prompting reshapes response characteristics rather than universally enhancing capability, emphasizing the need for multi-metric evaluation.

persona promptingexpert role injectionembedding-based retrievalmulti-metric evaluationin-context learning

Constructing efficient channels for ideal observers using the conjugate gradient method

arXiv cs.LG · Weimin Zhou · 2026-05-28

The study introduces a conjugate gradient (CG)-based method for constructing efficient channels to approximate ideal observer performance in medical imaging systems. Addressing computational intractability in high-dimensional image data, the proposed method facilitates dimensionality reduction for both Bayesian Ideal Observer (IO) and Hotelling observer (HO) frameworks. This approach enables objective assessment of image quality (IQ) through computationally feasible figures of merit (FOMs) for signal detection tasks.

ideal observerhotelling observerconjugate gradientdimensionality reductionsignal detection

Real-Time Retargeting Using Controllability Boundary for Chandrayaan-3 Lunar Landing

arXiv cs.LG · Suraj Kumar, Debjyoti Chakrabarti, Aditya Rallapalli, Bharat Kumar GVP · 2026-05-28

The paper contributes a real-time retargeting guidance policy for Chandrayaan-3's lunar landing, combining fuel-optimal baseline trajectories with contingency retargeting capability. The method employs a convex controllability boundary representation to enable rapid feasibility checks and target updates during descent, marking the first operational use of data-driven retargeting in lunar missions. Pre-flight simulations and actual flight data demonstrate the framework's effectiveness in ensuring safe landing site transitions.

retargeting guidancecontrollability boundaryfuel-optimal descentconvex representationlunar landing

The Good, the Bad, and the Ugly of Markov Boundary for Tabular Prediction

arXiv cs.LG · Shu Wan, Abhinav Gorantla, Huan Liu, K. Selçuk Candan · 2026-05-28

The study evaluates the practical utility of Markov boundaries for tabular prediction on SCM3K, a synthetic benchmark with 3,450 tasks spanning 40-1000 features. While oracle Markov boundaries improve prediction accuracy (especially in high-dimensional sparse settings), standard causal discovery pipelines fail to reliably recover beneficial feature subsets due to optimization misalignment, asymmetric error costs, and the existence of multiple predictive feature sets. Results reveal a gap between theoretical promise and empirical feasibility, prompting recommendations for prediction-aligned feature selection and structure-aware tabular models.

markov boundarytabular predictioncausal discoveryfeature selectionscm benchmark

Information-Directed Offline-to-Online Reinforcement Learning

arXiv cs.LG · Keru Chen · 2026-05-28

The paper introduces information-directed sampling (IDS) for offline-to-online reinforcement learning, formalizing residual uncertainty via conditional mutual information between learning targets and online trajectories post-offline conditioning. IDS, parameterized by η≥0, optimizes action selection by balancing regret and information gain, inheriting Bayesian regret bounds from Thompson sampling. Theoretical analysis in a Bayesian linear-reward model shows IDS achieves Õ(Hd min{√T, T√(C†_{β,IDS₀}(N,T)/N)}), with coverage coefficients tied to IDS-induced visitation. Empirical validation in bandits and D4RL demonstrates IDS excels when offline data is informative but leaves biased/low-probability uncertainties.

offline-to-online rlinformation-directed samplingbayesian regretconditional mutual informationcoverage coefficient

Rethinking Post-Training Recipes for Multimodal Time-Series Forecasting

arXiv cs.LG · Haoxin Liu, Yichen Zhou, Rajat Sen, B. Aditya Prakash · 2026-05-28

PostTime introduces a multimodal time-series forecasting approach that post-trains LLMs to revise numerical TSFM priors using multimodal context. The method combines Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR), generating automated reasoning traces for forecast revisions. The LLM learns to conditionally revise, preserve, or ignore TSFM predictions based on context. Evaluated on the TimesX benchmark using Gemma-3-4B and TimesFM-2.5, PostTime outperforms standalone TSFMs, LLM-only baselines, and existing multimodal forecasting methods.

multimodal forecastingsupervised fine-tuningreinforcement learningtime-series foundation modelsautomated reasoning

GDSD: Reinforcement Learning as Guided Denoiser Self-Distillation for Diffusion Language Models

arXiv cs.LG · Xiaohang Tang, Keyue Jiang, Che Liu, Qifang Zhao · 2026-05-28

We propose Guided Denoiser Self-Distillation (GDSD), a reinforcement learning method for diffusion large language models (dLLMs) that bypasses training--inference mismatch biases inherent in ELBO-based approaches. GDSD directly distills the denoiser from an advantage-guided self-teacher derived from reverse-KL regularized RL, using a normalization-free objective to match logits without likelihood estimation. This reduces RL to likelihood-free self-distillation, avoiding pathologies of ELBO-based methods. Evaluations on planning, math, and coding benchmarks with LLaDA-8B and Dream-7B show GDSD outperforms state-of-the-art ELBO methods, achieving up to +19.6% test accuracy with more stable reward dynamics.

diffusion language modelsreinforcement learningself-distillationreverse-klelbo

On the Optimizer Dependence of Neural Scaling Laws

arXiv cs.LG · Vansh Ramani, Shourya Vir Jain · 2026-05-28

This work demonstrates that the scaling exponent α in neural scaling laws L(N) ∝ N^{-α} systematically depends on optimizer choice, challenging the assumption that α is architecture- and data-determined. Through controlled random-feature regression experiments, the authors measure α across five optimizer variants and six spectral conditions. Preconditioned optimizers yield steeper scaling (larger α), with α-shifts peaking near s = 1.5 and remaining significant at s = 2.0. At s ≈ 1.0 (characteristic of natural language), natural gradient achieves α ≈ 0.31 versus α ≈ 0.12 for gradient descent, a 2.6× larger exponent that compounds with model-size doubling. The findings suggest scaling-law forecasts should account for optimizer choice.

scaling exponentrandom-feature regressionpreconditioned optimizersspectral conditionsnatural gradient

TRACER: Persistent Regularization for Robust Multimodal Finetuning

arXiv cs.LG · Hesam Asadollahzadeh, Feng Liu, Christopher Leckie, Sarah M. Erfani · 2026-05-28

The paper introduces TRACER, a method for robust multimodal finetuning that mitigates catastrophic forgetting in pretrained models. The authors develop a theoretical framework for multimodal contrastive finetuning, demonstrating that self-distillation outperforms other regularization approaches. They identify a collapse issue in standard EMA teachers and propose a Weighted Moving Average (WMA) teacher that maintains persistent regularization and bias-free convergence. TRACER combines contrastive learning with WMA-guided multi-perspective distillation, showing consistent OOD accuracy and calibration improvements across three CLIP backbone architectures in experiments.

multimodal finetuningcontrastive learningcatastrophic forgettingweighted moving averageout-of-distribution robustness

BrahmicTokenizer-131K: An Indic-Capable Drop-In Replacement for o200k_base

arXiv cs.LG · Rohan Shravan · 2026-05-28

BrahmicTokenizer-131K introduces a 131K-vocabulary byte-level BPE tokenizer optimized for Brahmic scripts while maintaining performance on English, EU languages, and code. The method involves a two-stage retrofit: script-prune crop to reduce vocabulary size and surgical retrofit of dead slots via linear-programming allocation across Brahmic Unicode blocks. Results show 26.7% fewer tokens on Indic text compared to Mistral-Nemo Tekken/Sarvam-m, with per-language savings up to 76.79% (Odia), while matching o200k_base's English fertility and outperforming on coding/math benchmarks.

byte-level bpebrahmic scriptslinear-programming allocationtoken compressionunicode blocks

Deep Adaptive Dimension Reduction for Bayesian Inference in Inverse Problems

arXiv cs.LG · Yueyang Wang, Xili Wang, Kejun Tang, Xiaoliang Wan · 2026-05-28

The authors propose a deep adaptive dimension-reduction Bayesian inference framework for high-dimensional PDE-governed inverse problems, combining Variational Flow (VF) with nonlinear dimension reduction and dual normalizing flows. VF surpasses VAE by offering a higher evidence lower bound and flexible posterior approximation, while an iterative prior updating strategy automates prior tuning. Integrated with an adaptive Fourier Neural Operator surrogate, the method forms a closed loop for improved inference. Experiments on a 100D Rosenbrock problem and three PDE inverse problems demonstrate superior accuracy versus MCMC, UKI, and SVGD baselines, particularly in high-noise and high-dimensional settings.

variational flowdimension reductionbayesian inferencefourier neural operatorinverse problems

Kernel-based potential mean-field games with unbiased random Fourier $U$-statistics

arXiv cs.LG · Yumiharu Nakano · 2026-05-28

The authors introduce a computational framework for potential mean-field games with kernel-based interaction and terminal costs, leveraging reproducing-kernel maximum mean discrepancy (MMD) penalties. Costs are estimated via unbiased random Fourier U-statistics, ensuring linear batch-size complexity, while the drift is parametrized by a neural network trained with stochastic gradient descent. Theoretical guarantees include sample-level almost-sure convergence and explicit convergence rates under coupled conditions on penalty parameters, random-feature counts, sample sizes, and optimization tolerances. Experiments validate the method on high-dimensional Schrödinger bridge problems and electric vehicle charging coordination, demonstrating scalability and applicability to heterogeneous systems.

mean-field gamesmaximum mean discrepancyrandom fourier featuresschrödinger bridgestochastic gradient descent

Solving Integer Linear Programming with Parallel Tempering

arXiv cs.LG · Kyuil Sim, Sanghyeok Choi, Jinkyoo Park · 2026-05-28

We propose a solver-free, sampling-based optimization framework for Integer Linear Programming (ILP) that directly explores discrete feasible regions without training or external solvers. The method employs a Locally-Balanced Proposal to construct a transition kernel, avoiding gradient approximation, and integrates Parallel Tempering with both temperature and penalty tempering to address multimodal energy landscapes. Empirically, the approach outperforms SCIP across four benchmarks, matches or exceeds Gurobi on two tasks within a 200-second budget, and demonstrates superior robustness to distribution shifts compared to learning-based methods. On MIPLIB 2017 instances, it remains competitive with classical solvers without problem-specific tuning.

integer linear programmingparallel temperinglocally-balanced proposalpenalty temperingmultimodal energy landscapes

PassNet: Scaling Large Language Models for Graph Compiler Pass Generation

arXiv cs.LG · Yiqun Liu, Yingsheng Wu, Ruqi Yang, Enrong Zheng · 2026-05-28

PassNet introduces a large-scale ecosystem for LLM-based compiler pass generation, addressing performance ceilings in tensor compilers like TorchInductor. The framework includes PassNet-Dataset (18K computational graphs from 100K models) and PassBench (200 fusible tasks evaluated via Error-aware Speedup Score). Experiments show frontier models trail TorchInductor by 37% aggregate but achieve 3x speedup on individual subgraphs, indicating consistency bottlenecks. Fine-tuning a small model on 4K trajectories yields 2.67x improvement, validating PassNet as training infrastructure for LLM-driven optimization.

tensor compilerspass generationerror-aware speedup scorecomputational graphsllm-driven optimization

Neural-Behavioral Representation of Natural Whole-body Movement in Monkeys

arXiv cs.LG · Jieshi He, Puzhe Li, Yanan Sui, Mu-ming Poo · 2026-05-28

The study introduces a neural-behavioral framework for decoding natural whole-body movements in freely moving monkeys, addressing limitations in previous motor decoding research. The method combines large-scale epidural cortical signals from sensory- and motor-related areas with synchronized multi-view motion capture using a custom platform. An autoregressive encoder-decoder model was employed to reconstruct whole-body kinematics and learn a compact behavior prior. The model, conditioned on neural signals, decoded accurate and realistic whole-body movements without explicit physical constraints. This approach provides a novel proof-of-concept for decoding natural primate movements using large-scale intracranial neural activity.

neural-behavioral frameworkwhole-body kinematicsautoregressive encoder-decoderepidural cortical signalsmotor decoding

Harmless Yet Harmful: Neutral Prompting Attacks for Stealthy Hallucination Steering in Agent Skills

arXiv cs.LG · Chia-Yi Hsu, Chia-Mu Yu, Chun-Ying Huang, Jun Sakuma · 2026-05-28

The paper introduces Neutral Prompting Attack (NPA), a stealthy method to manipulate LLM-powered coding agents into generating hallucinated package names through semantically benign instructions. Unlike targeted attacks, NPA shifts dependency generation toward speculative packages without explicit malicious intent. Evaluations across coding-oriented LLMs show NPA increases Hallucination ASR and Pip Install ASR, alters hallucinated package distributions, and evades static-analysis, LLM-based, and agent-based Skill defenses, revealing covert software supply chain risks.

neutral prompting attackpackage hallucinationsoftware supply chainllm-powered agentsdependency generation

Attention as In-Context Empirical Bayes: A Two-Stage View via Particle Dynamics

arXiv cs.LG · Matthew Smart, Soumya Ganguly, Nilava Metya, Alexandre V. Morozov · 2026-05-28

The paper proposes a two-stage empirical Bayes interpretation of attention-only transformers under all-token corruption, revealing distinct statistical roles for depth (Stage 1: refining distributions via particle dynamics) and attention residuals (Stage 2: posterior inference via long-range skip-connections). The framework demonstrates that effective denoising emerges without explicit noise schedules, using fixed kernel bandwidths and finite integration horizons, while establishing posterior-mean recovery guarantees for well-behaved priors. Results connect attention mechanisms to in-context inference via sample-based posterior estimation, bypassing explicit density modeling.

empirical bayesattention mechanismsparticle dynamicsposterior estimationin-context learning

Mixing Vector Model for Copolymer Inference via Mixed Integer Linear Programming

arXiv cs.LG · Jianshen Zhu, Raveena Rai, Taiyo Sohkawa, Naveed Ahmed Azam · 2026-05-28

The study extends the mol-infer framework to copolymer inference by introducing the mixing vector (MV) model, a convex combination of MILP-tractable monomer descriptors weighted by mixing ratios. The method employs artificial neural networks, reduced quadratic multiple linear regression, and random forests for property prediction, achieving test R² scores exceeding 0.7 for 9/10 datasets and 0.9 for 6/10. The MILP-based inverse design remains tractable even for three-monomer systems, validated by external consistency checks. This work provides a computationally feasible approach to exact inverse design of copolymers under the two-layered model.

mixing vector modelmixed integer linear programmingcopolymer inferenceinverse designtwo-layered model

Reasoning-preserved Efficient Distillation of Large Language Models via Activation-aware Initialization

arXiv cs.LG · Junlin He, Yihong Tang, Tong Nie, Guilong Li · 2026-05-28

The paper introduces RED (Reasoning-preserved Efficient Distillation), a method to mitigate reasoning collapse in efficiently distilled large language models (LLMs). By analyzing geometric origins of reasoning degradation, the authors identify eRank collapse in hidden representations due to uneven singular value distribution in projection matrices. RED employs activation-aware initialization to initialize these matrices as channel-selection matrices, theoretically preserving effective rank. Experiments on Llama and Qwen series show RED recovers reasoning ability while maintaining state-of-the-art general performance and training efficiency.

reasoning collapseefficient distillationerank collapseactivation-aware initializationchannel-selection matrices

NeuroEdge: Real-Time Hand Gesture Recognition with High-Density EMG Using Deep Learning at the Edge

arXiv cs.LG · Peter Chudinov, Zhenyu Lin, Jay Motamarry, Srihita Panati · 2026-05-28

NeuroEdge introduces a real-time hand gesture recognition system using high-density electromyography (HD-EMG) and deep learning on microcontrollers. The system comprises an HD-EMG StreamBridge for wireless data streaming and an EdgeDL Inference Engine with a lightweight 1D CNN for embedded inference, leveraging DMA and SPI for low-latency performance. Experiments demonstrate 90% accuracy across seven gestures with 83 ms latency using 192 HD-EMG channels, enabling next-generation neural-machine interfaces on edge devices.

high-density electromyographyneural-machine interfacesembedded inferencedirect memory accessserial peripheral interface

GrepSeek: Training Search Agents for Direct Corpus Interaction

arXiv cs.LG · Alireza Salemi, Chang Zeng, Atharva Nijasure, Jui-Hui Chung · 2026-05-28

The paper introduces GrepSeek, a direct corpus interaction (DCI) search agent that trains a compact model to execute shell commands for evidence retrieval. To address RL instability, it employs a two-stage pipeline: (1) generating verified search trajectories via an answer-aware Tutor and answer-blind Planner, and (2) refining the policy with Group Relative Policy Optimization (GRPO). A sharded-parallel execution engine achieves 7.6× speedup while maintaining byte-exact equivalence. Evaluations on seven QA benchmarks demonstrate superior token-level F1 and Exact Match, though lexical interaction struggles with surface-form variation.

grepseekdirect corpus interactiongroup relative policy optimizationsharded-parallel executionsurface-form variation

Do Physics Foundation Models Learn Generalizable Physics? A Bias-Aware Benchmark Across Physical Regimes and Distribution Shifts

arXiv cs.LG · Mengdi Chu, Yang Liu, Ayan Biswas, Han-Wei Shen · 2026-05-28

The study evaluates the generalizability of physics foundation models through a systematic benchmark comprising 8 physical dynamics, 3 training-data mixtures, and 25 test regimes across in-distribution, distribution-shift, and out-of-distribution settings. Testing five architectures with four variants each (totaling 60,000 measurements), results reveal these models act as conditional rather than universal generalists, with performance dependent on physical regime, temporal scale, initial conditions, pretraining, model size, and architecture. Pretraining and scaling fail to reliably mitigate biases, suggesting future work must focus on mechanisms for transferable physical knowledge across diverse conditions.

physics foundation modelsdistribution shiftsspatiotemporal forecastinggeneralizability benchmarkconditional generalists

LoopFM: Learning frOm HistOrical RePresentations of Foundation Model for Recommendation

arXiv cs.LG · Shali Jiang, Hua Zheng, Boyang Liu, Laming Chen · 2026-05-28

LoopFM introduces a knowledge transfer framework for recommendation systems by leveraging historical intermediate embeddings from foundation models (FMs) as input features for vertical models (VMs), avoiding real-time FM inference. The method structures FM embeddings as sequential user history inputs, theoretically analyzed via gain decomposition and transfer-ratio metrics. Experiments on three benchmarks show AUC improvements (e.g., +6% on TaobaoAd) and complementary gains with knowledge distillation, while industrial deployments achieved +0.5% to +1.22% conversion improvements with doubled transfer ratios.

knowledge distillationfoundation modelsrecommendation systemsintermediate embeddingstransfer ratio

A Theoretical and Experimental Study of a Novel Adaptive Learning Algorithm

arXiv cs.LG · Sakshi Kumari, Shyam Kumar M, Sushmitha P · 2026-05-28

The paper introduces C-Adam, a novel adaptive optimizer addressing convergence issues in Adam and AMSGrad. Utilizing a line-of-sight approach, C-Adam theoretically guarantees convergence, validated through numerical experiments. The study critically reviews Adam and AMSGrad, highlighting their non-convergence behaviors, and demonstrates C-Adam's superior performance in minimizing loss functions with reduced computational cost and oscillations.

adaptive optimizationconvergence guaranteeamsgradc-adamline-of-sight

Causal Label Recovery in Payment Networks

arXiv cs.LG · Gaurav Dhama · 2026-05-28

The Sequential Triply Robust (STR) estimator achieves the minimax lower bound for fraud detection in payment networks by correcting four systematic impairments: authorization, issuer reporting, delay, and corruption. STR formalizes the observation pipeline as a sequential missing-data problem with three propensity stages and a corruption layer, leveraging noise-rate-adjusted pseudo-labels, empirical Bayes shrinkage, and a plug-in variance estimator. It is sequentially triply robust, requiring only correct specification of either the propensity model or outcome regression at each gate. STR reduces training delay from months to days, minimizes mean squared error, and provides valid confidence intervals and finite-sample guarantees via Bernstein concentration inequality.

sequential triply robustpropensity modelnoise-rate-adjustedempirical bayesminimax lower bound

CoHyDE: Iterative Co-Training of LLM Rewriter & Dense Encoder for Tool Retrieval

arXiv cs.LG · Vaishali Senthil, Ashutosh Hathidara, Sebastian Schreiber · 2026-05-28

CoHyDE introduces an iterative co-training framework for tool retrieval, combining a dense encoder and LLM rewriter to bridge the vocabulary gap between colloquial user queries and technical API catalogs. The method alternates between InfoNCE-based encoder training on LLM-generated hypothetical descriptions and DPO-based alignment of the rewriter using encoder retrieval scores. On a 10k-tool subset of ToolBench, CoHyDE improves NDCG@5 by +2.5 pp on standard queries and +6.3 pp on vague queries versus baselines, with up to +8 pp gains on the hardest cases. Ablations confirm co-training's necessity, as isolated components underperform by up to -8 pp on vague queries.

tool retrievaldense encoderllm rewriterinfoncepreference alignment

Compute Allocation in Evolutionary Search: From Depth-Breadth to Multi-Armed Bandits

arXiv cs.LG · Sixue Xing, Haoyu He, Kerui Wu, Zhuo Yang · 2026-05-28

The paper introduces BaSE (Bandit-based Self-Evolving), a multi-armed bandit method for optimizing compute allocation in LLM-guided evolutionary search. Analyzing depth-breadth tradeoffs across five models and three tasks, the authors identify two empirical patterns: a fitness-compute envelope where capability ordering collapses on effective FLOPs, and a bilinear depth-breadth fit with task-specific interactions. BaSE dynamically allocates LLM calls across parallel trajectories, improving mean fitness by 12.3% over island-protocol baselines across 8 (model, task) pairs, with particularly strong gains in high-variance settings.

evolutionary searchmulti-armed banditcompute allocationllm-guidedfitness-compute envelope

When and How Human Curation Backfires: Preference Alignment under Multi-Model Self-Consuming Loop

arXiv cs.LG · Yang Zhang, Xiukun Wei, Xueru Zhang · 2026-05-28

The paper analyzes how human curation affects preference alignment in multi-model self-consuming training loops, contrasting with prior single-model studies. It formalizes a framework for interacting self-consuming models, characterizing convergence conditions for the resulting dynamical system. Results demonstrate that while human curation improves alignment in isolated models, cross-model interactions in multi-agent settings can dampen or invert this effect, leading to degraded long-term alignment. The study reveals complex emergent behaviors not present in single-model scenarios.

self-consuming looppreference alignmentmulti-model traininghuman curationdynamical system

Robust Frequency-Calibrated Virtual EEG Channel Generation from Four Frontal Electrodes for Wearable EEG Augmentation

arXiv cs.LG · Minghao Xiao · 2026-05-28

FAVC-Net introduces a frequency-calibrated virtual-channel network that generates 13 unmeasured EEG channels from four frontal electrodes (Fp1, Fp2, F7, F8) for wearable EEG augmentation. The method combines multi-scale source encoding, source-state embeddings, target-conditioned mixing, GATv2 attention refinement, and spectral calibration to jointly optimize waveform and spectral fidelity. On the PRED+CT dataset, FAVC-Net reduced log-spectral distance and PSD KL divergence by 30.09% and 37.98% versus baselines, demonstrating robustness to source perturbations while maintaining spectral integrity.

eeg augmentationvirtual-channel generationspectral calibrationgatv2 attentionwearable neurotechnology

KLAS: Using Similarity to Stitch Neural Networks for Improved Accuracy-Efficiency Tradeoffs

arXiv cs.LG · Debopam Sanyal, Anantharaman Iyer, Alind Khare, Trisha Jain · 2026-05-28

KLAS introduces a novel stitch selection framework for optimizing accuracy-efficiency tradeoffs in neural networks by leveraging KL divergence between intermediate representations. The method automates stitch selection across model families, identifying optimal binary stitches from O(k^2n^2) possibilities for k pretrained models of depth n. Comprehensive experiments demonstrate that KLAS improves the accuracy-efficiency curve, achieving up to 1.21% higher ImageNet-1K top-1 accuracy at the same computational cost or maintaining accuracy with a 1.33× reduction in FLOPs.

kl divergencestitch selectionaccuracy-efficiency tradeoffintermediate representationsflops

OmniRetrieval: Unified Retrieval across Heterogeneous Knowledge Sources

arXiv cs.LG · Jinheon Baek, Soyeong Jeong, Sangwoo Park, Woongyeong Yeo · 2026-05-28

OmniRetrieval introduces a unified retrieval framework for heterogeneous knowledge sources, preserving structural distinctions while enabling cross-source querying. The system processes natural-language queries, identifies relevant sources (text, relational tables, knowledge graphs), and dispatches source-native queries to their respective execution engines. Evaluated across 13 datasets and 309 distinct knowledge bases, OmniRetrieval outperforms single-source baselines, demonstrating effective integration of diverse knowledge representations without structural homogenization.

heterogeneous retrievalknowledge graphsrelational tablesnatural-language queryingmulti-source integration

Prediction-Powered Inference Across Many Tasks for AI Evaluation & Social Science Research

arXiv cs.LG · Nicolas Emmenegger, Ellery Stahler, Chara Podimata · 2026-05-28

We introduce a multi-task prediction-powered inference (PPI) framework that improves statistical inference across related tasks by exploiting shared structure in proxy-ground-truth relationships. The method employs cross-task recalibration while preserving task-specific inference through within-task rectification and power tuning, enabling accurate point estimates and confidence intervals. Theoretical analysis shows efficiency gains are achievable only when the proxy-ground-truth relationship contains nonlinear structure, with affine recalibrations asymptotically equivalent to the original proxy. Experiments on synthetic, semi-synthetic datasets, and a case study auditing language models on 2024 U.S. election information demonstrate that cross-task recalibration substantially reduces confidence interval widths under label scarcity.

prediction-powered inferencecross-task recalibrationconfidence intervalsproxy-ground-truth relationshiplabel scarcity

DenseSteer: Steering Small Language Models towards Dense Math Reasoning

arXiv cs.LG · Yang Ouyang, Shuhang Lin, Jung-Eun Kim · 2026-05-28

DenseSteer introduces a training-free inference-time steering framework to enhance mathematical reasoning in small language models (<=3B parameters) by modulating internal representations toward dense reasoning patterns. The method is motivated by empirical analyses showing that proficient reasoning correlates with fewer but informationally denser reasoning steps. Evaluations on math reasoning benchmarks demonstrate consistent accuracy improvements without increasing token-level Negative Log-Likelihood, validating dense reasoning as an effective structural approach to mathematical problem solving.

dense reasoninginference-time steeringchain-of-thoughtnegative log-likelihoodmathematical reasoning

Implicit Identity Technologies for LLMs: Fingerprinting and Watermarking across Datasets, Models, and Generated Content

arXiv cs.LG · Bing Liu, Shunping Wang, Yufan Zhu, Xinyi Yu · 2026-05-28

The paper introduces 'implicit identity' as a unifying abstraction for fingerprinting and watermarking techniques in LLM systems, addressing the fragmented landscape of identity verification across datasets, models, and generated content. It distinguishes fingerprinting as non-intrusive identity derived from intrinsic characteristics and watermarking as intrusive identity deliberately embedded into assets. A lifecycle-based taxonomy organizes techniques by verification semantics: similarity-based attribution and keyed verification. An evaluation framework is established, focusing on identifiability, robustness, and deployability, with representative metrics under realistic access and transformation regimes. The survey provides a structured foundation for studying LLM identity technologies and developing reliable mechanisms for asset protection and provenance.

implicit identityfingerprintingwatermarkingverification semanticslifecycle-based taxonomy

📰 Industry Media (5)

Genesis AI Releases Nyx, Quadrants, and Genesis World 1.0 Physics Platform for Scalable Robotics Foundation Model Evaluation

MarkTechPost · Michal Sutter · 2026-05-30

Genesis AI introduces Genesis World 1.0, a physics platform comprising four components (Nyx renderer, Genesis World physics engine, Quadrants compiler, and simulation interface) to accelerate robotics foundation model evaluation. The platform enables zero-shot real-to-sim evaluation with a reported Pearson correlation of 0.8996 between simulation and hardware rollouts across 14 tasks. Key innovations include barrier-free elastodynamics (103× speedup over traditional IPC) and Quadrants' 4.6× runtime improvement over Taichi. The system reduces evaluation time from 200+ hardware hours to under 0.5 hours while maintaining bit-exact consistency.

zero-shot real-to-simbarrier-free elastodynamicspearson correlationphysics enginegpu compiler

Hermes Agent Ships Tool Search for MCP: Anthropic Evals Show 49% to 74% Accuracy Gain on Opus 4

MarkTechPost · Asif Razzaq · 2026-05-30

Nous Research's Hermes Agent introduces Tool Search, a progressive-disclosure mechanism for MCP tools that reduces context window bloat by dynamically loading only relevant tool schemas. The system employs BM25 retrieval with substring fallback, replacing upfront schema loading with three bridge tools (tool_search, tool_describe, tool_call). Anthropic evaluations demonstrate accuracy improvements from 49% to 74% on Claude Opus 4 and 79.5% to 88.1% on Opus 4.5, alongside an 85% reduction in tool-definition token usage.

progressive-disclosurebm25 retrievalcontext windowmcp toolsdecision paralysis

How to Use AgentTrove: Streaming 1.7M Agentic Traces and Building a Clean ShareGPT SFT Dataset in Python

MarkTechPost · Sana Hassan · 2026-05-30

The article presents a Python workflow for processing AgentTrove, an open-source dataset containing 1.7M agentic interaction traces. The method employs streaming access to analyze conversation schemas, normalize turns, extract tool commands (achieving 97th percentile clipping for turn-length distributions), and filter successful trajectories into ShareGPT-style JSONL format. Results include statistical summaries of turn counts (mean=8.2±4.7), command frequencies, and visualizations of task sources and teacher model distributions, with 200 high-quality traces exported for supervised fine-tuning.

agentic tracesstreaming datasetcommand extractionsupervised fine-tuningturn normalization

NVIDIA Introduces X-Token: Projection-Guided Cross-Tokenizer KD That Outperforms GOLD by +3.82 Average Points on Llama-3.2-1B

MarkTechPost · Asif Razzaq · 2026-05-29

NVIDIA introduces X-Token, a projection-guided cross-tokenizer knowledge distillation (KD) method that addresses structural failures in prior approaches like GOLD. The method employs span alignment, a deterministic projection matrix W, and two loss formulations (P-KL and H-KL) to handle vocabulary mismatches between teacher and student models. Evaluated on Llama-3.2-1B with Qwen3-4B and Phi-4-mini-Instruct teachers, X-Token outperforms GOLD by +3.82 average points and achieves 40.48 avg. in multi-teacher setups.

knowledge distillationcross-tokenizerspan alignmentprojection matrixkl divergence

StepFun Releases Step 3.7 Flash: A 198B MoE Vision-Language Model for Coding Agents and Search Workflows

MarkTechPost · Asif Razzaq · 2026-05-29

StepFun introduces Step 3.7 Flash, a 198B-parameter Mixture-of-Experts vision-language model with 196B language and 1.8B ViT components, activating ~11B parameters per token. The model features 256k context length, multimodal capabilities, and three selectable reasoning depths (low/medium/high). It achieves 56.26% on SWE-Bench Pro (+5pp over 3.5 Flash), 79.16% on SimpleVQA (with search), and 47.20% on HLE with tools, while demonstrating emergent compositional tool use. Advisor Mode enables 97% of Claude Opus 4.6's coding performance at 1/9th cost ($0.19/task).

mixture-of-expertsvision-language modeladvisory modeemergent tool uselong-context retrieval


Generated automatically at 2026-05-30 20:21 UTC. Summaries and keywords are produced by an LLM and may contain inaccuracies — always consult the original article.