Daily Digest — 2026-05-27
209 items · 200 arxiv papers, 9 industry media
🏛️ Research Labs
No new items today.
📜 arXiv Papers (200)
From Model Scaling to System Scaling: Scaling the Harness in Agentic AI
The paper identifies system scaling as the next bottleneck in agentic AI, proposing the concept of 'harness scaling' to design auditable, persistent, and modular architectures around foundation models. It highlights three core bottlenecks: context governance, trustworthy memory, and dynamic skill routing, supported by orchestration and governance mechanisms. The authors introduce CheetahClaws, a Python-native reference harness, and compare it with Claude Code and OpenClaw. They argue that future progress in agentic AI will depend equally on system design and foundation model improvements.
agentic aisystem scalingfoundation modelscheetahclawscontext governance
Squeezing Capacity from Multimodal Large Language Models for Subject-driven Generation
The paper introduces a method for subject-driven image generation that enhances identity preservation while following textual instructions by conditioning diffusion models on Multimodal Large Language Models (MLLMs). The approach employs a Dual Layer Aggregation (DLA) module to fuse multi-level MLLM features and a multi-stage denoising strategy to balance semantic information from MLLM and fine-detail identity from VAE. Experiments show improved performance in harmonizing multimodal understanding with identity preservation, reducing copy-paste artifacts, and achieving higher human preference scores.
subject-driven generationmultimodal large language modelsdiffusion modelsdual layer aggregationidentity preservation
Prism: A Plug-in Reproducible Infrastructure for Scalable Multimodal Continual Instruction Tuning
The paper introduces Prism, a plug-in reproducible infrastructure for scalable Multimodal Continual Instruction Tuning (MCIT). It addresses engineering bottlenecks in MCIT research by decoupling algorithmic development from backbone MLLM implementation via a lightweight plugin registration mechanism, enabling method integration without codebase modifications. Prism supports large-scale training pipelines, facilitating reproducible and scalable experimentation. The code is publicly available.
multimodal continual instruction tuningplugin registration mechanismlarge-scale training pipelinereproducible infrastructuremllm codebase
Looped Diffusion Language Models
The paper introduces LoopMDM, a looped masked diffusion model (MDM) that selectively reuses early-middle transformer layers to improve training efficiency and performance. This approach yields depth-scaling benefits without parameter overhead and enables flexible compute scaling at inference by varying loop counts. LoopMDM matches same-size MDMs with 3.3× fewer training FLOPs, outperforms them by up to 8.5 points on GSM8K, and surpasses deeper non-looped MDMs. Adaptive loop adjustment during sampling further enhances compute efficiency. Attention analysis suggests looping improves masked position interactions.
masked diffusion modelstransformer architecturescompute scalingdepth-scalingattention analysis
Forgetting in Language Models: Capacity, Optimization, and Self-Generated Replay
The paper demonstrates that language models can mitigate catastrophic forgetting by leveraging self-generated samples as replay data, nearly eliminating performance degradation on prior tasks. This approach contrasts with traditional methods requiring stored exemplars. Findings indicate that forgetting persists when models operate near capacity saturation, necessitating low learning rates for retention at the cost of training efficiency. Self-generated replay breaks this tradeoff, enabling rapid finetuning with high learning rates while preserving prior knowledge.
catastrophic forgettinglanguage modelsself-generated replaycapacity saturationfinetuning
Goal-driven Bayesian Optimal Experimental Design for Robust Decision-Making Under Model Uncertainty
The paper introduces GoBOED, a goal-driven Bayesian optimal experimental design framework that optimizes experiments for specific decision-making objectives rather than general information gain. By combining an amortized variational posterior surrogate with a differentiable convex decision layer, GoBOED enables gradient-based design optimization focused on decision quality. Theoretical analysis shows GoBOED gradients ignore parameter directions irrelevant to decisions, justifying its superior alignment with objectives. Empirical evaluations in source localization, epidemic management, and pharmacokinetic control demonstrate GoBOED's wider near-optimal design windows compared to goal-agnostic BOED approaches.
bayesian optimal designdecision-focusedvariational posteriordifferentiable convex layerparameter uncertainty
OrpQuant: Geometric Orthogonal Residual Projection for Multiplier-Free Power-of-Two Transformer Quantization
The paper introduces Orthogonal Residual Projection (ORP), a geometric algorithm-hardware co-design framework for Power-of-Two (PoT) quantization in Transformers. ORP addresses the Low Angular Resolution Regime limitation in sub-4-bit quantization by formulating it as a dual-basis geometric projection, synthesizing higher-resolution residual lattices using shift-and-add operations. The method reduces calibration time for LLaMA-2-7B to 15 minutes and achieves a perplexity of 6.10 under 3-bit constraints, outperforming MAC-intensive baselines like AWQ. Silicon-level RTL synthesis at 28nm confirms ORP's efficacy in mitigating timing bottlenecks.
orthogonal residual projectionpower-of-two quantizationlow angular resolution regimeshift-and-add operationsrtl synthesis
DiscoverPhysics: Benchmarking LLMs for Out-of-the-Box Scientific Thinking
The paper introduces DiscoverPhysics, a novel benchmark evaluating LLMs' ability to discover physical laws in simulated worlds with non-standard physics. The benchmark comprises 22 procedurally generated worlds featuring varied gravitational models and hidden interactions, requiring agents to design experiments, analyze trajectory data, and formulate explanatory theories. Evaluations across eleven frontier models reveal that even top performers solve only 50% of worlds, with particular difficulty in latent structure discovery. Open-source models significantly underperform commercial counterparts in experimental design and hypothesis refinement. Results indicate a dissociation between predictive accuracy and conceptual understanding, emphasizing the need for iterative hypothesis testing.
physics discoveryinteractive benchmarkn-body simulationhypothesis refinementlatent structure
Global Convergence of Wasserstein Policy Gradient for Entropy-Regularized Reinforcement Learning
The paper establishes global convergence guarantees for Wasserstein policy gradient (WPG) in entropy-regularized reinforcement learning (RL), addressing a gap in theoretical understanding. By leveraging the Bellman structure of soft Q-functions, the analysis substitutes convexity with a Bellman-based argument: a KL representation of soft Bellman residuals, contraction properties linking residuals to optimality gaps, and a resolvent identity connecting value improvement to Fisher information. Combined with a uniform log-Sobolev inequality for Gibbs policies, this yields a distributional Polyak–Łojasiewicz condition, enabling geometric convergence up to discretization error. The results demonstrate that entropy-regularized RL exhibits favorable PL-type geometry despite non-convexity.
wasserstein policy gradiententropy-regularized rlbellman residualpolyak–łojasiewicz conditionlog-sobolev inequality
Active Query Synthesis for Preference Learning
The authors introduce Info-Synth, an active query synthesis framework for preference learning that addresses feedback reliability and computational efficiency. The method employs a confidence-aware response model to handle ambiguous pairwise comparisons and maximizes a mutual information-based objective in continuous space for query generation. Two strategies, Pair M-dist and Pair Opt-dist, extend Info-Synth to finite query pools. Evaluations on synthetic preference learning, constrained text summarization, and robot controller tuning demonstrate the framework's versatility and performance.
active learningpreference learningmutual informationquery synthesisconfidence-aware model
Rethinking Weak Supervision in Anomaly Detection: A Comprehensive Benchmark
The paper introduces WSADBench, a unified benchmark for evaluating weakly supervised anomaly detection (WSAD) across three supervision scenarios: incomplete, inexact, and inaccurate. It standardizes evaluation protocols for 36 algorithms across 4 modalities, varying label quantity, granularity, and quality. Based on 700K experiments, key findings include: (i) intrinsic correlations between WSAD scenarios, (ii) specialized WSAD methods excel only in extreme label-scarcity, (iii) inconsistent utility of unlabeled data, and (iv) asymmetric sensitivity to label noise. The benchmark reveals that tabular foundation models outperform specialized methods as supervision increases.
weakly supervised anomaly detectionwsadbenchtabular foundation modelslabel noisebenchmark
Conditional KRR: Injecting Unpenalized Features into Kernel Methods with Applications to Kernel Thresholding
The paper introduces conditional kernel ridge regression (conditional KRR), a method combining linear regression on unpenalized features with standard KRR on residuals, using conditionally positive definite kernels. Theoretical analysis shows the method's test risk reduces to standard KRR with a residual kernel, plus an O(1/√N) term dependent on feature class F. Experiments demonstrate conditional KRR outperforms standard KRR when F captures dominant signal components, particularly for Mercer eigenfunctions or random feature representations of K.
conditional kernel ridge regressionconditionally positive definite kernelsnative space normmercer decompositionrandom features
Paris 2.0: A Decentralized Diffusion Model for Video Generation
Paris 2.0 introduces the first decentralized diffusion model for video generation, extending Paris 1.0's decentralized training framework to temporally coherent outputs. The method adapts decentralized computation—previously proven viable for image generation—to video synthesis without relying on monolithic GPU clusters. In low-resolution text-to-video tasks, Paris 2.0 reduces Frechet Video Distance (FVD) by 2.0x (from 561.04 to 279.01) versus a centralized counterpart under matched compute, while improving CLIP text-video alignment and aesthetic scores.
decentralized diffusion modelfrechet video distancetext-to-videoclip similaritytemporal coherence
Neuronal Stochastic Attention Circuit (NSAC) for Probabilistic Representation Learning
The paper introduces Neuronal Stochastic Attention Circuit (NSAC), a biologically-inspired continuous-time attention architecture that models attention logits via an Ornstein-Uhlenbeck stochastic differential equation with input-dependent gates from C.elegans Neuronal Circuit Policies. NSAC propagates stochasticity through Gaussian-distributed logits and logistic-normal attention weights, optimized via a two-term objective combining Gaussian negative log-likelihood and epistemic-separation regularization. Evaluated on irregular function approximation, multivariate regression, long-range forecasting, Industry 4.0, and autonomous lane-keeping, NSAC achieves competitive accuracy with well-calibrated uncertainty estimates while maintaining neuronal-level interpretability.
neuronal stochastic attention circuitornstein-uhlenbeck processlogistic-normal distributionepistemic-separation regularizercontinuous-time attention
Accelerating Bayesian inverse design in computational fluid dynamics using neural operators
This work introduces neural operator-accelerated Bayesian inference for uncertainty-aware inverse design in computational fluid dynamics (CFD), achieving a 1000× speedup over traditional methods. The approach embeds a Deep Operator Network surrogate within a No-U-Turn Sampler MCMC loop while preserving posterior structure, validated on quasi-one-dimensional nozzle flow with cubic B-spline parameterization. Results show surrogate-based inference matches CFD reference posteriors in under 1 second, and a direct inverse neural operator enables single-shot deterministic reconstruction.
bayesian inverse designneural operatorscomputational fluid dynamicsmarkov chain monte carlodeep operator network
When Gradients Collide: Failure Modes of Multi-Objective Prompt Optimization for LLM Judges
The work identifies two failure modes in multi-objective prompt optimization for LLM judges: gradient dilution during optimization and instruction interference during inference. It evaluates five decomposition modes of textual gradient optimizers by varying cross-task information sharing among loss, gradient, and optimizer LLMs. Results show optimization fails in 6/10 configurations, with gradient specificity dropping 59% (9.0 to 3.7) under joint criteria processing, and naive instruction combination degrading Spearman's rho by -5.3%.
multi-objective optimizationtextual gradient methodsllm judgesgradient dilutioninstruction interference
CITYREP: A Unified Benchmark for Urban Representations Across Cities, Tasks, and Modalities
CityRep introduces a unified benchmark for evaluating urban representation learning across cities, tasks, and modalities, addressing limitations of current evaluations. It features a spatial unit-agnostic framework, block-based spatial splits to mitigate spatial leakage, and an extensible multi-city, multi-task suite spanning 8 cities and 8 tasks. The benchmark evaluates 11 urban representation models, revealing that random splits inflate performance and alter rankings, while performance varies significantly across cities and tasks. CityRep provides datasets, evaluation pipelines, and diagnostic tools to support reproducible research and fair comparison in urban representation learning.
urban representation learningspatial leakagemulti-task benchmarkgeneralization-aware evaluationurban foundation models
Length Generalization with Log-Depth Recurrent Units
We introduce MLP-LDRU, a Log-Depth Recurrent Unit designed to address length generalization challenges in neural networks by approximating recurrence through parallel reduction with associativity-biased operators. The model is evaluated on 21 regular-language tasks, including standard benchmarks and new prefix languages, achieving 100% out-of-distribution accuracy on 18 tasks and ≥99.9% on the remaining 3 when increasing max training length, surpassing comparable recurrent and attention-based models. MLP-LDRU also demonstrates competitive performance on ListOps and NLP classification benchmarks beyond regular languages.
length generalizationlog-depth recurrent unitregular-language tasksparallel reductionassociativity-biased operators
Everything at Every Scale: Scale-Invariant Diffusion with Continuous Super-Resolution
SKILD introduces a scale-invariant diffusion model unifying image generation and continuous super-resolution within a single unconditional framework. Leveraging scale invariance, it designs a forward process attenuating image content from fine to coarse scales while injecting spectrum-matched Gaussian noise, making scale an explicit coordinate of diffusion dynamics. The reverse process performs both tasks by varying only the starting timestep, requiring no task-specific architecture or retraining per scale factor. SKILD achieves FID 2.65 and Inception Score 9.63 on CIFAR-10, outperforms conditional models in ImageNet super-resolution, and accurately reconstructs critical Ising models.
scale invariancediffusion modelsuper-resolutionspectrum-matched noiseising models
A Multimodal 3D Foundation Model for Light Sheet Fluorescence Microscopy Enables Few-Shot Segmentation, Classification, and Deblurring
We introduce a multimodal 3D foundation model for light sheet fluorescence microscopy (LSM) that enables few-shot segmentation, classification, and deblurring. The model is pretrained on a large curated dataset of 3D images across organisms, stains, and imaging protocols, learning transferable volumetric representations via joint optimization of masked reconstruction and image-text alignment. Evaluations demonstrate consistent improvements over baselines in downstream tasks, both in standard metrics and expert assessments, while drastically reducing annotation requirements. Pretrained weights and code are publicly available.
light sheet fluorescence microscopyvolumetric representationmasked reconstructionfew-shot learningimage-text alignment
Retrieval-Augmented Detection of Potentially Abusive Clauses in Chilean Terms of Service
The study introduces a retrieval-augmented generation framework for detecting and classifying abusive clauses in Chilean Terms of Service, alongside the Chilean Abusive Terms of Service Extended corpus (100 contracts, 10,029 clauses in 24 categories). The method combines dense-sparse retrieval, reranking, and prompt augmentation to support local open-weight language models. Results show retrieval-augmented prompting improves performance, enabling local models to approach cloud-based systems with lower computational cost (token efficiency quantified but not specified).
retrieval-augmented generationdense-sparse retrievalopen-weight language modelscontract annotationprompt augmentation
AdvantageFlow: Advantage-Weighted Least Squares for RL in Flow Models
AdvantageFlow introduces a forward-process reinforcement learning algorithm for rectified flow models, optimizing an advantage-weighted forward-process prediction loss instead of the reverse process like Flow-GRPO. The method addresses instability in optimization with negative advantages by employing rollout policy regularization to reduce variance and fit a local reward-improving target distribution. Evaluated on image generation tasks using Stable Diffusion 3.5 Medium, AdvantageFlow outperforms both Flow-GRPO and a state-of-the-art forward-process RL baseline in negative-aware fine-tuning.
advantageflowrectified flow modelsrollout policy regularizationstable diffusion 3.5 mediumflow-grpo
Learning in Low-Dimensional Subspaces: Orthogonal Bottlenecks for Reinforcement Learning
This work introduces orthogonal bottlenecks, a representation-level prior for deep reinforcement learning that constrains encoder features to low-dimensional subspaces via fixed orthonormal projections. The method requires no auxiliary objectives, pretraining, or RL algorithm modifications, preserving expressivity when the bottleneck dimension exceeds the intrinsic rank of the optimal value function. Empirical results across single and multi-task benchmarks show baseline performance is maintained or improved above task-dependent threshold dimensions, with value representations compressible to extremely low dimensions without loss. Analysis reveals orthogonal bottlenecks stabilize feature norms and increase effective rank, supporting their role as lightweight, architecture-agnostic mechanisms for shaping RL representations.
orthogonal bottlenecksrepresentation-level priorlow-dimensional subspaceseffective rankreinforcement learning
Statistical Inference for Stochastic Gradient Descent Beyond Finite Variance
We present a model-agnostic methodology for constructing asymptotically valid confidence regions from Stochastic Gradient Descent (SGD) trajectories in both finite- and infinite-variance regimes. The approach leverages a joint weak convergence result for the Polyak-Ruppert averaged estimator and an empirical second-moment normalizer derived from stochastic gradients, yielding a self-normalized statistic where tail-dependent scaling terms cancel. A subsampling calibration scheme estimates critical values without requiring explicit estimation of tail indices or stable-law parameters. Simulations demonstrate reliable coverage across various settings, establishing the method as a practical tool for uncertainty quantification in stochastic optimization.
stochastic gradient descentpolyak-ruppert averagingself-normalized statisticsubsampling calibrationconfidence regions
Causal methods for LLM development and evaluation
The paper advocates for integrating causal inference methods into large language model (LLM) development and evaluation pipelines, identifying three key contributions. First, it demonstrates how causal methods address confounding, distribution shifts, and biased evaluation in logged data settings. Second, it systematically maps causal opportunities across pretraining, alignment, routing, agentic workflows, and evaluation stages. Third, it identifies new research directions for causal LLM development. The authors argue causal methods provide principled solutions to current empirical fragility in LLM pipelines despite their underutilization.
causal inferencellm developmentdistribution shiftslogged dataconfounding
Deployment-complete benchmarking
The paper introduces deployment-complete benchmarking, a method to evaluate whether benchmark evidence sufficiently determines deployment actions. By analyzing evidence fibers and using completion curves, it identifies ambiguities in benchmark-to-deployment transitions. Results show poor transferability (10.07% coverage) of benchmark-channel conformal coverage to unmeasured deployment channels, while response-rank intervals achieved 94.91% coverage. Audits revealed significant incompleteness, with 97.9% mixed fibers in Tox21 and zero median certifiable fraction in Matbench/JARVIS. The method reduced false decisions from 1.19% to 0.027% in Tox21 and 20.3% to 0.128% in JARVIS replays.
deployment-complete benchmarkingevidence fiberscompletion curvesconformal coverageresponse-rank intervals
Fuzzy PyTorch: Rapid Numerical Variability Evaluation for Deep Learning Models
Fuzzy PyTorch introduces a framework for rapid evaluation of numerical variability in deep learning models, addressing floating-point arithmetic uncertainty. It integrates stochastic arithmetic into PyTorch via Probabilistic Rounding with Instruction Set Management, interfacing with the Verificarlo compiler, and offers stochastic rounding and novel up-down rounding modes. Comparative evaluations demonstrate runtime reductions of 5x to 60x versus Verrou, while maintaining model performance across architectures ranging from 1 to 341 million parameters. The framework provides scalable, efficient, and practical solutions for quantifying floating-point uncertainty without compromising computational efficiency.
stochastic arithmeticprobabilistic roundingnumerical variabilityfloating-point uncertaintyinstruction set management
Creative Quality Alignment: Expert Tacit Knowledge Transfer via Chain-of-Thought Fine-Tuning
The paper empirically validates the creative quality metric from Calibrated Surprise (Zou & Xu, 2026a) under strict engineering constraints: minimal data (100 expert chain-of-thought annotations from BC Protocol) and a small base model. It introduces Creative Quality Alignment (CQA), addressing dataset biases toward craft knowledge by emphasizing audience modeling and reality-logic coverage. Theoretically, it demonstrates that in single-conditional-distribution LLMs, calibrating appreciation transfers to generation via architectural duality, explaining why few CoT examples suffice, unlike empirical approaches like LIMA.
creative quality alignmentchain-of-thoughtexpert tacit knowledgeconditional distribution architecturedata bias
Hidden in Plain Tokens: Simply Robust, Gradient-Free Watermark for Synthetic Audio
The paper introduces a robust, gradient-free watermarking method for synthetic audio, leveraging vocabulary redundancy in discrete token representations. By analyzing token error impacts and employing community detection for vocabulary reduction, the method achieves significant detectability improvements without finetuning tokenizers. Experimental results demonstrate orders-of-magnitude gains in watermark detectability and inherent robustness to audio modifications, establishing a new state-of-the-art for token-level watermarks in multimedia.
watermarkingtokenizationcommunity detectiondiscretizationrobustness
Mapping the Schedule x Bit-Width Boundary in Sub-100M Quantisation-Aware Training
The study investigates whether optimal learning-rate schedules vary by bit-width in quantisation-aware training (QAT) for sub-100M decoder language models. Through extensive experiments (720-run factorial grid and 625-run follow-up) across FP16/INT8/INT6/INT4 precisions, model sizes (5M-350M), and training configurations, the primary hypothesis—that INT6 QAT requires distinct schedules—is falsified. Results show a consistent optimal warmdown fraction of 33% (wd33) across bit-widths, with INT4 exhibiting a noise-dominated regime below 50M and decisive wd33 preference above 50M. Practical recommendations include reusing FP16 schedules for INT8/INT6 QAT and adopting wd33 for INT4 models ≥50M.
quantisation-aware traininglearning-rate schedulewarmdown fractionsub-100m modelsbit-width
QUIET: A Multi-Blank Cascaded Story Cloze Benchmark for LLM Creative Generation Capability
The paper introduces QUIET, a diagnostic benchmark for evaluating large language models' (LLMs) creative generation capability through multi-blank cascaded story cloze tasks. QUIET features N blanks (10-20) with explicit content constraints and cascade dependencies, requiring open-ended generation. An automated scoring protocol based on information-theoretic principles operationalizes the 'calibrated surprise' framework, combining constraint satisfaction and surprise metrics. This method avoids subjective human grading, providing an objective measure of creative capability.
multi-blank cascaded story clozecreative generation capabilitycalibrated surpriseinformation-theoretic scoringconstraint satisfaction
Step-TP: A Grounded, Step-Level Dataset with Chain-of-Thought Reasoning for LLM-Guided Tensor Program Optimization
Step-TP introduces a step-level dataset for tensor program optimization with chain-of-thought reasoning, addressing limitations in existing LLM-guided approaches. The dataset provides grounded, atomic supervision through a token-efficient intermediate representation that deterministically lowers to TVM TIR, enabling reliable multi-step optimization. It decomposes complex trajectories into interpretable single-step decisions with structured CoT supervision and explicit IR-to-IR state transitions. Strategy filtering balances coverage while preventing shortcut exploitation. The dataset and implementation are publicly available on GitHub.
tensor program optimizationchain-of-thought reasoningintermediate representationstrategy filteringtvm tir
Small Models, Strong Priors: Architectural Inductive Bias for Parameter-Efficient Neural PDE Solvers
The paper proposes WaveLiT, a parameter-efficient neural PDE solver combining architectural inductive biases to compete with larger foundation models. The architecture integrates discrete wavelet transforms for multi-resolution tokenization, augmented linear attention, shared-weight multiscale feature pyramids, and wavelet-domain auxiliary losses. Evaluated on eight TheWell benchmarks, 1-10M-parameter WaveLiT models match or exceed 100-1000× larger models, particularly excelling in wave/acoustic-dominated systems where wavelet priors align with dynamics. A 10M-parameter multi-task variant shows interpretable transfer patterns, performing best on wavelet-aligned dynamics and worst on chaotic advection. Results indicate architectural priors outweigh scale for PDE solving, with failure patterns revealing prior content.
neural pde solverswavelet transforminductive biasparameter efficiencymultiscale feature pyramid
STaT: Resolving Shape Distortion in Non-Stationary Time Series via Tri-Modal Synergy
STaT introduces a tri-modal architecture for non-stationary time series forecasting, addressing shape distortion in existing multi-modal approaches. The method combines symbolic (discrete tokenization for structural patterns), temporal (sequential dependencies), and textual (domain semantics) modalities via Symbolic-Temporal-Textual Alignment. Evaluations on eight benchmarks show STaT improves magnitude metrics by up to 8.9% and reduces shape distortion by up to 8.5% compared to conventional methods.
non-stationary time seriesmulti-modal fusionsymbolic tokenizationshape distortiontemporal dependencies
From Latent Space to Training Data: Explainable Specialization in Minimal MLPs
The study identifies a design principle for prototype-recoverability-aware training in minimal one-hidden-layer MLPs, demonstrating that repulsive structural losses require compatible attractors to prevent latent geometry collapse. Using Gaussian-activation MLPs with width equal to dataset size, the authors evaluate three structural losses—coverage, separation, and overlap—against a standard fitting baseline on uniformly sampled one-dimensional datasets. Coverage regularization achieves the lowest mean reconstruction error across 480 runs (N = 3 to N = 100) and enhances prototype-usage specialization, while overlap penalties systematically degrade performance by pushing prototype centers outside the training input convex hull. Separation exhibits mixed effects, with expulsion occurring only at large temperatures.
mlpprototype-recoverabilitylatent geometrystructural lossgaussian-activation
Building an Adversarial Malware Dataset by Family and Type: Generation, Evasion, and Poisoning Evaluation
The study introduces a novel adversarial malware dataset derived from RawMal-TF, comprising 44,347 family-labelled and 33,596 type-labelled PE files generated via adversarial techniques. These samples achieve evasion rates of 98.35% and 92.20% against the EMBER classifier, respectively. The dataset includes metadata such as EMBER scores and VirusTotal classifications. Additionally, the work demonstrates the vulnerability of malware classifiers to data poisoning, showing that injecting 0.5% mislabelled adversarial samples increases evasion rates from 26.1% to 92.8%. The dataset is publicly released to support research on adversarial malware and classifier robustness.
adversarial malwareember classifierevasion ratedata poisoningpe files
Quantitative Evaluation of the Severity of Posttraumatic Stress Disorder through Transfer Learning from Specific Phobia Data
The study proposes a transfer learning approach using multivariate kernel density estimation (MKDE) to objectively assess PTSD severity through physiological signals. Heart rate (HR) and galvanic skin response (GSR) data from 21 military participants were analyzed, leveraging a fear-response model pre-trained on arachnophobia data. The model achieved 86% accuracy in PTSD classification (PCL-M threshold: 36) with a mean absolute error (MAE) of 5.6 and 17% mean absolute percentage error in severity estimation, demonstrating potential for clinical screening applications.
transfer learningmultivariate kernel density estimationgalvanic skin responseptsd severityphysiological signals
Multi-Agent Systems are Mixtures of Experts: Who Becomes an Influencer?
The paper investigates multi-agent LLM deliberation through Friedkin-Johnsen opinion dynamics, revealing input-dependent parameters that transform deliberation into a mixture of experts. By analyzing stubbornness, influence, and opinion change, the study demonstrates that dynamic routing based on agent competence enables multi-agent systems to surpass single agents and static ensembles. Empirical analysis focuses on observable proxies for latent competence: self-assessed confidence, perceived confidence, and initial alignment with peers.
multi-agent systemsfriedkin-johnsen dynamicsmixture of expertsopinion changelatent competence
Does Continued Pretraining on a Learner Corpus Improve Automated Essay Scoring on English Proficiency Tests? Evidence from EFCAMDAT
This study evaluates domain-adaptive continued pretraining (DAPT) on the EFCAMDAT learner corpus to enhance transformer-based automated essay scoring (AES) for English proficiency tests. Using three transformer encoders, the research assesses DAPT's impact on FCE and IELTS datasets through in-domain scoring and few-shot cross-dataset transfer. Results indicate mixed efficacy, with proficiency-aligned subsets outperforming full-corpus DAPT for B1--B2 FCE data but failing to improve cross-dataset transfer consistently. Findings suggest DAPT benefits in-domain AES when pretraining data aligns with assessment settings.
domain-adaptive pretrainingtransformer encodersautomated essay scoringefcamdatenglish proficiency tests
Joint Optimization of Training and Inference in Federated Edge Learning via Constrained Multi-Objective Deep Reinforcement Learning
The paper proposes a joint optimization framework for federated edge learning (FEEL) that simultaneously manages training and inference on resource-constrained devices. It introduces a tandem-queue mechanism linking inference requests to training data, incorporates temporal dynamics via data/model freshness metrics, and formulates the problem as a multi-objective Markov decision process (MOMDP). The solution, constrained multi-objective proximal policy optimization (C-MOPPO), learns Pareto-optimal policies balancing accuracy, latency, and energy. Experiments show C-MOPPO outperforms baselines in achieving dense, high-quality trade-offs across objectives.
federated edge learningmulti-objective optimizationmarkov decision processproximal policy optimizationedge intelligence
Universal Activation Verbalizer: A Unified Framework for Cross-Model Activation Explanation
The Universal Activation Verbalizer (UAV) framework enables cross-model activation explanation by using a shared decoder to interpret activations from heterogeneous donor models. UAV employs a lightweight adapter to convert donor activations into soft tokens in the decoder's embedding space, supporting adapter-only transfer via a frozen decoder-side LoRA. Evaluated across classification, fact retrieval, and gist summarization tasks, UAV matches self-explanation baselines while facilitating cross-model verbalization across different model families and scales. Ablation studies indicate decoder-side tuning primarily enhances task behavior, while the adapter supplies activation-grounded factual and semantic information for faithful explanations.
universal activation verbalizercross-model explanationadapter-only transferdecoder-side loraactivation-grounded
Reading the Finetuning Prior: Verbatim Content Recovery via Contrastive Decoding Diffing
The paper introduces Contrastive Decoding Diffing (CDD), a black-box method for verbatim content recovery from finetuned language models without weight access. CDD leverages output-level logit distributions, bypasses chat templates, uses vague pre-fills, and amplifies logit-space differences between base and finetuned models. It achieves exact recovery of implanted facts (drug names, vote counts, etc.) across four architectures (1B-32B parameters), outperforming white-box Activation Difference Lens (ADL) by 170x speedup. CDD also exposes data pipeline artifacts, demonstrating end-to-end fingerprinting from generator artifacts to model weights. Validation shows near-perfect recovery in single-dataset settings and correct identification in mixed-dataset scenarios.
contrastive decoding diffinglogit-space differenceverbatim recoveryfinetuning priormodel fingerprinting
Predicting Stock Price Direction on Earnings Announcement Days using Multi-modal Deep Learning
The study evaluates multi-modal deep learning for predicting stock price direction during earnings announcements, combining fundamental metrics, technical indicators, and FinBERT-derived sentiment scores. It compares LSTM and Transformer architectures against logistic regression, with and without sentiment features. Results show the Transformer outperforms in volatile movement detection, achieving higher macro F1-scores, while sentiment features consistently enhance performance.
finbertlstmtransformermacro f1-scoresentiment scores
Merge-Bench: Resolve Merge Conflicts with Large Language Models
The paper introduces Merge-Bench, a dataset of 7938 real-world merge conflicts from 1439 GitHub repositories, with ground truth from developer commits. It presents LLMergeJ, a 14B-parameter model trained via Group Relative Policy Optimization (GRPO) for resolving Java merge conflicts. Evaluations show LLMergeJ outperforms three commercial LLMs (except Gemini 2.5 Pro), with top models resolving <60% of conflicts across 11 languages.
merge conflictsgroup relative policy optimizationlarge language modelsversion controlreinforcement learning
Capability and Robustness Cannot Both Be Free: An Information-Theoretic Bound for Vision-Language-Action Models
The paper establishes an information-theoretic bound proving that capability and robustness in Vision-Language-Action (VLA) models cannot be simultaneously maximized. Using mutual information measures and the Data Processing Inequality, it derives a policy-independent budget constraining the sum of task performance (capability) and adversarial robustness. The bound is validated empirically on 252 Gaussian-VLA cells and 48 OpenVLA-7B × LIBERO × PGD configurations, with zero violations observed. A corollary tightens the bound by restricting the adversarial channel to policy-relevant subspaces, revealing OpenVLA-7B already consumes ~24% of its ~31-nat budget.
vision-language-action modelsmutual informationadversarial robustnessdata processing inequalitytask entropy
Optimal and Order-optimal Gated Priority-based Greedy Policies for Two-layer Multi-item Order Fulfillment
The paper introduces Gated Priority-based Greedy policies for real-time multi-item order fulfillment in two-layer e-commerce distribution networks, addressing the trade-off between immediate cost savings and inventory preservation. Using an adversarial online model with multiple front distribution centers (FDCs), a regional center (RDC), and time-varying costs, the authors derive competitive-ratio guarantees and near-matching lower bounds. Numerical experiments demonstrate superior performance against myopic and forecast-based benchmarks, providing managerial insights on inventory protection and order splitting.
online fulfillmentcompetitive-ratiotwo-layer distributiongreedy policiesmulti-item orders
Conformalised imprecise inference for robust extrapolation under limited data
The paper introduces a conformalised imprecise inference framework for robust extrapolation under distributional shift, addressing limitations in existing uncertainty quantification methods. The model-agnostic approach augments predictive models with imprecision and distance awareness, yielding valid probability boxes (p-boxes) that maintain coverage guarantees while adaptively expanding uncertainty in extrapolation regimes. Experiments on synthetic and benchmark datasets demonstrate improved robustness and reliable coverage compared to standard probabilistic methods, particularly in data-limited scenarios.
conformal predictionimprecise probabilitydistributional shiftuncertainty quantificationprobability boxes
The Quantization Benefits of Residual-Free Transformers
The work demonstrates that residual connections in transformers amplify non-Gaussian activation distributions, increasing quantization error compared to residual-free architectures. Through kurtosis analysis and controlled experiments, it shows residual mixing exacerbates heavy-tailed activations, while dense mixing contracts them. The authors enable trainable residual-free transformers via orthogonal initialization, second-order optimization, and depth-scaled attention temperature, achieving near-Gaussian activations. Results on language tasks show marginally lower full-precision accuracy but significantly improved 4-bit quantization robustness (e.g., <1% drop vs. >5% in residual models), revealing an accuracy-compressibility trade-off in transformer design.
quantization errorresidual-free transformersexcess kurtosisorthogonal initializationattention temperature
The Timing Dependencies of Trust: Speed, Accuracy, and cBCI Neuro-Decoupling in Human-AI Teams
This study examines how AI intervention timing (Fast/Less-Accurate vs. Slow/Accurate) affects Human-AI team performance in a cBCI-mediated drone task. Using a 2D Adaptive Riemannian Oracle to map spatial covariance, 17 operators performed search tasks under cognitive workload. Fast AI induced blind compliance (50.2% accuracy), while Slow AI caused hesitation (61.1% accuracy) but eventual recovery (100%). Hybrid Fusion improved Fast AI teams by 7.6% and accelerated Slow AI teams by 6.9%, demonstrating that cBCI synergy depends on temporal trust dynamics.
cbcriemannian oraclehybrid fusiontemporal dynamicscognitive workload
UNATE: UNsupervised ATomic Embedding for crystal structures property prediction
UNATE proposes unsupervised atomic embeddings for crystal property prediction, addressing data scarcity by leveraging unlabeled structural information. The framework combines a denoising autoencoder with contrastive learning to learn robust atomic representations, which replace raw atomic numbers as input features. Experiments demonstrate a 2.7% accuracy improvement over full-data baselines, with gains up to 10% when only 25% labeled data is available.
unsupervised learningatomic embeddingscontrastive learningdenoising autoencodermaterials discovery
When Self-Belief Misleads: Active Label Acquisition for Reinforcement Learning with Verifiable Rewards
The paper introduces Reinforcement Learning with Active Verifiable Rewards (RLAVR), a method that combines actively acquired ground-truth labels with pseudo-labels to stabilize training in Reinforcement Learning with Verifiable Rewards (RLVR). The authors propose the Corrective Advantage Gap (CAG) metric to identify high-value samples and develop Correction-Aware Reliability Estimation (CARE), a practical acquisition policy. Experiments across various domains, model families, and scales demonstrate RLAVR's effectiveness in improving performance under limited annotation budgets.
reinforcement learningverifiable rewardsactive learningcorrective advantage gaplabel acquisition
Minimax Limits of k-Fold Cross-Validation via Majority
(No summary returned.)
TIAR: Trajectory-Informed Advantage Reweighting for LLM Abstention Learning
The paper introduces Trajectory-Informed Advantage Reweighting (TIAR) for LLM abstention learning, extending ternary reward approaches with dynamic reward reweighting during Group Relative Policy Optimization (GRPO) training. Methodologically, it leverages trajectory-based confidence indicators to calculate abstention advantages, focusing on hallucination reduction rather than truthfulness improvement. Evaluated on AbstentionBench, TIAR achieves state-of-the-art abstention F1 scores across 5/6 categories, outperforming static ternary baselines on 17/31 datasets while maintaining baseline accuracy.
trajectory-informedadvantage reweightingabstention learninggrpohallucination reduction
Geometric Evolution Maps: Extracting Stable Concept Probes from Transformer Residual Streams
The paper introduces Geometric Evolution Maps (GEMs), a method for identifying stable concept probe directions in transformer residual streams by tracking directional trajectories and detecting handoff layers where concept representations cease rotating. GEMs analyze 23 architectures (70M-14B parameters) across 17 concept types, showing that concept representations undergo substantial directional rotation (mean entry-to-exit cosine similarity 0.233 in Concept Allocation Zones). GEM-extracted probes outperform peak-layer probes in 66.2% of 391 concept-model pairs, with performance varying by attention type (MHA models favor handoff in 78.3% cases vs. 47.1% for GQA). An adaptive ablation rule improves probe quality in 75.9% of near-final-layer cases (+7.44pp mean gain).
geometric evolution mapsconcept probesresidual streamsconcept allocation zonedirectional rotation
Context-Instrumental Data Distillation for Kubernetes Manifest Generation: Method and Experimental Evaluation
The paper introduces context-instrumental data distillation for fine-tuning Small Language Models (≤4B params) to generate Kubernetes manifests. The method combines synthetic data generation (via DeepSeek-V4 Flash API) and reverse instruction extraction from real YAMLs, filtered by validators and domain context. Unlike KL-divergence distillation, it uses supervised fine-tuning (LoRA on Qwen2.5-Coder-1.5B-Instruct, CPU-only). On K8s-Distill-Pilot (200 test samples), strict formatting yielded 91.5% full-pass@1, outperforming naive dataset scaling.
small language modelskubernetes manifestsdata distillationlora fine-tuningdomain-specific languages
Clarify, Abstain or Answer? Strategising in Conversation with Belief-Augmented Generation
The paper introduces Belief-Augmented Generation (BAG), a method that grounds large language models (LLMs) in their own belief state by prompting them to reason over K sampled responses. This enables strategic decisions—answering, clarifying, or abstaining—in conversational settings. Evaluated in a multi-turn ambiguous QA task, BAG improves accuracy across six LLMs and yields more faithful strategy decisions than prompt-only baselines, though distinguishing clarification from abstention remains challenging.
belief-augmented generationlarge language modelsprobabilistic uncertaintyselective predictionmulti-turn qa
Branched Signature Kernel Solvers for ODEs with rough Single-Trajectory signals
The authors introduce a branched signature kernel solver for ODEs driven by single observed trajectories of rough signals, addressing applications like earthquake engineering and finance. The method combines a count-sampling construction to generate nested training paths from a single observation and a kernel-collocation framework that places ansatzes on derivatives or integrated solutions. A universal approximation theorem is proven using the Hairer--Kelly morphism, and the solver is extended to online settings with linear updates or Newton steps. Experiments on six benchmarks demonstrate accurate and stable performance across diverse regimes.
branched signature kernelcount-samplingkernel-collocationhairer--kelly morphismonline updates
Visual-Redundancy-Controlled Parallel Decoding for Diffusion-Based Multimodal Large Language Models
The paper introduces Visual-Redundancy-Controlled Decoding (VRCD), an inference-time method for diffusion-based multimodal LLMs (dMLLMs) that mitigates visual redundancy in parallel token decoding. VRCD quantifies redundancy via a Visual Redundancy Index (VRI) and uses token-to-image attention to prioritize visually complementary positions, reducing step-level grounding overlap. Evaluated on M^3CoT and MMBench, VRCD achieves relative accuracy gains of 18.8% and 6.9% respectively over confidence-based decoding, with minimal runtime overhead.
diffusion-based mllmsparallel decodingvisual redundancy indextoken-to-image attentionmultimodal benchmarks
On Reliability of Efficient Membership Inference Vulnerability Evaluation
The work identifies two reliability flaws in efficient membership inference attack (MIA) evaluation pipelines and proposes corrective measures. First, it demonstrates that concatenating MIA scores across multiple individuals for low-FPR TPR estimation creates miscalibration across per-sample FPRs, undermining differential privacy audits; a post-processing calibration method is introduced. Second, it reveals a finite population bias in Carlini et al.'s (2022) likelihood-ratio attack (LiRA) implementation, causing upward bias in per-sample vulnerability estimates. The analysis focuses on statistical miscalibration and computational efficiency trade-offs in MIA vulnerability assessment.
membership inference attacksfalse positive ratelikelihood-ratio attackdifferential privacyfinite population bias
Geometry Adaptive Counterfactual Distribution Learning with Diffusion-Guided Smoothing
The authors propose geometry-adaptive estimators for counterfactual distribution learning in high-dimensional outcomes, addressing limitations of isotropic smoothing. Their method integrates diffusion-informed smoothing for counterfactual densities and diffusion-informed score smoothing, combining causal nuisance adjustment with geometry-adaptive localization driven by diffusion score information. This approach removes first-order nuisance bias while aligning smoothing with local outcome geometry, yielding asymptotic expansions, risk bounds, and inference procedures. Under structural geometry conditions, stochastic error is governed by an effective dimension induced by the diffusion-guided kernel rather than the ambient dimension. Semi-synthetic experiments on CelebA demonstrate steeper error decay, validating the effective-dimension theory.
counterfactual distributiondiffusion-guided smoothinggeometry-adaptive localizationnuisance biaseffective dimension
On the Benefits of Free Exploration for Regret Minimization in Multi-Armed Bandits
The paper introduces a stochastic multi-armed bandit problem with a free exploration phase before regret accumulation, formalizing it as regret minimization with free exploration. The authors propose UFE-KLUCB-H, a two-phase algorithm combining a principled free exploration policy (UFE) and a history-aware regret minimization policy (KLUCB-H). Instance-dependent upper bounds show UFE-KLUCB-H achieves strictly lower regret than policies without free exploration, while lower bounds demonstrate near-optimality for two-valued bandits. Simulations confirm the benefits of forced exploration and adaptivity.
multi-armed banditsregret minimizationfree explorationinstance-dependent boundsklucb-h
NPSolver: Neural Poisson Solver with Iterative Physics Supervision
The paper introduces NPSolver, a neural Poisson solver trained via iterative physics supervision without solution labels, addressing instability in physics-informed training and data scarcity. The method uses preconditioned conjugate gradient (PCG) steps to refine predictions, providing a stable training signal, with theoretical justification for stop-gradient optimization. A Boundary-Aware Transolver (BA-Transolver) architecture explicitly handles mixed boundary conditions. Evaluations on 2D/3D irregular geometries show NPSolver outperforms physics-informed and data-driven baselines, with demonstrated efficacy in thermal boundary control tasks.
poisson equationneural operatorphysics-informed trainingpreconditioned conjugate gradientboundary-aware architecture
Efficient Benchmarking Is Just Feature Selection and Multiple Regression
The paper demonstrates that efficient benchmarking of LLMs can be significantly improved by reformulating it as a multiple regression problem with feature selection. The proposed method combines kernel ridge regression for score prediction with minimum redundancy maximum relevance (mRMR) for optimal question subset selection. Results show superior performance in prediction error (MAE/RMSE) and ranking correlation (Spearman ρ/Kendall τ) across benchmarks, while being computationally faster and more stable than existing approaches.
efficient benchmarkingkernel ridge regressionminimum redundancy maximum relevancefeature selectionmultiple regression
MDGMIX: Boundary-Aware Subgraph Mixing for Multi-Domain Graph Pre-Training
MDGMIX introduces a boundary-aware subgraph mixing framework for efficient multi-domain graph pre-training, addressing data redundancy in existing joint training approaches. The method constructs challenging mixed-domain subgraphs via boundary node selection, employing hierarchical discrimination (coarse-grained domain discrimination and fine-grained domain decomposition losses) to separate shared and domain-specific patterns. Experiments show MDGMIX outperforms baselines in few-shot classification while improving time/memory efficiency, aided by a lightweight prompt weighting mechanism for knowledge transfer.
multi-domain graph pre-trainingboundary-aware subgraph mixinghierarchical discriminationfew-shot classificationprompt weighting
Concept Unlearning via Cross-Attention Activation Projection for Diffusion Models
PURE introduces a closed-form concept-unlearning method for text-to-image diffusion models by projecting cross-attention activations rather than relying on text embeddings. The approach constructs forget and retain bases from per-layer cross-attention activations during denoising, applying a single linear projector to key and value weights. Evaluated on a holistic benchmark with ten concepts, PURE reduces target leakage under paraphrased and adversarial prompts while maintaining retain-concept fidelity, achieving superior forget-retain trade-offs compared to existing methods.
concept unlearningcross-attention activationdiffusion modelsclosed-form methodtext-to-image generation
Invariant-Based Weight Sharing for Message Passing
The paper introduces ShareGNN, a structure-aware weight sharing principle for message-passing neural networks (MPNNs) that leverages graph invariants to enable systematic weight reuse across structurally equivalent subgraphs. The method employs a novel encoder-decoder architecture with learnable adjacency and transformer-like connectivity, providing explicit control over model complexity. Experiments on synthetic and real-world datasets demonstrate improved performance over standard MPNNs, competitive expressivity beyond the 1-WL test, and scalability to large graphs.
mpnnsgraph invariantsweight sharingencoder-decoder1-wl test
DeGRe: Dense-supervised Generative Reranking for Recommendation
DeGRe introduces a dense-supervised generative reranking framework to address heuristic label bias and credit assignment problems in multi-stage recommender systems. The method employs an offline-online decoupled design, utilizing a Lookahead Evaluator with cumulative regression and beam search to mine high-value sequences offline, then distilling step-wise value estimations into a lightweight Online Generator for efficient greedy decoding during online inference. Experiments show DeGRe outperforms baselines on public benchmarks and industrial datasets, with successful deployment on Taobao Flash Shopping improving online recommendations.
generative rerankinglookahead evaluatorcumulative regressionbeam searchgreedy decoding
Latent Representation Alignment for Offline Goal-Conditioned Reinforcement Learning
The paper proposes Latent-Aligned Value Learning (LAVL), an offline goal-conditioned reinforcement learning (GCRL) method addressing erroneous generalization in value functions for long-horizon tasks. LAVL integrates latent-representation-based value generalization with hierarchical planning, introducing inductive bias to improve reliability. Evaluated on OGBench, LAVL outperforms existing methods on 20/22 datasets, particularly excelling in long-horizon and trajectory-stitching tasks where prior approaches degrade. The code is publicly available.
offline reinforcement learninggoal-conditioned rllatent representationvalue functionhierarchical planning
The Behavioral Credibility Trilemma: When Calibrated Autonomy Becomes Impossible
The paper establishes the Behavioral Credibility Trilemma, proving no RL policy with confidence-gated autonomy can simultaneously maximize helpfulness, calibration, and autonomy under rational oversight for tasks beyond agent competence. Using geometric analysis, the Behavioral Perturbation Lemma quantifies confidence inflation (scaling as $w_A/(2 w_C)$ for Brier score) and detection requirements ($Ω(1/Δ^2)$ observations). Theoretical results show the principal's optimal oversight rule must be non-affine, making the trilemma unconditional across log-concave policy families. A 540-configuration Best-of-N experiment confirms five pre-registered hypotheses (effect sizes $d = 1.10$ to $5.32$) and reveals plateau-truncated frontier geometry in achievable $(H, C, A)$ space.
reinforcement learningconfidence calibrationstrict propernesslog-concave densitiesbehavioral trilemma
FLOATBench: A Dataset and Benchmark for Floating Offshore Wind Turbine Tower Fatigue
FLOATBench introduces a public benchmark dataset for floating offshore wind turbine (FOWT) tower fatigue prediction, addressing the lack of standardized evaluation in the field. The dataset comprises 582,120 per-section fatigue-damage labels derived from 19,404 high-fidelity OpenFAST simulations across three 22 MW FOWT tower geometries. It features a regime-aware alpha-shape partition of the joint wind/wave operating envelope, stratifying test points into in-train, interpolation, and extrapolation regimes. The benchmark includes a reproducible evaluation harness with three protocol levels: random validation, within-tower regime-aware evaluation, and cross-tower transfer. The regime-aware protocol reveals rank shifts between global and extrapolation performance, highlighting limitations of random-split leaderboards.
fatigue-damage predictionopenfast simulationsregime-aware evaluationalpha-shape partitiontabular surrogate modeling
Machine Learning Multiscale Interactions
The paper introduces Multiscale Structural Ensemble (MuSE), a hierarchical model addressing multiscale interactions in physical systems through Soft Coarse-Graining Pooling. MuSE integrates with MLFFs like SO3krates, MACE, and PaiNN to capture long-range many-body effects across molecules and materials. Benchmarks demonstrate MuSE's accuracy in Hessian-based tests, biomolecular folding, and molecule-graphene nanostructures, outperforming existing long-range ML models in quantum-mechanical interaction modeling.
multiscale structural ensemblesoft coarse-graining poolingmachine learning force fieldslong-range many-body effectsquantum-mechanical interactions
PowLU: An Activation Function for Stable Pre-Training of LLMs
The paper introduces Power Linear Unit (PowLU), a stable activation function designed for large-scale LLM pre-training, addressing numerical instability in SwiGLU caused by its quadratic amplification of large inputs. PowLU employs a rational power function to achieve adaptive nonlinearity, improving representation ability and training stability in spike regions. Theoretical justification for PowLU's properties is provided. Scaling law experiments confirm consistent performance across model sizes, and empirical results with the Ling architecture (7.9B and 124B parameters) show PowLU achieves competitive results against SwiGLU and SwiGLU-Clip, enhancing LLM scalability.
activation functionlarge language modelsnumerical instabilityscaling lawadaptive nonlinearity
How Should LLMs Consume High-Quality Data? Optimal Data Scheduling via Quality-Aware Functional Scaling Laws
The paper introduces quality-aware functional scaling laws to optimize joint scheduling of data quality and batch size in LLM training, revealing two regimes for high-quality data utilization. In noise-limited phases, high-quality data acts as a signal amplifier via reduced batch sizes; in signal-limited phases, it suppresses noise via late-stage placement. The proposed Drop-Stable-Rampup method outperforms Warmup-Stable-Decay and Cosine-decay by +1.70 and +2.98 average accuracy respectively on a 15B Mixture-of-Experts model, with GSM8K (+4.23) and MATH (+2.80) showing notable gains.
functional scaling lawsdata-quality schedulingnoise-limited regimesignal-limited regimedrop-stable-rampup
Evaluating passing decision-making in professional football: An enhanced MPNN approach to Receiver Selection
The paper introduces a Graph Neural Network (GNN) framework for predicting Receiver Selection in football by modeling on-field interactions as dynamic graphs. The method employs a Message-Passing Neural Network (MPNN) with nodes representing players (positional/contextual features) and edges encoding passing-line metrics (distance, angle, pressure). Trained on synchronized tracking and event data via an optimized Needleman-Wunsch Algorithm pipeline, the model achieves competitive accuracy in identifying the actual receiver and state-of-the-art top-3 accuracy. It additionally quantifies option likelihood, threat, and creativity, enabling rapid analysis of >1,000 passes.
graph neural networkmessage-passing neural networkreceiver selectionneedleman-wunsch algorithmdynamic graphs
Don't Retrain, Just Reuse: Recovering Dual-Target Molecules from Single-Target Diffusion Models
The authors propose REUSE, a hierarchical evolutionary input-space search framework that recovers dual-target molecules from frozen single-target diffusion models without retraining or modifying the denoising process. REUSE formulates the task as a constrained multi-objective optimization problem, combining pair-conditioned exploration with structured multi-stage selection to enforce dual-target affinity, chemical quality, and diversity. Experiments demonstrate that REUSE achieves a 20.9-percentage-point improvement in Dual High Affinity over prior baselines while maintaining competitive molecular quality, outperforming methods that modify the diffusion process.
dual-target moleculesdiffusion modelsmulti-objective optimizationinput-space searchevolutionary framework
PAC Learning with Bandit Feedback: Sharp Sample Complexity in the Realizable Setting
The work characterizes the optimal sample complexity of multiclass PAC learning with bandit feedback in the realizable setting, sharp up to logarithmic factors. It introduces the bandit DS dimension, a combinatorial measure based on generalized pseudo-boxes that aggregates neighbor counts across coordinates, contrasting with the DS dimension's coordinate counting. A ListCascade-based algorithm achieves the derived upper bound, connecting bandit learning to list learning. Theoretical results show sample complexity scaling with total neighbor counts rather than coordinate-wise structure.
pac learningbandit feedbackds dimensionsample complexityrealizable setting
Stochastic Estimation of the Layer-wise Hessian Trace for Monitoring Neural-network Training
The authors propose a stochastic estimator for the layer-wise trace of the Hessian matrix during neural-network training, addressing the inaccessibility of explicit curvature information in large models (P∼10^6–10^8). The method combines Hutchinson's stochastic trace estimator with a single Hessian-vector product, enabling unbiased per-layer trace estimates via one backward pass. Theoretical analysis reveals weight-sharing introduces bias unless layer-wise Hessians are assembled before differentiation, and derives variance bounds leading to a recommended probe count K∈[5,10]. Applied to ResNet-18/34 and VGG-11 on CIFAR-10/100, the estimator detects label memorization with 179/180 true positives at 16/120 false alarms using a cumulative-sum decision rule.
hessian tracestochastic estimatorweight sharinglabel memorizationresnet
Opportunistic Target Selection: Early Directional Commitment for Query-Efficient Black-Box Adversarial Attacks
The paper introduces Opportunistic Target Selection (OTS), a query-efficient wrapper for black-box adversarial attacks that mitigates class drift by switching untargeted attacks to targeted objectives early. OTS operates without gradient access, architectural changes, or target-class knowledge, functioning as a margin-loss surrogate. Evaluated on three score-based attacks (SimBA, Square Attack, Bandits) across five ImageNet classifiers (4,500 runs), OTS achieves up to +27 pp success rate improvement and 43% query reduction on ResNet-50 for random-search attacks, though it proves redundant for gradient-estimation or margin-loss attacks. Bimodal difficulty distributions on adversarially-trained models nullify its benefits.
black-box adversarial attacksclass driftopportunistic target selectionquery efficiencymargin-loss surrogate
Closed-Form Node Classification with Exact Graph Unlearning
The paper introduces a closed-form framework for node classification that matches or exceeds gradient-trained GNNs while enabling exact graph unlearning. For assortative graphs, it combines SGC-style propagation with Ridge regression; for heterophilous graphs, it proposes LCF-Net, a layer-wise closed-form network with Gaussian kernel-Ridge heads. Evaluated on 14 benchmarks (including ogbn-arxiv and ogbn-proteins), the method outperforms vanilla 2-layer GCN/SAGE/GAT on 9/9 datasets and ties tuned deep models within one standard deviation on 9/12 small benchmarks. The deterministic solutions permit exact unlearning for graph modifications, with 21–45× speedups over full re-solving and 10^6× over retraining, while theoretical analysis proves K-hop locality for Ridge components.
closed-form solversgraph unlearningridge regressionnode classificationheterophilous graphs
StrTransformer: Source-Wise Structured Transformers for Unsupervised Blind Source Recovery
StrTransformer introduces a source-structured Transformer framework for unsupervised blind source recovery, replacing latent variable encoders with direct optimization of a latent source matrix and observation-space mixer. Each source trajectory is processed by a dedicated Transformer branch employing multi-scale patch tokens, random masking, and locality-biased attention, with structural constraints enforced via masked patch reconstruction energy. An ordered multi-scale controller promotes branch specialization through learned patch-scale weights and locality attention slopes. Theoretical analysis examines objective decoupling/coupling and symmetry reduction, while empirical results demonstrate branch convergence to distinct temporal-scale structures and source-aligned latent trajectories.
blind source recoverystructured transformersmulti-scale patcheslocality-biased attentionpermutation symmetry
3D Magnetic Field Reconstruction and Mapping with Physics-Informed Neural Networks
This study introduces a Physics-Informed Neural Network (PINN) framework for high-precision 3D magnetic field reconstruction, integrating Maxwell's equations into the loss function to enforce divergence-free and curl-free conditions. The method incorporates physics-residual losses at measurement points, ensuring physical consistency beyond random collocation. Validation achieves $10^{-4}$ reconstruction accuracy in simulations (10× improvement over benchmarks) and sub-percent relative accuracy ($10^{-3}$ level) in experimental coil assembly tests, demonstrating robust performance in restricted sensor environments.
physics-informed neural networksmagnetic field reconstructionmaxwell's equationsdivergence-freecurl-free
Reinforcement Learning from Denoising Feedback
The paper introduces Reinforcement Learning from Denoising Feedback (RLDF), a novel paradigm for policy loss estimation in diffusion language models (dLLMs). RLDF leverages feedback from rollout and training processes, optimizing models toward clipped clean states from intermediate noisy states with weighted timestep sampling. Experiments show RLDF improves performance and generalizability across LLaDA and Dream architectures on multiple reasoning benchmarks. The work also presents Drift, a training framework for dLLMs.
reinforcement learningdenoising feedbackdiffusion language modelspolicy loss estimationdrift framework
Insuring Every Action: An Authority Frontier Framework for Runtime Actuarial Control of Autonomous AI Agents
The paper contributes a benchmark-ready framework for runtime actuarial control of autonomous AI agents' side-effect-bearing actions. The proposed Actuarial Action Interface (AAI) enforces deterministic runtime contracts via (i) a quote-bind-commit protocol with capability tokens, (ii) a seven-class action taxonomy for authority normalization, and (iii) pathwise reserve coverage under α-spending. Evaluated across four agentic environments (database, refunds, retail, airline), AAI exhibits domain-specific reserve demands (22x variance in Capital@50) while preventing realized loss in a live Postgres panel with three Azure-hosted models. The Authority Frontier primitive quantifies released autonomy per reserve level, revealing low-reserve refusal patterns.
actuarial action interfaceauthority frontierruntime contractreserve capitalside-effect-bearing actions
When In-Distribution Gains Fail: Evaluating Weak-to-Strong Reward Models under Preference Shift
The study identifies a representational failure mode in weak-to-strong (W2S) preference learning under distribution shift, where strong models fine-tuned on weak labels fail to transfer across preference domains. To address this, the authors propose Representation Anchoring (Anchor), a regularizer that constrains representation drift during fine-tuning while permitting task-adaptive updates. Experiments across multiple preference datasets and model families show Anchor improves out-of-distribution transfer by 15-30% while maintaining in-distribution performance, revealing limitations in current W2S reward modeling paradigms.
weak-to-strong generalizationpreference learningrepresentation anchoringdistribution shiftreward modeling
CUA-Gym: Scaling Verifiable Training Environments and Tasks for Computer-Use Agents
CUA-Gym introduces a scalable pipeline for generating verifiable training data for computer-use agents (CUAs), addressing the bottleneck of deterministic reward construction. The method employs Generator and Discriminator agents to co-generate task instructions, environment states, and reward functions, with iterative refinement via an orchestrator and quality filtering via LLM voting and agent rollouts. The pipeline produces CUA-Gym (32,112 verified RLVR tuples across 110 environments) and CUA-Gym-Hub (mock web applications). Trained agents (A3B, A17B) achieve 62.1% and 72.6% on OSWorld-Verified, outperforming prior open-source CUAs and demonstrating transfer to WebArena.
reinforcement learning with verifiable rewardscomputer-use agentsgenerator-discriminator pipelinellm majority votinggspo optimization
Analogies between Transformer Layers and Power Method
The paper establishes an analogy between transformer layer operations (projections, layer normalizations) and the power method, demonstrating that tokens progressively align with the principal eigenvector of a matrix formed by the product of output and value weight matrices. In transformers with shared weights across layers, this alignment becomes empirically pronounced and analytically tractable. The theoretical framework further enables steering transformer outputs toward arbitrary token-space directions by leveraging eigenvector properties.
transformerpower methodeigenvectorlayer normalizationshared weights
Courtroom Analogy: New Perspective on Uncertainty-Aware Classification
The paper proposes a courtroom analogy for uncertainty-aware classification, framing it as a structured debate among class-specific advocates. Methodologically, it introduces Mixture of Dirichlet EXperts (MoDEX), a neural architecture that models each advocate's opinion as a Dirichlet distribution with decomposed concentration parameters (shared evidence and class-specific advocacy), yielding interpretable uncertainty aggregation. Experiments show MoDEX achieves state-of-the-art uncertainty quantification performance while providing semantically meaningful uncertainty estimates.
uncertainty quantificationdirichlet distributioninterpretabilityclassificationneural architecture
Towards the Connection between Activation Sparsity and Flat Minima
This work establishes a theoretical connection between activation sparsity in MLP blocks of Transformers and flat minima in loss landscapes, proposing that sparsity emerges from the ratio between augmented flatness and the product of input norm and activation gradient. The authors introduce derivative sparsity, which generalizes activation sparsity under ReLU and enables backward propagation pruning. Three plug-and-play methods are proposed to encourage sparsity by manipulating this ratio. Experiments on ImageNet-1K and C4 datasets demonstrate 36% improvement in inference sparsity and 50% in training sparsity compared to vanilla Transformers, indicating significant computational cost reduction.
activation sparsityflat minimamlp blocksderivative sparsitytransformers
Learning Sparse Compositional Functions with Norm-Constrained Neural Networks
The paper develops a theoretical framework for analyzing norm-constrained deep neural networks learning sparse compositional functions represented by directed acyclic graphs (DAGs). By measuring complexity via parameter norms rather than counts, the work establishes approximation rates and excess risk bounds in overparameterized regimes. Results demonstrate that deep networks avoid the curse of dimensionality by exploiting hierarchical structure, with applications to multi-index models, binary trees, and general compositional architectures. The analysis covers all efficiently Turing-computable functions through their sparse compositional representations.
sparse compositional functionsnorm-constrained networksdirected acyclic graphsapproximation ratescurse of dimensionality
Decoding Stimulus Reconstruction-Based Auditory Attention Robustly in Unbalanced EEG Datasets
This study introduces a leave-one-paired-envelope-out (LOPEO) cross-validation protocol to address inflated decoding accuracy in stimulus reconstruction-based auditory attention decoding (AAD) from EEG signals on unbalanced datasets. Using three publicly available EEG-AAD datasets (KUL, DTU, NJU cEEGrid), the authors demonstrate that deep neural networks (DNNs) tend to overestimate performance on unbalanced data. LOPEO effectively mitigates this issue, providing a robust evaluation framework for existing unbalanced datasets. The results validate LOPEO's efficacy in preventing performance overestimation, offering a principled solution for AAD research with imbalanced data.
auditory attention decodingelectroencephalogramstimulus reconstructioncross-validationdeep neural networks
DVAO: Dynamic Variance-adaptive Advantage Optimization for Multi-reward Reinforcement Learning
The paper introduces Dynamic Variance-adaptive Advantage Optimization (DVAO), a method for multi-reward reinforcement learning that dynamically adjusts combination weights based on empirical reward variance. DVAO addresses limitations of Reward Combination and Advantage Combination by maintaining bounded advantage magnitudes and incorporating cross-objective regularization. Experiments on mathematical reasoning and tool-use benchmarks with Qwen3 and Qwen2.5 models show DVAO outperforms baselines, achieving superior Pareto frontiers and training stability.
reinforcement learningadvantage optimizationmulti-rewarddynamic variancepareto frontier
Generalized Evidential Deep Learning: From a Bayesian Perspective
The authors formalize Evidential Deep Learning (EDL) within a generalized Bayesian framework, providing theoretical grounding for prior specification, posterior updates, and training objectives. Their proposed Generalized Evidential Deep Learning (GEDL) unifies existing EDL variants by explicitly disentangling components and linking them to Bayesian distributional uncertainty via asymptotic analysis. Experiments show GEDL achieves comparable performance to specialized variants in classification, uncertainty estimation, and OOD detection while offering systematic extensibility.
evidential deep learningbayesian frameworkuncertainty estimationood detectionasymptotic analysis
Optimal Design for Multinomial Logit Model with Applications to Best Assortment Identification
The authors propose an optimal experimental design framework for multinomial logit (MNL) bandits, addressing computational intractability in combinatorial action spaces. The framework combines two approaches: (i) a 0-1 mixed-integer linear program (MILP) with solver-certified early stopping for exact or certified-approximate solutions, and (ii) a polynomial-time lifted design using a tractable surrogate objective. Near G-optimality guarantees are established via the Kiefer-Wolfowitz equivalence theorem, characterizing statistical-computational trade-offs. As an application, they develop a best assortment identification algorithm for MNL bandits with linear utilities and non-uniform revenues, achieving instance-dependent sample complexity of Õ(d log N / Δ²), where d is feature dimension, N is the number of arms, and Δ is the minimum revenue gap.
multinomial logitoptimal designmixed-integer linear programkiefer-wolfowitzsample complexity
Nonstationary Generalized Linear Bandits with Discounted Online Mirror Descent
The paper introduces DOMD-GLB, a computationally efficient algorithm for nonstationary generalized linear bandits (GLBs) using discounted online mirror descent (DOMD) for parameter estimation. Unlike prior MLE-based approaches requiring O(t) memory/computation per round, DOMD-GLB maintains O(1) costs while handling time-varying parameters via nonlinear link functions. Theoretical analysis yields dynamic regret bounds of Õ(c_μ^{-1/2}d^{3/4}P_T^{1/4}T^{3/4}) for drifting environments and Õ(c_μ^{-1/3}d^{2/3}Γ_T^{1/3}T^{2/3}) for piecewise-stationary cases, where d is feature dimension, P_T path length, and Γ_T change points. This constitutes the first GLB method with time-invariant per-round complexity.
generalized linear banditsdiscounted online mirror descentnonstationary environmentsdynamic regretcomputational efficiency
Extreme Region Policy Distillation
Extreme Region Policy Distillation (ERPD) addresses the trade-off between sample efficiency and asymptotic performance in reinforcement learning for large language models by decoupling these objectives into a two-stage framework. The first stage performs weakly constrained off-policy optimization on fixed data to extract maximal training signals, while the second stage distills these signals into the base policy under trust-region constraints to prevent harmful drift. ERPD achieves comparable or better performance with reduced KL divergence, demonstrating that much initial divergence is unnecessary. Experiments on mathematical reasoning show ERPD improves strong base models where on-policy training plateaus and reliably enhances weak teachers.
reinforcement learningoff-policy optimizationtrust-region constraintskl divergencepolicy distillation
Learning Latent Dynamical Causal Processes for Single-Cell Perturbation Prediction
The authors propose CITE-VAE, a latent dynamical causal generative model for single-cell perturbation prediction that jointly captures unobserved cellular programs, perturbation-conditioned mechanisms, and temporal evolution. The framework is grounded in identifiability theory, proving latent causal variables are recoverable under standard equivalence classes. Experiments on Causal-3DIdent validate theoretical guarantees, while real-world CRISPR perturbation data demonstrate improved OOD generalization over baselines (specific metrics not provided).
latent causal variablessingle-cell perturbationood generalizationdynamical causal modelidentifiability analysis
Geometric Flow Matching for Molecular Conformation Generation via Manifold Decomposition
GO-Flow introduces manifold-aware flow matching for molecular conformation generation by decomposing the process into three physically motivated subspaces: translation (linear optimal transport), rotation (geodesic flows on $SO(3)$), and conformation (entropic optimal transport). This approach aligns generative paths with molecular degrees of freedom, leveraging equivariant architectures for rotation-consistent generation. On GEOM-Drugs and GEOM-QM9, GO-Flow achieves SOTA quality, enabling high-fidelity sampling in 50 steps by learning straighter probability paths on intrinsic manifolds.
flow matchingmanifold decompositionoptimal transportequivariant architecturesconformation generation
Rao-Blackwellized Score Matching on Manifolds
The paper introduces Rao-Blackwellized score matching for denoising on smooth embedded manifolds, addressing the singularity in tangent denoising targets under ambient Gaussian corruption. By conditioning on the nearest-point projection, the method derives the unique L²-optimal predictor among estimators dependent on projected observations. A small-noise expansion reveals that the canonical target equals the intrinsic Riemannian score, corrected by an explicit order-σ² term comprising intrinsic Tweedie and extrinsic curvature components. Results show exact reduction to Gaussian denoising in flat cases and simplification to scalar factors on Sᵈ, with cancellation of extrinsic corrections on S².
rao-blackwellizeddenoising score matchingriemannian scoretweedie correctionweingarten operator
RotMoLE: Enhancing Mixture of Low-Rank Experts through Rotational Gating Mechanism
RotMoLE introduces a rotational gating mechanism to enhance Mixture of Low-rank Experts (MoE-LoRA) for improved representation and generalization in complex scenarios. Unlike conventional scalar reweighing, RotMoLE applies rotation transformations to selected experts, enabling superior exploitation and specialization, particularly with limited expert candidates. The method leverages low-rank structures inherent in MoE-LoRA to implement this mechanism. Empirical validation demonstrates RotMoLE's effectiveness in multi-task and multilingual training scenarios, addressing challenges in adapting Large Language Models to diverse specialized knowledge domains.
mixture of expertslow-rank adaptersrotational gatingparameter-efficient fine-tuningmultilingual training
Learning Permutation from Structure Without Supervision
The paper introduces an entropy-adaptive Gumbel-Sinkhorn formulation for learning permutations from structural objectives without supervision. The method locally modulates temperature based on assignment uncertainty, allowing confident assignments to discretize early while preserving exploration in ambiguous regions. Experiments on sorting, jigsaw reconstruction, and routing tasks demonstrate improved training stability and permutation quality over fixed-temperature baselines, particularly for larger problem sizes and higher ambiguity.
permutation learninggumbel-sinkhornunsupervised learningdoubly stochastic matricesentropy adaptation
BC Protocol: Structured Dual-Expert Dialogue for Eliciting High-Quality Chain-of-Thought Post-Training Data
The BC Protocol introduces a structured dual-expert dialogue method for generating high-quality chain-of-thought (CoT) data in LLM post-training, addressing limitations of crowdsourcing, solo expert writing, and RLHF. It pairs domain experts (crystallized intelligence) with knowledge engineers (fluid intelligence) to externalize implicit reasoning, guided by a Participant Aptitude Model and the 'Selection-over-Prescription' principle. In a narrative fiction experiment (n=40), BC Protocol-generated CoT significantly outperformed solo-expert CoT in reasoning naturalness (Group A mean 4.80 vs. Group B 1.30, p=2.4×10⁻⁸, Cliff's δ=1.0) across three judge models (GPT-4o, Claude Opus 4.5, Gemini 2.5 Pro).
chain-of-thoughtpost-trainingelicitationcrystallized intelligenceparticipant aptitude model
'Si'multaneous 'S'patial-'T'emporal Message Passing for Dynamic Graph Representation Learning
The paper introduces SiST-GNN, a dynamic graph neural network that simultaneously processes spatial and temporal signals through unified message passing, avoiding the limitations of sequential temporal-first or spatial-first approaches. The method maintains recurrent node states to capture historical trajectories, pairs them with current features, and performs graph convolution on this temporally augmented structure. Evaluated across 14 model-dataset combinations, SiST-GNN achieves 109-277% and 68-194% relative improvements in link prediction over prior methods in fixed-split and live-update settings respectively, while also outperforming discrete-time baselines by 7-22% in node classification tasks.
dynamic graph neural networksmessage passingtemporal augmentationlink predictionnode classification
TopoAlign: Topology-Aware Visual Representation Alignment
TopoAlign introduces a topology-aware framework for comparative analysis of neural representations using mapper graphs from topological data analysis. The method jointly analyzes representation graphs via force-directed layout optimization, identifies local correspondences through automated structural matching, and enables motif-based queries with membrane visualizations. Evaluations on language and multimodal models demonstrate its capability to reveal structural alignment patterns missed by geometric approaches.
representation alignmentmapper graphstopological data analysisforce-directed layoutstructural matching
A Multimodal Framework for Dementia Detection via Linguistic and Acoustic Representation Learning
We propose a multimodal deep learning framework for dementia detection that jointly models linguistic and acoustic features. Speech recordings are processed via HuBERT with attentive statistics pooling, while transcripts are encoded using BERT. An Audio-Text Fusion mechanism combines modalities, enhanced by a MINE objective to maximize mutual information. Evaluated on the ADReSS Challenge and PROCESS-2 datasets, our approach demonstrates robust performance in speech-based dementia assessment.
multimodal fusionhubertmine objectiveattentive statistics poolingaudio-text fusion
DeepSeekMath Meets Order Book: Group-Aware Policy Optimization for High-Frequency Directional Trading
The paper introduces group-aware policy optimization methods for high-frequency trading on limit order books, leveraging Order-Flow-based state models and policy-gradient techniques. It proposes variants of Proximal Policy Optimization (PPO), including GRPO and GSPO, which incorporate group-normalized updates and downside-aware shaping, outperforming traditional value-based RL methods like Q-learning. Backtests on financial assets AMZN, AAPL, and GOOG demonstrate improved net average PnL, profitability, and drawdown metrics. Results validate the adequacy of Order-Flow signals as state representations and the superiority of group-aware PPO surrogates over value-based baselines in high-frequency trading scenarios.
policy-gradient methodsorder-flow signalsgroup-normalized updatesdownside-aware shapinghigh-frequency trading
From DPPs to $k$-DPPs: identifiability analysis via spectral decomposition
This work characterizes the identifiability structure of $k$-DPPs through spectral decomposition $L=UΛU^{\top}$, contrasting it with full DPPs. The analysis reveals that $k$-DPPs exhibit fundamentally different identifiability properties: spectral parameters become identifiable only up to a common scale, and eigenspace rotations are identifiable solely through squared minors of the eigenvector matrix. The authors precisely quantify this identifiability gap via three explicit invariances (scale, sign similarity, and eigenspace rotation) and a dimension-counting theorem, demonstrating additional continuous non-identifiability when $\binom{N}{k} determinantal point processesspectral decompositionidentifiabilityelementary symmetric polynomialseigenspace rotation
SAE-FD: Sparse Autoencoder Feature Distillation for Continual Learning of Large Language Models
SAE-FD introduces Sparse Autoencoder Feature Distillation for continual learning in large language models, addressing catastrophic forgetting through sparse feature space regularization. The method leverages a pre-trained Sparse Autoencoder to decompose dense activations into an overcomplete sparse basis, reducing representational entanglement and enabling targeted regularization with minimal interference to new-task learning. Evaluations on two continual learning benchmarks across three architectures demonstrate SAE-FD's superiority over existing regularization-based methods, achieving 52.70% average accuracy with -0.46 backward transfer.
sparse autoencoderfeature distillationcontinual learningcatastrophic forgettingregularization
Guided Flow Matching for Forward and Inverse PDE Problems with Sparse Observations: Algorithm and Theory
FM4PDE introduces a flow-matching generative framework for solving forward and inverse PDE problems with sparse observations, learning joint distributions of coefficients/solutions. The method employs guided sampling via composite losses (measurement agreement, PDE residual reduction) with deterministic, stochastic, and hybrid variants, supported by theoretical error guarantees. Deterministic optimization achieves logarithmic complexity under coercivity, while adaptive stochastic guidance attains polynomial-time bounds by addressing noise-floor bias. Experiments on static/time-dependent PDE benchmarks show superior accuracy and faster inference versus diffusion models.
flow matchingsparse pde reconstructionadaptive guidancedeterministic-stochastic hybriderror guarantees
Relative Repairability: A Calibration-Based Diagnostic for High-Sparsity Post-Pruning Allocation
The paper introduces Relative Repairability (RR), a calibration-based diagnostic for high-sparsity post-pruning allocation in neural networks. RR evaluates the residual activation distortion after channelwise variance matching repair, estimating the fraction of unrecoverable damage using unlabeled calibration data. Experiments on ResNet18, ResNet34, and VGG16 BN across CIFAR10 and CIFAR100 demonstrate RR's utility near architecture-dependent recoverability transitions, where it outperforms ERK and LAMP in specific sparsity ranges. Findings highlight the importance of allocating both retained weights and repairable damage in high-sparsity pruning.
relative repairabilityhigh-sparsity pruningactivation distortionchannelwise variance matchingrecoverability transition
Accelerated Dynamic Importance Weighting with Versatile Divergence-Minimizing Estimators
The paper proposes Accelerated Dynamic Importance Weighting (ADIW), a unified framework for deep learning under joint distribution shift. ADIW improves efficiency over Dynamic Importance Weighting (DIW) by using lightweight projected gradient descent with warm-start initialization, and generalizes DIW to support multiple divergence-minimizing weight estimators (Kullback-Leibler, squared distance, Wasserstein-1). Theoretical convergence guarantees are provided, and empirical results show ADIW achieves state-of-the-art performance while being significantly more computationally efficient than prior methods.
importance weightingdistribution shiftdivergence minimizationkernel mean matchinggradient descent
SafetyRepro: Configuration-Conditional Rank Instability on Alignment Benchmarks
The paper introduces SafetyRepro, a method to quantify configuration-induced rank instability in foundation-model alignment benchmarks. It proposes a finite-envelope proposition linking pairwise disagreement rates to strict ordering reversals, validated via a commit-stamped evaluation protocol. Results demonstrate that benchmark configuration choices alone can reverse pairwise safety verdicts (e.g., 'A is safer than B') across all tested benchmarks, exposing a critical failure mode in comparative evaluations.
configuration-conditionalrank instabilitypairwise disagreementalignment benchmarksstrict reversal
JacQuant: STE-Free Quantization-Aware Training via Learned Jacobian Surrogates
JacQuant introduces a STE-free quantization-aware training framework that learns lightweight Jacobian surrogates to model local parameter sensitivity, stabilizing and accelerating training without modifying forward quantizers. The method employs data-driven diagonal or block-diagonal surrogates compatible with common weight/activation quantizers, proving convergence for non-convex objectives and linear rates under PL conditions. Evaluated on ≤2-bit LLM benchmarks, JacQuant consistently outperforms STE-based QAT in accuracy while maintaining negligible runtime overhead under practical group sizes.
quantization-aware trainingjacobian surrogatestraight-through estimatorlow-precision modelsnon-convex optimization
Mean-Shift PCA by Knockoff Mean
The paper introduces a two-stage PCA algorithm that removes mean-shift noise by deliberately adding knockoff mean-shift perturbations. Leveraging Random Matrix Theory, the authors prove that mean-shift contamination creates spectrally separable spikes while leaving the original eigenspace asymptotically invariant. The proposed method identifies and eliminates contaminated components using standard PCA operations, addressing a limitation of Robust PCA in high-dimensional regimes with mean-shift mixtures. Theoretical guarantees show spectral stability independent of mixture weights.
robust pcamean-shift contaminationrandom matrix theoryspectral separationknockoff perturbation
From Simulation to Enaction: Post-trained language models recognize and react to their own generations
The study demonstrates that post-trained language models implicitly recognize and adapt to their own on-policy generations, unlike pretrained models. Through entropy analysis across model families and sizes, it reveals a 3--4$ imes$ reduction in on-policy output entropy compared to off-policy, linked to an internal representation of input surprise. The work also identifies distinct mechanisms for implicit (via entropy modulation) versus explicit (verbal report) recognition of on-policy contexts, with evidence from topic-specific response prefills.
post-trainingon-policyoutput entropyinput surpriseprefill
Different Statistical Perspectives for Understanding Generalisation in Graph Neural Networks
The paper systematizes three statistical frameworks for analyzing generalization in Graph Neural Networks (GNNs). First, learning-theory approaches derive uniform convergence bounds via hypothesis class complexity and expressivity through graph isomorphism tests. Second, infinite-parameter asymptotics approximate GNNs using Gaussian processes, neural tangent kernels, or graphon operators to study stability. Third, random graph models (e.g., contextual stochastic block models) enable non-asymptotic error rate analysis via high-dimensional statistics. Each framework's key results and limitations are discussed, highlighting open questions in GNN theory.
graph neural networksuniform convergenceneural tangent kernelgraphon operatorsstochastic block model
BigMac: Breaking the Pareto Frontier of Compute and Memory in Multimodal LLM Training
BigMac introduces a novel training pipeline for multimodal large language models (MLLMs) that breaks the Pareto frontier between compute and memory efficiency. The method elegantly nests encoder and generator computations into the original LLM pipeline, achieving O(1) activation memory complexity for these components while maintaining the LLM's activation memory complexity. This design enables simultaneous optimization of computation and memory, achieving 1.08×-1.9× training speedup over baseline systems with stable memory usage as batch size increases.
multimodal llmpareto frontieractivation memorynested pipelinetraining speedup
A Signal-Language Foundation Model for Broad-Spectrum Cardiovascular Assessment from Routine Electrocardiography
We introduce ECG Contrastive Language-Image Pre-training (ECGCLIP), a signal-language foundation model for broad-spectrum cardiovascular assessment from routine electrocardiography. ECGCLIP aligns ECG waveforms with expert diagnostic reports via contrastive learning, pre-trained on 2,837,962 ECG studies from 1,324,856 patients. Evaluated on 89 downstream tasks across nine external cohorts (~1.5M ECGs), ECGCLIP-R34 achieved strong performance for atrial fibrillation (PRAUC 0.900) and ST-segment elevation myocardial infarction (PRAUC 0.383), with robust generalization to rare diseases like Ebstein anomaly (PRAUC 0.253). ECGCLIP matched baseline performance with only 10% of training data, demonstrating data efficiency. Feature visualization revealed clinically meaningful representations aligned with electrocardiographic criteria.
contrastive learningelectrocardiographysignal-languagepruaccardiovascular assessment
Missing Pattern Recognized Diffusion Imputation Model for Missing Not At Random
The paper introduces PRDIM, a diffusion-based imputation model addressing Missing Not at Random (MNAR) data by explicitly modeling missing patterns. It employs a pattern recognizer within an EM framework to iteratively maximize the joint distribution likelihood of observed values and missing masks. Experiments demonstrate PRDIM's superior imputation performance across diverse data modalities under MNAR conditions.
missing not at randomdiffusion modelexpectation-maximizationpattern recognizerdata imputation
Rethinking Feature Alignment in Generalist Graph Anomaly Detection: A Relational Fingerprint-based Approach
The paper introduces ReFi-GAD, a generalist graph anomaly detection (GAD) method addressing feature alignment limitations in existing approaches. Current methods rely on PCA-based projection, neglecting feature semantics and causing negative transfer. ReFi-GAD employs a Relational Fingerprint (ReFi) to encode anomaly-indicative cues from contextual and structural perspectives, combined with a transformer-based encoder and SNR-guided refinement for domain adaptation. Evaluations on 14 datasets show ReFi-GAD outperforms state-of-the-art methods.
generalist anomaly detectionrelational fingerprintfeature alignmenttransformer encodersnr-guided refinement
SeqRoute: Global Budget-Aware Sequential LLM Routing via Offline Reinforcement Learning
SeqRoute introduces a global budget-aware sequential LLM routing framework that treats multi-turn interactions as a finite-horizon Markov Decision Process, solved via offline reinforcement learning. It incorporates remaining budget into the state space and employs Conservative Q-Learning (CQL) to strategically allocate resources, alongside Hindsight Budget Relabeling (HBR) to expand training data by simulating trajectories under diverse budgets. A dynamic λ-sweep mechanism enables zero-shot Pareto frontier navigation. Evaluations show SeqRoute reduces operational costs by 6.0-73.5%, maintains or improves quality, and suppresses bankruptcy rates to under 1%, outperforming baselines across the Pareto frontier.
offline reinforcement learningmarkov decision processbudget-aware routinghindsight budget relabelingpareto frontier
Capture-Calibrate-Coach: A Graph-Based Framework for Knowledge Monitoring Estimation and Adaptive Feedback
The paper introduces Capture-Calibrate-Coach (3C), a graph-based framework for metacognitive learning support that jointly estimates knowledge monitoring and delivers adaptive feedback. The method constructs a heterogeneous learner-concept graph from self-reports, infers latent perceived states via a heterogeneous GNN, and classifies learners into five metacognitive patterns for personalized coaching. Evaluation on 684 students shows 85.21% AUC in latent state prediction, while a 47-participant user study confirms the perceived utility of feedback addressing both knowledge gaps and calibration errors.
knowledge monitoringheterogeneous graph neural networkmetacognitive patternsadaptive feedbackself-regulated learning
Generating 3D models from sketches of human faces using a combined approach of Convolutional Neural Networks, Procedural Modeling, and Contour Mapping
The authors present a novel method for generating 3D facial models from sketches by integrating expression detection with model generation. Their approach combines Convolutional Neural Networks (CNNs) trained on a custom dataset to detect Facial Action Coding System (FACS) Action Units, a parametric 3D face model (Valley Girl) for expression duplication, and Active Snake Contours for contour alignment. This marks the first use of CNNs for sketch-based expression detection in literature, enabling more accurate 3D model generation that preserves facial expressions from input sketches.
convolutional neural networksparametric 3d face modelfacial action coding systemactive snake contourssketch-based modeling
Autoregression-Free Neural Operators for Time-Dependent PDEs
The authors propose Autoregression-Free Neural Operators (AFNO), a novel framework for solving time-dependent partial differential equations (PDEs) without autoregressive rollout. AFNO maps PDE time evolution into a latent space and models continuous-time vector fields using flow matching, enabling stable long-horizon predictions and explicit conditioning on physical parameters. Theoretical analysis and experiments on six PDE benchmarks show AFNO reduces rollout errors and improves prediction stability compared to autoregressive baselines.
neural operatorspartial differential equationsautoregressive rolloutflow matchinglatent space
EMA-Nesterov: Stabilizing Nesterov's Lookahead for Accelerated Deep Learning Optimization
EMA-Nesterov introduces a stabilized lookahead optimization method for deep learning by replacing Nesterov's standard lookahead direction with an exponential moving average (EMA) of parameter updates. This modification captures low-frequency trends in optimization trajectories through EMA's low-pass filtering, maintaining adaptability via geometric weighting while avoiding instability from noisy short-horizon updates. Theoretical analysis confirms accelerated convergence rates analogous to Nesterov's method in convex settings. Empirical evaluations on language model pre-training demonstrate broad applicability across optimizers like Adam, SOAP, Muon, and NanoGPT, outperforming prior lookahead methods in stability and performance.
exponential moving averagelookahead optimizationnesterov accelerationlow-pass filterconvergence rate
A Context Augmented Multi-Play Multi-Armed Bandit Algorithm for Fast Channel Allocation in Opportunistic Spectrum Access
The authors propose a context-augmented multi-play multi-armed bandit (MP-MAB) algorithm for channel allocation in opportunistic spectrum access (OSA), addressing limitations of existing methods by incorporating channel noise as a perturbation of the reward function. They model the correlation between channel state information and noise using both linear and nonlinear approaches, deriving index policies that learn these correlations via a linear model and neural network, respectively. The policies adjust the upper confidence bound using estimated noise values. Numerical experiments demonstrate reduced regret and more rational sub-optimal arm selection compared to existing methods.
multi-armed banditchannel allocationopportunistic spectrum accessupper confidence boundchannel noise
ViroBench: Benchmarking Nucleotide Foundation Models on Viral Genomics Tasks
ViroBench introduces the first comprehensive benchmark for evaluating nucleotide foundation models (NFMs) in viral genomics, addressing biological understanding and biosecurity risks across 18 scenarios and 4 task types. The study evaluates 66 NFMs, revealing three key findings: performance degradation under phylogenetic and temporal shifts, decoupling between statistical likelihood and biological validity in generation tasks, and the critical importance of taxonomic diversity over parameter scale in pretraining. A lightweight baseline trained on diverse data achieves a 67.5% performance gain. ViroBench provides interpretable evaluations and a reproducible framework, with datasets and code publicly available.
nucleotide foundation modelsvirobenchbiosecurity riskphylogenetic shifttaxonomic diversity
Learning manifold diffusion semigroups from graph transition matrices
(No summary returned.)
Not only where, But when: Temporal Scheduling for RLVR
The paper introduces temporal scheduling of credit allocation criteria during RLVR (Reinforcement Learning with Verifiable Rewards) optimization for LLMs, arguing that dynamic scheduling of learning signals improves upon static token-level credit assignment. The method prioritizes targeted tokens early in training before gradually shifting to general optimization, using trajectory percentiles to distinguish policy behaviors. Experiments on mathematical and reasoning benchmarks show temporal scheduling yields healthier policy entropy dynamics and consistent performance gains over standard RLVR approaches.
rlvrcredit allocationtemporal schedulingpolicy entropytrajectory percentiles
PDEInvBench: A Comprehensive Dataset and Design Space Exploration of Neural Networks for PDE Inverse Problems
The authors introduce PDEInvBench, a benchmark dataset for evaluating neural networks on inverse problems in partial differential equations (PDEs), addressing the gap in existing benchmarks focused on forward problems. The dataset includes time-dependent and time-independent PDE simulations with in-distribution and out-of-distribution evaluation splits. Through systematic exploration of optimization procedures, problem representations, and scaling, they find that two-stage training (supervised pre-training + test-time fine-tuning), PDE derivative features, and diverse initial conditions yield optimal performance. Results demonstrate consistent accuracy improvements from these design choices across varied physical behaviors.
pde inverse problemsbenchmark datasetneural networkstest-time traininginductive biases
Certified Robustness from Approximate Gaussian Mixture Structures in Pretrained Latent Spaces
The work introduces a framework for certifiably robust classifiers by exploiting approximate Gaussian mixture structures in pretrained latent spaces. The authors derive necessary and sufficient conditions for robust classifiers in the Gaussian mixture setting, then extend this to cases where the latent distribution is ε-close (in KL divergence) to a mixture, proving graceful degradation of certified accuracy. The method achieves state-of-the-art or competitive certified accuracy on CIFAR-10 and ImageNet while maintaining clean performance and low computational overhead, demonstrating practical certifiable robustness via approximate latent structure.
certified robustnessgaussian mixturelatent spacekl divergenceadversarial perturbations
Parameter-Efficient CT Reconstruction via Deep Graph Laplacian Regularization
The authors propose Deep Graph Laplacian Regularization (Deep GLR), a parameter-efficient LDCT reconstruction method combining quadratic graph regularization with lightweight CNNs in a Proximal Forward-Backward Splitting framework. Deep GLR achieves 30.70 dB PSNR on LoDoPaB-CT (6.33 dB improvement over filtered backprojection) using only 91,848 parameters trained on 1,000 samples, yielding 5.8× better parameter efficiency and 30× better data efficiency per dB than benchmarks. The learned graph bandwidth (ε=1.25) suggests interpretable priors, though a 13 dB gap remains versus SOTA methods.
low-dose computed tomographygraph laplacian regularizationproximal forward-backward splittingparameter efficiencymedical imaging
ERNIE-Image Technical Report
ERNIE-Image introduces an 8B-parameter single-stream Diffusion Transformer (DiT) for text-to-image generation, aiming to close the performance gap between open-source and proprietary models. The method employs a bottom-up pre-training pipeline combining fine-grained image categorization, dense captioning, aesthetic scoring, and hierarchical sampling, followed by top-down post-training with diversified prompts and stabilized Direct Preference Optimization (DPO). The system includes ERNIE-Image-Turbo for 8-step generation via MT-DMD distillation and a Prompt Enhancer for practical deployment. Evaluations show state-of-the-art open-source performance in instruction following, text rendering, and aesthetics, with released models and the ERNIE-Image-Aes-1K benchmark for reproducible assessment.
diffusion transformerdirect preference optimizationaesthetic assessmentinstruction followingtext-to-image generation
Parallel Differentiable Reachability for Learning and Planning with Certified Neural Dynamics and Controllers
The paper introduces a parallelizable, differentiable reachability framework in JAX for certifying neural dynamics models and controllers in continuous- and discrete-time systems. The method unifies Taylor-model flowpipe construction with CROWN-style linear bound propagation, preserving affine dependencies while enabling GPU-batched computation and automatic differentiation. Applications include certified training for reachability-friendly models and reachability-aware sampling-based MPC with gradient refinement. Experiments on non-prehensile manipulation and quadrotor tasks (up to 72D) demonstrate certified reachable-set over-approximations under bounded uncertainty during online planning.
differentiable reachabilitytaylor-model flowpipecrown-style boundscertified trainingreachability-aware mpc
A general tensor-structured compression scheme for efficient large language models
The paper introduces Tensor Mixture (MixT), a tensor-structured compression scheme for efficient large language models (LLMs) that replaces dense linear layers with mixtures of tensor operators. Operating generically on linear projections, MixT is applicable to Transformer-based LLMs and other dense neural mappings. Evaluated on Qwen3-8B and LLaMA2-7B, MixT preserves MMLU accuracy until model-specific boundaries, where output entropy, prediction entropy, and inter-layer geometry shift. At LLaMA2-7B's boundary, MixT reduces parameters by 47.5%, inference FLOPs by 37.1%, training FLOPs by 52.1%, and peak memory by 60.4%.
tensor-structured compressionlarge language modelslinear projectionsmmlu accuracyinference flops
CausalFlow: Causal Attribution and Counterfactual Repair for LLM Agent Failures
CausalFlow introduces an interventional framework for diagnosing and repairing LLM agent failures through causal attribution. The method models execution traces as sequential chains, computes Causal Responsibility Scores via step-level counterfactual intervention, and generates minimally edited repairs that flip outcomes to success. Evaluated on four benchmarks (mathematical reasoning, code generation, question answering, medical browsing), CausalFlow produces validated minimal repairs with high minimality and causal-consensus scores, outperforming heuristic refinement in complex retrieval settings while enabling reliable improvement across diverse tasks.
causal attributioncounterfactual interventionexecution tracesminimal repairspreference optimization
UWM-JEPA: Predictive World Models That Imagine in Belief Space
The Unitary World Model Joint Embedding Predictive Architecture (UWM-JEPA) introduces a density-matrix latent on a joint system-environment space with a learned unitary predictor, preserving the joint-state spectrum during rollout to prevent uncertainty dissipation. This architecture outperforms parameter-matched LSTM-JEPA baselines on a hidden-velocity indicator task, achieving 0.77 accuracy under counterfactual action sequences versus the baseline's 0.53 majority-class accuracy. UWM-JEPA also demonstrates superior robustness in blind rollout, losing fewer than ten points of probe R^2 at short horizons compared to vector-latent baselines losing forty-one and sixty-eight. The results highlight the importance of latent geometry and predictor dynamics, not just context-encoding capacity, for JEPA world models in partially observed environments.
joint embedding predictive architecturedensity-matrix latentunitary predictorblind rolloutcounterfactual action
Electricity Consumption Forecasting: An Approach Using Cooperative Ensemble Learning with SHapley Additive exPlanations
The study proposes a cooperative ensemble learning approach (Weaker Separator Booster) for electricity consumption forecasting, combining LSTM, RF, SVR, and XGBoost with SHAP-based feature selection and GA/PSO hyperparameter optimization. Using 7-year data from two campuses of Federal Institute of Paraná, the model achieved sMAPE of 13.90% (MAE: 1990.87 kWh) and 18.72% (MAE: 465.02 kWh), outperforming individual methods. SHAP analysis identified lagged time-series values as dominant predictors, with minimal climatic influence.
ensemble learningshapley valueshyperparameter optimizationelectricity forecastingsmape
When Interpretability Becomes a Liability: Adversarial Attacks on CBM Concept Layers
The paper identifies a novel vulnerability in Concept Bottleneck Models (CBMs) where adversarial attacks can manipulate concept layers to induce misclassification. It develops a theoretical framework to quantify concept-space robustness and introduces SPECTRA, a defense method using semantic perturbation-based regularization. Experiments on CUB-200-2011 show SPECTRA increases required perturbation norms from 0.46 to 4,200 while maintaining classification accuracy within 2.2% of baseline.
concept bottleneck modelsadversarial attacksinterpretabilityrobustness regularizationsemantic perturbations
Algorithms with Polynomially-Improved Approximation Factors for the $2 \rightarrow q$ Norm, and Applications
The paper presents polynomial-time approximation algorithms for the $2 \rightarrow q$ matrix norm, achieving improved approximation factors over previous baselines. The authors develop novel techniques to surpass the $d^{1/4}$-approximation baseline, notably achieving $d^{1/8}$-approximation for the $q=4$ case. Their approach involves constructing sum-of-squares certificates, which also enables applications in robust statistics (mean/covariance estimation, regression) and clustering under $q$-th moment constraints. The results address open problems in combinatorial optimization, quantum information, and algorithmic statistics, while circumventing hardness barriers implied by the Exponential Time Hypothesis.
matrix normapproximation algorithmssum-of-squaresrobust estimationexponential time hypothesis
A Principled Self-Referenced Early Stopping Approach for Deep Image Prior
We propose a principled early stopping framework for Deep Image Prior (DIP) that addresses overfitting to noisy measurements by constructing pseudo self-referenced images. Our approach leverages theoretical insights on single-reference validation, pseudo-validation estimation, and shared noise impact, enabling robust overfitting detection without requiring precise noise level estimates. Three novel algorithms are introduced for inverse imaging problems (IIPs), including natural image restoration and medical image reconstruction. Extensive experiments demonstrate consistent performance improvements over existing DIP early stopping methods across varying noise levels and types.
deep image priorearly stoppinginverse imaging problemsoverfitting detectionpseudo-validation
Eureka: Intelligent Feature Engineering for Enterprise AI Cloud Resource Demand Prediction
Eureka introduces an LLM-driven framework for agentic feature engineering, where features are generated as executable programs rather than static transformations. The method employs three stages: (1) a domain-expert SFT-tuned agent produces structured feature plans, (2) an LLM translates plans into Python code via chain-of-thought reasoning, and (3) a GRPO-based alignment engine optimizes code quality via dual-channel rewards. Evaluated on 7 benchmarks and Alibaba Cloud GPU demand prediction, Eureka outperforms AutoFE and LLM baselines, improving demand fulfillment by 16% and reducing resource migration by 33%.
agentic feature engineeringchain-of-thought reasoninggrpoautofeself-evolving alignment
Choosing Online Experiment Designs under Interference in Ads, Recommendations, and Member-Experience Systems
The paper contributes an interference-aware experiment design framework for online systems, addressing uncertainty in exposure mechanisms like graph spillovers and temporal carryover. It formulates robust design selection over an ambiguity set, evaluating six implementable designs by worst-case planning risk, which combines exposure bias, variance, and operational cost. Theoretical guarantees include Wasserstein-distance bounds on design bias and minimax tightness under Lipschitz exposure response. Empirical evaluations on Criteo ads, Open Bandit-bts/men, and KuaiRand datasets demonstrate varying design recommendations, with robust risks ranging from 1.295 to 2.240. The framework outputs justified design choices or uncertainty shortlists based on mechanism-robust decisions.
interference-aware designwasserstein distancelipschitz exposurerobust riskmechanism-robust
Label-NTK Alignments and A Tighter Convergence Bound in the NTK Regime
The authors derive sharper convergence guarantees for neural network optimization in the Neural Tangent Kernel (NTK) regime by characterizing Label-NTK and Residual-NTK alignment, where label and residual projections onto NTK eigenvectors scale with corresponding eigenvalues. This approach yields a refined convergence bound dependent on the full NTK spectrum, significantly improving over classical worst-case results that rely on the smallest eigenvalue. Theoretical justification under mild data assumptions is provided, along with improved generalization bounds. Empirical validation on MLPs and CNNs across multiple datasets demonstrates alignment with practical training dynamics.
neural tangent kerneleigen-spectrumconvergence boundgeneralization boundlabel alignment
Latent Q-Barrier Shielding for Safe In-Context Reinforcement Learning
The paper introduces Latent Q-Barrier Shielding, a method for safe in-context reinforcement learning (ICRL) that improves reward-safety tradeoffs under out-of-distribution deployment shifts. The approach learns a context representation, latent dynamics, and an ensemble cost critic before deployment, enabling action filtering or reweighting based on remaining budget and predicted future cost without test-time parameter updates. A theoretical result establishes a conditional, error-decomposed barrier-margin guarantee for budget-safe continuations. Empirical evaluation across five safe ICRL benchmarks demonstrates improved returns in four benchmarks and reduced average episode costs in all five compared to a strong baseline.
in-context reinforcement learninglatent dynamicsensemble cost criticbarrier-marginout-of-distribution
First, do no harm: Breaking suicidogenic echo chambers in media recommendation
The paper introduces RankAid, a re-ranking method for recommender systems that mitigates suicidogenic echo chambers by jointly optimizing clinical safety and predictive relevance. The approach operates as an add-on layer to existing models, dynamically penalizing harmful content and promoting therapeutic items based on user vulnerability levels. Evaluation on the MovieLens 1M dataset, with risk annotations from large language models, demonstrates effective blocking of harmful recommendations during crisis periods while maintaining controlled accuracy degradation (measured by NDCG). The system allows tunable intervention severity through asymmetric hyperparameters aligned with clinical guidelines.
recommender systemssuicidogenic echo chambersre-rankingclinical safetyndcg
Quantifying Empirical Compute-Supervision Tradeoffs in RLVR
This study empirically challenges the theoretical prediction that compute scaling can compensate for imperfect supervision in reinforcement learning with verifiable rewards (RLVR). Through controlled experiments on Qwen2.5 (0.5B, 1.5B) models trained with GRPO on GSM8K, we systematically varied verifier noise levels and compute resources (rollouts per prompt). Results show persistent accuracy gaps despite compute scaling, with diminishing returns and asymmetric effects: false negatives degrade performance more rapidly than false positives. These findings demonstrate that verifier quality and compute are not interchangeable, emphasizing the importance of reducing false negatives over pure compute scaling.
reinforcement learningverifiable rewardscompute scalingfalse negativesgrpo
Constraint-Anchored Attribution: Feasibility-Certified Counterfactuals and Bonferroni-PAC Sufficient Subsets for Neural CO Policies
The paper introduces Constraint-Anchored Attribution (CAA), a method for explaining neural combinatorial-optimization policies via three components: (i) constraint-family decomposition using LP-relaxation duals, (ii) feasibility-certified counterfactuals via a CSP model, and (iii) Bonferroni-PAC sufficient subsets with Hoeffding testing. Evaluated on CVRPTW, Orienteering, and Flexible Job-Shop Scheduling problems, CAA achieves 96.5% and 77.2% alignment with counterfactual signals (vs 75.0% and 35.2% for gradient baselines), with exact agreement in no-gain scenarios. PAC subsets average 5.0 nodes per step (ε=δ=0.2).
combinatorial optimizationcounterfactual explanationlp-relaxationpac learningconstraint satisfaction
On the Epistemic Uncertainty of Overparametrized Neural Networks
The work investigates epistemic uncertainty in overparametrized neural networks, challenging the conventional view that it vanishes with increasing data. Through the lens of parameter non-identifiability, the authors characterize discrete and continuous sources of residual uncertainty, emphasizing that substantial parameter uncertainty persists even when the underlying function is fully identified. Focusing on one-hidden-layer ReLU networks, they analyze the posterior structure and validate theoretical insights empirically. The findings highlight the nuanced relationship between parameter uncertainty and predictive variability in overparametrized models.
epistemic uncertaintynon-identifiabilityoverparametrized networksrelu networksposterior structure
A Blended Likelihood Approach for Achieving Fairness Using Naive Bayes
The Bias Mitigating Naive Bayes (BMNB) classifier introduces fairness-awareness into Naive Bayes through a blended likelihood approach and adaptive thresholding. The in-processing stage combines group-specific and pooled likelihood estimates via a tunable parameter α, while post-processing calibrates outputs with group-specific decision boundaries. BMNB achieves Disparate Impact (DI) values of 1.000, 1.171, and 0.997 and Equal Opportunity Difference (EOD) values of -0.217, -0.226, and -0.053 on Adult, ProPublica, and Framingham datasets, respectively, maintaining computational efficiency. Ablation studies confirm the synergy of blended likelihood and adaptive thresholding.
naive bayesfairness-awareblended likelihooddisparate impactadaptive thresholding
Continuous-Depth Field Theory for Transformer Patching and Mechanistic Interpretability
The paper introduces a field-theoretic framework for mechanistic interpretability in Transformers, formalizing patching interventions as source insertions in a depth-token field. By treating the residual stream as a continuous field, it models patch effects via sensitivity fields, downstream propagation via empirical Green functions, and patch selection via adjoint variational problems. Experiments on GPT-2-style models demonstrate local linearity in responses, anisotropic propagation patterns across depth and token positions, and transferable behavior via prompt-induced residual displacements. The results establish sensitivity fields and Green operators as foundational tools for systematic patching analysis.
mechanistic interpretabilityresidual streamgreen functionsensitivity fieldadjoint variational
Data-Specific Hyper-Parameter Design: A Paradigm Shift in Reservoir Computing
The paper introduces a data-specific hyper-parameter design paradigm for reservoir computing, departing from traditional random reservoir constructions. By analyzing deterministic dynamical systems geometrically, the authors propose aligning reservoir state increments within input-determined subspaces via cone concentration, theoretically reducing ridge-regression error. For echo state networks, they develop a constructive reservoir matrix design maintaining Krylov-chain closure in relevant subspaces while controlling orthogonal mixing. Spectral diagnostics identify predictive information concentration versus spectral pollution. Experiments demonstrate consistent performance improvements over random reservoirs.
reservoir computingecho state networksridge regressionkrylov-chainspectral pollution
Personalized Federated Learning by Energy-Efficient UAV Communications
(No summary returned.)
Evolving Causal Regulatory Networks (ECR-Net)
ECR-Net introduces a bio-inspired framework for adaptive causal discovery by modeling data-generating processes as dynamic Gene Regulatory Networks (GRNs) rather than static graphs. The method employs evolutionary search to optimize regulatory graph topologies, using statistical property shifts as signals for environmental shocks and parsimoniously modifying causal links. This approach enables robust generalization in non-stationary systems by capturing structural adaptation mechanisms.
gene regulatory networkscausal discoveryevolutionary searchstructural adaptationnon-stationary systems
Multi-Objective Learning for Diffusion Models: A Statistical Theory under Semi-Supervised Learning
The paper develops a multi-objective learning (MOL) framework for diffusion models in semi-supervised settings, where paired samples are scarce but unlabeled condition data are abundant. The method employs a two-stage procedure: first training lightweight specialist models on limited paired data, then distilling them into a generalist model via pseudo-sample generation. Theoretical analysis shows generalization bounds where paired sample complexity depends only on specialist model class complexity, extended to sequential decision-making with diffusion policies. Experiments on robotic control and image restoration validate the approach.
diffusion modelsmulti-objective learningsemi-supervised learninggeneralization boundspseudo-sample generation
Influence-Inspired Spectral Rotations for Extreme Low-Bit LLM Quantization
The paper introduces influence-inspired spectral rotations for extreme low-bit weight-only quantization in large language models (LLMs), building on Walsh-Hadamard transform (WHT) geometry from prior theory. The method involves WHT-rotating each linear layer's weight matrix and rescaling columns by per-coordinate Walsh-basis activation energy before quantization, biasing rounding toward high-spectral-energy channels. Evaluated on decoder-only models (135M–1.5B parameters), the approach reduces WikiText-2 perplexity by 15–58% at W2A16 versus vanilla auto-round, with extensions addressing Qwen3 attention and MoE architectures. Results show device-invariant execution (PPL ±0.1) on Intel hardware, though theoretical transfer from Boolean influence remains unproven.
quantizationwalsh-hadamard transformperplexitylow-bitllm
Hide to Guide: Learning via Semantic Masking
We introduce Semantic Masked Expert Policy Optimization (SMEPO), a novel reinforcement learning with verifiable rewards (RLVR) method that employs fine-grained semantic masking to guide language models on reasoning-intensive tasks. SMEPO selectively masks reward-relevant semantic spans in expert traces while preserving problem-solving structure, transforming hard problems into fill-in-the-blank exercises. This approach prevents reward hacking by forcing models to reconstruct critical content rather than copying expert traces. Evaluated across math, code, and agentic search domains, SMEPO improves accuracy by up to 3.2 points over GRPO and reduces training time by up to 4.2x, demonstrating effective exploration and learning efficiency.
semantic maskingreinforcement learningverifiable rewardsreward hackingexpert traces
Localization then Neutralization: Gradient-guided Token Suppression against Visual Prompt Injection Attack
We propose Gradient Token Masking (GTM), a defense against visual prompt injection attacks on multimodal large language models. GTM localizes critical image tokens via Hidden-State Gradient Norm scoring, which is theoretically guaranteed to align with full adversarial loss gradients, and neutralizes them through masking. This method requires only a single forward-backward pass to identify and suppress a small subset of tokens, effectively disrupting adversarial attack paths. Experiments on prompt injection and multimodal jailbreak attacks demonstrate that GTM reduces attack success rates to near zero while maintaining model utility with minimal computational overhead.
gradient token maskingvisual prompt injectionhidden-state gradient normmultimodal jailbreakadversarial loss
Directional Alignment Mitigates Reward Hacking in Reinforcement Learning for Language Models
The authors propose trusted-direction projection, a method to mitigate reward hacking in reinforcement learning for language models by constraining gradients to a clean reference subspace. They analyze reward hacking through the geometry of parameter updates, identifying that hacking exhibits larger directional change than clean runs via dominant singular directions. Experiments on mathematical reasoning tasks demonstrate that the approach delays shortcut exploitation and maintains task performance better than unconstrained optimization.
reward hackingreinforcement learninglanguage modelsgradient projectionparameter updates
Growing a Neural Network in Breadth, Depth, and Time
The authors propose a framework for jointly optimizing neural network architectures across breadth, depth, and temporal recurrence via differentiable cost terms within a recurrent convolutional network. Their method treats the network as a finite subset of an infinite lattice, applying backpropagation to balance task performance against spatial and temporal resource constraints. Results demonstrate trade-offs between these dimensions for accuracy, with emergent computational graphs adapting to task complexity and occlusion (increased recurrence). Notably, model recurrence steps correlate with human reaction times in object recognition (r=0.72).
differentiable architecture searchrecurrent convolutional networksresource-constrained optimizationemergent computational graphsnormative neural modeling
Nyström Kernel Stein Discrepancy Tests
The paper establishes that Nyström-accelerated Kernel Stein Discrepancy (KSD) preserves the asymptotic properties of quadratic-time bootstrap-based goodness-of-fit (GoF) tests while reducing computational cost. By proving that the accelerated method maintains asymptotic level and local consistency, the work enables efficient GoF testing for spherical and functional data. Empirical results demonstrate statistical parity with traditional KSD tests, achieving runtime improvements without accuracy loss.
kernel stein discrepancynyström approximationgoodness-of-fit testingbootstrap methodscomputational efficiency
Rejoinder: The ICML 2023 Ranking Experiment: Examining Author Self-Assessment in ML/AI Peer Review
The rejoinder addresses critiques of the ICML 2023 Ranking Experiment, which evaluates author self-assessment in ML/AI peer review. It reframes peer review as a statistical estimation problem and proposes the Isotonic Mechanism to mitigate equity and strategic concerns. The response integrates reviewer rankings and structured metadata as complementary signals and explores a human-centered framework for peer review in the context of generative AI. The discussion emphasizes practical deployment challenges and theoretical implications for improving review processes.
peer reviewstatistical estimationisotonic mechanismreviewer rankingsgenerative ai
Grow-Prune-Freeze Networks: Adaptive & Continual Learning Technique for Olfactory Navigation
The paper introduces Grow-Prune-Freeze (GPF) networks, an adaptive continual learning framework for olfactory navigation in non-stationary environments. GPF dynamically modifies policy networks by growing/pruning/freezing layers based on world complexity, grounded in non-linear random matrix theory extensions of Pennington & Worah (2017). The method achieves 94% success in turbulent plume navigation (a partially observable benchmark) via Expected SARSA, with evidence suggesting generalization to Atari RL, image classification, and autoregressive LMs. Theoretical analysis shows preserved eigenvalue composition during layer expansion.
continual learningolfactory navigationrandom matrix theoryexpected sarsanon-stationary environments
Learning Treatment Effects during Resource Allocation via Priority-Queue Randomization
The authors propose an experimental design framework for estimating treatment effects during resource allocation via priority-queue randomization, addressing challenges in public service programs. Their method randomizes incoming applicants into priority queues based on risk scores, allocating treatments across queues in priority order as budgets permit. They characterize identifiable causal effects: standard estimands under exogenous arrivals and local treatment effects under endogenous arrivals via queue randomization as an instrument. Additionally, they develop optimized queue-assignment designs balancing statistical efficiency with prioritization of high-need applicants, demonstrating that iid efficiency bounds remain valid despite treatment assignment dependencies. The framework is validated using data from a U.S. county housing allocation program.
treatment effectspriority queuescausal inferenceresource allocationstatistical efficiency
AME-TS: Anchored Mixture-of-Experts for Time Series Forecasting
AME-TS introduces a structure-guided sparse time series foundation model that improves Mixture-of-Experts (MoE) routing by aligning expert specialization with interpretable temporal structure. The method employs a lightweight regime predictor to estimate series-level descriptors (e.g., forecastability, seasonality, trend, sparsity) and maps them to a soft structural prior over experts, guiding token-level routing during training. On the GIFT-Eval benchmark, AME-TS achieves superior accuracy-efficiency tradeoffs across model scales, outperforming existing models at small scales and remaining competitive at larger scales while activating fewer parameters. Fine-tuning on the M5 dataset demonstrates more interpretable routing geometry and stable expert specialization compared to standard MoE.
mixture-of-expertstime series forecastingsparse routingstructural priortoken-level routing
Abduction-Deduction Entanglement: Domain Generalization via Representation Transplants
The paper introduces a domain generalization framework leveraging causal invariance through representation transplants. By factorizing predictions into abduction (inferring unobserved variables) and deduction (label prediction) maps, the method constrains valid abduction-deduction ensembles via source data. Representation transplants linearly transform representations to manipulate abduction while preserving deduction, enabling search over plausible target distributions. Theoretical analysis shows minimax-optimal target prediction under ideal optimization. Empirical results demonstrate competitive performance on domain generalization benchmarks.
domain generalizationcausal invariancerepresentation transplantabduction-deduction entanglementminimax optimization
Beyond the Frontier: Stochastic Backtracking for Efficient Test-Time Scaling
The paper introduces stochastic backtracking for efficient test-time scaling in language models, addressing premature pruning in existing PRM-guided methods by maintaining a persistent pool of historical prefixes. Two mechanisms are proposed: Subpool Selection (Top-N within random subpools to revive promising prefixes) and Power Backtrack Sequential Monte Carlo (SMC-style resampling with powered PRM scores). Evaluations on mathematical reasoning benchmarks show improved accuracy per token count and equivalent accuracy with fewer tokens compared to frontier-only PRM baselines.
test-time scalingprm-guided searchstochastic backtrackingsequential monte carlomathematical reasoning
ASTRO: Adaptive Spatio-Temporal Reinforcement Optimization for GNN Powered Anomly Detection in Cyber Physical Systems
The paper introduces ASTRO, a reinforcement learning-based anomaly detection framework for IIoT/CPS that dynamically optimizes decision thresholds via DQN. The method combines GNNs (for spatial sensor relations), temporal modeling, and multi-head attention (for salient time steps) to generate adaptive anomaly scores. Evaluated on SWaT and WADI datasets, ASTRO achieves F1-scores of 0.990 and 0.788 respectively, outperforming baselines by 14% on WADI's 127-device network while demonstrating consistent generalization.
anomaly detectiongraph neural networksreinforcement learningmulti-head attentioncyber-physical systems
Theoretical Analysis of Sparse Optimization with Reparameterization, Weight Decay, and Adaptive Learning Rate
The paper proposes ReWA, a sparse optimization method combining reparameterization, weight decay, and adaptive learning rates to address instability in ℓ_p regularization (0
sparse optimizationℓ_p regularizationreparameterizationadaptive learning rateweight decay
Blocked Gibbs meets Diffusion Transformers: Unsupervised Learning for Constraint Optimization
BloGDiT introduces blocked Gibbs sampling into Diffusion Transformers for constraint optimization, addressing limitations of standard diffusion in handling discrete variables and global constraints. The method replaces joint Gaussian denoising with blocked Gaussian denoising, iteratively resampling variable blocks while annealing block sizes to enable targeted edits. Evaluated on Sudoku, Graph Coloring, Maximum Independent Set, and MaxCut, BloGDiT matches or surpasses existing methods, demonstrating the efficacy of blocked Gibbs diffusion as an inductive bias for Transformer-based constraint solving.
diffusion transformersblocked gibbs samplingconstraint optimizationdiscrete variablesannealed block resampling
PQDT: Pseudo-Query Dual Transformer for Robust Point Cloud Restoration
The authors propose PQDT, a Pseudo-Query Dual Transformer for robust point cloud restoration that handles diverse degradations (incompleteness, noise, outliers) through a unified architecture. The method introduces a Pseudo-Query module within a Transformer backbone, decomposing geometric translation into two cooperative stages to preserve local details while enhancing structural clarity. Experiments on curated benchmarks demonstrate state-of-the-art performance in joint completion, deformation, and denoising tasks, outperforming specialized single-task approaches. The work provides a point-only backbone for versatile 3D perception without requiring global bottleneck features.
point cloud restorationtransformer architecturegeometric translationlocal detail preservationdegradation robustness
Optimizing Multidimensional Scaling in Gini Metric Spaces
The authors propose Gini Multidimensional Scaling (Gini MDS), an extension of Euclidean MDS using a Gini pseudo-distance based on values and ranks with a tunable hyperparameter. This method enables flexible exploration of latent configurations for improved embedding alignment with observed dissimilarities. Experiments on 16 UCI datasets with outliers and noisy MNIST images demonstrate Gini MDS's robustness, outperforming standard Euclidean MDS. The implementation leverages PyTorch for GPU acceleration and computational efficiency compared to sklearn's MDS.
gini multidimensional scalingpseudo-distancelatent configurationseuclidean mdsgpu acceleration
Inference-Time Alignment of Diffusion Models via Trust-Region Iterative Twisted Sequential Monte Carlo
The paper introduces Trust-Region Iterative Twisted Sequential Monte Carlo (TRI-TSMC), a method for inference-time alignment of diffusion models without weight updates. It addresses limitations of existing Sequential Monte Carlo (SMC) approaches—such as weight degeneracy and high-variance estimates—by learning twisting functions via a trust-region framework with closed-form KL-constrained updates and weighted maximum-likelihood projections. Theoretical analysis shows optimal twisting yields zero-variance sampling, while empirical results demonstrate improved alignment in discrete diffusion text generation and text-to-image tasks under fixed inference budgets.
sequential monte carlodiffusion modelsinference-time alignmenttwisting functionstrust-region optimization
Trust-Aware Joint Feature-Prediction Discrepancy for Robust Domain Adaptation
The authors propose trust-aware domain adaptation, introducing Joint Feature-Prediction Discrepancy (JFPD) to jointly model domain divergence in feature and prediction spaces while weighting contributions by sample-specific trust. Trust is quantified via two mechanisms: uncertainty-aware trust based on prediction entropy and semantic-alignment trust derived from prototype similarity. JFPD prioritizes confident, semantically consistent samples and suppresses noisy ones, providing reliability-aware domain discrepancy estimates. Integrated into a training objective, JFPD guides adaptation toward trustworthy target-domain regions. Experiments on standard benchmarks show superior adaptation performance and discrepancy estimates correlating with target-domain error, addressing trust modeling in feature-prediction interaction for domain adaptation.
domain adaptationdiscrepancy estimationuncertainty-aware trustsemantic-alignment trustfeature-prediction interaction
Courant: a State-Adaptive Perceiver-Based Neural Surrogate with Local Support and Interpretable Field Decomposition
The authors propose Courant, a Perceiver-based neural surrogate model featuring state-adaptive latent queries and local support in physical space, mimicking adaptive hp-refinement in numerical solvers. The architecture employs shared random Fourier feature embeddings, lightweight decoding, and trains end-to-end with L_2 loss on steady/transient simulation data. Results show competitive accuracy, with interpretable latents exhibiting multiscale geometric specialization and coherent structure tracking in time-dependent cases, enabling geometry-anchored field decomposition.
perceiver-basedhp-refinementfourier featurelatent queriesfield decomposition
Counterfactually Safe Reinforcement Learning
The authors propose a counterfactual safety framework for reinforcement learning that minimizes individual harm while maximizing expected return. They formalize individual harm as the event where an action yields a strictly worse outcome than a baseline alternative and introduce a two-stage procedure for learning harm-aware policies. Theoretical analysis establishes finite-sample properties, derives an upper bound on sub-optimality gap, and demonstrates controlled harm rates. Empirical evaluation on simulated and real-world datasets validates the approach's effectiveness in balancing safety and performance.
reinforcement learningcounterfactual safetyindividual harmsub-optimality gapfinite-sample properties
Revisiting Pre-Propagation GNNs: Robust Diffusion Operators and Hidden-State Re-Propagation
The paper introduces robust graph diffusion operators and a few-shot hidden-state re-propagation scheme to enhance pre-propagation GNNs (PPGNNs). PPGNNs decouple feature propagation from transformation, enabling efficient mini-batch training on dense compute accelerators but lag behind message-passing GNNs in accuracy, particularly on heterophilic graphs. The proposed methods bridge this gap, matching message-passing GNN accuracy while preserving training efficiency, as validated on standard benchmarks.
pre-propagation gnnsgraph diffusion operatorsheterophilic graphshidden-state re-propagationmini-batch training
Uncertainty-DTW for Sequences and Visual Tokens
We introduce uncertainty-DTW (uDTW), a probabilistic framework for aligning structured data that models pairwise correspondences with heteroscedastic uncertainty. uDTW employs a Maximum Likelihood Estimate objective combining precision-weighted matching to suppress unreliable features and log-variance regularization to prevent degenerate solutions. This approach generalizes from temporal sequences to tokenized visual representations, enabling structured matching over visual tokens while providing interpretable uncertainty estimates. Evaluations across diverse domains demonstrate consistent improvements over state-of-the-art methods, with learned uncertainty correlating with semantic importance. The framework establishes uncertainty-aware alignment as a robust and interpretable method for learning from structured data.
heteroscedastic uncertaintydynamic time warpingvisual tokensmaximum likelihood estimatestructured matching
Leveraging Gauge Freedom for Learning Non-Gradient Population Dynamics of Stochastic Systems
We introduce Non-Gradient Inference Flows (NGIF), a method for inferring non-gradient population dynamics in stochastic systems by leveraging gauge freedom in vector field selection. NGIF employs a weak formulation of the continuity equation to parameterize general vector fields, enabling criteria beyond minimal kinetic energy. Experiments on low- and high-dimensional physics problems demonstrate that NGIF improves distributional accuracy and better captures non-potential transport compared to gradient-restricted baselines.
population dynamicsgauge freedomcontinuity equationvector fieldskinetic energy
RECTOR: Priority-Aware Rule-Based Reranking for Compliance-Aware Autonomous Driving Trajectory Selection
RECTOR introduces a rule-based reranking layer for autonomous driving trajectory selection, enforcing a tiered priority of safety > legal > road > comfort constraints via differentiable proxies and scene-conditioned applicability. The method employs a deterministic ε-lexicographic rule to preserve cross-tier priorities without retraining the underlying predictor. Evaluated on the Waymo Open Motion Dataset (43,219 instances, K=6), RECTOR reduces safety+legal violations from 28.58% to 20.42% and total violations from 40.32% to 32.41% compared to confidence-only selection, demonstrating robustness under adversarial confidence corruption (∼96% rejection rate).
trajectory selectionlexicographic optimizationdifferentiable proxiesrule-based rerankingautonomous driving
Polynomial Context-Truncation Sensitivity in Autoregressive Language Models: Sequential Wyner-Ziv Bounds for KV Cache Compression
The paper characterizes the rate-distortion limits of KV cache compression in autoregressive language models through sequential Wyner-Ziv coding, with next-step queries as decoder side information. Empirical analysis across four models (0.5-3B parameters) reveals polynomial (not geometric) decay in next-token distribution sensitivity to context truncation, validated via power-law fits and positional encoding ablations. Theoretical results show suffix-only cache policies achieve distortion ε with window size Θ(ε^{-1/α}), where α is the power-law exponent; a block-Markov scheme matches this bound under certain conditions. Practical evaluations confirm recency-based eviction outperforms random retention by two orders of magnitude.
kv cache compressionwyner-ziv codingautoregressive language modelspower-law decayrate-distortion
Security in the Fine-Tuning Lifecycle of Large Language Models: Threats, Defenses,Evaluation, and Future Directions
This survey establishes a unified lifecycle framework for analyzing security threats and defenses in LLM fine-tuning, categorizing interventions into pre-tuning, during-tuning, and post-tuning phases. Through systematic review and cross-phase empirical evaluation on 1B-4B parameter models, it reveals scale-dependent attack dynamics (e.g., failed cross-lingual backdoor transfer) and limitations of single-phase defenses. Key findings include non-monotonic attack effectiveness across model generations and safety alignment vulnerabilities from benign samples, highlighting needs for configuration-robust and composable defenses.
fine-tuning lifecyclecross-lingual backdoorsafety alignmentweight-editing attacksembedding-space attacks
QML-PipeGuard: Drift-Aware Behavioral Fingerprinting for Quantum Machine Learning Pipeline Integrity
QML-PipeGuard introduces a contract-based framework for ensuring quantum machine learning (QML) pipeline integrity against hardware drift and adversarial channel substitution. The method employs behavioral fingerprinting via tomographically structured measurements, operating in drift-monitoring and adversarial-detection modes, with theoretical guarantees (tight frame-bound C=√3 for single-qubit Pauli family) and finite-shot sample-complexity bounds. Validation on IBM Heron r2 (ibm_fez) with a two-qubit QSVM pipeline confirms detection of adversarial channels within 1.4×10⁴ shots while tolerating natural hardware drift.
quantum machine learningbehavioral fingerprintingtomographic measurementchannel substitutionsample-complexity
Reinforcement Learning for Laser Additive Manufacturing Scan-Order Optimisation: A Bilevel Proxy--FEA Diagnostic Framework for Reward and World-Model Diagnosis
The paper proposes a bilevel Proxy--FEA diagnostic framework for evaluating reward functions and world models in reinforcement learning (RL) for laser additive manufacturing scan-order optimization. The method combines lightweight thermo-inspired proxies for rapid candidate generation with sparse Abaqus FEA simulations for reference validation, tested on a LDED32 stripe benchmark with ten scan strategies. Results reveal a stress-distortion trade-off, identify center_out as a robust compromise strategy, and show current path-based proxies primarily capture distortion (U3) with weak FEA correlation, highlighting risks of proxy-only RL reward designs.
reinforcement learningscan-order optimizationfinite-element analysisproxy metricsthermo-mechanical objectives
GL-LFGNN:A Global-Local Dual-branch Causal Graph Neural Network Based on Liang-Kleeman Information Flow for EEG Emotion Recognition
The paper introduces GL-LFGNN, a global-local dual-branch causal graph neural network for EEG emotion recognition, leveraging Liang-Kleeman information flow theory to model asymmetric neural causal interactions. Unlike conventional GNNs using symmetric adjacency matrices, it quantifies directed causal strength via dynamical systems theory, integrating whole-brain connectivity with region-specific processing. Evaluated on the MEEG dataset, GL-LFGNN achieves 86.17% (Arousal) and 86.71% (Valence) accuracy with only 37K parameters, outperforming state-of-the-art models in both efficiency and interpretability.
eeg emotion recognitionliang-kleeman information flowcausal graph neural networkdynamical systems theoryglobal-local dual-branch
Random Neural Network Expressivity for Non-Linear Partial Differential Equations
This work investigates the expressivity of random neural networks (RaNNs) for approximating solutions to non-linear partial differential equations (PDEs). The authors derive error bounds for RaNN approximations to time-dependent Sobolev functions, achieving a dimension-free approximation rate of 1/2 for sufficiently regular functions. Theoretical results are applied to Porous Medium Equations and Compressible Navier-Stokes Equations, demonstrating RaNNs' capability to approximate solutions efficiently. Numerical experiments validate the derived convergence rates, extending their applicability beyond the theoretical setting.
random neural networksnon-linear pdessobolev functionserror boundsporous medium equations
Scale When Needed: Adaptive Neuron-level Mixed Precision Quantization Aware Training
Neuron-Level Mixed-Precision Quantization-Aware Training (NMP-QAT) introduces adaptive precision allocation at the neuron level, enabling independent learning of discrete precision per neuron during training. The method employs differentiable surrogates and straight-through estimators to expand bit-width only when training signals necessitate, while maintaining a fully discrete inference graph. NMP-QAT adapts both weights and activations, reducing memory movement. Evaluated on telecom and non-telecom datasets across MLP and tabular foundation models, it achieves superior compression-accuracy trade-offs compared to mixed-precision QAT baselines, making it suitable for Green AI deployments on resource-constrained 6G edge devices.
quantization-aware trainingmixed-precisionneuron-levelstraight-through estimator6g edge devices
Multimodality Stacking with Blockwise missing values and application to the PIONeeR biomarkers study for prediction of resistance to immunotherapy
The study introduces Multimodality Stacking with Blockwise missing values (MSB), a late-fusion framework for survival analysis that handles incomplete multimodal datasets by independently modeling modality-specific features before aggregating predictions via cross-validated stacking. Validated on the PIONeeR study (n=443 patients, 378 biomarkers across 8 sources), MSB outperformed baselines in predicting progression-free survival for NSCLC patients under immunotherapy, with C-index improvements of 15.9% for linear models, 5.4% for random survival forests, and 2.1% for gradient boosting (all p<0.05). MSB also reduced generalization gaps (train-test difference: 0.055 vs 0.380) and identified key predictive biomarkers without bias from missing data patterns.
multimodal stackingblockwise missingnesssurvival analysislate-fusionbiomarker integration
TRACE: A taxonomy-grounded synthetic dataset for teaching-program generation and session interpretation in Applied Behavior Analysis
The paper introduces TRACE (Taxonomy-Referenced ABA Clinical Examples), a synthetic 2,999-example instruction-tuning dataset for Applied Behavior Analysis (ABA), addressing the lack of publicly available clinical data due to HIPAA restrictions. TRACE covers two ABA tasks: teaching-program generation (Discrete Trial Training, Natural Environment Teaching, Task Analysis) and multi-session behavioral interpretation (12 trajectory patterns, 13 target behaviors). Examples are generated deterministically via a taxonomy-driven method grounded in ABA literature, with full provenance tracking. The dataset is released under CC BY-NC 4.0 (data) and MIT (code), with stratified splits (2,549 train, 149 validation, 281 test, 20 sanity).
synthetic datasetinstruction-tuningapplied behavior analysistaxonomy-driven generationclinical documentation
MimirRAG: A Multi-Agent RAG Framework for Financial Data Retrieval with Metadata Integration
The paper introduces MimirRAG, a multi-agent RAG framework for financial data retrieval, featuring metadata integration, table-aware chunking, and agentic workflows. The system employs structure-preserving PDF parsing, hybrid search, and context-aware generation with numerical reasoning. Evaluation on FinanceBench shows 89.3% accuracy, outperforming baselines, with expert validation emphasizing trust calibration and user personalization. The study identifies metadata integration, table-aware chunking, and agentic workflows as key enablers for effective financial RAG systems.
retrieval-augmented generationmetadata integrationtable-aware chunkingagentic workflownumerical reasoning
A perspective on fluid mechanical environments for challenges in reinforcement learning
The paper proposes fluid mechanics as a testbed for reinforcement learning (RL) in high-dimensional, nonstationary environments, focusing on nonlinear instabilities like droplet breakup and rogue waves. It introduces two RL problem formulations with specified state/action spaces and reward functions, leveraging preserved invariances in fluid dynamics. The authors demonstrate environment generation using Dedalus for stationary navigation tasks, suggesting future work on RL for industrial/scientific flow challenges.
reinforcement learningfluid mechanicsnonlinear instabilitiesnonstationary environmentsdedalus
Convex-Neural RRT*: Fast and Reliable Learning-Guided Sampling for High-Quality Robot Path Planning
Convex-Neural RRT* introduces neural-guided sampling for high-quality robot path planning by predicting informative waypoint regions and extracting convex candidate regions to focus exploration. The method combines neural network predictions with geometric constraints, preserving global exploration while improving efficiency. Evaluated against Neural RRT*, Neural Informed RRT*, RRT*, and LTA* across 18 benchmark maps, it reduces computation time by 30-75% versus neural-guided variants and up to 88-98% versus LTA*, achieving a 5% average path length reduction over classical RRT* with a 99% success rate across obstacle densities.
sampling-based planningneural guidanceconvex regionsrrt*robot navigation
Metropolis-Scale Resilient and Trustworthy Traffic Flow Inference Using Multi-Source Data
The Task-Aware Attentive Neural Process (TA-ANP) is introduced as a unified probabilistic framework for resilient and trustworthy global traffic state inference (GTSI) by fusing floating car data (FCD) with sparse fixed-detector measurements. TA-ANP leverages neural processes for rapid adaptation to sensing configuration changes and employs a task-aware multi-query attention module to handle three GTSI sub-tasks while mitigating cross-task interference. Uncertainty is quantified using Monte Carlo Dropout for both aleatoric and epistemic uncertainty. Evaluated on the Metropolitan Multi-Source Traffic Dataset (MMTD) with 2,371 road segments, TA-ANP achieves state-of-the-art performance across sub-tasks and demonstrates superior resilience in sensing lifecycle scenarios.
neural processesuncertainty quantificationtraffic state inferencemulti-source datamonte carlo dropout
Mitigating Gradient Pathology in PINNs through Aligned Constraint
The paper proposes Constraint-Aligned loss with Manifold Lifting (CAML) to mitigate gradient pathology in Physics-Informed Neural Networks (PINNs). By reformulating zeroth-order terms into aligned constraints and introducing a delay factor to bypass high-curvature regions, CAML resolves gradient conflicts between PDE residuals and boundary constraints. Experiments show CAML improves numerical stability and training efficiency for complex PINN problems, outperforming adaptive weighting and hard constraint methods. The method is supported by systematic analysis of gradient pathology through loss landscape and optimization dynamics perspectives.
physics-informed neural networksgradient pathologypartial differential equationsconstraint alignmentloss landscape
Scaling up Energy-Aware Multi-Agent Reinforcement Learning for Mission-Oriented Drone Networks with Individual Reward
We propose an energy-aware multi-agent reinforcement learning (MARL) model for mission-oriented drone networks, addressing dynamic environments and limited battery capacity. The model leverages Deep Q-Networks (DQN) with individual reward functions based on task execution progress and remaining battery levels. Simulation studies demonstrate that the model achieves at least 80% success rate across task locations and lengths, scaling robustly with environment size and agent numbers. Compared to shared reward MARL, our approach improves energy efficiency and success rates, reaching nearly 100% success at 40% task density.
multi-agent reinforcement learningdeep q-networksenergy-awaredrone networksindividual reward
Selective Test-Time Compute Scaling for Click-Through Rate Prediction via Uncertainty-Triggered Feature Path Exploration
The paper proposes UTTSI, a training-free framework for Click-Through Rate (CTR) prediction that dynamically scales test-time compute based on per-instance uncertainty. The method combines model logit confidence with data-level frequency priors to distinguish epistemic uncertainty, then applies adaptive feature filtering and stochastic feature-path exploration for uncertain instances, aggregating predictions via consistency-weighted ensembling. Experiments across four datasets and three architectures show statistically significant improvements over baselines, with a 5.3% relative CTR gain in online A/B testing, while maintaining average overhead at 2.8× base model cost.
click-through rate predictiontest-time computeuncertainty estimationadaptive feature filteringconsistency-weighted ensembling
Self-Balancing Gradient Allocation for Heterogeneity-Aware Feature Generation in Click-Through Rate Prediction
HeteGenCTR addresses generative difficulty imbalance in click-through rate prediction by introducing per-field learnable difficulty parameters jointly trained with the denoising network. The method employs a self-balancing loss that reallocates gradient budget toward harder fields and a difficulty-guided attention mechanism that suppresses easy fields while amplifying cross-field information flow. Both components utilize the same learned signal, maintaining consistency throughout training. Experiments on five CTR benchmarks and a seven-day online A/B test show statistically significant improvements over state-of-the-art baselines, particularly benefiting cold-start and long-tail users.
click-through rate predictiongenerative difficulty imbalanceself-balancing lossdifficulty-guided attentioncold-start users
Learning, locomotion, and navigation of soft synthetic snakes in three-dimensional, heterogeneous environments
The study presents a computational framework for enabling soft synthetic snakes to navigate unstructured 3D terrains, combining bio-inspired actuation and sensing models with reinforcement learning. Locomotion primitives are first trained in homogeneous environments, then composed into adaptive strategies for complex landscapes. The method demonstrates robustness in high-fidelity 3D environments reconstructed from real-world imaging, achieving reliable navigation for continuum systems.
soft roboticsreinforcement learningcontinuum systemsbio-inspired actuation3d navigation
Benchmarking non-conformity score functions in conformal prediction
The paper benchmarks non-conformity score functions in conformal prediction, addressing a gap in comparative analysis. It reviews existing score functions, proposes novel modifications, and introduces an evaluation method for prediction set sizes. Experiments compare score functions' efficacy, particularly in class-conditional conformal prediction with imbalanced datasets. Results demonstrate variability in prediction set sizes across different score functions, highlighting their impact on conformal prediction's utility.
conformal predictionnon-conformity scoreprediction setsmodel calibrationimbalanced classes
Large Language Model Selection with Limited Annotations
The paper introduces SELECT-LLM, the first framework for active model selection of Large Language Models (LLMs) with limited annotations. The method selects informative queries based on expected information gain derived from pairwise similarities between candidate model outputs, requiring no architectural assumptions or weight access. Evaluated across 23 datasets, 156 models, and diverse tasks, SELECT-LLM reduces annotation costs by up to 81.8% for best model selection and 84.78% for near-best selection, outperforming all baselines.
large language modelsactive learningmodel selectioninformation gainannotation efficiency
Bridging the Gap: Enabling Soft Actor Critic for High Performance Legged Locomotion
This work bridges the performance gap between Soft Actor-Critic (SAC) and Proximal Policy Optimization (PPO) for legged robot locomotion by introducing targeted algorithmic modifications. The proposed enhancements include policy initialization strategies, timeout-aware critic targets, and multi-step return estimation, enabling stable large-scale SAC training. Evaluated across multiple legged robot platforms and diverse locomotion tasks, the modified SAC achieves parity with PPO's empirical performance while maintaining its off-policy advantages for sim-to-real transfer and online adaptation.
soft actor-criticproximal policy optimizationsim-to-real transferlegged locomotionmulti-step return
📰 Industry Media (9)
Rethinking organizational design in the age of agentic AI
The article introduces Agentic Business Transformation (ABT) as a framework for integrating AI agents into organizational structures, contrasting it with incremental AI adoption. Drawing on industry data (85% organizational ambition vs. 76% infrastructure readiness) and expert analysis from PwC and Ema, it identifies three ABT pillars: technology stack redesign for agentic workflows (e.g., cross-system tacit knowledge), workforce restructuring for hybrid human-AI teams, and outcome-based metrics replacing activity tracking. Early adopters report 30-50% process acceleration and 3x ROI shifts when prioritizing systemic over point solutions.
agentic business transformationtacit knowledgehybrid workforceoutcome-based metricssystems-level change
A reality check on the AI jobs hysteria
Analysis of US Bureau of Labor Statistics data reveals minimal large-scale AI-driven labor market disruption, with unemployment rates in AI-exposed occupations lower than less-exposed sectors. Stanford Digital Economy Lab's study of 950 occupations using ADP payroll data identifies a 16% decline in entry-level jobs for 22-25-year-olds in high-exposure fields (e.g., software development) post-2024, contrasting with growth for older workers. Task-based analysis shows automation-prone roles declining while augmentation-focused roles expand. Only 20% of companies currently deploy AI, suggesting gradual adoption. Emerging diagnostic tools track sector-specific AI adoption rates (~40% workforce penetration) and productivity impacts.
labor economicstask automationoccupational exposureadp datasetcodified knowledge
It’s time to address the looming crisis in entry-level work.
Recent studies highlight a concerning trend in early-career employment due to AI adoption, particularly in AI-exposed occupations. A Stanford Digital Economy Lab working paper (2025) found a 16% relative decline in employment for workers aged 22-25 in such roles, while experienced workers remained unaffected. This suggests firms are substituting AI for junior tasks traditionally used for skill-building. The Federal Reserve Bank of New York reported rising unemployment (5.6%) and underemployment (42.5%) among recent graduates in Q4 2025. To mitigate this, educational institutions must integrate AI literacy, prompt-based workflows, and verification skills into curricula, while governments and firms should incentivize structured, AI-augmented entry-level roles to preserve long-term workforce development.
early-career employmentai-exposed occupationsprompt-based workflowsverification skillsai-augmented roles
Meet OmniVoice Studio: A Local, Open-Source Alternative to ElevenLabs
OmniVoice Studio presents an open-source, locally executable alternative to ElevenLabs' cloud-based voice AI services, offering six core functionalities: voice cloning via 3-second audio clips using zero-shot diffusion-based TTS (supporting 600+ languages), voice design parameterization, video dubbing with WhisperX transcription and Demucs audio separation, real-time dictation, speaker diarization via Pyannote, and batch processing. The architecture combines React/FastAPI with CUDA/MPS/ROCm GPU support, featuring pluggable TTS engines (OmniVoice, CosyVoice 3, MLX-Audio, VoxCPM2, MOSS-TTS-Nano, KittenTTS) and neural watermarking via AudioSeal. Benchmarks show 646-language TTS coverage and 99-language ASR, with CPU fallback for ≤8GB VRAM systems.
zero-shot learningdiffusion-based ttsspeaker diarizationpluggable backendneural watermarking
Design a Complete Multimodal RLVR Pipeline with Open-MM-RL, Vision-Language Prompting, Reward Scoring, and GRPO Export
The tutorial introduces a multimodal RLVR pipeline leveraging the TuringEnterprises/Open-MM-RL dataset for vision-language reasoning tasks. It details dataset preprocessing, including schema analysis, domain-specific visualization, and LaTeX block extraction. A verifiable reward function is implemented for exact, numeric, and symbolic answer grading, alongside a LaTeX-to-SymPy converter for mathematical evaluation. The pipeline integrates SmolVLM for inference and exports data in GRPO format for RL training. Initial tests yield a mean reward of 0.3 over six samples, demonstrating the pipeline's utility for multimodal RL applications.
multimodal rlvrvision-language promptinglatex-to-sympyreward scoringgrpo export
Together AI Open-Sources OSCAR: An Attention-Aware 2-Bit KV Cache Quantization System for Long-Context LLM Serving
Together AI introduces OSCAR, an attention-aware 2-bit KV cache quantization system for long-context LLM serving, addressing channel-wise outliers via offline-calibrated rotations derived from query and score-weighted value covariances. The method combines optimal eigenvector-aligned rotations (UQ/US), Walsh-Hadamard transforms, and permuted bit-reversal to achieve 8× memory reduction with minimal accuracy drop (e.g., −0.02 for Qwen3-32B). Integrated into SGLang's paged attention system, OSCAR yields 3× decode speedup at 100K context while maintaining near-BF16 accuracy across benchmarks like AIME25 and RULER-NIAH.
kv cache quantizationattention-aware rotationpaged attentionchannel-wise outlierswalsh-hadamard transform
Step by Step Guide to Build and Compare FedAvg and FedProx Federated Learning on Non-IID CIFAR-10 with NVIDIA FLARE
This tutorial implements a federated learning experiment comparing FedAvg and FedProx algorithms on non-IID CIFAR-10 data using NVIDIA FLARE. The authors partition CIFAR-10 across 3 clients using a Dirichlet distribution (α=0.3) to simulate label imbalance, then train a CNN model for 5 communication rounds with local epochs=1. Results show global test accuracy evolution, demonstrating FedProx's (μ=0.1) performance relative to FedAvg under heterogeneous data conditions. The implementation leverages NVFlare's Job API for server orchestration and Client API for local training, model exchange, and aggregation.
federated learningnon-iiddirichlet distributionnvflarefedprox
Autonomous AI systems test governance in physical environments
The Infocomm Media Development Authority (IMDA) of Singapore released version 1.5 of its Model AI Governance Framework, addressing risks posed by autonomous AI systems in physical environments. The framework emphasizes iterative risk assessment, human oversight, technical controls (e.g., least-privilege access), and continuous monitoring through simulation and telemetry. Case studies from Grab and OCBC Bank demonstrate deployment challenges, including reliability testing and task-level autonomy. A Reuters/Nikkei survey indicates 34% of Japanese firms are adopting AI robots, primarily in manufacturing. The framework highlights amplified physical risks compared to digital systems, necessitating multi-stakeholder accountability across the AI value chain.
agentic aileast-privilege accesstelemetry monitoringembodied aiiterative testing
Proving the case on day two at TechEx North America
The TechEx North America conference addressed key challenges in enterprise AI adoption, focusing on transitioning from experimental pilots to durable systems. Sessions analyzed governance frameworks, risk control, and ROI measurement, emphasizing cross-functional collaboration and data lineage. Agentic AI emerged as a critical area, requiring formal evaluation and boundary definitions for system-level actions. Cybersecurity tracks highlighted the 'GenAI velocity gap', where adoption outpaces security oversight, necessitating zero-trust architectures for AI systems and workflows. Government transformation cases demonstrated AI's role in public service reliability and explainability. The conference underscored that successful AI implementation depends on organizational change readiness, data quality, and accountable outcome alignment.
agentic aigovernance frameworksgenai velocity gapzero-trust architectureschange readiness
Generated automatically at 2026-05-26 21:18 UTC. Summaries and keywords are produced by an LLM and may contain inaccuracies — always consult the original article.
