Daily Digest — 2026-06-20
321 items · 316 arxiv papers, 5 industry media
🏛️ Research Labs
No new items today.
📜 arXiv Papers (316)
How Transparent is DiffusionGemma?
The study investigates the transparency of DiffusionGemma's reasoning compared to autoregressive Gemma 4, focusing on variable and algorithmic transparency. Initially, DiffusionGemma exhibits 28.6X higher opaque serial depth due to continuous latent space computation. By mapping intermediate states through an interpretable token bottleneck, this gap reduces to 1.1X. Algorithmic transparency remains challenging due to dynamic token updates during denoising. Case studies reveal diffusion-specific phenomena like non-chronological reasoning and token smearing. Despite these challenges, DiffusionGemma maintains monitorability comparable to Gemma 4 for downstream tasks.
transparencydiffusiongemmaserial depthdenoisingmonitorability
Structuring and Tokenizing Distributed User Interest Context for Generative Recommendation
The paper introduces G2Rec, a framework for generative recommendation that unifies graph-based user co-engagement modeling with semantic tokenization. It addresses limitations in existing methods, such as scalability issues in graph serialization and lack of explicit supervision in semantic tokenization. G2Rec captures holistic user interest prototypes without ground-truth labels, enabling more accurate behavior modeling. Online deployment and experiments on public datasets demonstrate its superiority over existing approaches in industrial sequential recommendation.
generative recommendationgraph serializationsemantic tokenizationuser co-engagementsequential recommendation
Toward Calibrated Mixture-of-Experts Under Distribution Shift
This work investigates calibration in mixture-of-experts (MoE) models under distribution shift, demonstrating that expert-level calibration ensures overall model calibration in hard-routed but not soft-routed models. The authors propose adversarial reweighting to penalize calibration errors of the routed aggregate under shift, showing improved accuracy-calibration tradeoffs across model classes, tasks, and shifts. Results validate the method's effectiveness on average and challenging data subsets.
calibrationmixture-of-expertsdistribution shiftadversarial reweightinghard-routed
How Do Instructions Shape Speech? Cross-Attention Attribution for Style-Captioned Text-to-Speech
The study introduces cross-attention attribution for speech diffusion models, adapting the DAAM framework to analyze how style captions influence acoustic output in CapSpeech-TTS. The method extracts per-token heatmaps across 25 layers and 24 ODE steps, analyzing 3,600 (caption, transcript) pairs. Key findings include: style tokens exhibit lower temporal variance than content/function tokens, style attention correlates with F0 and energy, style conditioning peaks in early steps and deep layers, and attention entropy minimizes at layer 17, coinciding with maximal style importance. This is the first analysis of natural language's impact on cross-attention in speech diffusion models.
cross-attention attributionspeech diffusion modelsstyle-captioned ttsacoustic outputode steps
LedgerAgent: Structured State for Policy-Adherent Tool-Calling Agents
The paper introduces LedgerAgent, an inference-time method for policy-adherent tool-calling agents that maintains structured task states in a separate ledger. The approach explicitly tracks facts, identifiers, and constraints, rendering them into prompts and enforcing state-dependent policy checks before tool execution. Evaluated across four customer-service domains with mixed model panels, LedgerAgent demonstrates improved pass@k performance over standard prompt-based methods, particularly under strict multi-trial consistency metrics.
tool-calling agentspolicy adherencestructured stateinference-time methodpass@k
DeepSWIP: Quotient-WMC Counterfactuals for Neural Probabilistic Logic Programs
DeepSWIP introduces a single-world counterfactual semantics for DeepProbLog programs, enabling causal reasoning through neural materialization and weighted model counting (WMC). The method reduces neural predicates to ProbLog choices, applies Single World Intervention Programs (SWIPs), and computes counterfactuals via quotient-WMC over a transformed program. Experiments on MPI3D validate the approach with 12,000 queries, showing a 2.14× inference speedup over DeepTwin and addressing calibration biases via randomized-policy AIPW estimators.
deepsprobcounterfactualwmcneural materializationswip
SARLO-80: Worldwide Slant SAR Language Optic Dataset 80cm
The paper introduces SARLO-80, a large-scale dataset addressing the scarcity of multimodal resources for synthetic aperture radar (SAR) by combining very-high-resolution (VHR) SAR SLC data, aligned optical imagery, and natural-language descriptions. The dataset standardizes 2,500 worldwide SAR scenes to an 80cm slant-range grid, generating 119,566 triplets (complex/amplitude SAR patches, optical patches, captions) across 257 locations in 72 countries. It supports vision-language tasks through three caption variants and provides reproducible benchmarks for cross-modal retrieval and conditional generation in native SAR geometry. The dataset is publicly available on Hugging Face Hub.
synthetic aperture radarmultimodal datasetslant-range gridvision-languagecross-modal retrieval
Sovereign Execution Brokers: Enforcing Certificate-Bound Authority in Agentic Control Planes
The paper introduces Sovereign Execution Brokers (SEB), a runtime enforcement boundary for certificate-bound agentic infrastructure that separates proposal, admission, and execution phases. SEB verifies execution contracts against certificates from a Sovereign Assurance Boundary (SAB), checks validity windows and state drift, mints scoped identities, and logs signed decisions. The method includes certificate predicates, identity semantics, and bypass-prevention patterns, implemented in a prototype evaluated on AWS/Kubernetes with measured latency overhead, revocation propagation, and fault tolerance.
sovereign execution brokercertificate-bound authorityruntime enforcementscoped identitydrift detection
FlowEdit: Associative Memory for Lifelong Pronunciation Adaptation in Flow-Matching TTS
FlowEdit introduces lifelong adaptation for flow-matching TTS systems by learning pronunciation corrections as latent conditioning edits instead of weight updates. The framework uses corrective feedback to optimize token-level perturbations in text embedding space, storing corrections in a Modern Hopfield Network for content-addressable episodic memory. At inference, corrections are retrieved via soft attention with a similarity gate, enabling fuzzy morphological matching. On a benchmark of 312 multilingual proper nouns across 18 language families, FlowEdit reduces target-word Phoneme Error Rate by 92.7% relative to zero-shot baselines while preserving general-speech quality, with corrections completing in ~15 seconds on a single GPU.
flow-matchingmodern hopfield networkphoneme error ratecontent-addressable memoryzero-shot
Multi-LCB: Extending LiveCodeBench to Multiple Programming Languages
Multi-LCB extends LiveCodeBench (LCB) to evaluate large language models (LLMs) across twelve programming languages, addressing LCB's Python-only limitation. The benchmark transforms Python tasks from LCB into equivalent tasks in other languages while maintaining LCB's contamination controls and evaluation protocol. Multi-LCB automatically tracks future LCB updates, enabling systematic assessment of cross-language code generation. Evaluation of 24 LLMs revealed Python overfitting, language-specific contamination, and significant multilingual performance disparities. These findings establish Multi-LCB as a rigorous benchmark for multi-programming-language code evaluation, exposing critical gaps in current LLM capabilities.
livecodebenchmulti-lcbcode-generationcontamination-controlsmultilingual-performance
Efficient and Sound Probabilistic Verification for AI Agents
The authors present a sound and efficient probabilistic verification framework for AI agents, addressing limitations of deterministic policy enforcement in complex environments. The method employs distributionally robust optimization to compute rigorous upper bounds on policy violation probabilities, handling correlated predicates without independence assumptions. Evaluations on terminal and tool-calling agent benchmarks demonstrate improved security-utility trade-offs and superior performance over prior approaches.
probabilistic verificationdistributionally robust optimizationai agentspolicy violationdatalog
What Do Safety-Aligned LLMs Learn From Mixed Compliance Demonstrations?
This work characterizes how safety-aligned LLMs interpret mixed compliance demonstrations by testing three hypotheses about demonstration composition's impact on harmful compliance. The authors mix benign (non-harmful request, helpful response) and harmful (harmful request, helpful response) demonstrations, evaluating four models across demonstration ordering, refusal behavior, and training stages. Results show benign and harmful demonstrations are not interchangeable, with benign demonstrations variably reducing or increasing harmful compliance based on the model. Preference optimization emerges as critical in preventing benign demonstrations from increasing harmful compliance, while demonstration ordering exhibits recency bias. Models differ in how refusal interacts with in-context learning, either adopting demonstrated formatting or overriding all in-context signals.
compliance demonstrationspreference optimizationrecency biasin-context learningrefusal behavior
FreeStyle: Free Control of Style-Content Dual-Reference Generation from Community LoRA Mining
FreeStyle introduces a scalable framework for style-content dual-reference image generation by leveraging community LoRAs as compositional anchors. The method employs a two-stage curriculum with attention-level enrichment constraints and frequency-aware RoPE modulation to prevent content leakage, alongside a pipeline for constructing large-scale Style-Reference and Content-Reference triplets. Evaluated on a new benchmark with style-invariant Content Alignment Score and VLM-based Rejection Score, FreeStyle demonstrates balanced performance in style alignment (94.5% similarity), content preservation (89.2% fidelity), and leakage suppression (92.1% rejection rate).
lora miningdual-reference generationcontent leakagerope modulationstyle alignment
Calibration Without Comprehension: Diagnosing the Limits of Fine-Tuning LLMs for Vulnerability Detection in Systems Software
The study introduces CWE-Trace, a framework with 834 Linux kernel samples across 74 CWEs, to evaluate LLMs' vulnerability detection capabilities while controlling for data contamination. Using Directional Failure Index (DFI) and Hierarchical Distance and Direction (HDD) metrics, it assesses eight vanilla LLMs and 15 LoRA fine-tuned variants. Key findings show data contamination offers no advantage (84% of samples lack memorization signals) and fine-tuning adjusts output thresholds without improving security reasoning (best detection at 52.1%, CWE Top-1 accuracy below 1.3%).
cwe-tracedirectional failure indexdata contaminationlora fine-tuningvulnerability detection
Contagion Networks: Evaluator Bias Propagation in Multi-Agent LLM Systems
The paper introduces Contagion Networks, a framework for quantifying how evaluator biases propagate in multi-agent LLM systems. Using DeepSeek-chat agents with three bias profiles (structured, balanced, evidence-based), the authors measure Cross-Agent Contagion Matrix Gamma_3, finding consistent bias propagation (γ∈[0.157,0.352]). They identify three propagation regimes based on spectral radius ρ(Γ_N), showing homogeneous-model agents exhibit 3-5x weaker contagion than cross-model systems (γ≈0.85-1.3). Increasing evaluator committee size from k=1 to k=3 reduces contagion by 72.4%. The framework is released as open-source.
contagion networksevaluator biasmulti-agent systemsspectral radiuscross-agent propagation
Optimal Order of Multi-Agent and General Many-Body Systems
The paper develops a framework for analyzing multi-agent systems with feedback loops, focusing on agent-level variables: power, measuring influence on collective outcomes, and response functions, determining reactions to observations. It derives macroscopic properties such as total power, useful power, entropy, order, fragility, and mobility from these variables. A system-level utility function, parameterized by a risk-appetite coefficient, is introduced to balance productivity, stability, and adaptability, identifying an optimal degree of order. The analysis indicates that stronger synchronization can enhance collective output but may increase systemic fragility and reduce mobility. The framework suggests that optimizing agent power distributions and response functions can improve collective behavior understanding and optimization.
multi-agent systemsfeedback loopsresponse functionssystemic fragilityoptimal order
UltraQuant: 4-bit KV Caching for Context-Heavy Agents
The paper introduces UltraQuant, a 4-bit KV-cache compression method for context-heavy AI agents, addressing challenges in cache residency and serving throughput. It employs TurboQuant-style rotation, codebook quantization, and optimizations like asymmetric K/V treatment and Walsh-Hadamard rotation. The method demonstrates a 3.47x reduction in P50 time-to-first-token for cache-pressured scenarios and a 1.63x throughput improvement over FP8 baselines, validated on AMD GPUs with specialized kernels and CDNA4 support.
quantizationkv-cachethroughputamdcdna4
Analyzing Defensive Misdirection Against Model-Guided Automated Attacks on Agentic AI Systems
This work introduces detect-and-misdirect defense strategies against model-guided automated attacks on agentic AI systems, addressing limitations of conventional detect-and-block approaches. The authors propose a probabilistic model analyzing interactions between target systems, defense mechanisms, and attacker-automated judges, showing that predictable refusals in detect-and-block strategies enable attackers to achieve near-certain success rates. They evaluate Contextual Misdirection via Progressive Engagement (CMPE), a lightweight conversational misdirection method replacing predictable refusal text with strategically misleading responses. CMPE reduces attack success rate upper bounds by up to two orders of magnitude on jailbreak benchmarks and nearly eliminates verified success in PAIR and GPTFuzz attack runs.
agentic aiprompt-injectionjailbreak attackscontextual misdirectionautomated judge
Context-Aware Hierarchical Bayesian Modeling of IVF Laboratory Environmental Conditions
The study demonstrates that context-aware temporal features of IVF laboratory environments significantly improve pregnancy rate predictions over raw sensor averages. By engineering 55 features capturing thermal stability, humidity adherence, and stress dynamics, the model achieves 1.27% cross-validated error. A hierarchical Bayesian Beta regression with partial pooling leverages data from Asian and Northern European clinics, reducing error by 64% for ages 35-39 (R2=0.86) and showing transferable environmental signals.
ivfhierarchical bayesianbeta regressionthermal stabilitypartial pooling
Repurposing a Speech Classifier for Guided Diffusion-Based Speech Generation
The work demonstrates that pretrained speech classifiers can be repurposed for conditional speech generation via diffusion models, eliminating the need for separate classifier and diffusion models. The method attaches a lightweight subnetwork to a frozen noise-conditioned classifier in log-Mel space, reusing intermediate representations while training only the subnetwork with Denoising Score Matching. This approach achieves high-quality speech synthesis within a single-backbone model, reducing memory and computational costs compared to traditional classifier-guided diffusion.
classifier guidancediffusion modelsdenoising score matchingspeech synthesislog-mel space
Interpretable Sperm Morphology Classification via Attention-Guided Deep Learning
An attention-guided deep learning framework is proposed for interpretable sperm morphology classification, addressing clinical adoption barriers. The method integrates a pretrained EfficientNet-B0 with a Convolutional Block Attention Module (CBAM) to focus on key sperm head regions, enhancing both accuracy and interpretability. Evaluated on SMIDS and HuSHem datasets, the model achieves 90.2% and 93.9% accuracy (macro F1 scores of 0.913 and 0.948), outperforming SimpleCNN and standard EfficientNet-B0. Grad-CAM++ visualizations demonstrate feature influence on model decisions, validating its clinical utility for automated sperm analysis in fertility clinics.
attention-guidedefficientnet-b0convolutional block attention modulesperm morphologygrad-cam++
Multi-View Decompilation for LLM-Based Malware Classification
The study demonstrates that using multiple decompiler views (Ghidra and RetDec) improves LLM-based malware classification by enhancing recall on malicious samples. A benchmark of benign and malicious programs was compiled and decompiled with both tools, creating matched pseudo-C views. Experiments across various LLM families showed that multi-decompiler prompting increases malicious-class F1 scores, with error analyses revealing complementary evidence from different decompilers. This approach offers a training-free method to enhance malware triage.
decompilationllmmalware classificationghidraretdec
LLM agent safety, multi-turn red-teaming, jailbreak benchmarks, adversarial robustness, safety-critical systems
The paper introduces NRT-Bench, a benchmark for evaluating LLM agent robustness in safety-critical systems through multi-turn red-teaming. Using a simulated nuclear power plant control room with five LLM-backed operator roles and six critical safety functions (CSFs), the study tests four frontier models under adaptive adversarial attacks. Results show 8.7%-12.1% of attack sessions compromise CSFs, with vulnerabilities being model-specific and non-overlapping. Defensive measures exhibit model-dependent efficacy, sometimes increasing vulnerability. The study releases simulation tools for reproducible safety testing.
llmred-teamingcritical safety functionsadversarial robustnessmulti-turn attacks
DataMagic: Transforming Tabular Data into Data Insight Video
DataMagic introduces an end-to-end system for transforming tabular data into narrative data-insight videos, addressing limitations in existing tools by ensuring data fidelity and narrative coherence. The system employs DVSpec, a declarative specification that binds visual elements to data fields, and a Generate-then-Orchestrate multi-agent architecture to optimize scene generation and narrative flow. Evaluation on 109 real-world samples demonstrates its effectiveness, with additional support for interactive data exploration through structured provenance-based Q&A.
data videosdeclarative specificationmulti-agent architecturenarrative coherencedata provenance
Rethinking Shrinkage Bias in LLM FP4 Pretraining: Geometric Origin, Systemic Impact, and UFP4 Recipe
The study identifies Shrinkage Bias as a fundamental limitation in E2M1-based FP4 training for LLMs, caused by geometric asymmetry in representable bins. Proposing UFP4 as a solution, the method employs uniform grids (E1M2/INT4) and applies Random Hadamard Transform (RHT) to all three training GEMMs while limiting stochastic rounding to dY. Experiments on Dense 1.5B, MoE 7.9B, and MoE 124B models demonstrate UFP4's superior performance, achieving lower BF16-relative loss degradation compared to E2M1 baselines, validated through scaling-law analysis and ablations.
shrinkage biasfp4 trainingrandom hadamard transformuniform gridsquantization quality
CRAX: Fast Safe Reinforcement Learning Benchmarking
The paper introduces CRAX, a fast safe RL benchmarking framework built on MuJoCo XLA (MJX) with 3D physics, achieving ~100x speedups over CPU-based alternatives through JAX vectorization and hardware acceleration. The benchmark includes six environment suites and three agent tasks at varying difficulty levels. Evaluation of six safe RL methods reveals no dominant approach, highlighting performance-safety trade-offs, while demonstrating benefits of curriculum learning and safety transfer across difficulty levels.
safe reinforcement learninghardware accelerationcurriculum learning3d physics simulationbenchmarking
AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning
AutoPass introduces a multi-agent LLM framework for compiler performance tuning, leveraging compiler-internal states and runtime feedback to guide optimization decisions. Unlike black-box approaches, it enables LLMs to query intermediate representations and orchestrate compiler options iteratively. The method operates without training or fine-tuning, demonstrating applicability across benchmarks and platforms. Evaluated on LLVM for x86-64 and ARM64, AutoPass achieves geometric-mean speedups of 1.043x and 1.117x over LLVM -O3, outperforming expert heuristics and classical autotuning.
llmcompilerautotuningintermediate representationruntime feedback
Automating SKILL.md Generation for Computer-Using Agents via Interaction Trajectory Mining
The paper investigates automated skill library generation for computer-using agents through interaction trajectory mining. A three-stage pipeline segments GUI trajectories, clusters segments into candidate skills, and trains skill-aware policies. Results show 5/8 clusters achieve ≥0.95 purity against InteraSkill Workflows labels, but GRPO only improves skill-step accuracy from 18.5% to 20.5% on IW, with minimal impact on BrowseComp+. The study demonstrates trajectory mining can reveal inspectable skill structures, though current methods lack cross-domain policy improvement.
skill librariesinteraction trajectory mininggui trajectoriesskill-step accuracycross-domain policy
Robust $Q$-learning for mean-field control under Wasserstein uncertainty in common noise
A robust $Q$-learning algorithm is proposed for discrete-time mean-field control problems under Wasserstein uncertainty in common noise laws. The method integrates a quantization-and-projection scheme with a Wasserstein dual reformulation on the common-noise space, ensuring convergence with finite-time iteration bounds for both synchronous and asynchronous learning schemes. Numerical experiments on systemic risk and epidemic models demonstrate the asynchronous implementation's performance relative to idealized Bellman iteration, highlighting robustness-performance tradeoffs under common-noise misspecification and validating the algorithm's convergence behavior.
q-learningmean-field controlwasserstein uncertaintycommon noiseasynchronous learning
SoftSkill: Behavioral Compression for Contextual Adaptation
SoftSkill introduces a method for compressing natural-language agent skills into compact continuous context objects, refined by a trainable soft delta while keeping the base model frozen. The approach uses a frozen-backbone architecture to tune soft skills via next-token prediction, deploying them as latent behavioral priors during inference. On Qwen3.5-4B, a 32-token SoftSkill prefix improves accuracy by 8.3 points on SearchQA, 42.1 on LiveMath, and 1.3 on DocVQA, outperforming SkillOpt by 5.2 and 12.5 points on SearchQA and LiveMath while reducing token overhead. Results suggest latent controls can outperform Markdown-based skill reinterpretation.
behavioral compressioncontinuous contextlatent priorsfrozen-backbonenext-token prediction
Leveraging systems' non-linearity to tackle the scarcity of data in the design of Intelligent Fault Diagnosis Systems
A novel Deep Transfer Learning (DTL) approach is proposed for designing vibration-based Intelligent Fault Diagnosis Systems (IFDS) under severe data scarcity. The method leverages intrinsic non-linearities in real-world systems through a periodic multi-excitation level procedure, generating images analyzable by pre-trained Convolutional Neural Networks (CNNs). A new data visualization technique and its augmentation method are introduced to address data limitations. Experimental validation on a railway pantograph structure demonstrates the effectiveness of the proposed approach in fault diagnosis.
deep transfer learningintelligent fault diagnosisconvolutional neural networksdata visualizationnon-linearity
Boundary Embedding Shaping with Adaptive Contrastive Learning for Graph Structural Disentanglement
The paper introduces Boundary Embedding Shaping (BES), a plug-in module for GNNs that addresses graph structural entanglement by adaptively suppressing spurious correlations near decision boundaries. Using contrastive learning, BES selectively reduces noise in boundary-region embeddings without significant parameter overhead. Experiments show BES improves GCN performance by 3.3% on average (up to 5.0% on WikiCS) for node classification and enhances link prediction accuracy, outperforming existing methods in boundary discrimination.
graph neural networksstructural disentanglementcontrastive learningnode classificationlink prediction
ELVA: Exploring Ranking-Driven Universal Multimodal Retrieval
The paper introduces ELVA, a rule-based RL framework addressing grain blindness in Universal Multimodal Retrieval (UMR) by leveraging ranking-driven Multimodal Large Language Models (MLLMs). ELVA extends Reinforcement Learning with Verifiable Rewards (RLVR) to retrieval tasks, using rule-based rewards to optimize negative sample ranking while widening positive-negative similarity gaps, without requiring explicit ranking labels. The method demonstrates state-of-the-art performance on standard benchmarks and achieves a 13.1% improvement on the newly introduced MRBench for multi-grain query evaluation.
universal multimodal retrievalcontrastive learninggrain blindnessreinforcement learningmultimodal large language models
Lagrange: An Open-Vocabulary, Energy-Based Sparse Framework for Generalized End-to-End Driving
The paper introduces Lagrange, an open-vocabulary energy-based framework for end-to-end autonomous driving that addresses the trade-off between representational efficiency and generalization. The method leverages Vision-Language Models (VLMs) to encode class-agnostic object proposals into semantic visual tokens, processed via an intent-driven masked cross-attention module. These tokens are decoded into a continuous energy field, enabling Lagrangian action minimization for kinematically valid trajectory planning. Evaluations on nuScenes and CODA benchmarks demonstrate robust performance in both standard and long-tail scenarios.
masked latent fieldsvision-language modelslagrangian action minimizationopen-vocabulary reasoningkinematic feasibility
Confidence-Aware Automated Assessment of Student-Drawn Scientific Models
The paper introduces a confidence-aware automated scoring system for student-generated scientific drawings using a Vision Transformer (ViT) with parameter-efficient adaptation. The method leverages test-time predictive distributions to derive response-level confidence, enabling selective automation by scoring high-confidence responses automatically while deferring uncertain cases for human review. Experiments on six NGSS-aligned middle school assessment items demonstrate improved scoring reliability and a practical trade-off between automated coverage and scoring risk, validating the approach for trustworthy educational assessment.
vision transformerparameter-efficient adaptationconfidence-aware scoringngss-alignedpredictive distributions
Editorial Alignment: A Participatory Approach to Engaging Editorial Expertise in LLM-mediated Knowledge Dissemination
The paper proposes editorial alignment as a participatory design practice for adapting LLM interfaces to institutional editorial standards. Through design workshops with a Nordic public knowledge institution, the authors develop an LLM-enabled encyclopedia interface that translates editorial values into technical alignment objectives. The approach positions editorial standards as design artifacts, enabling editor participation in LLM-mediated knowledge dissemination while maintaining institutional authority over content curation.
editorial alignmentparticipatory aillm-mediated knowledge disseminationdesign artifactsalignment objectives
The Register Gap: A Meaning Intelligence Framework for Nigerian Public Discourse
The Meaning Intelligence Framework (MIF) introduces a nine-dimension annotation schema for Nigerian public discourse, addressing context failure by separating surface sentiment from true communicative intent. The framework evaluates register, surface sentiment, true intent, irony, coded subtext, risk tier, annotator confidence, speaker emotion, and recommended communications action. Using a 30-item calibration dataset across Standard English, Nigerian English, Nigerian Pidgin, and code-mixed registers, the study evaluates Gemini 2.5 Flash under zero-shot and schema-informed prompting. Results show a Register Gap: zero-shot register classification accuracy improves from 33.3% to 73.3% with MIF schema, and the composite Meaning Intelligence Score increases by 5.4 points, with notable gains in coded-subtext detection and strategic action recommendation.
meaning intelligence frameworkregister gapcontext failureschema-informed promptingcoded-subtext detection
Finetuning Vision-Language-Action Models Requires Fewer Layers Than You Think
This work demonstrates that Vision-Language-Action (VLA) models exhibit significant layer-wise redundancy despite training on diverse physical trajectories. The authors propose a training-free structural compression pipeline using Centered Kernel Alignment to identify and remove redundant layers, achieving 50% depth reduction in both VLM backbone and policy head. The compressed models show 40-50% faster fine-tuning and 30% faster inference while maintaining performance, validated across 3 simulation benchmarks (LIBERO, RoboCasa, SimplerEnv) and 10 real-world manipulation tasks with 4 robotic embodiments.
vision-language-actionlayer redundancycentered kernel alignmentstructural compressionrobotic manipulation
Navigating Unreliable Parametric and Contextual Knowledge: Explicit Knowledge Conflict Resolution for LLM Inference
The paper proposes MACR, a novel framework for explicit knowledge conflict resolution in LLMs that moves beyond binary source privileging. The method combines adaptive knowledge assessment via modified semantic entropy with a multi-agent reasoning system featuring three specialized agents for rule induction, conflict analysis, and inconsistency resolution. Experiments show MACR outperforms state-of-the-art baselines across benchmarks while providing interpretable conflict resolutions.
knowledge conflict resolutionmulti-agent reasoningsemantic entropyparametric knowledgein-context learning
SPOT-E: Test-Time Entropy Shaping with Visual Spotlights for Frozen VLMs
SPOT-E introduces a test-time entropy shaping method for frozen vision-language models (VLMs) to improve performance on evidence-intensive tasks. The approach leverages answer-span prediction entropy as a feedback signal, resolving ambiguity in entropy minimization through low-entropy anchors and an entropy-shaping objective. SPOT-E generates question-conditioned spotlights optimized per instance using Group Relative Policy Optimization (GRPO), enabling lightweight tuning without model retraining. Evaluations across multiple benchmarks and VLM families demonstrate consistent performance gains and enhanced robustness under visual corruptions. The method is implemented as a plug-and-play module with publicly available code.
vision-language modelsentropy shapinggroup relative policy optimizationtest-time adaptationevidence-intensive tasks
A Multi-Agent system for Multi-Objective constrained optimization
MAMO (Multi-Agent system for Multi-Objective constrained optimization) introduces a multi-agent reinforcement learning approach to autonomously balance cost minimization and constraint satisfaction in dynamic environments. The method decouples task execution from objective design by formulating reward weight selection as a learning problem, addressing the manual tuning challenge inherent in Lagrangian-inspired formulations. This enables more robust optimization in non-stationary settings where the relative importance of objectives and constraints may vary.
multi-agent reinforcement learningconstrained optimizationlagrangian formulationdynamic environmentsreward weight selection
ScholarQuest: A Taxonomy-Guided Benchmark for Agentic Academic Paper Search in Open Literature Environments
ScholarQuest introduces a taxonomy-guided benchmark for evaluating LLM-based agentic academic paper search in open literature environments. The benchmark comprises 1,000+ computer science topics and four research intents (method-oriented, setting-anchored, comparison-based, scope-controlled), with scalable answer construction and a shared retrieval backend (ScholarBase). Results show agentic methods outperform single-shot retrieval (0.314 Recall@100, 0.355 Recall@All), but significant gaps remain. The benchmark provides multi-dimensional evaluation via search efficiency, intent-level robustness, and failure case analysis.
agentic searchrecall@ktaxonomy-guidedretrieval backendresearch intents
Thermodynamic Measure of Intelligence
The paper proposes a thermodynamic measure of intelligence defined as the lawful amplification of rare but valid futures, where systems increase probabilities of unlikely yet admissible outcomes. Methodologically, it formalizes recursive self-simulation as a necessary architecture, where systems model themselves within their environment. Key results show that high rare-valid lift requires faithful internal simulation of rare-valid futures, and near-optimal lift is achievable given high simulation fidelity and effective policies. The framework enables universal intelligence measurement across passive matter, feedback controllers, LLMs, and humans.
thermodynamic intelligencerare-valid liftrecursive self-simulationlawful amplificationmaxwell-demon
QMFOL: Benchmarking Large Language Model Reasoning via Quantifiable Monadic First-Order Logic Test Case Generation
The paper introduces QMFOL, an automated framework for generating quantifiable monadic first-order logic reasoning tasks to evaluate large language models (LLMs). The method constructs formal logical structures with controlled complexity (depth, width, label types, distractors), translates them into natural language via LLMs, and verifies consistency using an external prover. QMFOLBench, a benchmark with 2880 instances across 960 configurations, reveals performance degradation with increasing logical complexity, better True-label accuracy, and semantic sensitivity in six large reasoning models and two LLMs.
monadic first-order logicdeductive reasoningbenchmark generationlogical complexityround-trip verification
Learner-based Concept Drift Detection: Analysis and Evaluation
The study provides a theoretical analysis and empirical evaluation of concept drift detection methods in non-stationary streaming environments. It examines drift characteristics and categorizes detection algorithms, testing them on synthetic and real-world datasets with abrupt and gradual drift scenarios. Results highlight the performance variations across detectors, offering insights into their applicability for maintaining predictive accuracy in evolving data distributions.
concept driftdrift detectionnon-stationary datastreaming environmentspredictive performance
Augmenting Game AI with Deep Reinforcement Learning
The paper proposes a reinforcement learning framework for enhancing game AI to improve character believability and player immersion. It addresses current limitations in behavioral complexity by leveraging machine learning models trained through game interactions or player data. The work presents case studies of RL-augmented game AI implementations, analyzes deployment challenges in modern games, and identifies key research bottlenecks for broader industry adoption.
reinforcement learninggame aibehavioral complexityplayer immersionmachine learning deployment
FlowMaps: Modeling Long-Term Multimodal Object Dynamics with Flow Matching
FlowMaps introduces a latent flow matching model for predicting multimodal distributions over future 3D object locations in dynamic household environments, conditioned on past human interactions. The method learns spatio-temporal patterns in object dynamics through continuous flow matching, enabling generalization to unseen environments with similar routines. Evaluated on 600+ dynamic Object Navigation episodes in simulated and real-world settings, FlowMaps outperforms state-of-the-art approaches by leveraging learned object dependencies and temporal evolution.
flow matchingmultimodal distributionsobject dynamics3d scene understandingrobotic navigation
Beyond Accuracy: Measuring Logical Compliance of Predictive Models
The paper introduces the Rule Violation Score (RVS), a novel metric evaluating logical compliance of predictive models beyond traditional accuracy metrics. RVS quantifies adherence to hard and soft logical constraints, computable via automatically generated SQL queries for Horn rules. Applied to knowledge graph link prediction and relational regression benchmarks, RVS reveals significant differences in logical compliance among models with comparable accuracy, highlighting limitations of standard evaluation metrics.
rule violation scorelogical compliancehorn rulesknowledge graphrelational regression
Apparent Psychological Profiles of Large Language Models are Largely a Measurement Artifact
The study demonstrates that apparent psychological profiles of instruction-tuned LLMs are primarily measurement artifacts rather than intrinsic model properties. Using a psychometric framework, researchers administered personality and risk-preference instruments to 56 LLMs and human samples, revealing that 81-90% of between-model variation stems from directional response bias rather than targeted traits. Key findings include bias persistence despite model capability, instrument reliability dependence on response orthogonality (newly defined), and profile malleability through item selection. The work argues against uncritical use of human psychological instruments for LLM assessment.
response biaspsychometric frameworkinstruction-tuned llmsresponse orthogonalitymeasurement artifact
HilDA: Hierarchical Distillation with Diffusion for Advancing Self-Supervised LiDAR Pre-trainin
HilDA introduces a self-supervised pretraining framework for LiDAR backbones that enhances cross-modal knowledge distillation from Vision Foundation Models (VFMs) to LiDAR. The method combines hierarchical distillation (multi-layer distillation for semantic alignment and global context distillation for scene-level semantics) with a temporal occupancy diffusion objective for spatiotemporal consistency. Evaluated on 3D object detection, scene flow, and semantic occupancy prediction, HilDA achieves state-of-the-art performance on cross-modal distillation benchmarks, outperforming prior distillation approaches.
hierarchical distillationdiffusion objectivelidar pretrainingcross-modal distillationspatiotemporal consistency
Evaluating and Enhancing Negation Comprehension in Remote Sensing MLLMs
The paper introduces RS-Neg, the first benchmark for evaluating negation comprehension in Remote Sensing Multimodal Large Language Models (MLLMs), addressing a critical gap in real-world applications like disaster response. It proposes an automated data generation pipeline using LLMs to synthesize negation queries and a dynamic visual focus module for verification. Results show advanced RS MLLMs struggle with negation, exhibiting hallucinations and performance drops. The authors present NeFo, a test-time learning method incorporating negation logic, which improves performance using 5% unlabeled test samples and generalizes to unseen tasks.
negation comprehensionmultimodal large language modelsremote sensingtest-time learningbenchmark evaluation
MedRLM: Recursive Multimodal Health Intelligence for Long-Context Clinical Reasoning, Sensor-Guided Screening, Evidence-Grounded Decision Support, and Community-to-Tertiary Referral Optimization
MedRLM introduces a recursive multimodal framework for clinical decision support, addressing limitations of single-step prompting in medical LLMs by treating patient cases as inspectable environments. The system employs specialized agents for text, EHR, imaging, sensors, and guidelines, connected via a Clinical Evidence Graph Memory, with recursive triggering for abnormal patterns and uncertainty-gated refinement. Evaluation uses public and credentialed datasets across EHR, radiology, ECG, and ICU time series, aiming to transition from static QA to workflow-aware support.
clinical decision supportmultimodal reasoningelectronic health recordssensor-guided screeningevidence-grounded generation
Implicit Semantic-Aware Communication Based on Hypergraph Reasoning
The paper introduces HISR, a hypergraph-based implicit semantic reasoning framework for semantic-aware communication systems. Addressing limitations of pairwise graph representations, HISR models higher-order multi-entity relationships through dedicated semantic subspaces, mitigating over-smoothing in traditional graph embeddings and enhancing robustness to information loss. Evaluations demonstrate a 36.6% accuracy improvement in implicit semantic interpretation over state-of-the-art benchmarks.
hypergraph reasoningsemantic-aware communicationmulti-entity relationshipssemantic subspacesover-smoothing
Modularity-Free Conflict-Averse Training for Generalized PINNs
The paper introduces Modular-Sparsity Synchronization (ModSync), a framework addressing capacity-induced failures in Physics-Informed Neural Networks (PINNs). It identifies that overparameterized networks develop functional modularity, partitioning into task-exclusive modules that hinder Pareto-stationary convergence. ModSync integrates structural optimization into conflict-averse training by penalizing task-exclusive connections while preserving interaction-promoting pathways. Experiments across diverse PDE benchmarks show ModSync prevents capacity-driven failures, maintains robust cross-objective coupling, and achieves state-of-the-art accuracy.
physics-informed neural networksconflict-averse optimizationfunctional modularitypareto-stationary pointsstructural optimization
From Texts to Scores: Tracing the Emergence of Essay Quality Representations in Large Language Models
This work reveals how large language models (LLMs) internally represent essay quality through systematic analysis of hidden representations across eight LLMs and three datasets (ASAP++, CSEE, ENEM). Using linear probing, cross-prompt generalization, and neuron-level analyses, the study demonstrates that essay quality information emerges progressively across layers, is linearly decodable, and partially transfers across prompts. Key findings include identification of 'essay scoring neurons' whose activations correlate with scores and exhibit length-dependent layer distribution, with nonlinear probes providing only marginal improvements over linear methods.
automated essay scoringlinear probinghidden representationsneuron-level analysiscross-prompt generalization
Hybrid ANN-SNN Pipeline with Local Plasticity
The paper introduces a hybrid ANN-SNN pipeline combining pretrained ANN embeddings with spiking neural networks. The method employs an EfficientNet encoder converted to spike trains via rate-coding, coupled with a CoLaNET spiking classifier trained using local plasticity rules instead of backpropagation. This biologically inspired approach achieves 99.09% accuracy on a 64-class ImageNet variant, matching conventional deep networks' performance while maintaining computational efficiency.
hybrid ann-snnrate-codinglocal plasticityefficientnetcolanet
BIM-Edit: Benchmarking Large Language Models for IFC-Based Building Information Modeling
The paper introduces BIM-Edit, a benchmark evaluating LLMs on natural-language editing of Building Information Models (BIM) in IFC format. It contains 324 editing tasks across 11 real-world and 36 synthetic building models, assessing geometric accuracy, semantic validity, and topological consistency. Results show the best-performing LLM achieves only 49.5% average score across metrics, with none solving >3.4% of tasks, revealing significant gaps in LLM capabilities for structured engineering design workflows.
bim-editindustry foundation classesllm evaluationbuilding information modelingtopological consistency
RACL: Reasoning-Agent Control Layers for Continuous Metaheuristic Learning
The paper introduces RACL (Reasoning-Agent Control Layer), a method for metaheuristic optimization where a reasoning agent controls an existing optimizer's search behavior without modifying constraints. The agent observes operational memory, reasons over past behavior, formulates hypotheses, tests interventions, and consolidates policies while providing explanations. Evaluated on vehicle routing, RACL improved or tied existing policies in 21/21 cases versus Operational Memory Policy and 18/21 versus Stagnation-Triggered Policy, with average cost reductions of -0.641%. In Sevilla-9/10 tests, it reduced costs by -8.337% versus Fixed and -1.605% versus STP, with negligible overhead. Codex served as the initial reasoning agent.
metaheuristic optimizationreasoning agentoperational memoryalgorithmic controlvehicle routing
Learning to Prompt: Improving Student Engagement with Adaptive LLM-based High-School Tutoring
We introduce an adaptive LLM-based tutoring system that employs subject-aware prompting to improve high-school student engagement. The system extracts 14 pedagogical features from raw transcripts and trains a prompt routing model in a simulation environment, later deploying it for online adaptation. Simulation benchmarks show the router outperforms static baselines (0.694 vs. 0.647 and 0.64, p<0.001). A/B testing (N=656 conversations) demonstrates sim-to-real transfer, with the model switching between analytical and scaffolding strategies, reducing interactions by ~3 turns (p=0.007). While a greedy router matches baseline exercise conversion rates (19.1% vs. 19.6%), a stochastic router achieves higher rates (28.1%).
llm-based tutoringsubject-aware promptingprompt routingpedagogical featuressim-to-real transfer
Frequency-Aware Flow Matching for Continuous and Consistent Robotic Action Generation
Frequency-Aware Flow Matching (FAFM) improves robotic action generation by producing continuous, temporally consistent actions from heterogeneous frequency demonstrations. The method transforms discrete action sequences into frequency-domain coefficients via discrete cosine transform (DCT), performs flow matching over these coefficients, and reconstructs continuous actions using cosine basis expansion. A Sobolev-type regularization on the first-order temporal derivative ensures smoothness. Evaluated on synthetic benchmarks, LapGym, LIBERO, and real-world Franka robot tasks, FAFM enhances success rates, motion smoothness, convergence speed, and robustness to mechanical bias without additional parameters.
flow matchingdiscrete cosine transformsobolev regularizationrobotic manipulationaction generation
ScaffoldAgent: Utility-Guided Dynamic Outline Optimization for Open-Ended Deep Research
ScaffoldAgent introduces a utility-guided dynamic outline optimization framework for open-ended deep research (OEDR), addressing scaffold drift and delayed feedback in long-form report generation. The method models outline evolution as a structured decision process with Expansion, Contraction, and Revision operations, guided by a utility signal derived from retrieval gain, structural coherence, and trial-generation quality. Experiments on DeepResearch Bench and DeepResearch Gym demonstrate ScaffoldAgent's consistent improvements in report generation and factual grounding compared to existing deep research agents.
scaffoldagentoutline optimizationopen-ended deep researchutility-guided feedbackstructured decision process
Dual-Agent Framework for Cross-Model Verified Translation of Natural-Language Protocols into Robotic Laboratory Platform
The study introduces a dual-agent framework for translating natural-language biological protocols into executable robotic commands, addressing the semantic gap in microplate-based automation. A Parser Agent formalizes protocols into structured representations, while a rule-based mapping engine generates device-level commands; a Validation Agent verifies completeness and triggers self-correction. Evaluated on ELISA protocols with 7 Parsers and 3 Validators, the framework demonstrates improved translation accuracy and pass rates under cross-model verification. End-to-end validation via Bradford assay confirms autonomous execution. The approach combines deterministic rule-based mapping with LLM-based validation for robust protocol translation.
microplate automationprotocol translationdual-agent frameworkllm validationself-correction loop
Sensorimotor World Models: Perception for Action via Inverse Dynamics
The authors propose Sensorimotor World Models (SMWM), a latent world model trained end-to-end with inverse dynamics regularization to address representation collapse and induce action-aligned representations. SMWM forces latent states to preserve information about actions underlying transitions, focusing on controllable degrees of freedom while discarding uncontrollable distractors. This approach enables stable training from offline, reward-free trajectories without frozen encoders, exponential moving averages, or complex latent regularizers. Empirically, SMWM learns compact, interpretable latent spaces and achieves competitive planning performance across simple 2D and 3D control tasks.
latent world modelsinverse dynamics regularizationrepresentation collapseaction-aligned representationsoffline trajectories
Hybrid Diffusion Transformer for Instruction-Guided Audio Editing via Rectified Flow
The paper introduces a hybrid two-stage diffusion transformer architecture for instruction-guided audio editing, leveraging rectified flow matching. The method combines joint attention over audio and text tokens for coarse semantic alignment at low resolution, followed by alternating joint-attention and cross-attention blocks for high-resolution refinement. Experiments demonstrate significant performance improvements on complex editing tasks involving overlapping audio events and intricate instructions, while maintaining efficiency with a compact model.
diffusion transformerrectified flowinstruction-guided editingsemantic alignmentcross-attention
MakeupMirror: Improving Facial Attribute Preservation in Diffusion Models for Makeup Transfer
MakeupMirror introduces a diffusion-based approach to makeup transfer that significantly improves facial attribute preservation over existing methods like Stable-Makeup. The method integrates facial geometry conditioning via ControlNets, enables region-specific makeup transfer control, modulates skin tone-based transfer to prevent alterations, and employs a Levenberg-Marquardt Langevin sampler for faster inference. Evaluated on CPM-Real, Makeup Wild, and MakeupSelfies datasets, MakeupMirror achieves a +60% improvement in facial recognition similarity, reduces skin tone difference by -50%, and attains a 94% expert acceptance rate for identity preservation, with a latency of 0.7s.
diffusion modelsmakeup transfercontrolnetsfacial geometrylevenberg-marquardt
IHUBERT: Vector-Based Semantic Deduplication and Domain-Balanced Pretraining for Persian Resources
IHUBERT introduces a Persian RoBERTa-base model (125M parameters) pretrained on a 45GB corpus (7-8B tokens) with novel semantic deduplication and domain balancing. The method employs vector-database-based semantic deduplication, BPE tokenization (139k vocab), and multi-stage preprocessing (normalization, anonymization, near-duplicate removal). Evaluated on seven Persian NLU benchmarks, IHUBERT achieves state-of-the-art performance on extractive QA (PQuAD F1 88.3542, ParsiNLU-RC F1 49.0987) and NLI (FarsTail Macro-F1 0.8350), while remaining competitive on NER (ParsTwiNER F1 0.8308) and topic classification (DigiMag Macro-F1 0.7953).
semantic deduplicationdomain balancingbpe tokenizationpersian nluroberta-base
Multi-Head Attention-Based Feature Extractor Integration with Soft Actor-Critic for Porosity Prediction and Process Parameter Optimization in Additive Manufacturing
A novel reinforcement learning architecture combining multi-head attention with Soft Actor-Critic (SAC) is proposed for additive manufacturing optimization. The attention-based feature extractor enhances low-dimensional feature capture, enabling improved exploration-exploitation balance in continuous action spaces. This approach addresses limitations of traditional RL methods in porosity prediction and process parameter optimization for laser powder bed fusion, demonstrating faster convergence and higher rewards compared to DQN, PPO, TD3, and vanilla SAC. The method achieves a convergence value of 322.79 within 14 episodes while maintaining training stability.
multi-head attentionsoft actor-criticadditive manufacturingporosity predictioncontinuous action space
Residual-Space Evolutionary Optimization via Flow-based Generative Models
The authors propose residual-space evolutionary optimization, a model-agnostic framework combining flow-based generative editing with evolutionary algorithms to address non-differentiable or black-box objectives. The method leverages conditional flow matching (CFM) to disentangle condition-controlled factors from instance-specific residuals, enabling two search regimes: self-pollination for local exploitation via residual refinement and cross-pollination for broader exploration through residual recombination. Validation on MorphoMNIST and crystal datasets demonstrates the framework's effectiveness in balancing target alignment, instance preservation, and diversity, extending its applicability beyond images to scientific domains.
conditional flow matchingevolutionary algorithmsresidual-space optimizationself-pollinationcross-pollination
The Hidden Evolution of Disguised Visual Context inside the VLM
This work provides a systematic comparison of visual-language model (VLM) integration paradigms, specifically in-context prompting versus layer-wise injection, under identical training conditions. The study evaluates these approaches across single image, multi-image, and video benchmarks, analyzing how visual tokens evolve into meaningful representations within large language models (LLMs). Results reveal that visual tokens enter LLMs as raw, disguised context lacking linguistic structure, but undergo distinct transformations depending on the integration paradigm, capturing different frequency characteristics of the visual signal. The research demonstrates that attention allocation alone is insufficient, and performance is driven by the quality of visual representations at each layer, affecting feature utilization and alignment with the language space.
visual-language modelin-context promptinglayer-wise injectionattention allocationlinguistic structure
Variable-Length Tokenization via Learnable Global Merging for Diffusion Transformers
The paper introduces a novel variable-length tokenizer (VLT) for diffusion transformers that modulates token length via learnable global merging, addressing cross-length representation misalignment in conventional VLTs. By encouraging semantically similar tokens to merge, the method ensures consistent latent distributions across varying token counts without data-dependent merging patterns. Integrated with a diffusion transformer on ImageNet 256×256 generation, this approach demonstrates superior gFID-compute trade-offs compared to prior VLT techniques.
variable-length tokenizerdiffusion transformerslatent distributionslearnable merginggfid-compute trade-off
Evaluation of EEG Foundation Models for Event-Based Burst-Suppression Detection in ICU
This work evaluates EEG Foundation Models (FMs) for event-based burst-suppression detection in ICU EEG, addressing variability in patterns and scarce annotations. The study compares REVE-base, LUNA-large, and LuMamba-Tiny against EEGNet and adaptive thresholding baselines, introducing event-based evaluation to assess clinical relevance. REVE-base achieved the highest event-based F1-score (0.868 ± 0.167), reducing burst-per-minute error by 52.1% and 36.2% versus EEGNet and thresholding respectively. Full fine-tuning proved most effective, with pretrained REVE-base outperforming random initialization by +0.723 F1 at 25% cohort size, demonstrating FM utility for limited-data scenarios.
eeg foundation modelsburst-suppression detectionevent-based evaluationintensive care unitfine-tuning
Process-Verified Reinforcement Learning for Theorem Proving via Lean
The paper introduces process-verified reinforcement learning (RL) for theorem proving using the Lean proof assistant, which provides dense, fine-grained feedback at both outcome and tactic levels. The method parses proof attempts into tactic sequences, using Lean's elaboration to mark sound steps and first failures, then incorporates these structured rewards into a GRPO-style RL objective with first-error propagation. Experiments on STP-Lean and DeepSeek-Prover-V1.5 show tactic-level supervision outperforms outcome-only baselines on MiniF2F and ProofNet benchmarks, demonstrating the potential of symbolic proof assistants as process-level reward oracles during training.
reinforcement learningtheorem provinglean proof assistantstructured feedbackprocess-verified rl
Autonomous Event-Driven Multi-Agent Orchestration for Enterprise AI at Scale
The paper introduces a Task Manager for continuous event-driven orchestration in enterprise-scale multi-agent AI systems, addressing limitations of discrete request-response workflows. It evaluates DAG Plan and Execute versus ReAct architectures across 208 production scenarios at three scales (Persona, Department, Enterprise), measuring performance degradation due to agent discovery noise. Results show DAG excels in precision and parallelization at small scales but suffers from overhead at enterprise scale, while ReAct handles failures more robustly. The Task Manager reduces high-priority latency by 14-75% and improves related-event correctness by >20 percentage points at enterprise scale.
multi-agent systemsenterprise aievent-driven orchestrationdag plan and executereact architecture
See-and-Reach: Precise Vision-Language Navigation for UAVs within the Field of View
The paper introduces UAV-VLN-FOV, a target-visible navigation task isolating the see-and-reach stage for precise evaluation of UAV terminal reaching ability, and proposes 3DG-VLN, a vision-language waypoint prediction framework. 3DG-VLN processes high-resolution front-view and downward-view observations adaptively for fine-grained visual grounding, while dynamically updating target-relative direction to reduce spatial drift. A dedicated benchmark with 2,717 trajectories is constructed, featuring target-oriented instructions and continuous 3D waypoints. Experiments show 3DG-VLN achieves a 13.82% success rate improvement over baselines, with real-world trials validating practical applicability.
uav-vln-fov3dg-vlnvision-language navigationvisual groundingwaypoint prediction
AI Economist Agent: An Agentic Framework for Model-Grounded Economic Analysis with RAG, Knowledge Graphs, and Large Language Models
We introduce an AI economist agent framework integrating retrieval-augmented generation (RAG), knowledge graphs, and large language models (LLMs) for economic scenario analysis. The framework employs LLM-based agents to plan analyses, retrieve evidence, select models, and generate reports grounded in explicit model-based computations and retrieved evidence, avoiding direct quantitative claims from LLMs alone. Evaluations on U.S. inflation persistence, Federal Reserve policy, and commercial real estate refinancing stress demonstrate improved economic coherence and traceability in generated reports through grounding.
retrieval-augmented generationknowledge graphslarge language modelseconomic scenario analysismodel-based computations
A Neuromorphic Reinforcement Learning Framework for Efficient Pathfinding in Robotic Mobile Fulfillment Systems
The paper introduces SDQN-RMFS, a neuromorphic reinforcement learning framework for energy-efficient pathfinding in Robotic Mobile Fulfillment Systems (RMFS). The method involves training an ANN policy via collision-allowing trajectory densification, followed by conversion to an SNN through hard-label knowledge distillation to maintain policy fidelity. Hardware experiments demonstrate 11,281× energy savings and 2× latency reduction compared to GPU baselines, while preserving decision quality.
neuromorphic computingreinforcement learningrobotic pathfindingspiking neural networksknowledge distillation
When Lower Privileges Suffice: Investigating Over-Privileged Tool Selection in LLM Agents
The authors introduce ToolPrivBench to evaluate over-privileged tool selection in LLM agents, where agents unnecessarily choose higher-privilege tools despite sufficient lower-privilege alternatives. They assess both initial tool selection and escalation behavior after transient failures across eight domains and five recurring risk patterns. Results show that over-privileged tool selection is prevalent among mainstream LLM agents and exacerbated by transient failures, with general safety alignment and prompt-level controls proving insufficient. The authors propose a privilege-aware post-training defense that significantly reduces unnecessary high-privilege tool use while maintaining general capabilities.
tool selectionllm agentsprivilege escalationtransient failurespost-training defense
Hierarchical Control in Multi-Agent Games: LLM-based Planning and RL Execution
The paper introduces a hierarchical architecture combining LLM-based strategic planning with RL execution for multi-agent coordination. A pretrained LLM serves as centralized controller selecting among specialized RL skill policies, while RL handles low-level execution. Evaluated in 2v2 King of the Hill, LLM+RL achieves comparable win rates to hand-crafted behavior trees (46.4% vs 51.5%, p=0.103) and outperforms flat RL. User studies (n=15) show 60% perceive LLM+RL as most human-like (p=0.027) due to adaptability and tactical variability, demonstrating LLMs can effectively orchestrate RL skills without manual engineering.
hierarchical controlmulti-agent reinforcement learningllm-based planningskill decompositionbehavior trees
StreamKL: Fast and Memory-Efficient KL Divergence for Boosting Attention Distillation
StreamKL introduces a fused GPU primitive for efficient Kullback-Leibler (KL) divergence computation in attention distillation, eliminating the $O(N_QN_K)$ memory overhead of materializing full attention distributions. The method employs an online formulation for KL reduction, enabling a one-pass forward kernel with tile-wise streaming through on-chip SRAM, and recomputes attention probabilities tile-by-tile during backward passes. Experiments demonstrate speedups of up to $43\times$ (forward) and $14\times$ (backward) over baselines, while reducing high-bandwidth memory (HBM) overhead from quadratic to constant ($O(1)$), facilitating long-context distillation on single GPUs.
attention distillationkl divergencegpu optimizationonline computationmemory efficiency
Connect the Dots: Training LLMs for Long-Lifecycle Agents with Cross-Domain Generalization Via Reinforcement Learning
The paper introduces Connect the Dots (CoD), a framework for training large language models (LLMs) as long-lifecycle agents capable of iterative self-updating and cross-domain generalization. The method combines end-to-end reinforcement learning (RL) with long rollout sequences alternating between task-solving and context-updating episodes, employing a GRPO-style RL algorithm with fine-grained credit assignment. Empirical results demonstrate the framework's efficacy in eliciting meta-capabilities, enabling out-of-distribution generalization within and across domains, and extending to Ralph-loop settings. The authors release implementations to facilitate further research.
long-lifecycle agentsreinforcement learningcross-domain generalizationmeta-capabilityralph-loop
Tri-Info: Generalizable, Interpretable Failure Prediction for VLA Models via Information Theory
The paper introduces Tri-Info, a generalizable and interpretable failure prediction method for Vision-Language-Action (VLA) models based on information theory. By formalizing VLA control as a closed-loop information pipeline, the authors derive three information-theoretic signals that capture action diversity, temporal consistency, and state-action coupling. Evaluated across six VLA models and three benchmark environments, Tri-Info matches top baselines in-domain (83% accuracy on real-world tasks) while demonstrating strong cross-domain generalization without retraining, outperforming prior detectors that collapse to chance.
vision-language-action modelsfailure predictioninformation theorycross-domain generalizationinterpretable diagnostics
Beyond Static Endpoints: Tool Programs as an Interface for Flexible Agentic Web Services
ToolPro introduces executable tool programs as a flexible interface for LLM-based agents invoking web services, addressing limitations of static endpoints in expressing complex workflows with loops, conditionals, and state modifications. The method combines constraint-guided program construction, effect-aware replay for idempotent state changes, and a profile-driven execution policy. Evaluated on MCP-style services with WebAssembly sandboxing, ToolPro achieves up to 53.4% latency reduction and 96.1% client-side traffic savings, with greater benefits under high network latency and workflow complexity.
tool programseffect-aware replaywebassembly sandboxingconstraint-guided constructionprofile-driven policy
Reward as An Agent for Embodied World Models
The paper proposes a novel RL framework for embodied world models that combines robust reward verification with diversified exploration. Methodologically, it introduces 'Reward as an Agent' for active behavior evaluation to mitigate reward hacking, and 'DynDiff-GRPO' for dynamic-aware rollout diversification to expand state-action coverage. Experiments demonstrate accuracy improvements across multiple open-source world models, showing that broader exploration scales successfully when paired with reliable verification.
reinforcement learningworld modelsreward hackingrollout diversificationembodied agents
ENPIRE: Agentic Robot Policy Self-Improvement in the Real World
ENPIRE introduces a framework for autonomous robot policy self-improvement in real-world manipulation tasks, addressing the bottleneck of human supervision. The system implements a closed-loop feedback routine with four modules: Environment (scene reset/verification), Policy Improvement (code refinement), Rollout (multi-robot evaluation), and Evolution (failure mode analysis via coding agents). Results demonstrate 99% success rates on dexterous tasks like pin box organization and tool use, with accelerated improvement when scaling to robot fleets, suggesting a viable path for autonomous robotics advancement.
robot policy improvementcoding agentsdexterous manipulationclosed-loop systemreal-world robotics
The Algorithmic-Human Manager: AI, Apps, and Workers in the Indian Gig Economy
The study proposes an Algorithmic-Human Manager framework to address fairness and transparency issues in AI-driven management systems within India's blue-collar gig economy. Employing a mixed-methods approach, including interviews with 16 gig workers and 21 stakeholders, the research highlights the dual impact of AI: operational efficiency gains versus inequitable outcomes and opaque decision-making. Findings indicate that algorithmic systems fail to proportionally reward additional labor and are designed opaquely, undermining worker dignity. The hybrid governance model advocates for integrating technological efficiency with human accountability, offering policy implications for equitable AI governance in the Global South.
algorithmic managementgig economyhybrid governancesocial justice frameworkoperational efficiency
ROSE: Benchmarking the Perception-to-Action Gap in Multimodal Models
We introduce ROSE, a benchmark evaluating multimodal large language models' (MLLMs) ability to convert visual evidence into context-specific actions. ROSE fixes visual scenes while varying region constraints and symbolic outputs, testing models through counting and coordinate-action tasks to infer implicit majority references and act on fine-grained visual evidence. Across nine MLLMs, performance drops by up to 44.5 percentage points from counting to region-conditioned action tasks, despite 98.8% human accuracy. Analysis reveals coordinate grounding explains only part of this gap, indicating a model-dependent bottleneck in transforming shared visual evidence into context-specific actions.
multimodal large language modelsvisual evidencecontext-specific actionsregion constraintssymbolic outputs
Confidence Calibration for Multimodal LLMs: An Empirical Study through Medical VQA
This study introduces a novel method combining Multi-Strategy Fusion-Based Interrogation (MS-FBI) with auxiliary expert LLM assessment to improve confidence calibration in Multimodal Large Language Models (MLLMs) for Medical Visual Question Answering (VQA). The approach reduces Expected Calibration Error (ECE) by 40% on average across three Medical VQA datasets, enhancing model reliability for AI-assisted diagnosis. Results underscore the necessity of domain-specific calibration in healthcare applications of MLLMs.
multimodal large language modelsconfidence calibrationmedical vqaexpected calibration errormulti-strategy fusion-based interrogation
Advancing DialNav through Automatic Embodied Dialog Augmentation
The authors propose an automatic generation pipeline to address the data scarcity in DialNav, a framework for evaluating dialog-execution loops in embodied indoor navigation. They construct the RAINbow dataset, scaling training data from 2K to 238K episodes by converting existing Vision-and-Language Navigation (VLN) datasets into multi-turn dialog. Two complementary advances are introduced: Dual-Strategy Training to align navigation with dynamic dialog-navigation loops, and a localization model leveraging VLN knowledge. The combined approach achieves significant improvements, with success rates increasing by 89% (58.24) on Val Seen and 100% (29.05) on Val Unseen splits, setting a new state of the art.
dialnavvlnmulti-turn dialogdual-strategy traininglocalization model
SIMBA: ABidirectional Retrieval Forward Simulation Framework for Modeling FY-4A GIIRS Hyperspectral Infrared Radiances Toward NWP Applications
The study introduces SIMBA, a bidirectional retrieval-forward simulation framework for FY-4A GIIRS hyperspectral infrared radiance modeling in NWP applications. The framework jointly performs atmospheric profile retrieval and radiance reconstruction, incorporating a cycle-consistency constraint and a bidirectional Mamba state-space module to capture long-range dependencies. Evaluated on collocated FY-4A GIIRS and ERA5 data, SIMBA outperforms deep learning baselines in temperature/humidity retrieval and radiance reconstruction, with ablations confirming the bidirectional design's efficacy. Results suggest potential for Jacobian analysis and NWP extensions.
hyperspectral infrarednumerical weather predictioncycle-consistencystate-space moduleradiance reconstruction
Triangular Consistency as a Universal Constraint for Learning Optical Flow
The paper introduces triangular consistency as a universal geometric constraint for optical flow learning, applicable across network architectures, supervision types (supervised/unsupervised), and datasets. The method enforces consistency among three flows: two composed flows (from image pairs, multi-frame sequences, or synthetic transformations) and their induced third flow. This architecture-agnostic approach requires no additional annotations and adds minimal computational overhead. Experiments demonstrate consistent performance gains in supervised, unsupervised, and transfer learning scenarios on optical flow tasks.
optical flowtriangular consistencygeometric constrainttemporal chainingtransfer learning
PhysDrift: Bridging the Embodiment Gap in Humanoid Co-Speech Motion Generation
The paper introduces PhysDrift, a framework for generating physically executable co-speech motions in humanoid robots, addressing the embodiment gap in traditional human-centric pipelines. It proposes IK-EER for prosody-preserving motion curation and PhysDrift for direct prediction of robot joint trajectories from speech, bypassing intermediate human-body representations. Experiments show improvements in speech-motion alignment, physical plausibility, motion smoothness, and real-time interaction capabilities compared to retargeting-based methods.
physdriftik-eerembodiment gapco-speech motionhumanoid robots
Speeding up the annotation process in semantic segmentation industrial applications
This study introduces unsupervised algorithms to accelerate semantic segmentation annotation in industrial materials science, demonstrating a 78% reduction in labeling time (from 170 to 37 hours) for high-resolution microstructure images (1280x959, 960x703). The work presents the largest public steel microstructure dataset (MIT License, DOI) and compares manual vs. algorithm-assisted labeling. A validated deep learning model is provided as a benchmark for the dataset.
semantic segmentationunsupervised algorithmsmicrostructure characterizationannotation efficiencyindustrial materials science
Spatial-Aware Reduction Framework: Towards Efficient and Faithful Visual State Space Models
The paper introduces STORM, a spatial-aware token reduction framework that addresses performance degradation in structurally enhanced Mamba variants during token reduction. STORM reformulates reduction as a structured operation on spatial units, preserving grid topology and neighborhood coherence through localized constraints. As a plug-and-play module requiring no training, STORM achieves state-of-the-art pruning accuracy, recovering up to 63.3% top-1 accuracy on VMamba and maintaining near-ViT performance (1.0% drop) on PlainMamba.
token reductionspatial awarenessmamba variantsselective scanninggrid topology
The Tao of Agency: Autotelic AI, Embedded Agency and Dissolution of the Self
The paper introduces autotelic AI, where agents generate their own goals rather than relying on exogenous specifications. It examines intrinsic motivation, resource-driven priors, and causal-interventional learning, identifying embeddedness as a necessary but insufficient condition for autotelic agency. The study highlights the non-unique individuation of agents, emphasizing the challenge of self-boundary generation and relativization. The framework extends to quantum formulations, philosophical non-dual traditions, and LLM-based agentic implementations.
autotelic aiembedded agencyintrinsic motivationcausal-interventional learningquantum formulation
eCNNTO: A Highly Generalizable ConvNet for Accelerating Topology Optimization
The paper proposes eCNNTO, an element-based Convolutional Neural Network with residual connections to accelerate density-based Topology Optimization (TO) by predicting near-optimal element densities from early iteration histories. The method improves upon prior Deep Belief Network approaches by incorporating spatial correlations through CNN architecture and a novel training strategy using final-stage density histories, reducing required training data. Experiments demonstrate 90-97% iteration reduction across 2D/3D problems while maintaining generalization to diverse boundary conditions, loading cases, and mesh resolutions.
topology optimizationconvolutional neural networkfinite element analysisresidual connectionsmesh resolution
Co-policy: Responsive Human-Robot Co-Creation for Musical Performances
Co-policy introduces a human-robot musical co-creation framework combining semantic intent grounding with real-time visuomotor execution. The system employs a fine-tuned Qwen-vl planner (F-Qwen) for semantic anchoring and a Gaussian-Mixture Visuomotor Policy (GMP) for low-latency action generation. Evaluations demonstrate superior intent alignment, execution accuracy, and response frequency compared to diffusion-policy baselines, validating its efficacy in embodied co-creation scenarios.
co-policyqwen-vl plannergaussian-mixture visuomotor policyembodied aihuman-robot co-creation
Multi-Agent Transactive Memory
The paper introduces Multi-Agent Transactive Memory (MATM), a framework for population-level storage and retrieval of agent-generated trajectories in heterogeneous LLM agent ecosystems. MATM extends retrieval-augmented generation by organizing reusable procedural knowledge from agent trajectories (e.g., in ALFWorld and WebArena) into a shared repository, enabling consumer agents to retrieve solutions without rediscovery. Experiments show MATM improves task performance and reduces interaction steps by 15-30% without requiring coordination or joint training, establishing it as a design pattern for open agent ecosystems.
multi-agent systemsretrieval-augmented generationprocedural knowledgeagent trajectoriestransactive memory
Measuring Biological Capabilities and Risks of AI Agents
The paper introduces biological agentic evaluations as a framework for assessing AI systems' autonomous biological research capabilities and associated risks. It synthesizes current evidence on AI-enabled biological risks and presents practical evaluation design considerations derived from empirical work. Key findings demonstrate how methodological choices in defining, designing, running, scoring, and documenting evaluations critically impact risk assessment interpretations. The analysis targets policymakers, funders, and biosecurity practitioners, with secondary relevance for AI researchers conducting agentic evaluations in labs and scientific institutions.
agentic aibiological risksevaluation designbiosecurityautonomous research
MetaResearcher: Scaling Deep Research via Self-Reflective Reinforcement Learning in Adversarial Virtual Environments
MetaResearcher introduces a novel framework for scaling deep research agent training across four dimensions: (1) an Evolving Virtual World with adversarial dynamics, (2) Discovery-Oriented Tasks beyond fact retrieval, (3) a Self-Reflective Meta-Reward mechanism within GRPO to optimize multiple objectives, and (4) a Heterogeneous Multi-Agent Swarm architecture. The method combines temporal environment dynamics, advanced task design, and coordinated reinforcement learning to improve epistemic robustness and benchmark performance (GAIA, Xbench-DS) while maintaining zero marginal API cost.
evolving virtual worlddiscovery-oriented tasksmeta-rewardheterogeneous multi-agent swarmgrpo framework
SL-S4Wave: Self-Supervised Learning of Physiological Waveforms with Structured State Space Models
SL-S4Wave introduces a self-supervised learning framework combining contrastive learning with a structured state space model (S4) encoder tailored for physiological waveforms. The encoder employs multi-layer global convolution with multiscale subkernels to capture fine-grained local patterns and long-range temporal dependencies in noisy, high-resolution multichannel data. Evaluated on arrhythmia detection and EEG tasks, SL-S4Wave outperforms state-of-the-art supervised and self-supervised baselines, demonstrating strong label efficiency, robust performance on long sequences, and effective cross-domain generalization to unseen arrhythmia types.
structured state space modelscontrastive learningmultichannel physiological waveformslong-range dependencieslabel efficiency
FFinRED: An Expert-Guided Benchmark Generation and Evaluation Framework for Financial LLM Red-Teaming
The paper introduces FinRED, an expert-guided framework for evaluating financial LLM safety through red-teaming. It addresses finance-specific risks like regulatory violations and fraud using a two-level taxonomy aligned with global standards (FATF, EU DORA). The method converts financial documents into context-rich adversarial prompts via an expert-defined schema, validated for plausibility. Results show the framework reduces critical false negatives from 28 to 12 and is deployed in South Korea's FSI regulatory sandbox. Resources are gated for qualified researchers.
financial llmred-teamingregulatory compliancefraud facilitationiso/iec 27001
A Systematic Evaluation of Black-Box Uncertainty Estimation Methods for Large Language Models
The study presents a systematic evaluation of black-box uncertainty estimation methods for large language models (LLMs), addressing the fragmentation in existing methodologies. It organizes 24 representative methods into five categories: verbalization-based, sampling-based, explanation-based, multi-agent, and hybrid methods, and benchmarks them across 4 models and 4 dataset settings. Results indicate no single method dominates across all settings, though methods reasoning over answer space candidates and hybrid methods combining multiple uncertainty signals perform well. The release of benchmark data and a unified evaluation framework aims to support reproducible comparisons and future research.
uncertainty estimationlarge language modelsblack-box methodshybrid methodsbenchmark framework
PSCT-Net: Geometry-Aware Pediatric Skull CT Reconstruction via Differentiable Back-Projection and Attention-Guided Refinement
PSCT-Net introduces a geometry-aware framework for pediatric skull CT reconstruction from sparse bi-planar X-rays, addressing depth ambiguity and osseous boundary degradation in existing methods. The approach combines differentiable back-projection for spatially faithful volumetric priors, an Attention-Guided Projection (AGP-3D) module for voxel-wise 2D-3D correspondence learning, and a Bidirectional Mamba (BiM-3D) module for efficient long-range volumetric dependency modeling. The authors also curate PedSkull-CT, a pediatric skull CT dataset with normal and pathological cases, filling a gap in adult-centric datasets.
differentiable back-projectionattention-guided projectionbidirectional mambapediatric skull ctvolumetric prior
Large Language Models Do Not Always Need Readable Language
The paper introduces BabelTele, a model-centric textual representation that sacrifices human readability while preserving semantic recoverability by LLMs. Through readability diagnostics, likelihood measures, human questionnaires, and task evaluations, the study demonstrates that BabelTele achieves 99.5% semantic fidelity at 27.9% text compression, enabling efficient cross-model transfer, agent memory, and multi-agent communication. Results indicate partial decoupling of human readability from model-side semantic processing, suggesting potential for model-native representations in LLM systems.
babeltelesemantic fidelitymodel-centric representationinstruction-tuned llmsinformation density
Neural Additive and Basis Models with Feature Selection and Interactions
The paper proposes feature selection mechanisms for neural additive models (NAM) and neural basis models (NBM) to address computational bottlenecks in high-dimensional settings. By introducing trainable feature selection layers, the method reduces computational costs and model sizes while enabling explicit modeling of feature interactions via two-input neural networks. Experiments demonstrate improved efficiency over vanilla NAM/NBM and competitive performance with state-of-the-art generalized additive models (GAMs), maintaining interpretability through GAM-based architectures.
neural additive modelsfeature selectiongeneralized additive modelsinterpretabilitycomputational efficiency
When, Where, and How: Adaptive Binning for Tabular Self-Supervised Learning
The paper introduces Adaptive Binning, a self-supervised learning method for tabular medical data that dynamically adjusts feature discretization during training via a coarse-to-fine curriculum. The approach combines feature-wise adaptive quantization with representation-aware split selection, leveraging spectral bias and curriculum learning principles. It unifies categorical reconstruction and ordinal supervision through a heterogeneity-aware objective. Evaluated on public medical datasets under standardized protocols, the method demonstrates consistent improvements in linear probing and fine-tuning performance without requiring dataset-specific tuning. The work also contributes a benchmark for reproducible research in medical tabular SSL.
adaptive binningtabular sslspectral biascurriculum learningheterogeneity-aware objective
CSWinUNETR: Segmentation of Thin Anatomical Structures in Medical Images
The paper introduces CSWinUNETR, a novel architecture for segmenting thin anatomical structures in medical images. The method combines cross-shaped stripe self-attention for long-range context modeling with cyclic shifts for inter-stripe communication, augmented by a detail-enhanced multi-scale self-attention module. It further employs sparse-control dynamic snake convolution to reconstruct dense curvilinear kernels from sparse control points. Evaluated on four benchmarks spanning ophthalmology, neurovascular imaging, and dermatology, CSWinUNETR outperforms state-of-the-art methods without task-specific post-processing, demonstrating superior performance in preserving fine structures.
thin-structure segmentationcross-shaped attentiondynamic snake convolutionmulti-scale self-attentioncyclic shifts
TelcoAgent: A Scalable 5G Multi-KPM Forecasting With 3GPP-Grounded Explainability
TelcoAgent introduces a scalable foundation model-based framework for multi-KPM forecasting in 5G networks, addressing limitations in scalability and explainability. The framework integrates three components: an automated three-agent pipeline constructing a 3GPP knowledge graph from specifications, a time-series foundation model (TSFM) enabling zero-shot forecasting, and a reasoning pipeline providing domain-grounded diagnostics. Evaluated on a 3-month city-scale 5G KPM dataset from a U.S. operator, TelcoAgent achieves high forecasting accuracy across 200 cells for all 7 KPMs while delivering actionable insights for network degradation.
kpm forecasting3gpp knowledge graphtime-series foundation modelzero-shot forecastingdomain-grounded diagnostics
CREDENCE: Claim Reduction for Decomposition & Enhanced Credibility -- Semantic Metrics and Convergence Analysis
The paper introduces CREDENCE, a framework for decomposing compound sentences into verifiable claims, addressing limitations in prior work. It proposes Semantic-F1, a BGE-large cosine similarity metric replacing token-overlap measures, improving fact-checking accuracy by +15-32pp. The work provides convergence theorems for rule-based (monotone, finite termination) and LLM-based (non-monotone) repair pipelines, validated on three benchmarks (SocialClaimSplit, WikiSplitBench, ClaimDecompBench). Experiments show rule-repair reduces Atomicity Violation Rate by 47-100% without fidelity loss, with EPR ranging 0.824-1.00 across domains.
claim decompositionsemantic-f1convergence analysisatomicity violation raterule-based repair
Uncertainty-Aware Reward Modeling for Stable RLHF
We propose Uncertainty-Aware Reward Modeling (UARM) to address two key challenges in RLHF: reward models' inability to signal prediction uncertainty and group-based policy optimization's uniform treatment of rewards. UARM integrates calibrated uncertainty estimation via quantile-based conformal prediction and reweights GRPO advantages through heteroscedastic variance decomposition. Experiments on HelpSteer, UltraFeedback, and PKU-SafeRLHF demonstrate that UARM significantly improves reward model calibration, reduces reward hacking, and enhances downstream alignment quality compared to standard GRPO and uncertainty-agnostic baselines.
rlhfreward modelingconformal predictionheteroscedastic variancereward hacking
Human-on-the-Loop Orchestration for AI-Assisted Legal Discovery
The paper introduces a Human-on-the-Loop (HOTL) framework to mitigate trajectory collapse in LLM-assisted legal discovery, where early errors propagate through multi-step reasoning. It proposes a taxonomy of agentic failures, a four-layer verification architecture (planning, reasoning, execution, uncertainty quantification), and demonstrates via simulation on synthetic e-discovery data that calibrated uncertainty thresholds reduce privilege-waiver risk by 61% versus autonomous baselines while routing <25% of documents for attorney review.
trajectory collapseprivilege-waiver riskmulti-step reasoninguncertainty quantificationelectronic discovery
Think Again or Think Longer? Selective Verification for Budget-Aware Reasoning
The paper introduces SEVRA (Selective Verification for Reasoning Allocation), a serving-layer controller that optimizes compute allocation by deciding whether to preserve a frozen solver's initial answer or invoke active verification. Using Qwen3-4B as the frozen solver, the method trains recoverability-aware gates from serving-visible attempt state to selectively verify responses. Results show SEVRA achieves 76.3% accuracy on MATH5 (vs. 75.5% for always verifying) with 26.8% fewer post-generation tokens and reduces harmful flips from 2.2% to 1.0%. Transfer to GSM8K yields 94.5% accuracy (vs. 93.4%) while verifying only 3.0% of examples, demonstrating compute-efficient trade-offs between initial solve budgets and selective recovery.
selective verificationreasoning allocationrecoverability-aware gatescompute efficiencyfrozen solver
ParaScale: Scale-Calibrated Camera-Motion Transfer via a Gauge-Invariant Parallax Number
ParaScale introduces a scale-calibrated camera-motion transfer method using the gauge-invariant Parallax Number (Pi) to preserve motion perception across scenes of differing scales. The method computes Pi from a reference video and applies it to a target scene's depth, maintaining rotational components unchanged. This plug-and-play module operates between pose extraction and injection without retraining. Evaluated across four orders of magnitude, ParaScale reduces Parallax Consistency Error (PCE) by over 3x compared to uncalibrated transfer while preserving visual fidelity.
parallax numberscale-calibrated transfergauge-invariantparallax consistency errorcamera-motion transfer
Policy-aware Vector Search: A Vision for Fine Grained Access Control in Vector Databases
The paper proposes Policy-aware Vector Search, addressing the lack of Fine-grained Access Control (FGAC) in vector databases used for Retrieval Augmented Generation and organizational AI pipelines. It formalizes the FGAC policy model and enforcement problem, highlighting the tension between policy correctness, Approximate Nearest Neighbor (ANN) search recall, and query latency. The authors compare enforcement strategies, present preliminary findings, and identify open challenges. This work aims to enhance security in vector databases by integrating structured and unstructured attributes while maintaining semantic query accuracy.
fine-grained access controlvector databasesretrieval augmented generationapproximate nearest neighborpolicy-aware search
Improving End-to-End Speech Recognition for Dysarthric Speech through In-Domain Data Augmentation
The study improves dysarthric speech recognition by investigating in-domain data augmentation techniques for fine-tuning the Wav2Vec2 model, focusing on severity-specific adaptations. Four methods were evaluated: Speaking-Rate Modification (SRM), Pitch Modification (PM), Formant Modification (FM), and Vocal Tract Length Perturbation (VTLP), each tailored to dysarthric speech characteristics. Severity-specific fine-tuning with augmented data achieved best word error rates (WERs) of 9.02% (low severity, SRM), 38.11% (medium severity, SRM), and 55.15% (high severity, PM), yielding relative improvements of 30.02%, 16.64%, and 15.47%, respectively.
dysarthric speechwav2vec2data augmentationseverity-specificword error rate
Agentic Electronic Design Automation: A Handoff Perspective
The paper introduces handoff validity as a framework for analyzing LLM-based agents in electronic design automation (EDA), where multi-stage workflows require reliable transfer of design artifacts across tool and organizational boundaries. The authors survey 82 systems, classifying them into Stage-Bound, Flow-Bound, and Organization-Bound categories based on their handoff characteristics, and analyze contracts, objects, and coordination mechanisms. This analysis informs a proposed five-layer EDA Agent Communication Protocol (EACP) addressing discovery, messaging, tool invocation, workflow orchestration, and security/IP concerns.
electronic design automationhandoff validityllm-based agentsworkflow orchestrationprovenance tracking
Systematic Study of Dysarthric Speech Recognition: Spectral Features and Acoustic Models
This study systematically investigates acoustic feature combinations and models for dysarthric speech recognition, demonstrating improved performance through optimized feature selection and model architecture. The authors evaluate various acoustic features, particularly highlighting the benefits of incorporating pitch features, and implement these with the Factorized Time Delay Neural Network (F-TDNN) on the TORGO database. Their approach achieves relative improvements of 4.65% in isolated word recognition and 4.63% in sentence recognition compared to prior work, attributed to strategic frame overlap selection during training. The results effectively address acoustic variability challenges in dysarthric speech.
dysarthric speechacoustic featuresfactorized time delay neural networktorgo databasepitch features
Cross-Dataset, Age, and Gender Generalization: A Comprehensive Analysis of Fine-Tuning Strategies for Low-Resource Children's ASR
The study improves dysarthric speech recognition by optimizing acoustic feature selection and training strategies for Factorized Time Delay Neural Networks (F-TDNN). It systematically evaluates combinations of acoustic features, with Pitch features proving particularly beneficial for sentence recognition. Using the TORGO database, the method achieves 4.65% and 4.63% relative improvements in isolated word and sentence recognition respectively, addressing articulatory variability through optimized frame overlap in training chunks.
dysarthric speechf-tdnnacoustic featurestorgo databasesequence discriminative training
CombEval: A Framework for Evaluating Combinatorial Counting in Large Language Models
The paper introduces CombEval, a dynamic benchmark for assessing combinatorial counting capabilities in large language models (LLMs). The framework uses typed Cofola specifications to generate natural-language counting problems with solver-verified answers, enabling systematic variation of object types, entity scales, constraint counts, and reasoning depths. Evaluation of 11 LLMs reveals brittleness in handling ordered objects, indistinguishable elements, positional constraints, and nested dependencies, with error analysis identifying failures in constraint interpretation and counting principles. CombEval serves as a diagnostic tool for studying LLM limitations in combinatorial reasoning.
combinatorial countinglarge language modelsdynamic benchmarkconstraint interpretationreasoning depth
ORAgentBench: Can LLM Agents Solve Challenging Operations Research Tasks End to End?
We introduce ORAgentBench, an execution-grounded benchmark for evaluating autonomous agents on end-to-end operations research (OR) tasks, addressing limitations of existing OR evaluations. The benchmark comprises 107 human-reviewed tasks across diverse operational scenarios, each packaged with natural-language briefs, multi-file data, configuration artifacts, and submission schemas. Agents must write and run solution code, evaluated by hidden validators for schema validity, feasibility, and objective quality. Experiments with fourteen frontier agent-model configurations reveal that current agents remain unreliable, with the best agent passing only 35.51% of tasks and 20.59% of hard tasks, highlighting strategic weaknesses in operational rule adherence, formulation robustness, and solution quality.
operations researchautonomous agentsexecution-grounded benchmarkfeasibilitysolution quality
AgentFinVQA: A Deployable Multi-Agent Pipeline for Auditable Financial Chart QA
AgentFinVQA introduces a deployable multi-agent pipeline for auditable financial chart question answering, addressing regulatory requirements for traceability and data residency. The system decomposes queries into planning, OCR, legend grounding, visual inspection, and verification steps, recording each in a Model Evaluation Packet (MEP). On FinMME, it achieves 71.24% accuracy with Gemini-3 Flash (+7.68 pp over baseline) and 68.08% with locally served Qwen3.6-27B-FP8 (+4.84 pp), while providing verifiable confidence signals (68.2% vs. 55.6% accuracy on confirmed vs. revised answers). Error analysis reveals question misunderstanding, legend confusion, and extraction errors as primary failure modes.
financial chart qamulti-agent pipelinemodel evaluation packetdata residencyverification confidence
Towards Engineering Scaling Laws with Pretraining Data Composition
The study demonstrates that pretraining data composition can engineer neural scaling laws in particle physics applications, favoring data scaling over model scaling. Using high-fidelity simulators to generate synthetic data, the authors analyze hadronic jet classification in high-energy particle collisions. Results show that diverse, task-aligned pretraining data shifts the scaling regime toward requiring more data rather than larger models, contrasting with typical natural language or image domains.
neural scaling lawspretraining datahadronic jetssynthetic datahigh-energy physics
Beyond Entropy: Learning from Token-Level Distributional Deviations for LLM Reasoning
The paper introduces Independent Combinatorial Tokens (ICT), a framework addressing optimization instability in Reinforcement Learning with Verifiable Rewards (RLVR) for LLM reasoning. ICT shifts focus from scalar uncertainty to token logits distributions, using Jensen-Shannon divergence to identify critical branching points for exploration. Theoretical analysis shows ICT regulates policy concentration via dual entropy control (Shannon and second-order Rényi), preventing entropy collapse/explosion. Empirical results on Qwen2.5 models (0.5B-7B) demonstrate 4.58% average pass@4 improvement (max 14.9%) over baselines across seven reasoning benchmarks.
reinforcement learningtoken logitsjensen-shannon divergencerényi entropypolicy concentration
Data Standards for Humanoid Robotics: The Missing Infrastructure for Physical AI
The article proposes data standards as critical infrastructure for scaling humanoid robotics and Physical AI, based on the authors' development of ISO/WD 26264-1. It identifies three key insights: humanoid robot data must preserve embodied interaction context, require physical coherence across multimodal streams, and address non-cumulative data challenges. The authors argue that standardized datasets enable interpretable, shareable, traceable, and reusable embodied experience. The proposed framework includes horizontal infrastructure for lifecycle management and metadata, alongside domain-specific standards for manipulation, locomotion, and cognition. This standardization facilitates the transition from digital to physical AI systems.
humanoid roboticsphysical aidata standardsembodied interactionmultimodal streams
Optimal Scheduling in a Question-Answering Forum of Knowledge Workers
The paper proposes an optimal scheduling framework for question-answering forums employing knowledge workers, focusing on maximizing system capacity while maintaining stability. The authors model the request-answer process as a queuing system, where schedulers assign questions to experts based on their topic-specific expertise levels. They derive the system's capacity for handling requests and design schedulers that achieve this capacity. Additionally, the study explores how collaboration among experts can enhance the system's capacity. The results provide theoretical insights into optimizing QA forums with expert-driven workflows.
question-answeringqueuing systemknowledge workersoptimal schedulingsystem capacity
SafeSpec: Fast and Safe LLM via Dynamic Reflective Sampling
SafeSpec introduces a safety-aware speculative inference framework that integrates risk estimation into the verification process, addressing the incompatibility between speculative decoding and existing safety methods. The method employs a lightweight latent safety head to jointly assess semantic validity and safety during verification, enabling rollback and safety-guided reflective multi-sampling for unsafe generations. Evaluated on Qwen3-32B and adversarial benchmarks, SafeSpec reduces attack success rates by 15% while maintaining a 2.06x speedup on benign workloads, demonstrating optimized safety-efficiency trade-offs.
speculative inferencesafety-aware decodinglatent safety headjailbreak attacksrisk-aware trajectory recovery
Grounded Inference: Principles for Deterministically Encapsulated Generative Models
The paper introduces a foundational framework for deterministic encapsulation of probabilistic generative models in traditional computational systems. It defines four architectural primitives for AI-blended systems and identifies two prevalent anti-patterns in industry practice. The framework aims to mitigate risks associated with integrating generative AI into conventional systems while providing a basis for future generative model interfaces. This approach seeks to enable safer and more reliable incorporation of AI technologies.
generative modelsdeterministic encapsulationarchitectural primitivesanti-patternsai integration
Temporal Self-Imitation Learning
We introduce Temporal Self-Imitation Learning (TSIL), a reinforcement learning framework that leverages temporally efficient successful trajectories as self-supervision for policy improvement. TSIL employs configuration-conditioned adaptive temporal targets derived from fast successful trajectories and preserves efficient behaviors through efficiency-weighted self-imitation learning. Evaluated across 15 distinct long-horizon manipulation tasks, TSIL demonstrates improved learning efficiency, task-completion efficiency, revisitation of fast successful behaviors, and robustness to unstable training conditions. The results highlight that temporal structure in successful behavior provides a scalable self-supervisory signal beyond manual reward shaping.
temporal self-imitation learningreinforcement learningself-supervisionreward shapinglong-horizon manipulation
Manifold Bandits: Bayesian Curriculum Learning over the Latent Geometry of Large Language Models
The paper introduces Bayesian Manifold Curriculum (BMC), a framework for adaptive curriculum learning in LLMs that treats problem sampling as a manifold-structured bandit problem. BMC organizes tasks hierarchically based on latent representations and uses Bayesian learning to guide sampling, addressing non-stationarity and structural dependencies. Experiments reveal tradeoffs between productivity (learning signal), diversity (manifold coverage), and utility (evaluation performance), demonstrating that difficulty-based sampling alone is insufficient for optimal downstream results.
manifold banditsbayesian curriculum learninglatent geometrynon-stationaritytask hierarchy
Benchmarking Agentic Review Systems
The paper benchmarks agentic review systems for evaluating AI-assisted research, comparing two open-source (OpenAIReview, coarse) and one proprietary (Reviewer3) system across six LLMs. Using ICLR/NeurIPS papers and a perturbation benchmark with injected errors, the study measures alignment with human quality judgments (83.0% pairwise accuracy for OpenAIReview + GPT-5.5) and error detection recall (71.6% for the same configuration). Results show model diversity improves recall (83.3% in union), and real-user feedback indicates positive reception (1.44:1 vote ratio) despite false-positive concerns.
agentic review systemsperturbation benchmarkpairwise accuracyerror detection recallin-context learning
A Comparative Study of Pretrained Transformer Models for Quranic ASR: Speech Representations, Label Formats, and Dataset Composition
The study systematically evaluates Transformer-based models for Quranic Automatic Speech Recognition (ASR), achieving a 5-percentage-point reduction in Word Error Rate (WER) over baselines. It compares self-supervised speech feature extractors (Wav2Vec2.0, HuBERT, XLS-R) fine-tuned on 870+ hours of Quranic recitations, analyzing impacts of label formats, training strategies, and clip durations. The optimal configuration yields WERs of 0.08 (EveryAyah) and 0.11 (EveryAyah+Tarteel), with Wav2Vec2-XLSR-53 outperforming others. Arabic text without diacritics proved most effective for fine-tuning. Future directions include phoneme-aware models for Tajweed-sensitive applications.
automatic speech recognitiontransformer modelsword error rateself-supervised learningfine-tuning
Beyond Uniform Forgetting: A Study of Sequential Direct Preference Optimization Across Preference Settings
This study investigates sequential Direct Preference Optimization (DPO) across four preference settings—distributional conflict, multi-attribute interaction, strong safety signal, and compatible response-quality objectives—to determine whether later training uniformly degrades earlier preferences. Using Llama-3.1-8B-Instruct with LoRA adapters, the authors evaluate objectives after each stage with a fixed base-model reference. Results show heterogeneous preference changes, ranging from partial degradation to stability, pair-level redistribution, or positive transfer, depending on objective relationship, signal strength, and training order. Mechanistic diagnostics reveal that Stage 2 gradients and adapter updates are near-orthogonal to previous objectives, suggesting gradient opposition is not the primary driver. Findings emphasize the need to account for objective compatibility and signal strength in sequential alignment pipelines.
sequential direct preference optimizationlora adapterspolicy marginsgradient oppositionobjective compatibility
Interpreting Neural Combinatorial Optimization via Evolving Programmatic Bottlenecks
The authors introduce Evolving Programmatic Bottlenecks (EPB), a novel framework for interpreting Neural Combinatorial Optimization (NCO) policies by distilling black-box NCO models into human-readable program portfolios. EPB employs an LLM to autonomously evolve a bank of programs, using a hybrid textual-numerical gradient descent scheme for program revision and dynamic bank capacity adaptation. Experiments demonstrate that EPB-distilled portfolios match original NCO performance while revealing behavioral shifts across optimization stages, approximating NCO as a composition of classic heuristic variants. This work advances interpretable NCO and establishes EPB as a tool for sequential decision-making model interpretation.
neural combinatorial optimizationevolving programmatic bottlenecksconcept bottleneck modelshybrid gradient descentsequential decision-making
GLARE: A Natural Language Interface for Querying Global Explanations
We introduce GLARE, a natural language interface leveraging LLMs to query global explanations for black-box image classifiers. The system employs an LLM mediator to translate natural language questions into structured SQL queries over local explanation data, enabling flexible aggregation while abstracting low-level representations. It generates statistics-augmented natural language responses and intent-aligned visualizations for each query. Evaluations focus on intent interpretation, query mapping accuracy, generalization to novel queries and datasets, and robustness to linguistic errors. Results demonstrate that LLM-mediated querying significantly enhances the accessibility and usability of global explanations in human-centered explainable AI (XAI).
global explanationsnatural language interfacellm mediatorsql querieshuman-centered xai
QueryGaussian: Scalable and Training-Free Open-Vocabulary 3D Instance Retrieval
QueryGaussian introduces a training-free framework for scalable open-vocabulary 3D instance retrieval, addressing memory and computational bottlenecks in existing scene-level embedding approaches. The method employs an instance-level query mechanism, leveraging pre-trained 2D vision models to interpret natural language prompts and lift segmentation masks into 3D via maximum-weight association, enhanced by a temporal fusion module with multi-stage adaptive density clustering. Results show 70% reduced GPU memory usage, 180x faster inference, and city-scale retrieval capability on tens of millions of Gaussians using consumer hardware.
3d instance retrievalopen-vocabularymaximum-weight associationtemporal fusionadaptive density clustering
VOiLA: Vectorized Online Planning with Learned Diffusion Model for POMDP Agents
VOiLA introduces a framework for vectorized online POMDP planning using learned diffusion models, addressing the challenge of obtaining accurate POMDP models in real-world applications. The method combines conditional diffusion models for transition/observation sampling with distilled feedforward generators, integrated with the GPU-parallelized Vectorized Online POMDP Planner (VOPP). Results show a 1000x sampling cost reduction via distillation, outperforming Recurrent Soft Actor Critic with <10% training data and superior generalization to unseen environments. Physical robot tests achieved 100% success rate using simulation-trained models.
pomdpdiffusion modelsonline planninggpu parallelizationbelief updates
Bidirectional Tutoring for Developmental Motor Learning in Robots: Co-Developed Interaction Dynamics Support Stable Learning
The study demonstrates that bidirectional tutoring enhances developmental motor learning in robots by fostering consistent behaviors and stage-wise generalization. Using a free-energy-principle-based neural network with generative replay, the authors conducted experiments with a physical humanoid robot performing object manipulation, comparing human-robot interaction and AI-tutor conditions. Results showed bidirectional interaction reduced tutor guidance over time while maintaining behavioral coherence, unlike unidirectional methods.
developmental motor learningbidirectional tutoringfree-energy principlegenerative replayhumanoid robot
NRITYAM: Language Models Meet Art and Heritage of Dance
NRITYAM introduces a benchmark for evaluating language models' cultural comprehension in global dance traditions, featuring 9,260 question-answer pairs across 12 languages. Developed with native dance artists and speakers, it assesses models including large, small, and multimodal variants. Results demonstrate NRITYAM's effectiveness as a multilingual, multicultural standard for AI understanding of traditional performing arts.
nrityambenchmarkmultilingualcultural comprehensiondance traditions
Library-Aware Doubles and Iterative Repair for Large Language Model-Generated Unit Tests in OpenSIL Firmware
The study presents an automated unit test (UT) generation workflow for AMD's Open-Source Silicon Initialization Library (openSIL) firmware, combining LLM-guided multi-agent scaffolding, library-aware stub/mock creation, and iterative repair using build logs and coverage feedback. The method achieves 73/76 compilable UTs, with mean line coverage reaching 73.9% without coverage guidance, 98.8% with coverage guidance, and 94.7% when augmented by vector-database retrieval. Results demonstrate significant efficiency gains in UT creation for constrained firmware environments.
unit test generationfirmware validationiterative repairlibrary-aware stubscoverage-guided testing
OnDeFog: Online Decision Transformer under Frame Dropping
OnDeFog improves reinforcement learning performance under frame dropping by integrating Decision Transformer mechanisms with online learning. The method combines DeFog's frame-dropping resilience with Online Decision Transformer's (ODT) ability to adapt through environmental interaction, addressing DeFog's offline generalization limitations. Experiments show OnDeFog outperforms ODT in high frame-dropping scenarios and surpasses DeFog on datasets with abundant low-reward trajectories.
decision transformerframe droppingonline reinforcement learningoffline generalizationperformance degradation
AURA: Adaptive Uncertainty-aware Refinement for LLM-as-a-Judge Auditing
AURA introduces an adaptive uncertainty-aware refinement framework for auditing pairwise LLM-as-a-judge decisions, addressing limitations in existing pipelines that assume reliable supervision signals. The method iteratively learns human-consistency signals, propagates reliable evidence, and prioritizes uncertain comparisons for human review by treating judge trust as a latent quantity refined through evidence accumulation. Evaluations on synthetic and real pairwise LLM-answer data demonstrate the framework's effectiveness in improving audit reliability under scarce human verification.
llm-as-a-judgeuncertainty-awarehuman-consistencypairwise comparisonadaptive refinement
FineREX: Fine-Tuned NER-RE for Human Smuggling Knowledge Graphs
FineREX introduces a domain-specific fine-tuned LLM pipeline for constructing knowledge graphs from human smuggling court documents, addressing limitations of general-purpose models. The method combines named entity recognition and relationship extraction (NER-RE) on a manually annotated dataset of 512 text chunks. Results show 15.50% and 31.46% absolute F1-score improvements for entities and relationships respectively, with 50.0% faster processing and reduced node duplication (17.78% to 11.17%) compared to general-purpose baselines.
knowledge graph constructionnamed entity recognitionrelationship extractiondomain-specific fine-tuningillicit network analysis
Beyond Static Leaderboards: Predictive Validity for the Evaluation of LLM Agents
The paper critiques static leaderboards in LLM agent evaluation, demonstrating their poor predictive validity for out-of-distribution deployment. Through fourteen parallel implementation studies and consolidation of seven prior benchmarks, it reveals rank instability when comparing public and hidden test sets. The authors propose evaluating configurations by predictive validity (in-sample vs out-of-sample rank correlation) rather than aggregate scores, introducing a twelve-tier measurement apparatus and three falsifiable OOD criteria. Evidence partially supports the approach but remains insufficient for confirmation, prompting a pre-registered pilot design for next-generation agentic benchmarks.
predictive validityout-of-distributionagent benchmarksrank instabilityleaderboards
Efficiently Representing Algorithms With Chain-of-Thought Transformers
(No summary returned.)
Exit-and-Join Dynamics for Decentralized Coalition Formation
The paper introduces a decentralized coalition formation model where agents make unilateral exit-and-join decisions based on the Aumann-Dreze value, evaluating payoffs within their current coalition rather than through global negotiation. The framework connects cooperative payoff allocation with noncooperative best-response dynamics, defining terminal partitions as coalition structures without individually profitable deviations. Theoretical results include equilibrium characterizations, conditions for Lyapunov or exact-potential representations, and analysis of switching/acceptance costs. Numerical experiments examine finite-time stabilization, cost sensitivity, and a convex-game benchmark.
coalition formationaumann-dreze valuelyapunov representationnoncooperative dynamicsdecentralized decision-making
LOKI: Memory-Free Null-Space Constrained Lifelong Knowledge Editing
LOKI introduces a memory-free lifelong knowledge editing method for language models, addressing inflexible layer modification and catastrophic forgetting in prior approaches. It dynamically selects layers via Hilbert-Schmidt Independence Criterion and projects gradient updates onto weight null-spaces, eliminating need for previous knowledge access. Experiments demonstrate up to 14% average accuracy improvement over existing methods across diverse benchmarks.
lifelong knowledge editingnull-space projectionhilbert-schmidt independence criterioncatastrophic forgettingdynamic layer selection
TeleMorpher: Toward Robust Simultaneous Motion-Location Editing
TeleMorpher introduces a one-shot framework for simultaneous motion-location editing in videos, addressing a previously unexplored task. The method leverages motion priors from an off-the-shelf model, disentangles protagonist and background via pre-trained segmentation and inpainting models, and employs training-free pose warping for precise motion editing. Edited motion is injected into a baseline motion editor during inference, preserving source video appearance while aligning with target motion. Two novel LPIPS-based metrics evaluate background consistency and motion fidelity. Experiments on in-the-wild videos and the TaiChi dataset demonstrate TeleMorpher's superior performance in both quantitative metrics and real-human evaluations.
motion-location editingpose warpingmotion priorslpips-based metricstraining-free
Denoising Implicit Feedback for Cold-start Recommendation
The paper introduces DIF, a model-agnostic method for denoising implicit feedback in cold-start recommendation scenarios. It addresses noise in user-item interactions (e.g., clickbait) by leveraging stable user preferences to infer pseudo-labels via content-similar warm items. Confidence in pseudo-labels is modeled using content similarity, and label uncertainty is estimated via relative entropy and cold-start status. DIF outperforms heuristic-based denoising methods, as demonstrated by theoretical analysis and experiments on real-world datasets, including deployment on Kuaishou's billion-user platform with significant metric improvements.
implicit feedbackcold-start recommendationpseudo-labelscontent similarityrelative entropy
BrainG3N: A Dual-Purpose Tokenizer for Controllable 3D Brain MRI Generation
The paper introduces BrainG3N, a dual-purpose tokenizer for 3D brain MRI generation that decouples clinical information retention from anatomical reconstruction. The method employs a frozen 3D masked-autoencoder (MAE) encoder to produce clinically informative embeddings and a dedicated CNN decoder for voxel reconstruction, pretrained on 35,309 volumes from 18 public cohorts. Results show the encoder outperforms SOTA models on 21 of 23 linear-probing tasks, and a conditional diffusion transformer (DiT) trained on these embeddings enables controllable generation and patient-specific longitudinal forecasting.
3d brain mrimasked-autoencoderlatent diffusionclinical embeddingsconditional generation
Creating Multilingual Mental Health Dialogue Datasets: Limits of Persona-Based Localization via Nationality and Language
The study exposes limitations in generating multilingual mental health datasets via persona-based localization, demonstrating that merely modifying nationality and language parameters introduces clinical inconsistencies. Researchers created synthetic clinical dialogues in Mandarin, Bengali, and Hindi by adapting English-centric personas, then evaluated depression severity assessments using multiple LLMs. Results reveal significant inaccuracies in non-English evaluations, with performance variability across models, underscoring systemic biases in English-centric persona methods and advocating for culturally responsive data generation.
large language modelsclinical personasmultilingual datasetsdepression severitycultural responsiveness
Before the Labels: How Dataset Construction Shapes Suicidality Detection in Clinical Text
The study critiques the operationalization of suicidality detection in clinical NLP, emphasizing how dataset construction shapes label interpretation. Focusing on the ScAN dataset derived from MIMIC-III clinical notes, it examines governance constraints, ICD-based cohort selection, single-annotator labeling, and hospital-stay-level aggregation. These factors produce labels reflecting clinician-documented judgments, treating suicidality as bounded episodes and assuming reliable intent inference. A linguistic analysis reveals that identical labels encompass heterogeneous clinical framings varying in temporality, negation, and uncertainty. The authors advocate for critical examination of dataset assumptions before treating labels as ground truth in clinical NLP.
clinical nlpsuicidality detectiondataset constructionlinguistic analysiselectronic health record
Hard or Just Unreached? Diagnosing the Sampling Blind Spot in Math-Reasoning Difficulty Estimation
The paper identifies a sampling blind spot in math-reasoning difficulty estimation, demonstrating that pass@k metrics underestimate the hardest stratum of problems. Using activation grafting on residual streams, the authors introduce a deterministic regime involving greedy decoding and five distinct perturbations. On GSM8K and MATH benchmarks across four open-weight models, 10.3-22.9% of problems unsolved by six sampling seeds were solved by this regime, while greedy decoding alone solved ≤6%. Mechanistic distinctness of perturbations was verified via cross-kind fix-set Jaccard scores ≤0.47, confirming the structural identifiability of these problems in residual streams.
pass@kactivation graftingresidual streamgreedy decodingmechanistic distinctness
Token Factory: Efficiently Integrating Diverse Signals into Large Recommendation Models
The paper introduces Token Factory, a framework for efficiently integrating traditional signals into Large Recommendation Models (LRMs) by transforming them into 'soft tokens'. This method addresses limitations of conventional approaches that textualize signals or use discrete item representations, which often result in long prompts, high memory usage, and computational overhead. The proposed architecture compresses heterogeneous input features, preventing prompt length explosion while improving performance. Experimental results demonstrate its effectiveness in a production-scale recommendation environment.
large recommendation modelssoft tokenstransformer-based architecturesheterogeneous input featuresprompt length explosion
CTS-MoE: Implicit Terrain Adaptation via Mixture-of-Experts for Perceptive Locomotion
CTS-MoE introduces a mixture-of-experts architecture for perceptive legged locomotion, combining a dense MoE actor with perception-based gating and multi-critic value heads to balance behavior sharing and task specialization. The method employs end-to-end training in a concurrent teacher-student setup, eliminating need for terrain classifiers or hierarchical selectors at deployment. Experiments on a Unitree Go1 robot demonstrate improved tracking error and success rates across seen and unseen terrains compared to monolithic baselines.
mixture-of-expertsperceptive locomotionmulti-task reinforcement learningconcurrent teacher-studentterrain adaptation
Formal Verification of Learned Multi-Agent Communication Policies via Decision Tree Distillation
The authors introduce the first end-to-end framework for formal safety verification of learned multi-agent communication policies, addressing the lack of safety guarantees in neural policies for robotic deployments. Their method distills neural policies into interpretable decision trees (97.9% fidelity), translates them into PRISM specifications, and verifies Probabilistic Computation Tree Logic properties via compositional verification. Evaluated on Vector-Quantized Variational Information Bottleneck policies for multi-drone coordination, the framework verifies 18 temporal logic properties, achieving 88.9% satisfaction and collision probabilities below safety thresholds (0.3% vs. 1%). Monte Carlo validation confirms property transfer with ≤0.6 percentage-point deviation.
multi-agent reinforcement learningdecision tree distillationprobabilistic computation tree logicformal verificationvector-quantized variational information bottleneck
AI4SE and SE4AI Exploration: A Decade Looking Back and Forward
The article analyzes the intersection of AI and Systems Engineering (SE) through three phases: foundational, applied, and LLM inflection, based on a review of core papers. A human-AI agreement literature review assessed the relevance of 1,712 INCOSE INSIGHT articles and 889 SERC publications using human expertise and six AI models. The study identifies five critical research gaps and provides guidance for AI adoption, assurance, and workforce transformation in SE. The authors share agreement data and the AI4SE/SE4AI Explorer web application for comparative relevance judgments.
systems engineeringllm inflectionhuman-ai agreementai adoptionworkforce transformation
RIVET: Robust Idempotent Voice Attribute Editing
RIVET introduces an idempotency-based training framework to improve robustness in voice attribute editing models under noisy label conditions. The method enforces idempotency (f(f(x)) = f(x)) as an implicit regularizer, reducing sensitivity to mislabeled examples. Evaluated on controlled noise and the GLOBE dataset, RIVET outperforms standard training in editing success (preserving target attributes) and speaker identity retention, demonstrating idempotency's effectiveness for label noise robustness.
idempotencyvoice attribute editinglabel noise robustnessimplicit regularizationgenerative models
VCG: A Multimodal Retrieval Framework for E-Commerce Video Feeds under Extreme Cold-Start Conditions
The Video Candidate Generation (VCG) system addresses extreme cold-start challenges in e-commerce video feeds by introducing a multimodal retrieval framework. It employs a domain-adapted CLIP-based vision-language model to map users and videos into a shared semantic space, enabling zero-shot retrieval without behavioral history. Comparative evaluations show generative LLM embeddings suffer from collapse in retrieval tasks, while VCG's discriminative approach achieves a 50% uplift in deep video completion and mitigates engagement biases in online A/B tests.
multimodal retrievalcold-start problemvision-language modelzero-shot retrievalembedding space collapse
Toten: Knowledge-Based Ontological Tokenization Of Physical Quantities And Technical Notation In Brazilian Portuguese
The paper introduces TOTEN, a knowledge-based ontological tokenization framework for Brazilian Portuguese that replaces statistical BPE with declarative classification grounded in an engineering ontology (OEE). The system integrates three external oracles (Pint, Unicode Character Database, RSLP) to preserve dimensional, typographic, and morphological invariants. Evaluated on EngQuant (N=800) and four Brazilian Portuguese corpora (N=1771), TOTEN achieves 0.775-0.904 numerical reconstruction accuracy versus 0.627-0.703 for the best baseline (Quantulum3), with statistically significant improvements in ontological atomicity and dimensional equivalence.
tokenizationontologydimensionalreconstructionportuguese
Before the Pull Request: Mining Multi-Agent Coordination
The paper introduces 'grite', a decentralized coordination substrate for autonomous coding agents that records pre-pull-request interactions within git's append-only event log. This method eliminates duplicate work (reducing redundant tasks from 78% to 0%) and triples useful throughput while ensuring log consistency across agents. The system enables mining of coordination failures (e.g., conflicting edits, lock starvation) with full provenance, revealing patterns invisible in pull-request telemetry. The authors release the dataset, harness, and toolkit for further analysis.
autonomous agentsgit coordinationevent logconflict resolutionpull requests
StaminaBench: Stress-Testing Coding Agents over 100 Interaction Turns
StaminaBench introduces a novel benchmark for evaluating coding agents' stamina across 100 interaction turns, simulating real-world extended coding sessions. The benchmark tests agents on implementing and modifying a REST API server through procedurally generated change requests, with codebases reaching 6,000 lines. Using an isolated, black-box environment, six agent harnesses paired with seven LLMs were evaluated across 20 scenarios. Key findings include: (1) all models fail within 5-6 turns without testing; (2) test feedback improves performance up to 12x; (3) harness quality significantly impacts results, with a 6x performance gap observed. The benchmark is released for further research.
staminabenchcoding agentsrest apiinteraction turnsllm harness
Latent Confounded Causal Discovery via Lie Bracket Geometry
The paper introduces two novel causal discovery algorithms, BRIDGE and SKFM, for latent confounded settings by leveraging Lie bracket geometry from Kan-Do-Calculus (KDC). BRIDGE identifies latent confounders via Frobenius residuals from non-closing causal vector fields, while SKFM spectrally factors latent curvature using amortized intervention fields. Experiments demonstrate both methods significantly reduce the super-exponential DAG search space while accurately recovering causal structures with latent variables, establishing a geometric paradigm for intervention-based causal discovery.
kan-do-calculuslie bracket geometryfrobenius residualslatent confoundingcausal discovery
Which Pairs to Compare for LLM Post-Training?
The paper addresses the optimization of comparison pair selection in preference-based post-training for language models, focusing on Direct Preference Optimization (DPO). It formulates comparison curation as a sampling-design problem, analyzing how pair selection impacts policy performance through a design-dependent information matrix. Theoretical bounds on the DPO-trained policy's optimality gap are derived, linking label allocation to parameter estimation error. Proposed sampling designs improve sample efficiency over heuristics in synthetic and benchmark settings.
preference-based post-trainingdirect preference optimizationsampling-design probleminformation matrixoptimality gap
FAPO: Fully Autonomous Prompt Optimization of Multi-Step LLM Pipelines
FAPO introduces a fully autonomous framework for optimizing multi-step LLM pipelines by evaluating, diagnosing, and iteratively improving prompts and structural components. It employs Claude Code to propose scoped changes, prioritizing prompt edits before structural modifications when necessary. Benchmarked across six tasks and three models, FAPO outperforms GEPA in 15 of 18 comparisons, with mean gains of +14.1 pp overall and +33.8 pp in cases requiring structural changes. It also enhances security task performance, notably improving CVE-to-CWE accuracy by up to +7.1 pp on specialized models.
fapollm pipelinesprompt optimizationclaude codecve-to-cwe
Configurable Clinical Information Extraction with Agentic RAG: What Works, What Breaks, and Why
The paper introduces ACIE (Agentic Clinical Information Extraction), an on-premise agentic RAG pipeline addressing clinical information extraction challenges in heterogeneous patient documents. The system handles temporal reasoning, cross-document dependencies, and missing metadata by reasoning over complete patient contexts and grounding answers in source passages. Evaluated against a lymphoma registry study with 7,326 clinician judgments, ACIE achieved 96.5% acceptance, with per-type acceptance ranging from 80% to 99%.
agentic ragclinical information extractiontemporal reasoningcross-document dependenciesmetadata gap
PrefSQA: Pairwise Preference Prediction for Speech Quality Assessment and the Critical Role of High Quality Datasets
PrefSQA introduces a pairwise preference prediction method for speech quality assessment, addressing label noise in mean opinion scores (MOS) through uncertainty-aware logits, an impairment attention head, and non-matching-reference comparisons. The study evaluates five datasets, including MOS-derived and low-noise simulated sets, demonstrating improved reliability over baselines, particularly with high-quality preference data. Results show modest gains on MOS-derived data but significant improvements on other sets, validating the method's effectiveness in reducing rater variability.
preference predictionspeech quality assessmentmean opinion scoresuncertainty-aware logitsimpairment attention
IHBench: Evaluating Post-Interruption Recovery in Voice Agents with Structured Workflows
IHBench introduces a benchmark for evaluating post-interruption recovery in voice agents executing structured workflows across 10 enterprise domains. It assesses recovery on two axes—task fulfillment and recovery quality—by injecting six interruption types mid-utterance and generating per-interruption evaluation rubrics. The study evaluates 27 audio-language model configurations from OpenAI, Google, and open-weight communities, finding closed-weight models consistently more robust: they outperform on task fulfillment, degrade 3.3x slower in longer conversations, and exhibit no audio-text modality gap. Human validation confirms LLM judge reliability, and cross-benchmark analysis shows recovery quality as a distinct capability.
voice agentsinterruption recoverystructured workflowstask fulfillmentaudio-language models
A BART-based approach with hierarchical strategy for Vietnamese abstractive multi-document summarization
The authors propose a BART-based hierarchical approach for Vietnamese abstractive multi-document summarization, introducing a golden summary-driven document condensation strategy to improve inter-stage correlation. Their method employs a two-phase process: individual document compression followed by aggregated summarization, augmented with external data sources to address Vietnamese data scarcity. The system achieves a ROUGE2-F1 score of 0.2468 on the VLSP 2022 benchmark while generating fluent outputs, with released supplementary training data for community use.
abstractive summarizationmulti-document summarizationhierarchical approachrouge scorevietnamese nlp
Analyzing the Narration Gap in LLM-Solver Loops
The paper identifies and analyzes the narration gap in LLM-solver loops, where formal guarantees from solvers can be lost during output narration. It models the loop as a verified decision procedure and evaluates five open-sourced models under prompt injection attacks. Results show certificate gating preserves solver soundness, but adversaries can invert verified conclusions across phrasings and channels, with hardened prompts reducing but not eliminating vulnerabilities.
llm-solver loopnarration gapprompt injectioncertificate gatingverified decision procedure
FlowFake: Liquid Networks for Audio Deepfake Detection
FlowFake introduces a Liquid Time-Constant (LTC) network for audio deepfake detection, addressing cross-dataset generalization failures caused by multi-timescale synthetic speech artifacts. The architecture employs learned ODE dynamics with per-neuron adaptive time constants (10ms-2s) to capture spectral and prosodic anomalies, achieving BIBO stability and O(dt^4) integration error with 34K parameters. Evaluated on ASVspoof2019-LA, FakeOrReal, InTheWild, and MLAAD, FlowFake achieves 75.29-79.97% accuracy in cross-domain tests, outperforming RawGAT-ST, Whisper-DF, and matching Wav2vec2 at 0.01% of its parameter count.
audio deepfake detectionliquid time-constant networkscross-dataset generalizationadaptive time constantsode-based learning
Exploring Feature Extraction Technique Parameters for Acoustic Gunshot Classification
This paper systematically investigates feature extraction techniques for acoustic gunshot classification, addressing the generalization gap in current literature. Using a dataset of 23,000 gunshot recordings across 85 firearms and 21 calibers, the authors benchmark three feature extraction techniques with 12 unique parameter sets, evaluated via ResNet-18. Results show that selecting the optimal feature extraction technique improves top-1 accuracy by up to 20%, while parameter optimization within a technique yields an additional 4.7% accuracy gain. This work highlights the critical role of feature extraction in enhancing gunshot classification performance.
acoustic gunshot classificationfeature extractionresnet-18top-1 accuracyparameter optimization
GDGU: A Gradient Difference-based Graph Unlearning Method for Cyberattack Localization in Electric Vehicle Charging Networks
The paper proposes GDGU, a gradient difference-based graph unlearning method for cyberattack localization in EV charging networks. It addresses data deletion requests by performing feature-level unlearning through first-order parameter correction, followed by batch-normalization recalibration and fine-tuning. Evaluated on IEEE 34-bus, 123-bus, and 8500-node networks with three GNN backbones, GDGU matches baseline localization utility while achieving 10-12× faster unlearning than retraining and superior memory efficiency.
graph unlearningcyberattack localizationgradient differenceev charging networksparameter correction
Uncertainty Decomposition for Clarification Seeking in LLM Agents
The paper introduces a prompt-based uncertainty decomposition method for LLM agents, separating action confidence from request uncertainty to enable proactive clarification-seeking in underspecified tasks. The approach is evaluated on two new benchmarks (WebShop-Clarification and ALFWorld-Clarification) with 50% underspecified tasks, comparing against ReAct+UE and Uncertainty-Aware Memory across five LLM backbones. Results show 73% and 36% F1 improvements over baselines on ALFWorld-Clarification, with consistent gains across most backbones, demonstrating generalization beyond single-model performance.
uncertainty decompositionclarification seekingllm agentsunderspecified tasksprompt-based estimation
Review of Machine Learning Models for Solar Energetic Particle Prediction
The manuscript reviews machine learning models for solar energetic particle (SEP) prediction, addressing both scientific understanding and space technology protection. It systematically compares ML architectures, input features, and output formats across existing studies, while identifying datasets used for training. The analysis yields recommendations for future SEP prediction research, emphasizing methodological improvements and data standardization. Traditional physics-based simulations and empirical methods are contrasted with emerging ML approaches.
solar energetic particlesspace weather predictionmachine learning architecturesparticle accelerationheliospheric physics
ITNet: A Learnable Integral Transform That Subsumes Convolution, Attention, and Recurrence
We introduce Integral Transform Network (ITNet), a unified architecture that subsumes convolutional networks, recurrent networks, and transformers through a learnable integral transform with a kernel implemented as an MLP. The kernel jointly depends on positions and features, enabling adaptive pairwise interactions and universal approximation of continuous operators. Practical innovations include tiled kernel fusion, importance-weighted Monte Carlo integration, and learned low-rank factorization for efficient computation. ITNet matches or exceeds specialized baselines on ImageNet-1K, GLUE, ModelNet40, VQA v2, and NLVR2, demonstrating that a single learned interaction mechanism can recover behaviors of all three architectural families.
integral transformkernel fusionmonte carlo integrationlow-rank factorizationuniversal approximator
PerceptionDLM: Parallel Region Perception with Multimodal Diffusion Language Models
PerceptionDLM introduces a multimodal diffusion language model for parallel region perception, addressing efficiency limitations of autoregressive MLLMs in multi-region captioning. The method combines PerceptionDLM-Base (a diffusion-based MLLM) with efficient prompting and structured attention masking to enable simultaneous processing of multiple masked regions at sequence and token levels. On the newly constructed ParaDLC-Bench, PerceptionDLM maintains competitive caption quality while achieving 3.2× speedup over sequential approaches, demonstrating diffusion models' potential for parallel visual perception.
multimodal diffusionparallel region perceptionstructured attention maskingdetailed localized captioninginference efficiency
A Tool for the Synthesis of Adaptive Probabilistic Processors Based on the Ising Model
The authors present an automated tool for synthesizing probabilistic processors that solve combinatorial optimization problems via Ising model mapping. The system constructs Ising Hamiltonians, determines required p-bit counts based on problem characteristics, and adaptively selects between Gibbs Sampling, Simulated Annealing, Simulated Quantum Annealing, and cluster-based update algorithms. Benchmark experiments demonstrate improved convergence and flexibility over fixed-method approaches, while enabling systematic evaluation of probabilistic computing strategies for future MTJ-based hardware implementations.
ising modelprobabilistic computingcombinatorial optimizationp-bitsquantum annealing
Techniques for Peak Memory Reduction for LoRA Fine-tuning of LLMs on Edge Devices
The paper introduces four techniques to reduce peak memory during LoRA fine-tuning of LLMs on edge devices: (1) quantized base models with runtime dequantization, (2) selective activation caching and disk offloading, (3) softmax approximation via token subset selection, and (4) logits masking. Evaluated on Llama-3.2 3B and Qwen-2.5 3B, these methods collectively achieve 26×–28× memory reduction while preserving model quality, enabling fine-tuning on resource-constrained hardware.
low-rank adaptationquantizationactivation checkpointingsoftmax approximationlogits masking
Emergent Alignment
The paper introduces Emergent Alignment, a method enabling Large Language Models (LLMs) to self-assess and correct ethical misalignments in their outputs. By incorporating a conscience step for self-review and extending training loss with Direct Preference Optimization (DPO), the approach steers models away from unethical outputs without external judges. Empirical results demonstrate successful alignment in scenarios like fine-tuning and adversarial prompting, contrasting with prior Emergent Misalignment findings where models exhibited unethical behaviors under similar conditions.
emergent alignmentdirect preference optimizationconscience stepethical misalignmentself-correction
REVEAL++: Differentiable Phenotypic Grouping for Vision-Language Retinal Modeling of Alzheimer's Disease Risk
The paper introduces REVEAL++, a vision-language framework for Alzheimer's disease (AD) risk prediction from retinal images that reformulates phenotypic grouping as a continuous, differentiable process. Unlike prior discrete clustering approaches, it models inter-subject similarity via intra-modality embedding similarities in both retinal images and clinical risk profiles, enabling soft multi-positive contrastive learning through a continuous aggregation operator. Evaluated on UK Biobank data, REVEAL++ outperforms discrete group-based contrastive learning and standard vision-language baselines in incident AD prediction, demonstrating the benefits of learnable phenotypic structure.
contrastive learningphenotypic groupingvision-language alignmentretinal imagingalzheimer's disease
LLM Doesn't Know What It Doesn't Know: Detecting Epistemic Blind Spots via Cross-Model Attribution Divergence on Clinical Tabular Data
The study investigates epistemic uncertainty in LLMs applied to clinical tabular data through cross-model attribution divergence analysis between Qwen 2.5 7B and XGBoost. Key findings include: (1) LLM verbalized confidence is uninformative (0.856-0.937 range regardless of 49-75.3% accuracy), (2) an inverse difficulty effect where LLM accuracy drops when XGBoost is highly certain, (3) few-shot examples and SHAP-derived features reduce Attribution Disagreement Score from 1.54 to 0.38 and improve accuracy from 49% to 75.3%, and (4) a cross-model calibrator reduces expected calibration error from 0.254 to 0.080. The work proposes a path toward epistemic self-awareness for LLMs on structured data.
epistemic uncertaintyattribution divergencestructured clinical dataverbalized confidencecross-model calibration
DeXposure-Claw: An Agentic System for DeFi Risk Supervision
DeXposure-Claw introduces a forecast-grounded agentic system for decentralized finance (DeFi) risk supervision, addressing limitations of general-purpose LLM agents in handling networked credit risks. The system integrates DeXposure-FM, a graph time-series foundation model for exposure network forecasting, deterministic monitors for generating alerts and scenario evidence, and data-health gates to constrain escalation. It emits auditable supervisory tickets with rationales. DeXposure-Bench, a six-axis evaluation harness, assesses tickets against regulator-aligned absolute-loss ground truth and false-intervention rates. Experiments on five years of weekly real data validate the system's effectiveness.
decentralized financegraph time-seriesfoundation modeldeterministic monitorsfalse-intervention rate
Hidden Anchors in Multi-Agent LLM Deliberation
The paper introduces a dynamical systems model for multi-agent LLM deliberation that incorporates hidden internal beliefs (anchors) alongside social influence. It demonstrates that these anchors can be reconstructed from deliberation traces and explain non-convex behavior where agents' confidence exceeds initial belief bounds. The method validates anchor generalization across three open-weight model families, revealing a spectrum of anchor influence strength and position. Results show escape from convex hulls occurs only when anchors are sufficiently distant from initial opinions, necessitating closed-loop modeling.
multi-agent deliberationhidden anchorsdynamical systemsopinion dynamicsconvex hull
Concept Flow Models: Anchoring Concept-Based Reasoning with Hierarchical Bottlenecks
Concept Flow Models (CFMs) introduce a hierarchical bottleneck architecture to address information leakage in Concept Bottleneck Models (CBMs). CFMs replace flat bottlenecks with concept-driven decision trees, where each internal node focuses on localized subsets of discriminative concepts, progressively narrowing prediction scope. The framework constructs hierarchies from visual embeddings, distributes semantic concepts across hierarchy levels, and trains differentiable concept weights via probabilistic tree traversal. Experiments on diverse benchmarks show CFMs match flat CBMs' predictive performance while reducing effective concept usage, mitigating information leakage. CFMs provide stepwise decision flows, enabling transparent and auditable reasoning with hierarchical class structures.
concept bottleneck modelsinformation leakagedecision treevisual embeddingsprobabilistic tree traversal
Can In-Context Learning Support Intrinsic Curiosity?
The paper investigates whether in-context learning (ICL) in sequence models can enable intrinsic curiosity by serving as update-free world models for exploration policies. The authors prove impossibility in general Markov decision processes due to biased reward estimation, but establish theoretical guarantees for non-temporal settings like active learning, where ICL-derived rewards asymptotically converge to true learning progress. Experiments in continuous and symbolic environments demonstrate successful training of curiosity-driven policies using ICL-based rewards.
in-context learningintrinsic curiositymarkov decision processesactive learningbayesian experimental design
Diffusion Language Models: An Experimental Analysis
This work presents a systematic evaluation of Diffusion Language Models (DLMs), comparing eight state-of-the-art architectures across eight benchmarks in reasoning, coding, translation, and knowledge tasks. The study analyzes generation quality and computational efficiency while controlling for inference-time factors like denoising steps, context length, block size, and parallel unmasking strategies. Results demonstrate that DLMs exhibit distinct performance-efficiency trade-offs highly sensitive to generation-time hyperparameters, with controlled experiments revealing architectural strengths and limitations.
diffusion language modelsiterative denoisingparallel refinementinference-time factorsperformance-efficiency tradeoffs
Secure Coding Drift in LLM-Assisted Post-Quantum Cryptography Development: A Gamified Fix
The paper introduces Secure Coding Drift in Post-Quantum Cryptography (PQC), a novel socio-technical vulnerability model capturing the gradual degradation of secure coding practices due to prolonged reliance on LLM-generated code. To address this, the authors propose a gamified, LLM-augmented secure coding framework integrating adversarial evaluation, behavioural feedback, and security scoring into development workflows. This approach repositions LLMs as active security co-pilots, aiming to mitigate risks in AI-assisted PQC implementation.
post-quantum cryptographysecure coding driftlarge language modelsadversarial evaluationbehavioural feedback
Measuring Curriculum Alignment across Topical Coverage, Competency, and Cognitive Depth: A Longitudinal Framework Applied to CS2013 and CS2023
The authors present a human-in-the-loop pipeline for measuring curriculum alignment across successive computer science guidelines (CS2013 and CS2023), evaluating topical coverage, competency articulation, and cognitive depth. The method combines semantic retrieval (benchmarking seven retrievers, with reciprocal-rank-fusion performing best) with human validation (inter-rater κ=0.64-0.69), applied longitudinally to an accredited BSc program. Results show stable coverage (~50% of knowledge units) but a cognitive depth compliance drop (76% in CS2023 vs 95% in CS2013), revealing persistent gaps in parallel computing and systems fundamentals alongside guideline-driven changes.
curriculum alignmentsemantic retrievalknowledge unitscognitive depthlongitudinal analysis
Deontic Policies for Runtime Governance of Agentic AI Systems
The paper introduces AgenticRei, a deontic policy framework for runtime governance of LLM-driven autonomous agents, addressing limitations in current policy engines (XACML, Rego, Cedar) that lack obligation lifecycle management, meta-policy conflict resolution, and domain-specific reasoning. The method employs an OWL-based deontic language derived from the Rei framework, executed by an external logic engine to evaluate both tool invocations and inter-agent communications. Results demonstrate expressiveness for security/privacy constraints unmet by existing engines, with compatibility to industry standards like A2AS.
deontic policiesobligation lifecyclemeta-policy conflict resolutionowl ontologyruntime governance
Scaling Generative Foundation Models for Chest Radiography with Rectified Flow Transformers
The authors introduce the first billion-parameter generative foundation model for chest radiograph synthesis, addressing generalization limitations in existing radiographic AI models. Their 1.3B-parameter Rectified Flow Transformer, trained on 1.6T tokens from a curated dataset of 1.2M radiographs with expert-guided metadata, enables controllable generation across demographics, acquisition views, and pathologies. The model achieves state-of-the-art synthesis fidelity, producing radiographs indistinguishable from real images according to clinical experts.
generative foundation modelrectified flow transformerradiograph synthesiscontrollable generationclinical metadata
Playful Agentic Robot Learning
The paper introduces Playful Agentic Robot Learning, where Robotics Agent Teams (RATs) acquire reusable skills through self-directed play before downstream tasks. RATs autonomously propose exploratory tasks, execute Code-as-Policy programs, verify progress, diagnose failures, and distill successful executions into a persistent skill library. Experiments on LIBERO-PRO and MolmoSpaces show 20.6 and 17.0 percentage-point gains over CaP-Agent0, respectively. Learned skills improve RoboSuite and real-world transfer by 8.9 and 8.8 points without finetuning, demonstrating plug-and-play utility via in-context retrieval.
robotics agent teamscode-as-policyskill acquisitionself-directed playin-context retrieval
UNIEGO: Proxies as Mediators for Unified Egocentric Video Representation Learning
The paper introduces UNIEGO, a unified egocentric video encoder trained via hierarchical multi-teacher distillation to integrate knowledge from diverse viewpoints (ego-exo), modalities (RGB, depth, skeleton), and foundation models. The method employs representation-specific Proxy models to homogenize incompatible teacher features, followed by Selective Proxy Distillation (SPD) that dynamically selects reliable supervision per sample. Initializing UNIEGO as a convex combination of proxy parameters stabilizes training. Evaluated on three egocentric tasks (action recognition, retrieval, segmentation) across three benchmarks, UNIEGO outperforms naive multi-teacher baselines, demonstrating the efficacy of proxy-mediated knowledge transfer.
egocentric videomulti-teacher distillationproxy modelsselective distillationrepresentation learning
Optimal Deterministic Multicalibration and Omniprediction
The authors resolve an open problem in multicalibration by presenting the first deterministic predictor achieving minimax-optimal sample complexity of $\widetilde O(\varepsilon^{-3})$ for $\varepsilon$-multicalibration, eliminating the need for randomization previously required for optimal rates. Their method extends to produce deterministic predictors satisfying outcome indistinguishability (OI) for finite or finitely covered test collections. As applications, this yields optimal deterministic omnipredictors and panpredictors, addressing open problems from prior work on trustworthy machine learning.
multicalibrationdeterministic predictoroutcome indistinguishabilityomnipredictorssample complexity
The Token Is a Group Element: On Lie-Algebra Attention over Matrix Lie Groups
The paper introduces Lie-Algebra Attention, a novel attention mechanism where tokens are elements of a matrix Lie group $G$ rather than feature vectors. The method computes attention scores using the closed-form algebra norm of relative poses $\log(g_i^{-1} g_j)$, eliminating the need for learned kernels or representation-theoretic constructs like irreducible representations. This approach naturally handles non-compact, non-abelian affine groups (e.g., SE(2), SO(3), Aff(2)) that traditional vector-token methods cannot. Experiments show the method matches or outperforms learned MLP kernels while using 50-80x fewer parameters and maintaining strict invariance, unlike vector-token baselines which violate invariance by 5-12 orders of magnitude.
lie-algebra attentionmatrix lie groupsrelative poseinvariant kernelsnon-abelian affine groups
Predictability as a Fine-Grained Measure for Privacy
The paper introduces privacy via predictability, a fine-grained framework measuring privacy leakage as an attacker's incremental predictive gain about sensitive information after observing algorithm outputs, given compromised data. The method incorporates attacker knowledge, stochastic data generation, and query families, using generalized method of moments (GMM) for asymptotic analysis under stationary, ergodic, mixing processes. Results show predictability and differential privacy (DP) are generally incomparable, but predictability implies mutual-information DP in worst-case scenarios, enabling finer-grained privacy control complementary to DP.
predictabilitydifferential privacygeneralized method of momentsprivacy leakagestochastic process
Multi-Task Bayesian In-Context Learning
We propose a multi-task in-context learning framework for amortized hierarchical Bayesian predictive inference that explicitly encodes prior information as dataset prefixes. A transformer model is trained on sequences of prior and target tasks to adapt predictions across prior families, addressing limitations of existing approaches that lack explicit mechanisms for prior adaptation. Evaluations demonstrate that our method matches oracle Bayesian predictors in accuracy while being computationally efficient, particularly on out-of-meta-distribution priors and high-dimensional latent structures. The framework's practical utility is validated on a spatiotemporal temperature prediction benchmark.
bayesian inferencein-context learningamortized inferencetransformerdistribution shift
Execution-State Capsules: Graph-Bound Execution-State Checkpoint and Restore for Low-Latency, Small-Batch, On-Device Physical-AI Serving
The paper introduces execution-state capsules, a graph-bound checkpoint and restore mechanism for complete LLM execution states in low-latency, small-batch, on-device physical-AI serving. The proposed FlashRT runtime manages contiguous static buffers without block-table indirection, enabling snapshot/restore of KV cache, recurrent state, and other execution metadata at graph boundaries. Evaluations on RTX 5090 show byte-exact restores with sub-millisecond latency, achieving 3.9x-27x TTFT speedup over cold prefill for 2k-16k contexts, while maintaining token-identical outputs under greedy decode.
execution-state capsuleskv-cachegraph-bound executionlow-latency servingon-device ai
Probe-and-Refine Tuning of Repository Guidance for Coding Agents
The paper introduces probe-and-refine tuning, a method to iteratively improve repository guidance files (\texttt{AGENTS.md}) for LLM-based coding agents using synthetic bug-fix probes and single-shot LLM calls. This approach increases the resolve rate on SWE-bench Verified by 4.7 percentage points (33.0\% vs. 28.3\%) over static guidance and 7.5 points over unguided baselines, primarily by expanding coverage (14.5 pp more evaluable patches) rather than improving per-patch precision (∼59\%). Results show guidance enables productive use of larger step budgets, though tuning efficacy depends on the LLM's diagnostic capability.
probe-and-refine tuningrepository guidanceswe-benchllm-based agentsresolve rate
Entropy Estimation in Multi-Qutrit Systems via Variational and Classical Neural Networks
This work systematically compares variational quantum algorithms (VQAs) and classical convolutional neural networks (CNNs) for von Neumann entropy estimation in multi-qutrit systems. For small systems (≤3 qutrits), 11 SU(3)-inspired ansatzes demonstrate that accuracy depends primarily on parameter count (optimized at ~120) with diminishing returns from additional entangling gates. For larger systems (2-5 qutrits), a CNN trained on 12.5% of tomography measurements achieves 90th-percentile errors of 0.13-0.16 nats, showing improved performance with system size and robustness to shot noise. Results indicate VQAs are preferable for small systems while CNNs scale better.
von neumann entropyvariational quantum algorithmsconvolutional neural networksqutrit systemsstate tomography
Your Mouse and Eyes Secretly Leak Your Preference: LLM Alignment using Implicit Feedback from Users
We propose leveraging implicit human feedback, specifically mouse trajectories and eye-gazing points, to improve LLM alignment. Existing methods rely on costly explicit feedback and ignore implicit signals. We introduce IFLLM, a dataset of 1336 multi-turn questions from 59 Mechanical Turk workers, capturing their mouse movements and eye gaze on LLM responses. Training a reward model with this implicit feedback boosts text-based reward model accuracy from 55% to 64% and triples response quality improvements when applied with Direct Preference Optimization (DPO) across eight LLMs.
llm alignmentimplicit feedbackmouse trajectorieseye-gazing pointsdirect preference optimization
Scalable Training of Spatially Grounded 2D Vision-Language Models for Radiology
The authors introduce RefRad2D, a bilingual (German/English) dataset of 1.2M CT/MR image-text pairs with automated spatial annotations and VQA subsets, enabling scalable training of radiology VLMs without manual labeling. Their model RadGrounder jointly performs report generation, VQA, and spatial grounding (bounding-box/segmentation) via LLM-based curation and segmentation. On Slake and VQA-RAD benchmarks, RadGrounder matches specialized medical VLMs, with clinical data improving open-ended VQA (+transferability) and grounding supervision preserving language quality (no VQA degradation).
vision-language modelsspatial groundingautomated segmentationradiology vqamultitask learning
Marginal Advantage Accumulation for Memory-Driven Agent Self-Evolution
The paper proposes Marginal Advantage Accumulation (MAA), a post-processing architecture for memory-driven agent self-evolution that addresses contradictory feedback in batch-style trace distillation. MAA enforces alignability and comparability conditions, constructs cross-batch differential signals via exponential moving average (EMA), and ensures traceability through semantic identity merging. Evaluated across 4 benchmarks and 4 target models, MAA achieves top performance in 14/16 settings, matches or surpasses online alternatives, and reduces optimization-phase token consumption by ~75% compared to batch-level baselines.
trace distillationmemory-driven agentsexponential moving averagesemantic identity mergingbatch optimization
Fisher-Geometric Sharpness and the Implicit Bias of SGD toward Flat Minima
The study introduces Fisher-geometric sharpness, a Riemannian measure of flatness grounded in the Fisher Information Matrix (FIM), which is invariant under function-preserving reparametrizations, addressing critiques of Euclidean sharpness measures. By formalizing mini-batch SGD's gradient noise as FIM-proportional and deriving the stationary distribution of the resulting stochastic differential equation, the authors prove exponential concentration of probability mass at Riemannian-flat minima. A PAC-Bayes generalization bound links this geometric bias to test performance. Experiments on MNIST and CIFAR-10 confirm that Fisher-geometric sharpness reliably tracks generalization, aligning with theoretical predictions.
fisher-geometric sharpnessriemannian geometrypac-bayesstochastic gradient descentreparametrization invariance
Agentic Symbolic Search: Characterizing PDEs Beyond Hand-crafted Expressions, Meshes, and Neural Networks
The paper introduces Agentic Symbolic Search (ASYS), a framework combining evolutionary search and gradient-based optimization to derive interpretable symbolic representations of PDE solutions. ASYS integrates mathematical theory, problem constraints, and search experience to generate testable differentiable programs, automating inductive-bias injection. It successfully recovers known analytical forms and constructs novel approximations for complex PDEs, including Allen-Cahn 2D dynamics and Keller-Segel chemotactic blow-up, where closed-form solutions were previously unavailable. The method demonstrates a new paradigm for PDE characterization, bridging gaps between analytical, numerical, and neural network approaches.
pdesymbolic regressionevolutionary searchgradient-based optimizationallen-cahn
Data Bias Mitigation under Coverage Constraints & The Price of Fairness
The paper extends bias mitigation frameworks by incorporating coverage constraints for intersectional subgroups, trading small approximation errors for improved data efficiency. It formulates bias mitigation as an integer linear program, optimizing over strategies while characterizing the price of fairness as a function of fairness tolerance. Evaluations on public datasets show that the framework preserves predictive accuracy across classifiers and highlights the necessity of coverage constraints for maintaining ML performance.
bias mitigationcoverage constraintsintersectional fairnessinteger linear programprice of fairness
SSH-Net: A Deep Neural Network for Predicting Failure Time Distribution Functions under Competing Risks with Application to GPU Data
The paper introduces SSH-Net, a Structured Segmented Hazard Deep Neural Network for predicting failure time distributions under competing risks. The method addresses challenges in complex engineered systems by structuring neural networks to align with data hierarchies, using separate sub-networks for different covariate groups. It outputs cause-specific hazard functions and employs a penalized log-likelihood loss function. Validation via simulation studies demonstrates improved accuracy, measured by Brier score, AUC, and RMSE of predicted cumulative incident functions. The model's efficacy is further shown on Titan GPU failure data.
competing risksdeep neural networkhazard functionbrier scorecumulative incident function
Topological Data Analysis for High-Dimensional Dynamic Process Monitoring
A novel topological data analysis (TDA) approach for high-dimensional dynamic process monitoring is proposed, combining manifold representations of multivariate time-series data with neural ordinary differential equations (NODEs) to learn system evolution. The method extracts topological descriptors to summarize data structure and employs trajectory-based event detection, contrasting with reconstruction-based methods like principal component analysis and autoencoders, as well as Koopman autoencoders. Experimental results on industrial process data demonstrate effectiveness in detecting diverse event types.
topological data analysisneural ordinary differential equationsmanifold representationtrajectory-based detectionkoopman autoencoders
Evolutionary Two-Stage Hyperparameter Optimization Strategies for Physics-Informed Neural Networks
The authors propose a two-stage evolutionary hyperparameter optimization strategy for Physics-Informed Neural Networks (PINNs) to address unstable convergence and sensitivity to hyperparameters. Their method first performs low-fidelity training runs with truncated epochs to rapidly screen candidate configurations, followed by full training of the most promising candidates using gradient-based optimizers. Evaluated on Advection, Klein-Gordon, and Helmholtz equations, this approach achieves significantly lower mean error compared to standard training while maintaining computational efficiency. The evolutionary algorithm's population-based exploration effectively handles the heterogeneous, non-differentiable search space inherent in PINNs' hyperparameter optimization.
physics-informed neural networksevolutionary algorithmshyperparameter optimizationpartial differential equationsgradient-based optimizers
HEPTv2: End-to-End Efficient Point Transformer for Charged Particle Reconstruction
HEPTv2 introduces an end-to-end point-transformer architecture for charged-particle tracking, addressing inefficiencies in graph-based and transformer approaches. The method combines a locality-aware point encoder using locality-sensitive hashing with a sectorized track decoder, enabling direct hit-to-track prediction without intermediate graph construction or filtering. On the TrackML benchmark, HEPTv2 achieves 98.6% double-majority tracking efficiency at a 0.8% fake rate, with ∼15 ms inference time and 0.4 GB peak memory per event on an NVIDIA A100 GPU. It improves efficiency by 4.5% over prior transformers and 1.1–2.2% over graph-based pipelines while reducing latency by factors of 7 and 38–52, respectively, demonstrating suitability for HL-LHC reconstruction.
point-transformercharged-particle trackinglocality-sensitive hashingsectorized decodingend-to-end optimization
Sparsity, Superposition, and Forgetting: A Mechanistic Study of Representation Retention in Continual Learning
The study introduces a controlled framework to analyze representation retention in continual learning (CL), isolating mechanisms of forgetting through synthetic tasks with tunable sparsity and feature overlap. Using a generator-separator pipeline, the authors measure representation strength and superposition, fitting sparse dynamical relations (via SINDy) to model retention dynamics. Key findings include: (1) superposition increases over time with transient dips at task boundaries, (2) higher sparsity induces superposition but does not always cause forgetting if representations remain strong, and (3) task-level effective rank grows with sparsity, indicating broader capacity usage. The work provides falsifiable hypotheses and diagnostic tools for CL.
continual learningrepresentation retentionfeature sparsitysuperpositioneffective rank
Neural network surrogates with uncertainty quantification for inverse problems in partial differential equations
We introduce DeepGaLA, a neural-network surrogate for differential equation solvers that provides uncertainty-aware predictions to address computationally expensive Bayesian inverse problems. The method incorporates differential-equation constraints, including nonlinear settings, and employs delayed-acceptance Markov chain Monte Carlo for posterior approximation diagnostics. Experimental results demonstrate that DeepGaLA achieves forward-model approximation accuracy comparable to Gaussian-process surrogates while maintaining better efficiency as parameter dimension grows. This enables scalable and reliable Bayesian inference for inverse problems in complex systems with limited training data.
neural-network surrogateuncertainty quantificationinverse problemsbayesian inferencedifferential equations
On the Redundancy of Timestep Embeddings in Diffusion Models
The work challenges the necessity of explicit timestep embeddings in diffusion models, demonstrating theoretically and empirically that global optima can be achieved without temporal conditioning. Analyzing U-Net and Diffusion Transformer architectures on CelebA and CIFAR-10, the authors show time-agnostic models maintain structural fidelity and sometimes outperform conditioned counterparts in FID, precision, and recall. Results suggest architectures can implicitly infer noise scales from corrupted inputs, rendering explicit conditioning redundant under specific assumptions.
diffusion modelstimestep embeddingsu-netdiffusion transformerfid
Pseudo-Feature Padding: A Lightweight Defense Against False Data Injection in Power Grids
The paper introduces Pseudo-Feature Padding, a lightweight defense framework against False Data Injection Attacks (FDIA) in Cyber-Physical Systems (CPS). The method enhances Deep Neural Networks (DNNs) by adding an input layer that pads input samples with pseudo-feature values derived from the statistical distribution of inputs, increasing dimensionality in a randomized, data-aware manner. This approach makes adversarial attacks computationally infeasible due to non-transferable perturbations and unpredictable padding structure. Evaluated on IEEE 14-bus, 30-bus, 118-bus, and 300-bus power grid systems, the framework significantly improves model robustness with negligible performance impact, effectively mitigating attacks that bypass conventional defenses.
false data injection attacksdeep neural networkscyber-physical systemspseudo-feature paddingstate estimation
Direct Advantage Estimation for Scalable and Sample-efficient Deep Reinforcement Learning
The work extends Direct Advantage Estimation (DAE) to partially observable domains and reduces its computational overhead via discrete latent dynamics models. The method modifies DAE's theoretical framework for partial observability while approximating transition probabilities efficiently. Evaluations on the Arcade Learning Environment demonstrate DAE's scalability with function approximator capacity and maintained sample efficiency.
direct advantage estimationpartial observabilitylatent dynamics modelssample efficiencyarcade learning environment
The Significance of Style Diversity in Annotation-Free Synthetic Data Generation
We propose an annotation-free framework for synthetic dialogue generation that relies solely on intent definitions, eliminating the need for human-annotated seed data. The method incorporates topic and style attributes to enhance diversity, introduces two novel post-hoc stylization models (Univ and Exam) for human-like linguistic variation, and employs LLM-as-a-judge filtering for quality control. Experiments on industrial and public datasets show the approach achieves up to 93.3% of human-annotated data performance, with style diversity proving more critical than topic diversity for preventing spurious stylistic correlations. Results indicate that incorporating style attributes during generation outperforms post-hoc adaptation.
synthetic data generationintent classificationstylization modelsllm-as-a-judgestyle diversity
Towards Modality-imbalanced Federated Graph Learning: A Data Synthesis-based Approach
The paper proposes FedMGS, a data synthesis approach for modality-imbalanced federated graph learning (MM-FGL) addressing both client-level and node-level modality imbalances. The method employs an availability-aware graph encoder to prevent contamination from missing modalities, a prototype-guided latent semantic synthesizer for cross-client semantic alignment, and a reliability-calibrated fusion mechanism for representation integration. Experiments across four tasks demonstrate FedMGS outperforms baselines by up to 17.41% while maintaining efficient performance tradeoffs.
federated graph learningmodality imbalancelatent semantic synthesisgraph encoderrepresentation fusion
Judging to Improve: A De-biased VLM-as-3D-Judge Protocol for Single-Image 3D Generation
This paper introduces a de-biased VLM-as-3D-judge protocol for optimizing single-image 3D generation, focusing on furniture assets without human labels. The method employs distinct training (Qwen2.5-VL-7B) and evaluation (InternVL3-8B) judges to prevent circularity, incorporates position-bias correction, and addresses failure modes like image overload and geometry-hiding splat renders. Results show that lightweight parameter-efficient adaptation matches but does not exceed the performance of a strong baseline, with win-rates ranging from 0.50 to 0.94 across six adaptation methods. The study highlights that exceeding baseline performance requires more than lightweight PEFT on public data, and the protocol remains reusable for future optimization tasks.
vlm-as-3d-judgeparameter-efficient adaptationposition-bias correctiongeometry-hiding splat rendersquality-contrastive construction
Train, Retrieve, or Both? A Four-Arm Head-to-Head for Correct Statutory Citation on the Ontario Residential Tenancies Act
This work evaluates methods for generating correct statutory citations from the Ontario Residential Tenancies Act, comparing fine-tuning, retrieval, and hybrid approaches on Qwen2.5-7B-Instruct. A four-arm experiment tests base zero-shot, LoRA SFT-only, RAG-only, and SFT+RAG hybrid configurations, measuring exact-match citation accuracy on a small real-world dataset. Results show retrieval is essential to eliminate hallucinations, with the SFT+RAG hybrid achieving 0.481 exact-match accuracy and zero hallucinations. The hybrid's advantage stems from SFT improving provision selection robustness against high-recall candidate sets. Notably, this approach outperforms larger specialized retrieval pipelines without requiring additional data or specialized models, though it falls short of the 0.70 target.
statutory citationfine-tuningretrievalhallucinationexact-match
On the Variance of Temporal Difference Learning and its Reduction Using Control Variates
The paper analyzes variance reduction mechanisms in temporal difference (TD) learning, showing it aggregates over more independent trajectories than Monte Carlo (MC) methods. Using phased tabular settings, it proves TD's asymptotic variance is upper-bounded by MC estimators and shorter-horizon updates yield lower variance. It frames Direct Advantage Estimation (DAE) as a regression-adjusted control variate, achieving tighter variance bounds than TD in large-sample regimes. Numerical experiments validate these theoretical findings in controlled environments.
temporal difference learningcontrol variatesvariance reductiondirect advantage estimationtabular representation
Critical Percolation as a Synthetic Data Model for Interpretability
The authors propose critical mean-field percolation clusters as a synthetic data model for interpretability research, addressing limitations of existing toy models by incorporating hierarchical, multi-scale structure. The method generates sparse, fractal clusters with power-law size distributions, using latent variables to model taxonomic hierarchies, and provides an almost linear-time sampling algorithm via mappings to random trees and additive coalescence. Experiments demonstrate that ground-truth latent variables are linearly decodable from neural network activations, establishing percolation data as an analytically tractable benchmark with sparsity and self-similarity.
critical percolationinterpretabilitysynthetic datafractal clusterslatent variables
Quantum ring all-reduce: communication and privacy advantages for distributed learning
The paper introduces a quantum-enhanced ring all-reduce protocol for distributed learning, achieving both communication efficiency and information-theoretic privacy. By leveraging pre-shared entanglement and superdense coding, the method reduces per-link communication by a factor of two while enabling composable ε-secure aggregation via verified GHZ states. Results show quadratic and exponential quantum advantages in gradient conflict detection tasks: for GapIPτ, communication scales as Õ(τ⁻¹log P) qubits vs. Õ(min(τ⁻²,P)) bits; for TieAuditε, Ω(√P) bits are required classically versus O(ε⁻²log P) qubits quantumly.
quantum communicationring all-reducesuperdense codinggradient conflict detectionghz states
Constrained hybrid modelling to predict microbial dynamics and organic matter turnover in soil systems
The authors present a hybrid modeling framework that predicts biokinetic parameters for soil organic matter turnover models from metagenomic functional traits, integrating neural networks with ecological constraints. The method combines genomic data with process-based modeling, using neural networks to map traits to parameters while enforcing theoretical constraints for unobserved variables. Evaluated on synthetic and real datasets, the approach outperforms baselines and effectively learns dynamics of unmeasurable components, even with limited training data.
hybrid modelingbiokinetic parametersmetagenomic traitssoil organic matterneural networks
Quantum-classical physics-informed Kolmogorov-Arnold networks for PDEs
The authors introduce QCPIKAN, the first quantum-classical hybrid physics-informed Kolmogorov-Arnold network for solving partial differential equations (PDEs). The method combines Chebyshev-polynomial KAN layers with parameterized quantum circuits, embedding physical constraints into the training loss for consistency. Theoretical analysis demonstrates exponential convergence for high-frequency errors and effective numerical dispersion mitigation. Evaluated on three porous media seepage scenarios (single-phase flow, component transport, two-phase flow), QCPIKAN outperforms existing quantum-classical physics-informed neural networks in global accuracy, local error control, dynamic tracking, and displacement front localization.
kolmogorov-arnold networkquantum-classical hybridphysics-informed learningpartial differential equationschebyshev polynomial
Recurrent neural networks approximate continuous functions
The paper demonstrates that a single ReLU recurrent neural network (RNN) with fixed weights and hidden dimension can uniformly approximate any continuous function on [-1,1] by adjusting runtime alone. The key innovation is the Turing machine with neural units (TMNU), an intermediate model enabling polynomial approximation schemes while maintaining compatibility with RNN constraints. Theoretical results include explicit bounds on hidden dimension and weight magnitude, with convergence rates mirroring polynomial approximation rates. Minimax lower bounds establish runtime as a fundamental resource in this fixed-network paradigm.
recurrent neural networksfunction approximationturing machine with neural unitsminimax boundspolynomial convergence rates
A Model-Driven Approach for Developing Families of Reinforcement Learning Environments
The paper proposes a model-driven approach for generating families of reinforcement learning (RL) environments to address the labor-intensive manual development process. The method employs a hybrid genetic algorithm combining population-based global search and heuristic local search, with mutations and constraints expressed as model transformations operationalized by a model transformation engine. The approach is validated in a wildfire mitigation scenario and curriculum learning paradigm, demonstrating its effectiveness for scalable environment family generation.
reinforcement learninggenetic algorithmmodel transformationcurriculum learningenvironment variants
Statistical Properties of Training & Generalization
The article analyzes deep learning's divergence from classical statistical intuitions through a physics-informed lens, focusing on neural scaling laws and their interaction with domain-specific constraints. It systematically examines model construction choices and inductive biases relevant to physics applications. Key findings highlight how scaling behaviors emerge despite statistical expectations, with particular attention to mechanisms enabling superior real-world performance.
neural scaling lawsinductive biasesphysics-informedstatistical intuitionsdeep learning
Shifting-based Optimizable Linear Relaxations for General Activation Functions
The paper introduces SLiR (Shifting-based Linear Relaxations), a method for generating optimizable linear relaxations of neural network activation functions without requiring hand-crafted relaxations for each function. SLiR parameterizes relaxations by slope and computes offsets via a shifting procedure, ensuring sound bounds over the input domain while only requiring a Lipschitz constant or set of critical points. Experiments demonstrate SLiR produces tight relaxations across diverse activation functions and verifies up to 7.8x more properties than state-of-the-art methods.
linear relaxationsactivation functionsneural network verificationlipschitz constantformal guarantees
Integrating national forest inventory, airborne lidar, and satellite imagery for wall-to-wall mapping of forest structure with computer vision
The VibrantForests framework introduces a satellite-based forest structure model for wall-to-wall mapping of forest attributes across the contiguous United States at 10-meter resolution. The model integrates national forest inventory data, airborne lidar samples, and satellite imagery to concurrently estimate canopy cover, canopy height, aboveground live tree biomass, basal area, and quadratic mean diameter. Results demonstrate improved predictive capability across diverse forest conditions, reducing saturation and regression-to-mean behaviors common in passive-sensor models. This approach addresses key limitations in large-area forest and wildfire planning by providing coherent, annually updated estimates of management-relevant attributes.
forest structure modellidar-derived sampleswall-to-wall mappingregression-to-meanmanagement-relevant attributes
Off-Policy Evaluation for Missingness-Aware Policies in MDPs with Rewards Missing Not at Random
The paper introduces a missingness-aware off-policy evaluation (OPE) method for finite-horizon Markov decision processes where rewards are missing not at random (MNAR). By formalizing a reward-dependent propensity model and using future states as shadow variables, the authors identify full-data conditional mean rewards via a bridge function estimated through min-max optimization. Their Fitted-Q-Evaluation-style estimator propagates recovered rewards while accommodating target policies dependent on missingness indicators. Theoretical analysis shows consistency and finite-sample error bounds, with experiments demonstrating superior performance over baselines on simulated and MIMIC-III Sepsis datasets.
off-policy evaluationmissing not at randommarkov decision processesshadow variablesbridge function
Effective Dimension Governs Generalization in Quantum Kernel Vision Models
The paper demonstrates that effective dimension (d_eff) of quantum feature kernels governs generalization in quantum vision models, unifying two empirical phenomena: (i) entanglement-enhanced generalization and (ii) noise-induced accuracy improvements. Through spectral analysis of depolarized kernels and amplitude damping, the authors show d_eff contraction acts as ridge-like regularization in overfitting regimes. Key results include exact kernel decomposition under depolarization (d_eff→1), a 13% test accuracy boost via noise injection, and a capacity/alignment risk framework. Experiments validate monotonic d_eff contraction in entangled systems up to 12 qubits.
quantum kerneleffective dimensiondepolarizing noisespectral decompositionentanglement structure
Computational Methods and Challenges in Cell-Free DNA Analysis for Multi-Cancer Early Detection
The article reviews computational methods (2022-2025) for cell-free DNA (cfDNA) analysis in multi-cancer early detection (MCED), emphasizing fragmentomics and epigenetic feature extraction. It compares classical statistical methods, machine learning approaches, and deep learning frameworks (including autoencoder-based models), evaluating biological interpretability, validation strategies, and clinical readiness. Findings indicate multimodal ensemble approaches show highest clinical promise, though standardization of evaluation protocols remains critical for comparative assessment. Technical, computational, and methodological challenges are systematically categorized, with open problems identified for future research.
cell-free dnamulticancer early detectionfragmentomicsepigenetic featuresautoencoder-based models
Predicting gestational age at birth in the context of preterm birth from multi-modal fetal MRI
The study presents a machine learning pipeline for predicting gestational age (GA) at birth from multi-modal fetal MRI data, addressing preterm birth as a regression problem for the first time. Using data from 333 control and 93 preterm cases, the pipeline incorporates imputation, feature selection, and regression models, achieving an R2 score of 0.13 and MAE of 2.74 weeks. Key features include cervical length and placental T2* values. Performance metrics show 0.77 accuracy, 0.59 sensitivity, and 0.82 specificity in classifying GA predictions. The work demonstrates feasibility through 10-fold cross-validation and identifies critical biomarkers for preterm birth risk assessment.
gestational agepreterm birthmulti-modal mrifeature selectionregression models
Multi-Modal Contrastive Learning for Implicit Earth Embeddings via Location Tying
The authors propose two multimodal contrastive learning architectures, Multimodal Embedding via Location Tying (MELT) and Sequential Alternating Location Training (SALT), to address spatial prediction tasks limited by scarce labeled data. These methods extend contrastive learning beyond two modalities using unpaired geospatial data. Both architectures match the performance of the strongest two-modality baseline (SATCLIP) across four downstream tasks, though increasing modality count does not consistently improve results, indicating limitations in the location encoder. MELT demonstrates more stable training than SALT, offering a stronger basis for future scaling.
contrastive learningmultimodal embeddinglocation tyinggeospatial dataself-supervised pre-training
PASQA: Pitch-Accent-Focused Speech Quality Assessment Model Trained on Synthetic Speech with Accent Errors
The paper introduces PASQA, a pitch-accent-focused speech quality assessment model addressing limitations of conventional mean opinion score (MOS) predictors in detecting localized pitch-accent errors. The method constructs a synthetic Japanese accent-error dataset using an accent-controllable TTS system, computes pseudo accent-quality scores, and employs self-supervised representations with mora-conditioned fusion, ranking loss, accent-error localization, and speaker-invariant training. Results show PASQA outperforms baseline models in preserving accent-error severity ordering (achieving high accuracy on seen/unseen speakers) and aligns better with human judgments of accent correctness.
pitch-accentspeech quality assessmentmean opinion scoretext-to-speechself-supervised learning
The Correctness Illusion in LLM-Generated GPU Kernels
The study exposes a correctness illusion in LLM-generated GPU kernels, where benchmarks relying on fixed-shape, small-sample allclose-style checks fail to detect transcription errors. Using a controlled corpus of 24 Triton and CPU kernels (15 correct, 9 buggy), the authors employ op-schema-aware seeded fuzzing with a high-precision CPU reference and per-op absolute tolerances. The method correctly identifies all 9 buggy kernels and validates all 15 correct controls without precision loss. Extending the corpus to 26 ops and testing across five GPU classes (RTX 3060, A10, L40S, A100 SXM4, H100 NVL) yields consistent results: 10 illusions detected and 16 controls verified.
gpu kernelsseeded fuzzingtranscription errorsabsolute tolerancescontrolled corpus
Pose6DAug: Physically Plausible Multi-view Object Swapping for Robot Data Augmentation
Pose6DAug introduces a failure-driven data augmentation framework for vision-language-action (VLA) policies, enabling targeted demonstrations for failure modes without new data collection. The method leverages successful episodes, swapping manipulated objects while preserving physically valid action trajectories and calibrated multi-view observations. It operates in 3D, using temporally coherent 6D pose trajectories to ensure geometrically consistent renderings across all camera views. Fine-tuning VLA policies with Pose6DAug-augmented data improves success rates by 16.5% on novel objects compared to state-of-the-art baselines, while maintaining in-distribution performance. This demonstrates the efficacy of multi-view, physically consistent augmentation for scalable VLA generalization.
vision-language-actiondata augmentation6d posemulti-view consistencyphysically plausible
When Calibration Fails the Vulnerable Hospital: Federated Conformal Risk Control via Risk-Curve Shrinkage
The paper introduces a shrinkage-based federated conformal risk control (CRC) protocol to address coverage violations in multi-institutional medical imaging. Standard pooled CRC fails at 40% of institutions (worst-case +7.8pp false-negative rate), while local CRC inflates prediction sets 83x. The proposed method transmits empirical risk curves (G scalars) to compute shrinkage-regularized thresholds, with hyperparameter n0=19 balancing coverage (2.7/20 violations) and efficiency (2.0x stretch). Lagrangian optimization proves ineffective, and finite-sample correction is critical (3x violation increase if omitted). Validated on FeTS-2022 brain tumor data (1,251 subjects, 20 sites) with four coverage targets.
federated conformal risk controlrisk-curve shrinkagemulti-institutional calibrationfalse-negative rateempirical risk curves
EFIQA: Explainable Fundus Image Quality Assessment via Anatomical Priors
The paper introduces EFIQA, an explainable fundus image quality assessment framework that eliminates the need for quality-labeled training data by leveraging anatomical priors. The method employs a two-stage approach: (1) unsupervised anomaly detection via masked anatomical inpainting to identify missing vasculature regions, followed by (2) knowledge distillation into a shallow adapter that maps features from a frozen foundation model to spatial quality maps. Evaluations across external datasets show superior performance and explainability compared to supervised methods, achieving generalization across varying quality criteria without dataset-specific labels.
fundus imagingquality assessmentanatomical priorsunsupervised anomaly detectionknowledge distillation
Quantile of Means: A Bonus-Free Ensemble Method for Minimax Optimal Reinforcement Learning
The paper proposes a quantile-based ensemble method for finite-horizon Markov Decision Processes (MDPs), providing theoretical justification for ensemble-based exploration in Reinforcement Learning (RL). The method extends a recent ensemble-based approach for Multi-Armed Bandits to RL, eliminating the need for count-based uncertainty estimates. It achieves minimax optimal variance-dependent regret bounds, offering a practical and theoretically grounded alternative to traditional count-based exploration methods.
reinforcement learningmarkov decision processesensemble methodsquantile estimationregret bounds
Beyond Averaging in John Ellipsoid Approximation: High-Accuracy Algorithms in the Leverage-Score Model
The paper presents a refined analysis of leverage-score algorithms for approximating the John ellipsoid of symmetric polytopes, separating the complexity into certification, identification, and accuracy costs. By reformulating the problem as D-optimal-design and using a Frank-Wolfe gap framework, the authors show that the historical ε⁻¹ dependence is an artifact of certification via uniform averaging. They demonstrate that warm-started accelerated methods achieve (1+ε)-approximation in C(A) + O(√κ log(1/ε)) queries, with subsequent damped Newton steps requiring only O(log log(1/ε)) iterations. The key open problem remains condition-free identification of the optimal face.
john ellipsoidleverage-score algorithmsd-optimal-designfrank-wolfe gapself-concordant minimization
What Makes Effective Supervision in Latent Chain-of-Thought: An Information-Theoretic Analysis
The paper presents an information-theoretic analysis of supervision in Latent Chain-of-Thought (CoT), identifying dual collapse (gradient attenuation and representational drift) as key failure modes. It decomposes process supervision into Trajectory Supervision (dense stepwise signals) and Space Supervision (manifold preservation), showing generative reconstruction outperforms geometric compression in maintaining information capacity. The proposed Unified Latent Probe (ULP) quantifies mutual information between latent trajectories and explicit reasoning steps, revealing Information-Performance Binding where reasoning accuracy correlates with preserved information fidelity. Findings advocate shifting supervision strategies from geometric imitation to mutual information maximization.
latent chain-of-thoughtinformation-theoretic analysistrajectory supervisionmutual informationrepresentational drift
Optimal Coarse Correlated Equilibria in Mean Field Games: Linear Programming and No-Regret Learning
The paper introduces optimal coarse correlated equilibria (CCE) for continuous-time mean field games, where a moderator selects CCEs optimizing a system-level criterion distinct from individual players' objectives. The authors develop a linear programming formulation to characterize these equilibria, proving existence and establishing connections to the probabilistic setting. A no-regret primal-dual learning algorithm is proposed, leveraging a Lagrangian reformulation of external-regret constraints, with explicit convergence rates provided. Numerical experiments validate the approach.
mean field gamescoarse correlated equilibrialinear programmingno-regret learningprimal-dual algorithm
PaAno+: Multiscale Encoding and Cross-Variable Attention for Time Series Anomaly Detection
The study introduces PaAno+, a lightweight time-series anomaly detection model addressing computational overhead and feature extraction limitations in existing approaches. It employs multiscale convolutional encoding with differentiated receptive fields, cross-variable fusion attention for inter-variable dependency modeling, and a temporal patch-window sorting pretext task optimized via triplet loss. Experiments on TSB-AD show state-of-the-art performance in both univariate and multivariate settings, with significant improvements in VUS-PR metrics, while maintaining computational efficiency for real-time deployment.
time-series anomaly detectionmultiscale encodingcross-variable attentionpatch-oriented learningtriplet loss
Comparative Study of Neural Surrogate Architectures for Autoregressive Prediction of Internal Battery States
This study systematically compares four neural network architectures (MLP, ResNet, U-Net, FNO) as autoregressive state-transition operators for predicting internal states in lithium-ion batteries using the Doyle-Fuller-Newman model. The architectures are trained under a unified framework with multi-step unrolling and current-conditioning to isolate spatial inductive bias effects. Results show that U-Net achieves a mean final-step nRMSE of 3% across all internal state variables after 300-step autoregressive rollouts, outperforming other architectures while providing a 5.38x speed-up over numerical solvers. The findings emphasize the importance of spatial inductive bias in developing efficient surrogates for battery management systems and digital twins.
autoregressive predictionspatial inductive biasdoyle-fuller-newman modelneural surrogatebattery management systems
Alzheimer's Disease Diagnosis using a Multimodal Approach with 3D MRI and PET
The study introduces a multimodal approach for Alzheimer's disease diagnosis using 3D MRI and PET scans, combining three fusion strategies (concatenation, Gated Multimodal Unit, GMU, and gated self-attention) with a sparsely gated Mixture-of-Experts (MoE) classifier for input-adaptive routing. The method leverages 3D convolutional feature extractors and Grad-CAM for interpretability. Evaluated on binary classification tasks (NC vs. MCI, MCI vs. AD, NC vs. AD), GMU achieves 80.46% (NC vs. MCI) and 95.47% (NC vs. AD) accuracy, while gated self-attention reaches 82.08% (MCI vs. AD). Ablations confirm MoE's necessity, highlighting the benefits of input-adaptive multimodal modeling.
multimodal fusionmixture-of-experts3d convolutional networksgated self-attentiongrad-cam
PU-UNet: Stable Multiplicative Interactions for Medical Image Segmentation
PU-UNet introduces stable product-unit residual blocks for medical image segmentation, enabling explicit multiplicative feature interactions while maintaining numerical stability through log-domain clipping and smooth positivity mapping. The method integrates these blocks into low-resolution stages of a residual U-Net, adding negligible computational overhead. Evaluated on ISIC 2018, Kvasir-SEG, and BUSI, PU-UNet achieves Dice scores of 0.942, 0.959, and up to 0.925 respectively, outperforming a Residual U-Net baseline in Dice and IoU while eliminating false positives on normal BUSI cases. Ablations confirm the benefits of product-unit interactions and stabilization design.
multiplicative interactionsmedical image segmentationproduct-unit residual blocksnumerical stabilityu-net
Exploring the potential of AlphaEarth and TESSERA embeddings for Fine-scale Local Climate Zone Mapping: A case study across five cities in Switzerland
The study evaluates TESSERA and AlphaEarth embeddings against Sentinel-1/2 composites for fine-scale Local Climate Zone (LCZ) mapping at 10-m resolution in five Swiss cities using an attention-based U-Net. Experiments assess multi-city transferability, resolution impact, and temporal robustness. Results show strong performance (IoU 0.59-0.82), with TESSERA outperforming both S1S2 and AlphaEarth. Embeddings reduce preprocessing and feature engineering, though temporal transfer remains challenging. The method enhances regional scalability for global urban climate applications, with reference data quality as the key accuracy lever.
local climate zoneattention-based u-netmulti-city transferabilityfoundation modelsfine-scale mapping
Stochastic Linear Contextual Bandits with Bounded Noise: A Set-Membership Approach
The paper introduces SME-OFU, a novel algorithm for stochastic linear contextual bandits (SLCB) with bounded reward noise, achieving an improved regret bound of $O(\log T)$. Leveraging set-membership estimation (SME) and optimism in the face of uncertainty (OFU), the method explicitly exploits bounded noise, a stronger condition than sub-Gaussian assumptions. Empirical results demonstrate SME-OFU's superiority over sub-Gaussian-optimized benchmarks when noise is bounded, without contradicting the $\tilde{O}(\sqrt{T})$ lower bound for sub-Gaussian settings.
stochastic linear contextual banditsset-membership estimationregret boundbounded noiseoptimism in the face of uncertainty
Adaptive Distance-Aware Trunk Deep Operator Learning for Long-Span Roadway Bridges
The study proposes an adaptive-trunk DeepONet (AD-DeepONet) for localized structural response prediction in long-span bridges, addressing computational inefficiency in finite element analysis (FEM). The method dynamically constructs load-dependent learning domains via KNN, incorporates distance-aware trunk features, and enables full-field reconstruction through a stiffness-informed Schur complement formulation. Validated on benchmark and real-world bridges (Mussafah Bridge), it achieves FEM-level accuracy (<5% error) with 60× faster response evaluation (4 orders of magnitude faster inference excluding reconstruction). The framework supports rapid generation of full-field responses, influence lines, and surfaces for digital twin applications.
deeponetstructural digital twinschur complementreduced-order modelinginfluence surface
Self-Adaptive Scale Handling for Forecasting Time Series with Scale Heterogeneity
The paper introduces a self-Adaptive Scale-handling (AS) module to address scale heterogeneity in time series forecasting (TSF), where different series vary by orders of magnitude. AS comprises Scale Calibrating (SC), which adjusts prior mean scaling factors via neural networks, and Scaling Selection (SS), which determines whether to apply calibration or retain the original factor to prevent over-calibration. This approach preserves semantic discriminability and reduces inverse-scaling errors. Experiments on real-world fund sales datasets from Ant Fortune and Alipay demonstrate that AS enhances the performance of popular TSF models. The code and dataset are publicly available.
scale heterogeneitytime series forecastingadaptive scale-handlingscale calibratingscaling selection
VIMPO: Value-Implicit Policy Optimization for LLMs
VIMPO introduces a critic-free policy optimization method for large language models that derives a policy-implied value function from KL-regularized reinforcement learning optimality conditions. The approach formulates a value recurrence using policy-reference log-ratios, anchored by a terminal condition of zero future reward, enabling outcome-level verifiable rewards without training a critic. VIMPO separates reward incorporation via a value loss from policy improvement through a PPO-style actor update. Evaluations on MATH-500, AIME 2024, AIME 2025, and OlympiadBench demonstrate VIMPO's superiority over GRPO, particularly in competition-style tasks and noisy reward scenarios, indicating finer credit assignment while maintaining training simplicity.
policy optimizationkl-regularized reinforcementvalue recurrenceppo-style actorcredit assignment
Activation- and Influence-Aware Ranks (AIR): Function-Preserving SVD Compression for LLMs
The paper introduces Activation- and Influence-Aware Ranks (AIR), a function-preserving SVD compression framework for LLMs that integrates backward-signal influence metrics into low-rank weight matrix approximations. AIR employs a single closed-form alternating least squares sweep, guided by activation-aware SVD-LLM(W) initialization, with guaranteed monotone descent. The layer-local method combines orthogonally with end-to-end techniques: standalone AIR surpasses ACIP, while AIR+LoRA achieves further improvements. Results show >18% perplexity reduction over SVD-LLM(W) at ≤60% parameter retention, equivalent quality with ~90% less calibration data, and FLOPs/memory/latency improvements from parameter savings.
svd compressionlow-rank approximationalternating least squaresllm compressionperplexity reduction
Online Dynamic Batching with Formal Guarantees for LLM Training
Online Dynamic Batching (ODB) introduces a DataLoader-side system for LLM training that optimizes batch formation after observing true sample costs (post-preprocessing/tokenization), addressing the Distributed Group Alignment Problem with deadlock-free guarantees. The method requires no model/optimizer modifications, achieving 1.58-4.43x throughput gains over fixed-batch baselines on 2B/8B Qwen3-VL models across UltraChat/LLaVA/ShareGPT4o datasets, while maintaining comparable quality. ODB outperforms offline token-budget oracles by 2.24-3.69x in high-variability scenarios.
online dynamic batchingdistributed group alignment problemllm trainingthroughput optimizationdeadlock-free guarantees
Kolmogorov-Arnold Reservoir Computing
The paper introduces Kolmogorov-Arnold Reservoir Computing (KARC), a lightweight framework that replaces conventional trainable reservoirs with explicit basis-function expansions inspired by the Kolmogorov-Arnold representation theorem. KARC preserves the expressive capacity of Kolmogorov-Arnold networks (KANs) while enabling efficient closed-form training. Evaluated on challenging benchmarks including partial differential equations, KARC outperforms existing reservoir computing methods at comparable cost. Additionally, it demonstrates compatibility with generative diffusion models for text-to-image generation, establishing a principled connection between reservoir computing and KANs for high-fidelity dynamical system forecasting.
reservoir computingkolmogorov-arnold networksbasis-function expansionsdynamical systemsdiffusion models
Semantic-Anchored Evidential Fusion for Domain-Robust Whole-Slide Survival Analysis
The paper introduces SAEFS, a domain-robust survival analysis framework for whole-slide images that anchors representations in pathology semantics rather than pixel-level features. The method combines Visual Question Answering (VQA) to extract semantic anchors, a dual-stream evidence extraction architecture, Dirichlet-based uncertainty modeling via Subjective Logic, and cautious evidence fusion to handle correlated sources. Evaluated zero-shot across four unseen clinical domains, SAEFS improves average C-index by 10.2% over baselines while demonstrating that VQA-derived features exhibit significantly lower cross-center divergence than pixel-based alternatives.
whole-slide imagesdomain robustnesssubjective logicvisual question answeringzero-shot evaluation
Low-Energy Reduced RISC-V Instruction Subset Processor for Tsetlin Machine Inference at the Edge
The authors present a domain-specific RISC-V microprocessor optimized for Tsetlin Machine (TM) inference at the edge, addressing limitations of existing co-processor designs through instruction subset reduction and architectural simplification. Methodologically, they profile TM workloads to guide instruction set pruning, then streamline datapath and control logic while maintaining programmability. Evaluations against Binarized Neural Networks (BNNs) demonstrate TM's superior accuracy (88.18% vs 60.0% on CIFAR-2) alongside 98% faster execution and 29.7× lower energy consumption, validating the design's efficiency for edge deployment.
tsetlin machinerisc-vedge aibinarized neural networksinstruction profiling
Towards Graph-Based Deep Learning for Map Generalization: Insights from Building Footprints Simplification and Aggregation
This study introduces the first graph-based deep learning framework for map generalization, addressing building footprint simplification as node movement prediction and aggregation as link prediction. The authors evaluate GCN, GAT, and GraphSAGE architectures on multi-scale building datasets, finding GraphSAGE superior for link prediction (aggregation) while revealing persistent challenges in node movement precision. Results demonstrate aggregation's greater complexity than simplification, highlighting limitations in capturing high-level spatial relationships with current methods despite providing methodological directions for automated map generalization.
graph neural networksmap generalizationbuilding footprintslink predictionnode movement prediction
Investigating Human-Model Discrepancies in Speech Quality Assessment via Acoustic and Prosodic Perturbations
The study reveals systematic discrepancies between human listeners and MOS prediction models in speech quality assessment. Through controlled perturbations of acoustic degradation, prosodic errors, and speaker characteristics (pitch, speaking rate), the authors compare human and model ratings. Results indicate models accurately track acoustic degradation but are insensitive to prosodic errors causing significant human score drops, while exhibiting F0 biases absent in human ratings yet missing human-sensitive speaking rate and F0 variability cues.
mean opinion scoreprosodic errorsacoustic fidelityfundamental frequencyspeech quality assessment
QMaxCal: Path-Space Regularization for Open Quantum Control via Girsanov's Theorem
The paper introduces QMaxCal, a path-space regularization framework for open quantum control that mitigates decoherence effects via two novel regularizers derived from Girsanov's theorem. The Wiener KL (KL_W) and drift-variance regularizer (R_DV) penalize observable decoherence channel impacts rather than control amplitude, differing from conventional fluence/smoothness penalties. Evaluated on single/multi-qubit systems and an IBM Kingston processor calibration, the method improves final-state fidelity by up to 50% (+16% on IBM chain) and shows robustness to noise model mismatch (+17-27 pp gains under 2.5x noise variation).
quantum controldecoherencegirsanov's theorempath-space regularizationwiener kl
GEMS: Geometric Constraints Enable Multi-Semantic Superposition in LLMs
The paper introduces GEMS, a training-free method for multi-directional activation steering in LLMs that addresses collapse through geometric constraints. It identifies two failure modes—distributional deviation and directional interference—and proposes norm-preserving superposition and real-time orthogonalization as solutions. On GSM8K, GEMS maintains 98% accuracy with three concurrent injections versus 4% collapse in unconstrained cases, while Wikitext-2 shows only 2.2% PPL increase. Layer probes confirm preserved semantic specificity across 3B-31B models.
activation steeringgeometric constraintsnorm-preserving superpositiondirectional interferenceorthogonalization
Compositionality Emerges in a Narrow Depth-Connectivity Regime: Architecture Constraints and Solution Manifolds
The study demonstrates that compositional internal structure in neural networks emerges within a narrow depth-connectivity regime, requiring specific sparse connectivity patterns and target-dependent depths. The authors propose (i) similarity-based pruning (SP) to identify compositional connectivity and (ii) a depth predictor heuristic to locate optimal compositional depth. Empirical results show compositionality peaks at certain depths and connectivity patterns, with gradient descent converging to fractured solutions outside this regime. Theoretical analysis links these findings to compositional sparsity, volume-ratio arguments, and feature-interference bounds.
compositionalitygradient descentsparse networkssimilarity-based pruningfeature-interference
Deep-Unfolded Coordination
The paper introduces Deep Coordinator, a deep-unfolding framework that dynamically adjusts hyperparameters of ADMM-DDP, a distributed solver for multi-agent robotics tasks. The method unrolls ADMM-DDP iterations into a neural network with learnable functions mapping optimizer state to hyperparameters, addressing degeneracy in supervised training via an unsupervised scheme. Evaluations on car and quadrotor fleets show 6.18-9.44x speedup over conventional solvers while maintaining trajectory quality, with scalability demonstrated on systems 8x larger than training data.
deep-unfoldingdistributed optimizationadmm-ddphyperparameter adaptationmulti-agent robotics
ADaPT: Token-Level Decoupling for Efficient Large Reasoning Models
ADaPT introduces token-level decoupling for efficient large reasoning models by separating efficiency and correctness signals during training. The method employs a mode-selection token to control fast (efficiency-oriented) and slow (correctness-oriented) reasoning, applying efficiency rewards exclusively to this token to avoid penalizing correct long reasoning. Experiments show ADaPT reduces inference cost while maintaining strong performance across benchmarks, enabling continuous trade-off control via mode-selection token probability adjustment.
token-level decouplingmode-selection tokenefficiency-performance trade-offlarge reasoning modelsadaptive dual-process thinking
Structure-Oriented Randomized Neural Networks for Poisson-Nernst-Planck and Poisson-Nernst-Planck-Navier-Stokes Systems
The authors propose Structure-Oriented Randomized Neural Networks (SO-RaNN) for solving Poisson-Nernst-Planck (PNP) and PNP-Navier-Stokes (PNP-NS) systems. The method employs decoupled linearized subproblems solved iteratively via randomized neural networks in a space-time framework, incorporating pointwise positivity enforcement, mass-correction via discrete scaling factors, and SAV-type post-processing for dissipation control. For PNP-NS, a structure-preserving variant (SP-RaNN) ensures incompressibility. Theoretical contributions include residual-based error estimates, conditional convergence proofs, and approximation results for SP-RaNN. Numerical experiments validate the approach, demonstrating positivity preservation, mass matching, free-energy computation, and divergence-free velocity fields.
randomized neural networkspoisson-nernst-planckstructure-preservingmass-correctionsav-type correction
A fast direct solver based neural network for solving PDEs
The authors propose a neural network architecture that learns inverse operations for HODLR (Hierarchical Off-Diagonal Low-Rank) matrices, extending it to nonlinear PDE solution operators by incorporating deep sub-networks. The method builds on Ambikasaran and Darve's (2013) fast direct solver, demonstrating effectiveness through experiments on linear problems (Fredholm integral equations) and nonlinear PDEs (Schrödinger, Burgers', Darcy flow). Results show generalization across parameters, competitive inference times versus classical solvers, and comparisons with existing neural operators.
hodlr matricesneural operatorsfast direct solverpde learningnonlinear solution operators
Score Approximation for Diffusion Models on Arbitrary Low-Dimensional Structures
This work establishes a universal score approximation theorem for diffusion models, removing restrictive assumptions about data structure by supporting any compact set of upper Minkowski dimension $d$. The authors employ a discrete-mixture formulation to demonstrate that the score function can be approximated by a ReLU network with complexity scaling exponentially only in $d$, circumventing the curse of ambient dimensionality. Combined with existing backward diffusion SDE theory, this explains diffusion models' effectiveness on irregular, non-smooth real-world data.
score approximationdiffusion modelsminkowski dimensionrelu networkbackward sde
Adversarial Bandit Optimization with Globally Bounded Perturbations to Convex Losses
The paper extends adversarial bandit optimization to non-convex, non-smooth loss functions by introducing globally bounded perturbations to underlying convex and β-smooth components. The authors modify a standard bandit optimization algorithm to handle post-action adversarial perturbations, which are constrained by a cumulative budget over time. Theoretical analysis provides expected regret guarantees that explicitly account for the perturbation budget's impact. In the absence of perturbations, the results align with standard regret bounds for bandit convex optimization with β-smooth losses. This framework generalizes previous work from linear to convex and smooth loss functions.
adversarial banditnon-convex optimizationperturbation budgetregret guaranteesβ-smooth losses
Matching Markets meet Cumulative Prospect Theory: Towards Optimal and Adversarially Robust Learning
(No summary returned.)
Multimodal Concept Bottleneck Models
The paper introduces Multimodal Concept Bottleneck Models (MM-CBM), extending Concept Bottleneck Models (CBMs) to CLIP for improved interpretability and generalization. MM-CBM employs dual Concept Bottleneck Layers (CBLs) to align image and text embeddings into interpretable features, enabling tasks like zero-shot classification and image retrieval. The method addresses limitations of traditional CBMs, such as fixed class constraints and non-concept information leakage. Evaluated on four benchmarks, MM-CBM achieves up to 51.26% accuracy improvement while maintaining within ~5% of black-box performance, balancing accuracy and interpretability.
multimodalconcept bottleneckinterpretabilityzero-shotembeddings
On the Oracle Complexity of Interpolation-Based Gradient Descent
The paper introduces piecewise polynomial interpolation-based gradient descent (PPI-GD), an inexact gradient method that approximates full gradients by querying first-order oracles at equidistant points and constructing polynomial interpolants. The method leverages smoothness in empirical risk minimization (ERM) loss functions to improve oracle complexity. Theoretical analysis demonstrates PPI-GD's superiority over gradient descent variants for strongly convex and non-convex losses when data space dimension is polylogarithmic in sample size, extending bicubic spline interpolation techniques to $d$-variate tensor product polynomial interpolants.
gradient descentoracle complexitypolynomial interpolationempirical risk minimizationstrongly convex
Global Convergence of Gradient Descent for Score Matching in Gaussian Mixtures via Reverse Fisher Divergence
The paper establishes global convergence guarantees for gradient descent in Gaussian mixture models using reverse Fisher divergence, an alternative to the standard forward Fisher divergence in score matching. By analyzing gradient descent dynamics through a Lyapunov-based approach, the authors prove global convergence from arbitrary initializations when fitting a Gaussian mixture model (student) to a single Gaussian (teacher) with fixed weights and identity covariances. They extend these results to cases where the teacher is also a Gaussian mixture model, showing convergence under random initialization and mean separation assumptions. The reverse Fisher divergence exhibits a more favorable optimization landscape than the forward variant.
score matchingreverse fisher divergencegaussian mixture modelslyapunov analysisgradient descent
Doeblin Curves
The paper introduces Doeblin curves, a nonlinear generalization of Doeblin coefficients, to characterize multi-way contraction in Markov kernels without requiring strict positivity conditions. The method develops a variational characterization of Doeblin coefficients, analyzes properties of Doeblin curves, and defines power-constrained variants with derived bounds. Results demonstrate applications in generalization bounds for noisy optimization, error bounds for noisy circuits, and differential privacy guarantees, extending prior work to broader domains through finer-grained contraction analysis.
doeblin curvesmarkov kernelsnonlinear contractionvariational characterizationdifferential privacy
Physics-Informed Neural Network with Squeeze-Excitation-like Attention
The authors propose SEA-PINN, a physics-informed neural network incorporating Squeeze-Excitation-like attention to dynamically recalibrate neuron importance across layers. The architecture demonstrates highly stable initialization, showing negligible variance and reduced initial loss on 17/20 benchmark problems. Without Fourier features or periodic activations, SEA-PINN achieves 83% accuracy improvement versus FNN-PINN on high-frequency problems, comparable to specialized TSA-PINN (90%). Integration with TSA-PINN yields a 42.49% performance boost, demonstrating its efficacy as a lightweight plug-in module for enhanced nonlinear representation and robust convergence in physics-informed learning.
physics-informed neural networkssqueeze-excitation attentiondynamic recalibrationhigh-frequency problemsnonlinear representation
Prompt, Plan, Extract: Zero-Shot Agentic LLMs Workflows for Lung Pathology Extraction from Clinical Narratives
The study introduces a zero-shot agentic workflow using five open-source LLMs for extracting 13 lung pathology fields from clinical narratives, eliminating need for task-specific training. The method compares these against GatorTron NER-RE, a supervised baseline, using registry-aligned evaluation. GPT-OSS-20B achieved Micro-F1 of 0.893 (recall: 0.949) versus the baseline's 0.960, demonstrating competitive performance on complex relations like Pathologic Stage without manual annotation.
zero-shot learningnamed entity recognitionrelation extractionclinical narrativeslarge language models
Leverage Is Not Reach: A Control-Window Law for Single-Neuron Steering in Language Models
The paper introduces a budget-normalized control window framework for coherent single-neuron steering in language models, addressing when interventions collapse outputs versus control behaviors like refusal and language routing. The method leverages a universal saturation curve driven by the alignment between the residual stream and write direction, normalized by a coherence budget derived from residual and write norms. Results show a mean absolute error of 0.14 for predicted collapse ceilings across 15 neurons, with coherent control verified in 11 cases. The framework explains why gradient attribution fails to identify true controllers and enables precise recovery of controllers via forward-only contrastive screening.
single-neuron steeringcontrol windowresidual streamcoherence budgetgradient attribution
Enhancing Graph Neural Networks Using Proximity Graphs for Dust Source Emission Forecasting
The paper proposes enhancing Graph Neural Networks (GNNs) with proximity graphs for dust source emission forecasting, addressing limitations of traditional methods in capturing spatiotemporal dynamics. The method integrates proximity graphs (Delaunay triangulation, Gabriel graph, k-Nearest Neighbor graph, Yao graph) as input structures for GNNs (GraphSAGE, GCN, GAT) to improve message passing. Experimental results demonstrate that GNNs with proximity graphs significantly outperform both random-graph GNNs and Long Short-Term Memory (LSTM) models in forecasting accuracy.
graph neural networksproximity graphsdust source forecastingspatiotemporal modelingmessage passing
Low-Burden Data Augmentation for Dysarthric ASR via Zero-Shot Voice Cloning
This work proposes zero-shot voice cloning as a low-burden data augmentation strategy for dysarthric automatic speech recognition (ASR), addressing data scarcity and inter-speaker variability. Using Higgs Audio V2 to clone speakers from the TORGO dataset, the authors fine-tune Whisper-medium on cloned, real, and hybrid data, evaluating on held-out real speech. Clone fine-tuning achieves 26.00% WER, competitive with real (24.44%) and hybrid (25.12%) fine-tuning, while outperforming real fine-tuning for moderate-severe speakers. Cross-corpus evaluation on SAP-1102 shows Clone fine-tuning achieves the best results (11.45% relative improvement), demonstrating the scalability of zero-shot cloning for dysarthric ASR.
zero-shot voice cloningdysarthric asrdata augmentationwhisper-mediumwer
Flow Map Denoisers: Traversing the Distortion-Perception Plane for Inverse Problems
The paper introduces flow map denoisers, a method that continuously traverses the distortion-perception (DP) frontier in image restoration via a single parameter controlling the tradeoff between minimum mean squared error (MMSE) and perceptual quality. By extending flow matching to few-step sampling through learned average fields, the approach eliminates the need for paired-data supervision, auxiliary models, or sampler hyperparameter tuning. Theoretical analysis proves optimal DP frontier recovery for Gaussian targets, while experiments on CelebA (128×128) and AFHQ (256×256) demonstrate competitive performance across linear/nonlinear inverse problems compared to specialized baselines.
flow matchingdistortion-perception tradeoffinverse problemsimage restorationplug-and-play
The Hidden Environmental Cost of Poor Coding Practices in TensorFlow and Keras Applications: A Study on Resource Leaks and Carbon Emissions
This study quantifies the environmental impact of two resource-leak smells in TensorFlow/Keras applications: Improper Model Reuse (IMR) and Unreleased Tensor References (UTR). Through controlled experiments comparing smell-affected workloads against clean baselines, the authors demonstrate systematic increases in energy consumption (32% for IMR, 46% for UTR) and proportional CO2 emissions. Statistical analysis confirms these effects are significant, establishing empirical evidence that poor coding practices degrade ML sustainability.
resource-leak smellscarbon emissionstensorflowkerasenergy efficiency
An Information Theoretic Framework for Graph Novelty Generation via Latent Mixture Modeling
The paper introduces an information-theoretic framework for generating novel graph structures that deviate from existing patterns while maintaining global consistency. The method embeds graphs into a latent space, models their distribution via finite mixture models, and enforces novelty (samples poorly explained by existing components) and reliability (minimal impact on mixture structure) through Minimum Description Length constraints. Theoretical analysis proves convergence of misclassification probabilities to zero with explicit rates under proper thresholds. Experiments on synthetic and benchmark datasets demonstrate principled novelty generation with quantifiable risk.
graph novelty generationminimum description lengthlatent mixture modelinginformation-theoretic frameworkstructural consistency
Learning universal approximations for partial differential equations with Physics-Informed Broad Learning System
The Physics-Informed Broad Learning System (PIBLS) is introduced as a backpropagation-free framework for solving partial differential equations (PDEs) via direct least-squares optimization, addressing limitations of Physics-Informed Neural Networks (PINNs) in convergence and computational efficiency. PIBLS incorporates an improved algorithm for nonlinear PDEs and establishes universal approximation properties through rigorous mathematical proof. Experimental results demonstrate that PIBLS achieves one to three orders of magnitude faster computation than PINNs while significantly improving solution accuracy, offering a practical alternative for real-time simulation and design optimization in scientific machine learning.
partial differential equationsphysics-informed neural networksbroad learning systemleast-squares optimizationuniversal approximation
Federated Bilevel Performative Prediction
The paper introduces federated bilevel performative prediction (FBPP), addressing decision-dependent distribution shifts in federated bilevel optimization. The authors formalize the federated bilevel performatively stable (FBPS) point, proving its existence and uniqueness under specific conditions. They propose two methods: FBi-RRM, achieving linear convergence under contraction, and FBi-SGD, a communication-efficient stochastic approach with convergence guarantees. Experiments on strategic regression, meta strategic classification, and CNN-based tasks demonstrate improved meta-generalization and validate stability thresholds compared to non-performative baselines.
federated bilevel optimizationperformative predictiondistribution shifthypergradient estimationmeta-generalization
Closing the Calibration Gap in Semantic Caching
The paper introduces Precision-Cache Hit Ratio (P-CHR) AUC and Calibration Retention Rate (CRR) to address the calibration gap in semantic caching for LLM inference. Current evaluation metrics like PR-AUC focus on ranking rather than threshold usability, leading to suboptimal deployment choices. The authors demonstrate that the calibration gap stems from the training objective, not data scale, and propose these new metrics to better align offline and operational performance. Experiments show post-hoc calibration only partially mitigates the gap, emphasizing the need for calibration-aware model selection.
semantic cachingpr-aucp-chr auccalibration retention ratellm inference
Efficient Neural Network Model Selection for Few-Class Application Datasets
The paper introduces a data-driven metric for efficient neural network model selection in few-class applications (typically <10 classes), addressing a gap in traditional approaches optimized for thousand-class datasets. The proposed 'few-class distinctiveness' measure quantifies classification difficulty using dataset properties, enabling 6-29× faster model comparisons than repeated training. The authors demonstrate practical utility by extending scaled model families below published minima, achieving up to 42% size reduction versus YOLOv5-nano in mobile robotics while maintaining accuracy. Results span mobile robots, drones, and IoT scenarios, showing resource savings without performance loss.
few-class distinctivenessmodel selectionclassification difficultyscaled model familiesresource-constrained applications
A Differentiable Composite Approximation Framework for Autonomous Underwater Vehicle Maneuvering Modeling from Sea-Trial Data
The paper proposes a differentiable composite-approximation framework for autonomous underwater vehicle (AUV) maneuvering modeling, combining polynomial-basis and data-adaptive neural components in a jointly calibrated predictor. The method employs gradient-based co-calibration with sensitivity-aware polynomial updates and neural residuals, incorporating current compensation via turning-motion estimation. Evaluated on sea-trial data from a 7-meter AUV, the approach outperforms polynomial-only, neural-only, and frozen-prior hybrid baselines in recursive trajectory and velocity prediction.
autonomous underwater vehicledifferentiable approximationgradient-based calibrationmaneuvering modelingcurrent compensation
Comparative Study on Agility, Efficiency, and Impact Absorption of Bipedal Robots with Active Toes
The study introduces a 14-DOF bipedal robot with active toes to emulate human leg efficiency, agility, and impact absorption. A high-fidelity simulation environment was developed to quantitatively assess these capabilities, incorporating coupled transmissions and accurate power consumption. A minimal RL reward function ensured fair comparison between configurations with and without active toes. Results at 1.33 m/s walking speed demonstrated a 17.5% reduction in cost of transport (CoT) and a 5.0% decrease in heel-strike ground reaction force (GRF) for the toe-equipped robot. Agility tests showed reductions in average and maximum path deviation by 25.0% and 34.0%, respectively.
bipedal robotactive toescost of transportground reaction forcereinforcement learning
Multi-Granular Attention-Driven Reinforcement Learning Framework for Web Intelligent Enhancement Systems
The paper proposes MGAR-WIES, a Multi-Granular Attention-based Reinforcement Learning framework for web intelligent enhancement systems. The method integrates semantic graph modeling with attention mechanisms to process heterogeneous web data (structured, semi-structured, unstructured) into unified feature representations, then employs adaptive multi-agent RL for personalized web actions. The system achieves 80% accuracy by dynamically updating graph representations and policies via online feedback, outperforming existing approaches in semantic understanding and adaptability.
semantic graph modelingattention mechanismsmulti-agent reinforcement learningweb intelligent enhancementdynamic representation learning
DF-ExpEnse: Diffusion Filtered Exploration for Sample Efficient Finetuning
DF-ExpEnse introduces a diffusion-filtered exploration technique for sample-efficient finetuning of pretrained generative control policies in robotic decision-making. The method leverages multimodal policy outputs to generate diverse action candidates, then uses critic ensembles to select actions balancing quality and exploration. It supports cross-agent communication in fleet settings for collaborative exploration. Experiments demonstrate consistent sample-efficiency improvements over baseline finetuning methods across manipulation and locomotion tasks, with integration into existing RL-based policy adaptation frameworks.
diffusion modelssample efficiencygenerative controlreinforcement learningmultimodal exploration
Convex training of Lipschitz-regularized shallow neural networks
The authors propose a convex training procedure for shallow neural networks that enhances robustness against adversarial attacks. They introduce a convex restriction to solve a non-convex Lipschitz-regularized training program efficiently to global optimality, applicable as a post-processing step using pre-trained networks. Experimental results on real-world datasets for regression tasks demonstrate that their method yields networks with lower objective values on the Lipschitz-regularized program compared to existing methods. Additionally, the networks trained with this convex program exhibit improved accuracy and robustness against adversarial attacks on certain datasets.
convex optimizationlipschitz regularizationadversarial robustnessshallow neural networksglobal optimality
Variational Consensus Monte Carlo for Bayesian Mixture
The authors extend variational Consensus Monte Carlo (CMC) for federated Bayesian mixture inference, enabling cluster count estimation and parameter learning without conjugacy assumptions. Their method introduces novel cluster-matching algorithms for cross-silo settings, multiple aggregation strategies for federated constraints, and practical selection guidelines. Simulations demonstrate superior small-cluster recovery accuracy compared to pooled-data MCMC when local datasets reflect underlying cluster structure. The framework is validated on large-scale electronic health records, identifying multi-morbidity patterns in a British geriatric population.
variational consensus monte carlobayesian mixture modelsfederated learningcluster-matching algorithmselectronic health records
Where Does Social Reasoning Come From? Capability Provenance in Language Models
The study introduces training-data attribution as an interpretable method for capability discovery in language models, specifically analyzing social versus STEM reasoning in OLMo3-7B. Using gradient-based attribution (TrackStar via Bergson) on the Dolma3 corpus, aggregated via WebOrganizer's 24x24 taxonomy, the authors contrast benchmark pairs (SocialIQA, MMLU Social Sciences vs. ARC-Challenge, MMLU STEM). Results show distinct corpus regions support social and STEM reasoning, with sharper contrasts at the reasoning level than knowledge level. Partial causal validation via targeted unlearning confirms the findings, with open-sourced code and data.
training-data attributioncapability discoverygradient-based attributionmachine unlearningsocial reasoning
MassSpecGym in the Wild: Uncovering and Correcting Evaluation Pitfalls in AI-Driven Molecule Discovery
The study identifies and rectifies evaluation pitfalls in AI-driven molecule discovery using tandem mass spectrometry (MS/MS), focusing on the MassSpecGym benchmark suite. Through systematic review of 26 papers, the authors uncover three failure classes: data leakage, shortcut learning, and implementation bugs/metric divergence, affecting 17 studies. Experimental validation quantifies these issues' impact on benchmark reliability. The work contributes MassSpecGym v1.5, an updated benchmark suite implementing corrective recommendations, publicly available to improve evaluation standards in MS/MS-based ML research.
tandem mass spectrometrybenchmark evaluationdata leakageshortcut learningmolecule discovery
SEAGAN: domain-Specific and Edge-Aware Graph Attention Network for Dynamic Plant Processes
The paper introduces SEAGAN, a domain-specific edge-aware graph attention network for identifying biochemical limitation states in plant A-Ci curves. The method formulates the task as node classification on kNN and auxiliary-signal-guided graphs, incorporating process-aware node features, edge attributes, and weighted cross-entropy loss. Evaluated on synthetic data, SEAGAN achieves 0.857 F1-score and 0.882 accuracy, outperforming conventional baselines by effectively modeling local neighborhoods through graph attention.
graph neural networksa-ci curvenode classificationedge attributesphotosynthetic parameters
GB-LSR: A Fast Local Spectral Image Representation with a Single Global Bandwidth for Continuous Reconstruction and Super-Resolution
GB-LSR introduces a fixed-grid local spectral representation for continuous image reconstruction using non-overlapping patches with truncated Fourier basis coefficients predicted from shared convolutional features. A single trainable global bandwidth scalar enables efficient reconstruction at any coordinate, outperforming LIIF/LTE/WIRE baselines by 2.8-3.6 dB PSNR and 0.11-0.15 LPIPS while reducing inference cost by 4x. In arbitrary-scale super-resolution, GB-LSR achieves competitive PSNR-Y with 1.44-3.25x speedup over LIIF-RDN/LTE-SwinIR, and further optimizations yield additional speedups (1.77x) and memory reductions (35%) with minimal quality loss.
local spectral representationcontinuous reconstructionglobal bandwidthfourier basissuper-resolution
Comparing Linear Probes with Mahalanobis Cosine Similarity
The paper establishes Mahalanobis cosine similarity (MCS) as a theoretically grounded alternative to Euclidean cosine similarity for comparing linear probes. It extends prior empirical findings showing that MCS between a probe and an out-of-distribution (OOD) reference probe linearly predicts the probe's OOD AUROC (R^2 = 0.98) across models, layers, and concept domains. The authors prove this relationship in closed form for balanced Gaussian-distributed classes, demonstrating that both OOD AUROC and MCS are sigmoid-shaped functions of the probe's signal-to-noise ratio (SNR). Empirical validation confirms theoretical predictions of when this linearity fails.
mahalanobis cosine similaritylinear probesout-of-distributionaurocsignal-to-noise ratio
Unsupervised Causal Abstractions Discovery
The paper introduces an unsupervised method for discovering causal abstractions from low-level measurements, departing from the traditional hypothesis-testing paradigm. Leveraging low-rank causal discovery principles, the authors demonstrate that observations from a low-rank graph induce latents forming a valid causal abstraction. They provide theoretical identifiability guarantees for these latent variables and propose a practical objective function for learning the corresponding high-level structural causal model (SCM). The approach bridges the gap between abstract causal representations and empirical data without requiring expert-specified candidate models.
causal abstractionsstructural causal modellow-rank discoveryidentifiabilityunsupervised learning
A Solver-Free Training Method for Predict-then-Optimize
The paper introduces a solver-free training method for predict-then-optimize tasks, where ML model outputs serve as coefficients in linear optimization. By leveraging a measure transformation principle, the proposed method avoids computationally expensive solver calls during training, addressing scalability issues in existing approaches. Theoretical guarantees include Fisher consistency and excess risk bounds. Empirical results demonstrate competitive decision quality with state-of-the-art methods while reducing training time by orders of magnitude.
predict-then-optimizedecision-focused learningmeasure transformationsurrogate lossfisher consistency
On the QUEST for Uncertainty Quantification via Highest Density Regions
The paper introduces QUEST, a novel framework for uncertainty quantification (UQ) in probabilistic machine learning that characterizes uncertainty via the volume of highest density regions in a distribution's support. Unlike scalar UQ methods based on proper scoring rules, QUEST measures concentration using Lebesgue measure at distribution peaks, with tunable robustness parameter α. The method satisfies key UQ axioms (monotonicity, shift invariance) and demonstrates superior performance to variance and differential entropy in selective prediction benchmarks, while connecting to classical information-theoretic and economic statistics.
uncertainty quantificationhighest density regionsprobabilistic machine learningproper scoring rulesselective prediction
Advances in Scientific Machine Learning for Coupled Fluid Flow and Transport
The chapter advances Scientific Machine Learning (SciML) for coupled fluid flow and transport systems governed by incompressible Navier-Stokes and scalar transport equations. It combines linear reduced-order techniques (e.g., Dynamic Mode Decomposition) with nonlinear neural network approaches like Physics-Informed Neural Networks (PINNs) and $β$-Variational Autoencoders ($β$-VAEs), integrated with High Performance Computing strategies such as Adaptive Mesh Refinement/Coarsening and scientific floating-point data compression. Two novel contributions are presented: PINN-based surrogate modeling of turbidity currents and $β$-VAE-driven extraction of disentangled nonlinear modes from thermal flows. Benchmarks like lock-exchange flows and Rayleigh-Bénard convection demonstrate SciML’s ability to achieve fast, accurate approximations while reducing computational costs relative to full-order simulations.
scientific machine learningphysics-informed neural networksdynamic mode decompositionadaptive mesh refinementvariational autoencoders
Understanding Key Features of Time Series Foundation Models from Epidemic Forecasting
We systematically evaluate time series forecasting architectures for regional influenza prediction, comparing classical neural networks, transformer-based models, pretrained time series foundation models, and LLM-based approaches. Using influenza-like illness surveillance and hospitalization data, we assess 1-4-week-ahead predictions under temporal and spatial generalization settings. Results show that a mixture-of-experts model combining multiple pretrained forecasters achieves the strongest performance, indicating complementary predictive information from heterogeneous representations. Pretraining yields largest gains at longer horizons when aligned with influenza dynamics, while LLM-based methods underperform. Hospitalization signals enhance forecasting robustness in selected settings, providing guidance on model selection, pretraining strategy, and auxiliary-signal use for epidemic preparedness.
time series forecastingpretrained foundation modelsmixture-of-expertstemporal generalizationauxiliary covariate
Displacement Is Not Direction: Evaluating Fidelity Metrics for Quantized LLM Deployment
The study evaluates the fidelity of KL divergence (KLD) as a proxy for benchmark quality in quantized LLMs, testing 28-quant Qwen3.6-35B-A3B and 41-quant Devstral-Small-2-24B cohorts across downstream benchmarks. While KLD shows strong correlation with benchmark scores overall (ρ=-0.72 to -0.86, p<0.001), this relationship collapses in the near-baseline silent zone (ρ=-0.24 to +0.00, p=0.36). Analysis reveals KLD primarily measures disagreement volume (ρ=+0.94 on Qwen, p<0.001) rather than direction, with weak failure-prediction power (42.3%-49.4% routing accuracy) and limited code-task utility (geometric-mean ratios 1.08-1.22).
kl divergencequantized llmsfidelity metricsbenchmark correlationsilent zone
Predicting Mergeability of Parameter-Efficient Fine-Tuning Updates
The paper introduces MergeProbe, a method for predicting the mergeability of Low-Rank Adaptation (LoRA) fine-tuning updates early in training. MergeProbe uses signals from initial training phases, including low-rank update alignment, gradient coherence, and shared representation disturbance, to forecast adapter mergeability. It provides decision outputs: merge directly, reweight, prune, or route. Evaluated on MERGE-PEFT, a multi-domain benchmark covering math, code, science, instruction following, and safety, MergeProbe achieves superior average and worst-case retention compared to interference-aware baselines, with minimal deployment overhead. This transforms LoRA merging from a post-hoc process into a predictive measurement task.
low-rank adaptationmergeabilitytask routinggradient alignmentshared representations
Tracking Representation Dynamics in Large Language Models with Persistent Homology
The study analyzes representation dynamics during supervised fine-tuning of large language models using persistent homology, revealing topological changes in activation spaces. Examining four transformer models (1B-7B parameters) across three alignment objectives (helpful, harmless, mixed data), the authors find most topological reorganization occurs early in training, with a transient peak followed by rapid stabilization. Different alignment objectives produce distinct topological trajectories, while instruction-tuned and pretrained models show divergent evolutionary patterns. Persistent homology offers insights into representation-level changes not captured by behavioral metrics alone.
persistent homologyalignment dynamicsactivation spacessupervised fine-tuningrepresentation learning
FloatDoor: Platform-Triggered Backdoors in LLMs
FloatDoor introduces the first platform-triggered backdoor attack for LLMs, exploiting floating-point arithmetic divergence across deployment platforms to induce adversary-chosen behavior without input modification. The method employs two LoRA adapters: one amplifies platform-specific numerical divergence, while another binds this signature to malicious outputs, preserving overall model utility. Demonstrated on Qwen3-4B across NVIDIA GPUs, Google TPUs, AWS Graviton, and Alibaba Yitian-710, FloatDoor reliably induces exploitable code vulnerabilities on target platforms, revealing critical vulnerabilities in LLM supply chains.
platform-triggered backdoorfloating-point divergencelora adaptersmodel supply chaintime-of-check
Interactive Pareto navigation for deep multi-task learning
The paper introduces Preference Pareto Exploration (PPE), a novel framework for interactive multi-task learning that incorporates decision-maker preferences while respecting Pareto front geometry. PPE employs a predictor-corrector method: predictor steps follow the Pareto manifold tangentially based on preferences, while corrector steps generate new trade-offs. To avoid costly Hessian computations, tangent space characterization uses a Krylov subspace method with matrix-vector products computed via automatic differentiation. Experiments demonstrate PPE's effectiveness on both synthetic problems and deep learning applications.
multi-task learningpareto optimizationpredictor-corrector methodkrylov subspaceautomatic differentiation
Calibrating Generative Models to Feature Distributions with MMD Finetuning
The paper introduces kernel Calibrating Generative Models (kCGM), a method for aligning generative models' output distributions with target feature distributions while preserving sample quality. kCGM minimizes maximum mean discrepancy (MMD) between generated and target features using an unbiased score-function estimator, with KL regularization to maintain proximity to the pretrained model. Evaluated on antibiotic molecule generation (n=174), kCGM improved feature matching while increasing validity compared to direct finetuning. The method demonstrated versatility across autoregressive, continuous-space diffusion, and discrete diffusion models in protein and DNA generation tasks using only feature-level supervision.
generative modelsmaximum mean discrepancyfeature distributionkl regularizationscore-function estimator
Algebraic Dead Directions in LayerNorm Transformers: A Forward-Pass-Only Diagnostic at LLM Scale
The paper identifies an exact algebraic dead direction in LayerNorm transformers, computable solely from the LayerNorm scale parameter without forward/backward passes or eigendecomposition. The method leverages the inverse-scale direction of the LayerNorm affine as a kernel of the post-final-norm activation covariance, applicable to any input distribution. Empirical validation on 14 pretrained transformers (160M-35B parameters) shows the predicted direction matches the measured bottom singular direction with high precision (4 decimal places) in 9/9 LayerNorm models and is absent in 5/5 RMSNorm models. The approach also reveals singular structure evolution from random initialization to trained checkpoints and enables transformer normalization classification from parameters alone.
layernormdead directionactivation covariancesingular directionnormalisation
Optimal Ansatz-free Hamiltonian Learning In Situ
The authors propose a computationally efficient, control-free algorithm for ansatz-free Hamiltonian learning, achieving optimal Heisenberg-limited scaling with total evolution time Θ(Λ/ε² log(Λ/ε)). The method combines randomized sampling of band-limited kernels with a displacement sieve for structure learning, requiring only Pauli product state preparation and measurement. Key results include: (1) evolution time optimality (proven via Ω(Λ/ε² log(Λ/ε)) lower bound), (2) Λ-dependent time resolution advantageous for high-precision sensing, and (3) robustness to SPAM noise for local Hamiltonians. This provides a practical framework for in situ quantum platform characterization.
hamiltonian learningheisenberg-limited scalingpauli measurementband-limited kerneldisplacement sieve
Insulin4RL: Real-Time Insulin Management in the Intensive Care Unit for Offline Reinforcement Learning
Insulin4RL introduces a novel offline reinforcement learning (ORL) dataset for insulin management in ICUs, addressing limitations of temporally discretized EHR data. Derived from MIMIC-IV, it contains 375,000+ labeled decisions across 12,209 patients with irregular sampling intervals. The authors provide dataset specifications, baseline ORL performance metrics using model-free methods, and a standardized evaluation protocol via fitted Q-evaluation. This resource enables research on ORL under realistic clinical conditions.
offline reinforcement learningelectronic health recordsintensive care unitfitted q-evaluationinsulin titration
3D-DLP: Self-Supervised 3D Object-Centric Scene Representation Learning
3D-DLP introduces a self-supervised object-centric model for 3D scene representation, decomposing RGB-D/voxel inputs into interpretable 3D latent particles. Each particle encodes disentangled attributes (3D keypoints, bounding boxes, appearance) via the Deep Latent Particles framework, learned through reconstruction objectives. Experiments on simulated and real-world datasets demonstrate controllable scene generation via particle manipulation and improved robotic manipulation performance over non-object-centric or dense 3D baselines.
3d-dlpobject-centricself-supervisedlatent particlesdisentangled attributes
MortarBench: Evaluating Mortgage Loan Origination Agents
The authors introduce MortarBench, the first public benchmark for evaluating mortgage loan origination agents, addressing a critical gap in financial AI systems. They develop a financial data synthesis and mutation pipeline to generate realistic test cases with broad edge case coverage. Testing state-of-the-art LLMs reveals poor performance (max 77.1% exact match accuracy) and systematic biases against non-English names. The proposed CRIT framework improves accuracy to 80.5% while mitigating bias and enhancing risk management capabilities.
mortgage loan originationfinancial data synthesisedge case coverageconfidence calibrationsystematic bias
TxBench-PP: Analyzing AI Agent Performance on Small-Molecule Preclinical Pharmacology
TherapeuticsBench Preclinical Pharmacology (TxBench-PP) introduces a verifiable benchmark for evaluating AI agents in small-molecule preclinical pharmacology, addressing the need for trusted deployment in drug discovery. The benchmark comprises 100 evaluations across program stages, assay types, and task structures, testing agents' ability to derive accurate conclusions from real-world assay data rather than memorized literature. Agents process workflow snapshots in a coding environment and return structured answers graded deterministically. Testing 16 model-harness configurations (11 models, 4,800 trajectories) revealed no system reliably recovered preclinical pharmacology decisions, with Claude Opus 4.8 / Pi achieving the highest pass rate at 59.3%.
preclinical pharmacologyassay datamechanism-of-actionpharmacodynamic reasoningstructured answers
Does Text Actually Help? Uncovering and Resolving Text Collapse in Multimodal Time Series Forecasting
The paper identifies text collapse, a failure mode in multimodal time series forecasting where textual inputs become content-independent due to numerical dominance, and proposes REST-TS to address it. REST-TS leverages residual-exclusive supervision, where the numerical backbone generates independent forecasts, and the text branch predicts structured residuals that numerical pathways cannot explain. This forces the text branch to extract genuine content from input descriptions. Evaluated across diverse domains and architectures, REST-TS achieves state-of-the-art performance and demonstrates improved text-branch utilization, validating its effectiveness in resolving text collapse.
text collapsemultimodal forecastingresidual-exclusive supervisiontime seriesnumerical backbone
Spectral Retrieval-Augmented Time-Series Forecasting
The paper introduces SpecReTF, a spectral retrieval-augmented method for time-series forecasting that addresses spectral blindness and temporal recency in existing approaches. By converting time series to windowed frequency representations and using a combined amplitude-phase similarity metric, it captures periodic structures while applying exponential moving average weights to prioritize recent patterns. Evaluations on benchmark datasets show SpecReTF outperforms time-domain retrieval methods, particularly for non-stationary time series.
spectral retrievalnon-stationary time seriesfrequency-domain characteristicsexponential moving averagewindowed frequency representations
Spectral DPPs via NEPv: A Scalable Continuous Relaxation of Determinantal MAP for Diversity-Aware Data Selection
The paper introduces a continuous relaxation of the NP-hard Determinantal Point Process (DPP) MAP problem for diverse subset selection, reformulating it as a nonlinear eigenvalue problem (NEPv) on the Stiefel manifold. The proposed method, Spectral DPPs via NEPv, employs a self-consistent field iteration with spectral-gap-based convergence guarantees, requiring only matrix-vector products and scaling near-linearly in the ground-set size $n$. This approach achieves $O((ndk + nk^2)t)$ complexity for $t$ iterations, enabling efficient diversity-aware selection from large candidate pools.
determinantal point processesnonlinear eigenvalue problemstiefel manifolddiversity-aware selectionscalable optimization
Learning to Annotate Delayed and False AEB Events: A Practical System for Extreme Class Imbalance and Asymmetric Label Noise
We present the first automated annotation framework for identifying rare delayed and false Autonomous Emergency Braking (AEB) triggers, addressing extreme class imbalance (<5% minority samples) and asymmetric label noise. Our approach combines specific data augmentation techniques—manipulating focal target attributes, transplanting ego-vehicle dynamics, and masking non-focal agents—with noise suppression via stable hardness estimation and probe-guided adaptive thresholding. Deployed as a full-stack annotation system, it achieves an 80% recall improvement for delayed/false triggers and reduces manual workload by 50%. This system enables continuous self-improvement through high-quality annotation accumulation, supporting AEB optimization.
autonomous emergency brakingclass imbalanceasymmetric label noisedata augmentationhardness estimation
📰 Industry Media (5)
A startup claims it broke through a bottleneck that’s holding back LLMs
Subquadratic claims to have addressed the quadratic attention bottleneck in large language models (LLMs) by introducing SubQ, a sparse-attention-based LLM. Unlike traditional dense-attention transformers, SubQ dynamically selects token relationships, reducing computational overhead. Independent evaluations by Appen show SubQ achieves 56x speedup over FlashAttention, scores 89.7% on LiveCodeBench, and sustains 98% accuracy in long-context retrieval tasks with a 12M token window. SubQ also reportedly reduces operational costs significantly, e.g., $8 vs. $2600 for Anthropic’s Opus 4.6 on RULER 128. However, skepticism persists due to limited public access and reliance on pre-trained weights from Qwen.
sparse-attentionquadratic bottlenecklong-context retrievalflashattentionlivecodebench
Liquid AI Introduces LFM2.5-Embedding-350M and LFM2.5-ColBERT-350M: Dense Bi-Encoder and Late-Interaction Models for Fast Multilingual Search Across 11 Languages
Liquid AI introduced LFM2.5-Embedding-350M and LFM2.5-ColBERT-350M, bidirectional retrieval models for multilingual search across 11 languages. Both 350M-parameter models adapt the LFM2.5-350M-Base backbone via bidirectional attention masks and non-causal convolutions, enabling full-context representations. The dense bi-encoder (Embedding) produces single-vector document representations, while the late-interaction model (ColBERT) generates per-token embeddings for higher accuracy. Evaluated on NanoBEIR (NDCG@10) and MKQA-11 (Recall@20), ColBERT achieved 0.605 and 0.694, outperforming larger models like Qwen3-Embedding-0.6B. Both models support edge deployment via GGUF variants, with sub-10ms p50 query latency on cached documents.
bidirectional retrievallate-interaction modelmultilingual searchcontrastive pretrainingcross-lingual distillation
Salesforce CodeGen Tutorial: Generate, Validate, and Rerank Python Functions With Unit Tests and Safety Checks
The tutorial presents a structured pipeline for generating, validating, and reranking Python functions using Salesforce CodeGen models. It demonstrates loading CodeGen variants (350M to 7B parameters) via Hugging Face, then implements function extraction, syntax validation, static safety checks, and unit-test execution within restricted environments. The workflow includes best-of-N candidate generation (temperature=0.25-0.35) with multi-criteria scoring (syntax correctness, safety, test coverage, cyclomatic complexity). Benchmark tasks show successful generation of mathematical functions (factorial, Fibonacci) and text processing (palindrome detection) with 100% test pass rates for top-ranked candidates.
code generationunit test validationstatic analysisrestricted executioncandidate reranking
SAP and Google Cloud deploy agentic commerce architecture
SAP and Google Cloud introduced an agentic commerce architecture to automate enterprise-scale retail operations, addressing data fragmentation in customer experience platforms. The system integrates SAP Commerce Cloud with Google Gemini models via the Universal Commerce Protocol, enabling autonomous agents to manage end-to-end retail sequences while synchronizing inventory and marketing data in real-time. Results include reduced integration costs, improved inventory accuracy, and dynamic campaign generation through bidirectional data flows between SAP Business Data Cloud and Google BigQuery.
agentic commerce architectureuniversal commerce protocolbidirectional data flowsgoogle gemini modelsinventory synchronisation
e2e-assure introduces Cumulo, the U.K.’s only sovereign, AI-driven, zero-day SOC platform to secure IT and OT environments
e2e-assure introduces Cumulo, a sovereign AI-driven Security Operations Center (SOC) platform for UK IT/OT environments, addressing autonomous AI threats through digital twin technology and customer-dedicated LLMs. The platform integrates AI natively with SIEM systems, enabling millisecond threat detection via predictive modeling and local inference while maintaining human oversight. Cumulo's layered architecture separates sensitive operational reasoning from broader intelligence, validated by an anti-hallucination layer, and offers multi-tier deployment for varying security maturity levels.
digital twinsovereign aizero-day socanti-hallucination layerit/ot security
Generated automatically at 2026-06-19 20:50 UTC. Summaries and keywords are produced by an LLM and may contain inaccuracies — always consult the original article.
