Daily Digest — 2026-06-25
240 items · 3 research labs, 236 arxiv papers, 1 industry media
MarkTechPost: all feed URLs failed (last tried: https://www.marktechpost.com/feed/)AI News: all feed URLs failed (last tried: https://artificialintelligence-news.com/feed/)
🏛️ Research Labs (3)
OpenAI and Broadcom unveil LLM-optimized inference chip
OpenAI and Broadcom introduce Jalapeño, a custom AI accelerator optimized for LLM inference, achieving a 9-month design-to-production cycle. The chip architecture minimizes data movement and balances compute, memory, and networking resources, targeting near-theoretical peak performance. Early tests indicate superior performance per watt compared to state-of-the-art solutions, with engineering samples running GPT-5.3-Codex-Spark workloads at target frequency and power. Designed for scalability, Jalapeño integrates Broadcom’s Tomahawk networking silicon and Celestica’s system expertise, enabling gigawatt-scale deployment by 2026. This initiative aligns with OpenAI’s full-stack strategy to enhance AI accessibility, reliability, and cost efficiency.
llm inferenceperformance per watttomahawk networkinggigawatt scalefull-stack strategy
Accelerating Transformers Fine-Tuning with NVIDIA NeMo AutoModel
NVIDIA NeMo AutoModel accelerates Transformer fine-tuning by integrating Expert Parallelism, DeepEP fused all-to-all dispatch, and TransformerEngine kernels atop Hugging Face Transformers v5. The method maintains API compatibility while optimizing MoE model training through distributed weight sharding (ep_size=8 reduces per-GPU expert memory by 8x) and fused communication-computation kernels. Benchmarks show 3.4-3.7x higher throughput and 29-32% lower GPU memory versus Transformers v5 on models like Qwen3-30B-A3B and Nemotron 3 Nano 30B, enabling full fine-tuning of 550B-parameter models across 128 GPUs.
expert parallelismdeepeptransformers v5moe modelsgpu memory optimization
Introducing the FFASR Leaderboard: Benchmarking ASR in the Real World
The FFASR Leaderboard introduces a standardized benchmark for far-field automatic speech recognition (ASR), addressing the performance gap between near-field and far-field conditions. Using Treble Technologies' hybrid wave-based simulation engine, it evaluates models across nine acoustic conditions, including reverberation, noise, and moving sources, with sim-to-real validation. Initial results show far-field word error rates (WER) at low SNR are consistently several times higher than near-field WER, highlighting the challenge of real-world deployment. The benchmark also reports RTFx (real-time factor) to analyze accuracy-latency tradeoffs, supporting submissions via Hugging Face model IDs or custom evaluators.
far-field asrword error ratehybrid simulationreal-time factoracoustic robustness
📜 arXiv Papers (236)
InSight: Self-Guided Skill Acquisition via Steerable VLAs
The paper introduces InSight, a framework enabling autonomous skill acquisition in vision-language-action (VLA) models through primitive-action steerability. The method involves (1) automated segmentation of demonstrations into labeled primitives via VLM plan decomposition and end-effector poses, and (2) a VLM-guided data flywheel that identifies missing primitives, autonomously attempts demonstrations, and integrates successful attempts into the VLA training set. Evaluations on simulation and real-world tasks (block flipping, drawer closing, etc.) show successful skill acquisition without human demonstrations, enabling composition for novel long-horizon tasks.
vision-language-action modelsprimitive-action steerabilityvlm plan decompositiondata flywheelend-effector poses
FLUX3D: High-Fidelity 3D Gaussian Generation with Diffusion-Aligned Sparse Representation
FLUX3D introduces a scalable image-to-3D Gaussian Splatting (3DGS) framework addressing two structural bottlenecks in sparse voxel representation: representation learning and cross-modal alignment. The method employs Diffusion-Aligned Structured Latents (DA-SLAT) coupled with a decoder-only architecture to enhance 3DGS reconstruction fidelity. Additionally, it integrates a Sparse-structure Multimodal Diffusion Transformer (SMDiT) with Modal-Aware Rotary Positional Embedding (MARoPE) for geometry-agnostic 2D-3D alignment. Benchmark experiments show FLUX3D significantly improves appearance fidelity and outperforms state-of-the-art methods in generating high-quality 3DGS assets.
3d gaussian splattingsparse voxel representationdiffusion-aligned structured latentsmultimodal diffusion transformerrotary positional embedding
OpenThoughts-Agent: Data Recipes for Agentic Models
The OpenThoughts-Agent (OT-Agent) project introduces a data curation pipeline for training agentic models, addressing the gap in existing single-benchmark approaches. Through 100+ ablation experiments, the study systematically evaluates pipeline stages, emphasizing task sources and diversity. A 100K-example training set was assembled, fine-tuning Qwen3-32B to achieve 44.8% accuracy across seven benchmarks, a 3.9-point improvement over Nemotron-Terminal-32B (40.9%). The dataset demonstrates superior scaling properties in compute-controlled comparisons. All resources are publicly released at openthoughts.ai.
agentic modelsdata curationablation experimentsqwen3-32bscaling properties
It's Complicated: On the Design and Evaluation of AI-Powered AAC Interfaces
The paper proposes improved evaluation methods for AI-powered augmentative and alternative communication (AAC) systems to better capture intersectional user needs. Through analysis of six AAC problem spaces, the authors identify limitations in current metrics and suggest more nuanced assessment approaches. The work highlights how AI could enhance AAC interfaces while addressing the multifaceted requirements of diverse users through tailored evaluation frameworks.
augmentative communicationai interfacesintersectional evaluationassistive technologyhuman-computer interaction
IV-CoT: Implicit Visual Chain-of-Thought for Structure-Aware Text-to-Image Generation
The paper proposes Implicit Visual Chain-of-Thought (IV-CoT), a latent visual reasoning framework that improves structure-aware text-to-image generation by decomposing visual conditioning queries into structural and semantic components. The method employs a structural-to-semantic cascade, where structural queries form a latent visual plan before semantic queries render appearance, guided by training-only sketch supervision without requiring sketch extraction at inference. IV-CoT achieves state-of-the-art results on GenEval and T2I-CompBench, with visualizations confirming complementary roles of structural and semantic queries in preserving object counts, spatial relations, and layouts.
multimodal llmslatent visual reasoningstructure-aware generationsketch supervisiontext-to-image
World Models in Pieces: Structural Certification for General Agents
(No summary returned.)
Matching Tasks to Objectives: Fine-Tuning and Prompt-Tuning Strategies for Encoder-Decoder Pre-trained Language Models
The Match Task to Objective (MTO) framework improves task-specific adaptation of encoder-decoder pre-trained language models by aligning pre-training and fine-tuning objectives. It introduces automated methods for task-related data preparation and novel template designs for fine-tuning, achieving over 120% performance gain in few-shot settings compared to conventional methods. The framework also extends to prompt-tuning, enhancing soft prompt engineering and optimization. Evaluations on generation and question answering tasks, particularly commonsense knowledge retrieval and completion, demonstrate significant improvements over baselines in both few-shot and full-dataset scenarios.
encoder-decoder modelsfine-tuningprompt-tuningcommonsense knowledgefew-shot learning
Grading the Grader: Lessons from Evaluating an Agentic Data Analysis System
The study evaluates grading strategies for agentic data analysis systems, focusing on LAMBDA's performance on 153 DSGym QRData tasks. It introduces a three-layer human-AI cascade: strict regex matching, LLM-based lenient grading, and human inspection. Results show 100% precision for automated graders, with 97% recall for the lenient grader. A keyword-anchored extraction pipeline improves strict grader recall by 60 percentage points, while an iterative nudge mechanism increases grading success from 36% to 97%. Variable type emerges as the most influential task metadata field.
agentic data analysislenient gradingregex matchingiterative nudgetask metadata
Accuracy and Satisfaction in Multi-Turn LLM Dialogues for NFR Assessment
This paper evaluates the accuracy and quality of multi-turn LLM dialogues for assessing Non-Functional Requirements (NFRs) in software development, focusing on HIPAA compliance. Using GitHub Copilot, 49 developers assessed 148 HIPAA-derived NFRs against the iTrust codebase across requirement satisfaction, reasoning, and code localization. Results show high developer agreement with LLM outputs but low accuracy against expert ground truth. User satisfaction modeling reveals that longer responses and information-heavy turns decrease satisfaction, while proactive interactions increase it. The study offers design insights for LLM-based dialogue systems in NFR assessment.
non-functional requirementsmulti-turn dialoguehipaa compliancegithub copilotuser satisfaction
Difference-Making without Making a Difference
The paper critiques Andreas & Günther's seven definitions of actual causation, categorized into three types (factual, counterfactual, and regularity-based difference-making). It demonstrates that their latest factual difference-making definition incorporates elements from all three types, rendering the distinctions meaningless. Through comparative analysis of seven accounts on key examples, the author shows this undermines all proposed definitions. The analysis reveals internal inconsistencies in their theoretical framework.
actual causationdifference-makingfactualcounterfactualregularity-based
Solving Inverse Problems of Chaotic Systems with Bidirectional Conditional Flow Matching
The paper introduces Bidirectional Conditional Flow Matching (Bi-CFM) for solving inverse problems in chaotic systems, where initial conditions are inferred from final states despite ill-posedness and instability. The method learns bidirectional mappings between initial and final state distributions, mitigating exponential error accumulation, and extends to Conservation-constrained Bi-CFM (CBi-CFM) for systems with conservation laws. Evaluated on Lorenz, Circuit, Lorenz 96, and planetary dynamics systems, Bi-CFM improves five distribution-level metrics and achieves >100x speedup over baselines, while CBi-CFM reduces conservation errors to ground-truth levels. The method also advances accuracy on real-world globular cluster observations after 10 Gyr of evolution.
inverse problemschaotic systemsflow matchingconservation lawslorenz 96
Large-Language-Model Discovery of Quantum LDPC Codes through Structured Concept Evolution
The authors introduce structured concept evolution (SCE), a framework combining a large language model (LLM) with structured algebraic mutation grammar to discover quantum low-density parity-check (qLDPC) codes. SCE evolves algebraic specifications paired with executable programs through hierarchical mutations, modifying group algebra, protograph geometry, or base space. Using lightweight LLMs (GPT-5.4-mini and GPT-5.4-nano), SCE discovers diverse code families, including abelian and non-abelian constructions beyond standard designs like bivariate-bicycle codes, characterized under code-capacity depolarizing noise with BP+OSD decoding.
structured concept evolutionquantum ldpc codesalgebraic mutation grammargroup algebraprotograph geometry
OrbitForge: Text-to-3D Scene Generation via Reconstruction-Anchored Video Synthesis
OrbitForge introduces a text-to-3D scene generation framework leveraging frozen video priors and Gaussian Splatting reconstruction optimization. The method generates a preliminary 3D reconstruction from a text-to-video output using Deformable Gaussian Splatting with a MedianGS proxy, identifies missing viewpoints via prescribed orbit rendering, and completes these views using the video model. This approach avoids task-specific fine-tuning, score-distillation optimization, and progressive view generation. Evaluated on a 300-prompt T3Bench-derived audit, OrbitForge achieves a 359.0-degree median span, improves Q10 ImageReward from 8.07 to 16.36, and maintains competitive coverage-quality with VideoMV.
gaussian splattingtext-to-video3d reconstructionmediangscoverage-aware evaluation
EG-VQA: Benchmarking Verifiable Video Question Answering with Grounded Temporal Evidence
The paper introduces EG-VQA, a benchmark for verifiable video question answering with 2,067 videos and 11,838 QA pairs annotated with temporal evidence. It proposes Evidence-Grounded F1 (EG-F1) to evaluate both answer correctness and evidence localization. Experiments show proprietary models struggle with evidence grounding, prompting the development of EG-Reasoner, which achieves state-of-the-art performance among open-source models, particularly on reasoning-intensive tasks like counterfactual questions.
video question answeringtemporal evidenceevidence-grounded f1video-llmscounterfactual reasoning
Grad Detect: Gradient-Based Hallucination Detection in LLMs
Grad Detect introduces a gradient-based method for detecting hallucinations in Large Language Models (LLMs) by analyzing layer-wise gradient patterns during inference. The approach leverages internal gradient structures, inaccessible through output-level signals, to predict hallucinations and model abstention. Evaluated on multiple Q&A benchmarks, Grad Detect outperforms confidence-based and sampling-based baselines. Layer ablation studies across eleven models from four architectural families reveal that the final five layers concentrate over 97% of the discriminative gradient signal, enabling efficient deployment with minimal performance loss. The framework provides interpretable insights into model failures and enhances LLM reliability.
gradient-basedhallucination detectionlayer-wiseablation studiesinterpretable insights
Paying to Know: Micro-Transaction Markets for Verified Product Information in Agentic E-Commerce
The paper proposes a paradigm shift in agentic e-commerce, where autonomous buyer agents transact micropayments for verified product information (e.g., service histories, third-party tests) rather than relying on traditional recommendation systems. It introduces an architecture for a freemium micro-transaction market with reputation-based trust scoring, arguing this incentivizes genuine product quality over ranking manipulation. The work identifies key NLP challenges including cost-optimal information acquisition, data pricing negotiation, and privacy-preserving persona modeling as critical research directions for this emerging domain.
agentic e-commercemicro-transaction marketsverified informationreputation-based trustprivacy-preserving persona modeling
Assessing Distribution Shift in Human Activity Recognition for Domain Generalization
This paper systematically evaluates domain generalization in Human Activity Recognition (HAR) by analyzing four types of distribution shifts: device type, sensor placement, sampling rate, and user behavior. The study introduces a uniform benchmark for HAR-based distribution shifts and evaluates 28 domain generalization methods, revealing their limited effectiveness, with performance only marginally surpassing empirical risk minimization baselines. Results indicate that diversity shifts dominate across all shift types, highlighting unique features per domain and underscoring the challenge of model generalizability in real-world HAR applications.
human activity recognitiondomain generalizationdistribution shiftempirical risk minimizationsensor heterogeneity
BluTrain: A C++/CUDA Framework for AI Systems
BluTrain introduces a C++/CUDA framework for AI systems training, designed to optimize hardware expression while abstracting systems complexity. The framework implements native layers including typed tensors with autograd, linear algebra, caching allocators, distributed execution, and an MLIR-based compiler. Evaluations on an 8-GPU system show BluTrain outperforms PyTorch in throughput (407K vs. 395K tokens/s) and memory efficiency (22% reduction) for a 124M-parameter GPT-2 model, while maintaining numerical fidelity and achieving marginally better validation loss.
cudaautogradmlirdistributed traininggpt-2
DeepBD: A Grounded Agentic Workflow for Variant Prioritization and Diagnosis of Genetic Birth Defects
DeepBD introduces a grounded agentic workflow for genetic variant prioritization in birth defect diagnosis, combining LLM-assisted case structuring, a pretrained evidence engine, and specialist modules. The system integrates structured rule evidence, sequence representations, phenotype-conditioned context, and tool-based refinement to rank candidate variants. Evaluated on 18,622 fetal/infant cases, DeepBD achieved Recall@1/3/5/10 of 0.658/0.882/0.912/0.929, outperforming Exomiser, DeepRare, and LLM baselines, with ablation studies confirming complementary contributions from rule evidence, mechanistic context, and specialist refinement.
variant prioritizationagentic workflowphenotype-conditioned contextevidence enginerecall@k
UniDrive: A Unified Vision-Language and Grounding Framework for Interpretable Risk Understanding in Autonomous Driving
UniDrive introduces a unified vision-language framework for interpretable risk understanding in autonomous driving, addressing the trade-off between temporal reasoning and spatial precision. The method combines a temporal reasoning branch for multi-frame dynamics with a high-resolution perception branch, integrated via gated cross-attention fusion to align dynamic context with spatial evidence. Evaluated on DRAMA-Reasoning, UniDrive outperforms image- and video-based baselines in captioning and risk-object grounding, achieving superior small-object localization, zero-shot generalization to NuScenes and BDD100K, and human-rated interpretability. Code is publicly available.
multimodal large language modelstemporal reasoningspatial precisiongated cross-attentionrisk-object grounding
Can Scale Save Us From Plasticity Loss in Large Language Models?
The study investigates plasticity loss in transformer-based LLMs, demonstrating its persistence across models from 5M to 314M parameters in multilingual continual learning. Using GPT-style architectures and a Vietnamese probing task, the authors find that plasticity loss follows a sublinear scaling law with model size, suggesting larger models delay but do not prevent the phenomenon. Results also reveal plasticity loss occurs under stationary multilingual training, contradicting assumptions of its exclusivity to abrupt task changes. The work establishes that even large language models eventually lose adaptive capacity in both continual and stationary natural-language settings.
plasticity losstransformercontinual learningscaling lawmultilingual training
Scaling Laws for Task-Specific LLM Distillation
This paper establishes empirical scaling laws for task-specific LLM distillation, analyzing performance tradeoffs between in-domain and general knowledge under compression. The study compares logit-based and LoRA-based distillation with iterative structural pruning, introducing a blended chain-of-thought supervision loss to stabilize KL-divergence distillation over reasoning traces. Results show predictable in-domain quality degradation under compression, with general-knowledge benchmarks collapsing earlier; chain-of-thought supervision mitigates this by recovering pruned knowledge. The work contributes the FinHeadlineMix dataset and practical compression guidelines.
llm distillationscaling lawsiterative pruningchain-of-thought supervisionlora-based distillation
Beyond U-Net: A Latent-Representation-Aligned Skip-Free Backbone for Flow-Matching Speech Enhancement
The paper proposes a skip-free encoder-decoder backbone for flow-matching speech enhancement, replacing U-Net skip connections with Latent Representation Alignment (LRA) to a frozen Descript Audio Codec. This alignment promotes clean-speech representations by matching bottleneck and decoder features to quantized-free latent features from the pretrained codec. Evaluated on WSJ0-CHiME3 and VoiceBank-DEMAND, the method achieves improved PESQ and perceptual quality with only five function evaluations, demonstrating efficient inference while avoiding noise transfer from skip connections.
flow matchinglatent representation alignmentspeech enhancementdescript audio codecskip-free backbone
Task Decomposition for Efficient Annotation
The paper proposes task decomposition to optimize annotation workflows by reducing inferential load. It introduces a formal model based on centering theory, where sub-tasks focus on salient anchor entities to constrain output space complexity. The method demonstrates cost-efficiency improvements through strategic allocation of sub-tasks across heterogeneous annotators (models and humans). Results show that decomposition lowers aggregate inferential load while maintaining annotation quality under fixed budgets.
task decompositioninferential loadcentering theoryannotation workflowsheterogeneous annotators
Decentralised AI Training and Inference with BlockTrain
The paper introduces BlockTrain, a decentralized protocol for AI training and inference that partitions models into independently trainable blocks. Each block is optimized locally while contributing to a global objective, enabling distributed training without full-model optimizer states. On WikiText, BlockTrain achieves cross-entropy 1.359 (perplexity 3.89), close to a centralized Transformer baseline (ΔCE≈0.04). A six-worker configuration reaches CE 1.385 via block-averaging, and public-IP experiments demonstrate scalability, improving CE from 5.580 to 1.811 while transferring 15.22 GB. Inference supports 75.80B-parameter models over TCP with sequence-level efficiency gains.
blocktraindecentralized trainingcross-entropytransformertcp inference
Evaluating the Interpretability of Sparse Autoencoders with Concept Annotations
We introduce a human-grounded evaluation framework for sparse autoencoders (SAEs) that quantifies semantic alignment between SAE latents and human-annotated concepts. The method employs Fully-Binary Matching Pursuit (FBMP) for many-to-one concept matching and Targeted Attribute Perturbation Alignment Score (TAPAScore) for functional validation through attribute perturbations. Evaluated on synthetic benchmarks synCUB and synCOCO, FBMP outperforms one-to-one baselines, and TAPAScore reliably distinguishes trained SAEs from untrained ones. Experiments on CLIP and DINOv2 embeddings reveal that increased overcompleteness reduces interpretability, with moderate dictionary sizes offering the best trade-off. Code and datasets are publicly available.
sparse autoencodersfully-binary matching pursuittapascoresynth benchmarksovercompleteness
TACTFUL: Tactile-Driven Exploration For Object Localization and Identification in Confined Environments
TACTFUL introduces a vision-free tactile exploration framework enabling multi-fingered robots to autonomously explore confined workspaces, discover objects through contact, and identify them via tactile reconstruction. The system learns a single policy balancing global exploration with local surface refinement using a dynamic reward schedule, trained entirely on real hardware without simulation. Results demonstrate tactile sensing paired with structured learning achieves 77% success in object identification with 0.015 m average reconstruction error, outperforming baseline approaches on real-world objects.
tactile explorationobject identificationdynamic reward scheduletactile reconstructionconfined workspaces
FlowPipe: LLM-Enhanced Conditional Generative Flow Networks for Data Preparation Pipeline Construction
FlowPipe introduces a Conditional Generative Flow Network (C-GFlowNet) framework for automated data preparation pipeline construction, addressing limitations in Multi-DQN methods through trajectory balance objectives and LLM-enhanced semantic modulation. The method employs Feature-wise Linear Modulation (FiLM) to inject dataset-specific logical priors into policy decisions and incorporates failure awareness to optimize search efficiency. Evaluations on 74 real-world datasets demonstrate 11.96% accuracy improvement and 12.5× faster convergence compared to state-of-the-art baselines.
conditional generative flow networksdata preparation pipelinesfeature-wise linear modulationtrajectory balancesemantic modulation
Cost-Optimal Decision Diagrams for Stochastic Boolean Function Evaluation
The paper introduces the first practical exact algorithm for cost-optimal stochastic Boolean function evaluation, employing branch-and-bound with variable-selection heuristics, pruning, and caching. The method addresses scenarios where information acquisition incurs variable costs under probabilistic truth assignments. Experimental results demonstrate scalability on random instances and analyze efficiency-quality trade-offs via a greedy beam-search variant, including application to heart-disease diagnosis. Theoretical contributions establish the problem's #P-hardness and containment in PSPACE.
stochasticbooleanbranch-and-boundpspacebeam-search
LaGO: Latent Action Guidance for Online Reinforcement Learning
The paper introduces LaGO (Latent Action Guidance for Online Reinforcement Learning), a framework that leverages pretrained LLMs as latent action priors to guide online policy optimization, rather than using them as direct controllers. The method employs soft guidance from LLMs during Proximal Policy Optimization (PPO), avoiding precise action generation requirements. Experiments on CLEVR-Robot (discrete control) and Meta-World (continuous control) show LaGO improves average success rates from 15.1% to 27.2% and 2.7% to 15.2%, respectively, outperforming Vanilla PPO. Results indicate stronger pretrained LLMs yield better guidance, demonstrating LLMs' potential for enhancing online decision-making.
latent action guidanceonline reinforcement learningproximal policy optimizationlarge language modelsdecision-making
AI-PAVE-Br: Leveraging Large Language Models for Enhanced Product Attribute Value Extraction through a Golden Set Approach
The paper introduces AI-PAVE-Br, a system leveraging Large Language Models (LLMs) for Product Attribute Value Extraction (PAVE) in Brazilian e-commerce, and the Golden Set, a manually annotated Portuguese dataset for PAVE benchmarking. The method employs targeted prompt engineering with LLMs to address linguistic nuances in Portuguese product descriptions. Experiments demonstrate that AI-PAVE-Br significantly outperforms traditional Named Entity Recognition (NER) baselines, providing a scalable solution for non-English markets and a public resource for future research.
large language modelsproduct attribute value extractionnamed entity recognitionprompt engineeringportuguese nlp
CineCap: Structured Reasoning with Spatio-Temporal Anchors for Cinematographic Video Captioning
CineCap introduces a structured reasoning framework with spatio-temporal anchors and reinforcement learning for cinematographic video captioning, addressing the challenges of inferring professional film-language concepts and generating comprehensive, accurate descriptions. The method employs atomic reasoning for supervised fine-tuning and optimizes caption quality through comprehensiveness, accuracy, and gated coverage rewards. Evaluated on CineCap Bench (472 annotated video-caption pairs), the framework outperforms proprietary and open-source baselines, setting a new state of the art.
cinematographic captioningspatio-temporal anchorsreinforcement learningatomic reasoninggated coverage rewards
Visualizing "We the People": Bridging the Perception Gap through Pluralistic Data Storytelling
The paper proposes pluralistic data storytelling as an AI-enabled alternative to binary visualizations that exacerbate political polarization. It introduces an interactive opinion mapping approach using deliberative technologies to represent high-dimensional opinion spaces, emphasizing both consensus and dissensus. Results from the 'We the People' deliberation (N=2,400) demonstrate how AI-synthesized opinion landscapes can humanize diverse viewpoints and reveal hidden consensus, suggesting this method as a scalable intervention for democratic culture.
pluralistic data storytellingopinion landscapesdeliberative technologieshigh-dimensional opinion spacesai-synthesized visualization
SAFARI: Scaling Long Horizon Agentic Fault Attribution via Active Investigation
SAFARI introduces a tool-augmented diagnostic loop for scalable fault attribution in long-horizon multi-agent tasks, addressing context window limitations of traditional LLM-based methods. The framework combines a specialized toolbox for trajectory segment retrieval with a persistent Short-Term Memory (STM) to enable cross-turn reasoning without full context loading. Experiments show SAFARI outperforms state-of-the-art methods by 20% on Who&When and 19% on TRAIL GAIA, maintaining 0.58 precision for faults 5x beyond native context windows where baseline methods fail.
long-horizon tasksfault attributiontool-augmented llmshort-term memorycontext window
Privacy-Preserving RAG via Multi-Agent Semantic Rewriting: Achieving Confidentiality Without Compromising Contextual Fidelity
The paper proposes a multi-agent framework for privacy-preserving Retrieval-Augmented Generation (RAG) that sanitizes retrieved content through semantic rewriting. Three specialized agents handle privacy extraction, semantic analysis, and reconstruction to remove sensitive identifiers while maintaining semantic fidelity. Evaluated on ChatDoctor and Wiki-PII datasets across six LLMs, the method reduces targeted information leakage in LLaMA-3-8B from 144 to 1 instance and achieves a BLEU-1 score of 0.122, outperforming SAGE (0.117). The framework operates asynchronously without adding inference latency.
retrieval-augmented generationprivacy leakagesemantic rewritingmulti-agent frameworkcontextual fidelity
Themis: An explainable AI-enabled framework for Reinforcement Learning with Human Feedback
The paper introduces Themis, an explainable AI framework combining transparency and human feedback for Reinforcement Learning from Human Feedback (RLHF). The framework supports over 200 RL environments and includes a cloud-based platform for scalable human feedback collection. Results demonstrate that Themis-trained reward models match or exceed environment-provided rewards using human preferences. The system supports 1,000 users in back-to-back experiments on modest hardware.
reinforcement learninghuman feedbackexplainable aireward modelingscalability
Infinitesimal Causality
The paper introduces a categorical framework for infinitesimal causality within Frobenius Markov categories, leveraging tangent-bundle semantics. It defines infinitesimal causality through the interaction of two Frobenius structures: a categorical Frobenius algebra for classical variable operations and a geometric Frobenius integrability condition for intervention distributions. Causal sufficiency is characterized by the compatibility of these structures. Interventions are modeled as tangent vectors deforming copy/discard operations, with Lie brackets assessing information-flow preservation. The framework is applied to structural causal models, demonstrating its alignment with Pearl's do-calculus principles, including counit invariance, coproduct compatibility, and involutive bracket closure.
frobenius markov categoriesinfinitesimal causalitytangent-bundle semanticslie bracketsdo-calculus
When CQs Go Wrong: Challenges in CQ Verification with OE-Assist
The paper identifies challenges in Competency Question (CQ) verification for ontology evaluation, proposing solutions to improve user performance. Using data from 19 participants performing CQ-verification on 20 tasks with LLM assistance, the study highlights ambiguities and complexities in CQs that lead to inconsistent modeling. Results demonstrate the need for tools to refine CQs pre-publication, mitigating ambiguity and complexity in ontology engineering.
competency questionsontology evaluationllm assistantambiguity resolutionontology engineering
Abstractions of Queries in Ontology-Based Data Access
The paper investigates query abstraction in ontology-based data access (OBDA) using existential rules and certain answer semantics. It addresses the challenge of translating data queries to the ontology layer by introducing minimally complete and maximally sound abstractions. The study extends UCQs with limited inequality and a special predicate for database constants, enabling expression of minimally complete abstractions without increasing complexity. Results include characterizations of maximally sound abstractions through connections with data exchange's maximum recovery concept.
ontology-based data accessexistential rulesquery abstractionucq extensioncertain answer semantics
ScaleToT: Generalizing Structured LLM Reasoning for Billion-Scale Low-Activity User Modeling
ScaleToT proposes a method for generalizing structured LLM reasoning to billion-scale low-activity user modeling, addressing the challenges of sparse profiles and high computational costs. The approach combines entropy-guided Tree-of-Thought (ToT) refinement for reliable reasoning with supervised fine-tuning (SFT) and Outcome-Driven Segment-Aware Implicit Reward Policy Optimization (OSIPO) to transfer reasoning to a lightweight profile encoder. Evaluated on lifetime value (LTV) prediction in advertising, ScaleToT achieved a 6.738% increase in LT30 while covering only 7.32% of the population, significantly reducing compute costs.
tree-of-thoughtsupervised fine-tuningoutcome-driven optimizationuser modelinglightweight encoder
Uncertainty-Aware Longitudinal Forecasting of Alzheimer's Disease Progression Using Deep Learning
The paper proposes a probabilistic deep learning framework for longitudinal Alzheimer's disease progression forecasting with uncertainty quantification. The method combines a Temporal Fusion Transformer encoder with CORAL ordinal output, asymmetric loss weighting, and oversampling to handle disease-stage ordering, followed by an autoregressive Mixture Density Network for multi-horizon trajectory generation. Evaluated on ADNI, the model outperforms baselines in next-visit diagnosis prediction (particularly MCI-to-dementia discrimination), achieves 90% credible interval coverage, and decomposes uncertainty into aleatoric and epistemic components, with higher epistemic uncertainty for rare progressions and external OASIS-3 data.
temporal fusion transformermixture density networkordinal regressionuncertainty quantificationalzheimer's disease progression
ASALT: Adaptive State Alignment for Lateral Transfer in Multi-agent Reinforcement Learning
The paper introduces ASALT, an adaptive state alignment method for lateral transfer in multi-agent reinforcement learning (MARL) that handles mismatched state-space dimensionalities between source and target domains. The approach employs observation-level and state-level adapters to map heterogeneous observations and global states into a shared embedding space, facilitating knowledge transfer across actors and critics. Experiments on standard benchmarks show ASALT improves sample efficiency and global return in cooperative settings compared to baselines, while mitigating negative transfer, though performance depends on the degree of domain mismatch.
multi-agent reinforcement learningtransfer learningstate-space alignmentembedding spacenegative transfer
AdversaBench: Automated LLM Red-Teaming with Multi-Judge Confirmation and Cross-Model Transferability
AdversaBench introduces an automated red-teaming pipeline for LLMs, combining prompt mutation via five structured operators with failure confirmation through a three-judge panel and meta-judge tiebreaker. The method evaluated 45 seeds across reasoning, instruction-following, and tool-use categories, achieving 100% confirmed failure generation. Key findings include: operator effectiveness varies by category (e.g., inject_distractor scores 0.00 vs. 0.80-0.83), instruction-following requires 2.4x more iterations than other categories, judge agreement shows 80-87% pairwise consensus despite low Cohen's kappa, and adversarial prompts transfer zero-shot between Llama 3.1 8B and 3.3 70B.
red-teamingprompt mutationmulti-judge confirmationcross-model transferabilitycohen's kappa
LLMs Prompted for Legal Context Object More: Overrefusal from Small On-Premises LLMs in Criminal Legal Context
The study reveals that small on-premises LLMs exhibit systematic overrefusal in legal contexts when prompted with authority-style prefixes, increasing refusal rates by 2--20x compared to no-prefix baselines. Researchers evaluated modern deployable LLMs using legal prompts with varying contextual framings (e.g., "assistant of the national supreme court") and observed unstable refusal behaviors, particularly under institutional-style prompts. Results indicate that role-play jailbreak prefixes yield mixed effects, exacerbating refusals in some models while leaving others unaffected, highlighting potential biases in legal applications.
overrefusalon-premises llmslegal contextrole-play jailbreakcontextual framing
Quant Convergence: Bridging Classical Value Investing and Modern Factor Models for Systematic Equity Selection
The study demonstrates that integrating Benjamin Graham's classical value investing principles with modern machine learning models mitigates excessive risk-taking in equity selection. Using S&P 500 data (2002-2022), the authors compare three feature sets (Graham rules, modern factors, and a hybrid) with XGBoost and AutoGluon under a four-year buy-and-hold strategy (2022-2026). Results show the pure Graham Random Forest achieved the highest return (232.13%) with lower risk (Calmar Ratio: 1.38), while the hybrid model balanced momentum and value (202.91% return, 34.53% max drawdown), outperforming complex models prone to overfitting.
value investingfactor modelsxgboostautogluoncalmar ratio
Governed Shared Memory for Multi-Agent LLM Systems
The paper introduces governed shared memory primitives for multi-agent LLM systems, addressing four failure modes: unauthorized leakage, stale propagation, contradiction persistence, and provenance collapse. It presents MemClaw, a production multi-tenant memory service implementing scoped retrieval, temporal supersession, provenance tracking, and policy-governed propagation, evaluated via the ArgusFleet harness. Key results include 100% provenance reconstruction accuracy for depth-four chains, zero cross-fleet leakage, and sub-second write-to-visible latency, while revealing asymmetric scope enforcement and pipeline ordering conflicts in production.
multi-agent llmgoverned shared memoryprovenance trackingtemporal supersessionscoped retrieval
Poster: Exploring the Limits of Audio-Based Detection of Turkish Phone Call Scams
This work introduces the first public multi-modal dataset of 100 aligned audio-transcript pairs for Turkish scam call detection, addressing a gap in low-resource language fraud prevention research. Seven large language models (Gemini 2.5, GPT-4o, Qwen) were evaluated across three input modalities: raw audio, automatic speech-to-text transcripts, and human-refined transcripts. Results demonstrate that transcript-based inputs consistently outperform direct audio processing, with minimal performance difference between human-corrected and uncorrected transcripts. The study highlights the need for culturally inclusive AI safety research and robust multi-modal systems in fraud detection.
multi-modal datasetlow-resource languagespeech-to-textfraud preventionlarge language models
Reinforcement Learning for Computer-Use Agents with Autonomous Evaluation
The paper proposes a reinforcement learning framework for Computer-Use Agents (CUAs) that uses autonomous vision-language evaluation as a scalable reward signal for GUI interaction tasks. The method employs a Vision-Language Model to assess task completion from final screenshots and instructions, modeling evaluator feedback as noisy binary rewards with a noise-corrected estimator for Proximal Policy Optimization. Experiments on macOSWorld, Windows Agent Arena, and OSWorld demonstrate 12.6 and 5.1 percentage point improvements over zero-shot baselines and raw evaluator rewards respectively, validating the approach for GUI-based RL with imperfect evaluators.
computer-use agentsvision-language modelsproximal policy optimizationgui interactionnoisy rewards
A specialized reasoning large language model for accelerating rare disease diagnosis: a randomized AI physician assistance trial
The study introduces RaDaR, a 32B-parameter open-source reasoning LLM specialized for rare disease diagnosis, addressing data scarcity through 49,170 real and 104,666 synthetic cases with reasoning-enhanced training. Evaluation across public benchmarks and four external centers showed RaDaR outperformed larger models like DeepSeek-R1 (671B), achieving 61.06% diagnostic prioritization before clinical suspicion (1.87-month lead time) and improving physician accuracy by 21.44 percentage points in a randomized trial. Phenotype-anchored synthetic data demonstrated monotonic scaling benefits for long-tail rare diseases.
large language modelrare disease diagnosissynthetic data augmentationreasoning-enhanced trainingphenotype-anchored narratives
A Fair Evaluation of Graph Foundation Models for Node Property Prediction
This work conducts a rigorous evaluation of 9 Graph Foundation Models (GFMs) for node property prediction, comparing them against strong Graph Neural Network baselines under a unified evaluation framework. The study focuses on GFMs designed for tasks such as fraud detection and recommendation systems, where inconsistent evaluation methodologies have hindered reliable comparisons. Results indicate that only the most recent GFMs, based on the Prior-data Fitted Networks paradigm, outperform well-tuned GNNs in predictive accuracy, albeit with increased inference costs.
graph foundation modelsnode property predictionprior-data fitted networksgraph neural networksinference cost
CrossPool: Efficient Multi-LLM Serving for Cold MoE Models through KV-Cache and Weight Disaggregation
CrossPool introduces an efficient serving engine for cold MoE models by disaggregating weights and KV-cache into separate GPU memory pools. The system employs a KV-cache planner, layer-wise pipeline scheduler, and persistent kernels to optimize memory utilization and reduce control overhead. By dynamically sharing KV-cache capacity across models and consolidating FFN weights, CrossPool handles bursty long-context requests while minimizing GPU memory waste. Experimental results demonstrate a 10.4× reduction in P99 tail burst time compared to state-of-the-art KV-cache-based serving systems.
crosspoolkv-cachemoegpuffn
On the Smallness of the Large Language Models Scaling Exponents
The paper analyzes the unsustainability of current Large Language Models (LLMs) scaling exponents in terms of energy consumption, demonstrating that the observed small exponents persist even after accounting for the 'pedestal effect' (a numerical bias from neglecting non-zero loss at infinite data). Methodologically, it draws analogies to phenomenological fluid turbulence models to examine how data smoothness/roughness impacts scaling behavior. Results indicate that neither the pedestal effect correction nor data texture considerations resolve the fundamental energy inefficiency implied by these scaling laws.
scaling exponentspedestal effectenergy efficiencylarge language modelsfluid turbulence analogy
Red-Teaming the Agentic Red-Team
The paper presents the first security analysis of widely used agentic systems for offensive security operations, identifying common design flaws that enable adversaries to exfiltrate API keys, establish persistence, and compromise operator machines. The authors introduce a full cyber kill chain for agentic systems, detailing attack progression from LLM manipulation to sandbox escape. Based on their analysis, they propose a robust architecture and design principles to mitigate these vulnerabilities at the architectural level.
agentic systemscyber kill chainllm manipulationsandbox escapeapi key exfiltration
RetiSEM: Generalising Causal Models for Fragmented Biomedical Data
RetiSEM introduces a domain-constrained structural equation modeling framework for causal graph recovery in fragmented biomedical data, addressing challenges of incomplete clinical, molecular, and imaging variables. The method organizes variables into biologically informed blocks, applies forbidden-edge constraints, and decomposes pathway-level effects into TE, NDE, and NIE components. Evaluated across ten synthetic benchmarks and a real-world NHANES-retinal dataset, RetiSEM achieves lower structural error and higher causal accuracy than unconstrained baselines, demonstrating efficacy in limited-resource settings.
structural equation modelingcausal graph recoverymediation analysisbiomedical aiforbidden-edge constraints
Adaptive Machine Learning Framework for UAV Trajectory Optimization in O-RAN
The paper introduces an adaptive UAV trajectory optimization framework for O-RAN that reduces retraining overhead through continual transfer learning. The method maintains a library of pre-trained models, selects the most relevant one via a similarity-based mechanism, and falls back to continuous refinement when no suitable model exists, using ray-traced city maps for reliability. Simulations show 44-56% faster convergence versus scratch training and up to 40% improvement over traditional transfer learning without model selection.
uav trajectory optimizationo-rancontinual transfer learningmodel selectionray tracing
video-SALMONN-R$^3$: Learning to ReWatch, ReAsk, and ReAnswer for Efficient Video Understanding
The paper introduces video-SALMONN-R$^3$, the first end-to-end video-LLM that enables efficient video understanding through a re-watch mechanism without chain-of-thought (CoT) cold-start. The method employs reinforcement learning to localize relevant segments, then revisits them at higher fidelity, while avoiding CoT-based supervised fine-tuning (SFT) to preserve pretrained capabilities. It incorporates a re-answer strategy (initial answer followed by refinement) and re-ask mechanism (query re-injection) to improve question adherence. Experiments show consistent improvements over base models and QA-SFT baselines, with lower computational cost than prior re-watch approaches.
video-llmreinforcement learningsupervised fine-tuningquestion answeringtemporal localization
G$^3$VLA: Geometric inductive bias for Vision-Language-Action Models
G$^3$VLA introduces a geometric inductive bias for Vision-Language-Action (VLA) models by incorporating calibrated camera geometry into visual tokens without modifying the action space or imitation objective. The method employs intrinsic-conditioned ray embeddings, projective positional encoding (PRoPE), and bidirectional cross-view fusion, supervised by ground-truth point maps or confidence-gated $π^3$X teacher predictions. Evaluated on $π_0$, G$^3$VLA demonstrates consistent performance improvements across LIBERO suites, RoboCasa24, RoboTwin2.0, and real-robot settings, particularly in spatially and object-sensitive tasks. Further validation on $π_{0.5}$ and GR00T 1.5 suggests that geometric transfer is most effective when geometry-aware tokens directly influence action generation.
geometric inductive biasvision-language-action modelsprojective positional encodingbidirectional cross-view fusionconfidence-gated predictions
The Latent Bridge: A Continuous Slow-Fast Channel for Real-Time Game Agents
The paper introduces a Latent Bridge (L) for coupling frozen reactive (9B) and reasoning (8B) VLMs in real-time game agents, projecting slow-model residuals into the fast model's input-embedding space without text round-trip. Compared to Text Bridge (T) and Fast-Only (F) baselines, L matches or exceeds T performance across 7 Atari games and MetaDrive, with significant gains in MsPacman (+57%) and RoadRunner (+28%). Bridge utility is predictable (r=0.93 correlation between T and L gains) and mutually exclusive with T. Results include reproducible pipelines and replay recordings.
latent bridgereal-time agentsresidual projectionfrozen modelsvlm coupling
CompressKV: Semantic-Retrieval-Guided KV-Cache Compression for Resource-Efficient Long-Context LLM Inference
CompressKV introduces a semantic-retrieval-guided KV-cache compression framework for GQA-based LLMs, addressing memory and decoding cost constraints in long-context inference. The method identifies Semantic Retrieval Heads (SRHs) to retain critical tokens and allocates cache budgets layer-wise based on eviction error estimates. Evaluations on LongBench and Needle-in-a-Haystack show CompressKV preserves 97% of full-cache performance with 3% KV cache and achieves 90% accuracy with 0.7% storage, outperforming existing eviction methods.
kv-cache compressionsemantic retrieval headslong-context inferencegqa-based llmscache budget allocation
The African Language Tax: Quantifying the Cost, Latency, and Context Penalty of Tokenizing African Languages in Frontier LLMs
The study quantifies the tokenization penalty for African languages in frontier LLMs, revealing systematic disadvantages in cost, latency, and context capacity. Using parallel corpora from FLORES-200+ and MAFAND-MT across 20 African languages (spanning 5 families and 3 scripts), the authors measure token-fertility ratios against English for 11 tokenizers. Results show median premiums of 1.88x (GPT-5/o200k_base) up to 8.92x (N'Ko), with Ethiopic and N'Ko scripts facing 7-9x penalties. Gemma 4 reduces the mean premium from 3.31x to 2.38x but fails to eliminate disparities. The work provides measurement tools (afri-fertility), a leaderboard, and mitigation guidance.
token-fertilitysubword penaltyparallel corporainference costcontext window
Bayesian control for coding agents
We propose Bayesian control as a cost-sensitive sequential hypothesis testing framework for orchestrating tool use in coding agents. The method maintains a belief state over candidate correctness and dynamically decides between evidence gathering, refinement, verification, or termination based on uncertainty. Evaluated across six LLM generators and nine coding benchmarks, Bayesian control demonstrates superior performance when verification costs are high and critics are informative yet imperfect. Additionally, the belief state provides an interpretable correctness score that outperforms token-probability and raw tool-success baselines for uncertainty quantification.
bayesian controlcoding agentssequential hypothesis testinguncertainty quantificationcost-sensitive
NoContactNoWorries: Estimating Contact through Vision and Proprioception for In-Hand Dexterous Manipulation
The paper introduces NoContactNoWorries, a transformer-based multimodal framework for binary contact estimation in dexterous manipulation. By fusing RGB-D vision with proprioceptive data, the method eliminates reliance on tactile sensors, addressing cost and fragility limitations. The model predicts contact states during hand-object interactions, enabling downstream reinforcement learning for in-hand object reorientation. Validation across simulation and real-world experiments demonstrates generalization to novel objects. Project page: https://soham2560.github.io/no-contact-no-worries/
dexterous manipulationcontact estimationmultimodal fusiontransformerproprioception
ReM-MoA: Reasoning Memory Sustains Mixture-of-Agents Scaling
ReM-MoA introduces a memory-augmented Mixture-of-Agents (MoA) framework that sustains scaling through Ranked Reasoning Memory and Curated Diversified Memory Routing. The Ranked Reasoning Memory persistently stores and ranks reasoning traces using a comparative Reviewer Agent, while the routing scheme exposes agents to distinct combinations of successful and failed traces, preserving exploration diversity. An optional multi-domain Reviewer distillation pipeline enhances ranking quality via frontier-model supervision. Evaluated across five reasoning benchmarks in math, formal logic, code, knowledge, and commonsense, ReM-MoA consistently outperforms prior MoA variants, with advantages widening with depth, establishing structured cross-layer reasoning memory as crucial for scalable multi-agent inference.
mixture-of-agentsranked reasoning memorycurated diversified memory routingreviewer distillation pipelinescalable multi-agent inference
MedPCFM: Improving Medical Point Cloud Completion by Integrating Point Transformers and Flow Matching
The paper introduces MedPCFM, a medical point cloud completion method combining Point Transformer v3 (PTv3) with flow matching. The approach addresses generative modeling for anatomical reconstruction, evaluated on SkullFix, SkullBreak, and Mandibular Defect datasets. Compared to deterministic PTv3 baselines and diffusion-based completion (PCDiff), MedPCFM achieves state-of-the-art generative performance with fewer sampling steps, offering 7× speed-up over PVCNN backbones. Scaling analysis reveals consistent improvements with higher point resolution and model size trade-offs.
point cloud completionflow matchingpoint transformerdiffusion modelsmedical imaging
Transformation Behavior of Images in Latent Space
The study analyzes the transformation behavior of histopathology images in latent space, comparing embeddings from original and transformed images to assess encoder network robustness. Using encoder networks from Lunit Inc., Bioptimus, and Meta Research Team, the authors evaluate embeddings generated from hematoxylin/eosin-stained colorectal tissue and TCGA datasets. Results indicate that embeddings of transformed images remain closer to original embeddings than to random embeddings, demonstrating partial robustness to transformations. However, encoder networks do not fully neutralize transformation effects, explaining the utility of transformation-mediated data augmentation. Significant performance differences were observed between general and histopathology-specific encoder networks.
latent spaceencoder networkshistopathologydata augmentationembeddings
Detecting AI Coding Agents in Open Source: A Validated Multi-Method Census of 180 Million Repositories
The study presents a multi-method framework for detecting AI coding agents in open-source repositories, combining configuration-file scanning, commit-message analysis, author-identity matching, and bot-signature lookup across 180M+ Git repositories. Validated on 495 hand-labeled samples, the method reveals significant undercounting by single-signal approaches (e.g., bot-account lookup captures only 3.3% of Claude Code commits). Analysis of snapshots from December 2024 to April 2026 identifies 850,157 Claude Code commits, with distinct adoption patterns: PR-deployed agents (Codex, Cursor) focus on features, while commit-deployed agents (Claude Code, OpenHands) handle maintenance. Multi-channel detection shows 79% of Claude Code adopters are missed by PR-based censuses.
git repositoriescommit-message analysisbot-signature lookupclaude codeconfiguration-file scanning
Can Aggregate Invariants Accelerate Continuous Subgraph Matching? Limits, Laws, and a Dynamic Spectral Index
The paper investigates whether spectral filtering techniques, effective for static subgraph matching, can accelerate continuous subgraph matching (CSM) in dynamic graphs. It establishes three key findings: (1) lazy spectral bound maintenance loses pruning power after four updates, (2) selective exact maintenance of small-neighborhood spectra is computationally feasible (microseconds per update), and (3) integrated spectral tests reduce candidates by up to 51% and skip 47% of update enumerations without altering intermediate results. The study introduces an intermediate-invariance methodology for evaluating CSM filters and releases a dynamic local-spectra index.
spectral filteringcontinuous subgraph matchingdynamic graphslaplacian interlacinglocal-spectra index
Agentic AI for Bilevel Long-Term Optimization of Policy-Driven Physical Layer Systems
The paper introduces Agentic long-term performance optimization (Agentic-LTPO), a bilevel optimization framework for adaptive physical layer problem configuration. It employs agentic AI to generate upper-level configurations, translating evolving operator policies, environment summaries, and historical experiences into structured lower-level optimization problem configurations. The lower level solves these problems for real-time physical-layer decisions. Using cell-free MIMO beamforming as a case study, Agentic-LTPO incorporates a multi-agent decision process with retrieval-augmented experience-based verification in the upper level and a closed-form beamformer in the lower level. Experiments show a 57.2% improvement in long-term system performance compared to traditional methods.
bilevel optimizationagentic aiphysical layercell-free mimobeamforming
Cycle-Consistent Neural Explanation of Formal Verification Certificates
The authors propose a cycle-consistent neural architecture for generating natural language explanations of formal verification certificates, addressing their opacity to non-specialists. The method employs a forward network (NN1) to map certificates to explanations and an inverse network (NN2) to reconstruct certificates, with a symbolic verifier ensuring cycle-consistent faithfulness via differentiable verification. A pointer-generator mechanism lexically grounds explanations by copying state names. Evaluated on 420 certificates across six verification methods in a financial compliance domain (207 states), the system achieves 90.0% cycle-verified soundness, outperforming multi-LLM few-shot baselines (76.1%) by 13.9 percentage points while offering 860x faster inference (185 ms vs. 160 s) and deterministic outputs.
cycle-consistencyformal verificationpointer-generatorsymbolic verifierlexical grounding
Entity Resolution via Batched Oracle Queries
The authors propose an optimal batched entity resolution method for datasets exceeding oracle batch processing capacity, where no batch contains all records of any entity. They formalize batched entity resolution, prove optimal batch selection is NP-hard, and provide an optimal solution under a natural entity size condition. The approach enables pay-as-you-go cost control while maximizing recall at each step. Evaluation on six datasets demonstrates superiority over state-of-the-art baselines.
batched entity resolutionoracle queriespay-as-you-gonp-hardrecall maximization
Average Rankings Mask Per-Subject Optimality: A Friedman-Nemenyi Benchmark of EEG Motor-Imagery BCI Decoders
The study challenges claims of universal superiority for specific EEG motor-imagery decoding pipelines by demonstrating substantial inter-individual variability in optimal methods. Using the MOABB framework, researchers evaluated 1,056 configurations (feature extractor × scaler × classifier) across three datasets (PhysionetMI, Cho2017, Zhou2016) with >340,000 subject-level fits, employing Friedman-Nemenyi statistics. Results show covariance tangent-space projection and Common Spatial Patterns as top-performing families, but their ranking varies by dataset (p = 0.27 for PhysionetMI). Only 35% of participants benefited from the single best pipeline, with personalized selection yielding +7% accuracy gains, underscoring the necessity of participant-aware model selection.
eeg decodingmotor-imagery bcifriedman-nemenyi testcovariance tangent-spacecommon spatial patterns
ATRIA: Adaptive Traceable ECG Reporting with Iterative Agents
The authors present ATRIA, a multi-agent system for adaptive ECG reporting that decouples interpretation and reporting while enabling iterative refinement. The system binds report claims to supporting evidence, flags unsupported statements, incorporates mid-session context, and allows clinician verification of individual findings. ATRIA leverages clinically validated ECG analysis models and operates as a cloud-based web service. Four interaction cases demonstrate its functionality, with a live demo available.
multi-agent systemecg reportingiterative refinementevidence tracingclinical decision support
Age of LLM: A Strategic 1v1 Benchmark for Reasoning, Diplomacy and Reliability of Large Language Models under Fog of War
The paper introduces Age of LLM, a 1v1 benchmark testing LLMs' strategic reasoning under fog of war, diplomacy, and strict action constraints. The private engine uses randomized maps to avoid data contamination, with models receiving minimal rule-based prompts. Results from 54 matches (5,258 actions) show nuclear rushes dominate (78-85% win rates), military conquest is faster (12.3 vs. 18.9 turns), and diplomacy rarely succeeds. Illegal actions (58% fog/state errors) serve as belief-tracking metrics. The corpus provides turn-by-turn behavioral traces for analyzing LLM reasoning under uncertainty. Replay data and engine details are released.
fog of warjson schemanuclear rushbelief-trackingdata contamination
Female-RHINO: A Real-Time Scanner-Integrated Framework for Automated Quantitative Uterine MRI Analysis and Structured Reporting
Female-RHINO introduces a real-time AI-assisted framework for automated quantitative uterine MRI analysis and structured reporting during image acquisition. The end-to-end system integrates inline MRI scanner communication with deep learning-based analysis, combining segmentation and anatomical landmark detection models trained on 500+ multi-center datasets. It performs uterine volumetry, detects incidental findings (fibroids, Nabothian cysts), and extracts six anatomical landmarks, compiling results into structured clinician-oriented reports. Evaluation demonstrated robust performance: mean Dice coefficients of 0.82 (uterus) and 0.80 (fibroids), 3.7mm mean radial error for landmarks, and end-to-end processing in <70 seconds. Prospective deployment yielded standardized, reproducible analyses with inter-observer agreement.
uterine mrideep learningsegmentationanatomical landmarksstructured reporting
PHANTOM: A Large-Scale Dataset of Multimodal Adversarial Attacks for Vision-Language Models
We introduce PHANTOM, a large-scale open-source dataset of 47,524 pre-generated adversarial attacks for vision-language models (VLMs), designed to enhance robustness and safety evaluations. The dataset extends existing benchmarks by covering 10 high-level categories and 55 subcategories of harmful intents, consolidating 7,826 intents from multiple sources and introducing an additional category for broader coverage. Adversarial samples were generated using state-of-the-art attack strategies from recent literature. PHANTOM aims to lower the computational barrier for adversarial research, enabling systematic evaluation of VLM robustness, fine-tuning of attack-generation models, and stress-testing of defensive guardrails under diverse adversarial conditions.
vision-language modelsadversarial attacksrobustness evaluationharmful intentsdefensive guardrails
On the Stability of Prompt Ranking in Large Language Model Evaluation
This paper introduces a stability-aware selection strategy for prompt ranking in large language models (LLMs) to address reliability issues in prompt selection. The authors systematically evaluate prompt ranking stability under common sources of variability, such as random seeds and limited evaluation subsets, across three open-weight LLMs and two benchmark tasks. They find that while rank correlations are often moderate to high, the top-performing prompt frequently changes, leading to unreliable decisions. The proposed method, based on a lower confidence bound, improves robustness in unstable settings while remaining competitive in stable regimes, highlighting the importance of accounting for evaluation uncertainty in LLM benchmarking.
prompt rankinglarge language modelsevaluation uncertaintystability-aware selectionlower confidence bound
Structural Kolmogorov-Arnold Convolutions: Learnable Function on the Values or the Filter Shape as Parameter-Efficient Alternative to Per-Edge Convolutional KANs
The paper introduces Structural Kolmogorov-Arnold Convolutions (KANs) as parameter-efficient alternatives to per-edge convolutional KANs by placing learnable functions in the convolution's structure rather than on each edge. Three variants are proposed: SV-KAN (shared value function), AG-KAN (adaptive Gaussian gate), and RF-KAN (ridge profiles in Morlet wavelet basis). Evaluated on CIFAR-10/100, RF-KAN and SV-KAN achieve 88.47%/64.40% and 88.20%/64.57% accuracy respectively at ~0.4M parameters, outperforming per-edge KANs and plain convolutions. Ablations show RF-KAN's gains stem from localized oscillatory bases and content adaptivity, with learned shape being critical.
kolmogorov-arnold networksparameter-efficientcontent adaptivitymorlet waveletridge profiles
When Helpfulness Overrides Causal Caution: Context-Dependent Suppression and Recovery in LLMs
The study identifies a systematic suppression of Causal Caution (refraining from causal judgment with insufficient evidence) in LLMs when transitioning from academic to practical advisory contexts. Using Pearl's Causal Hierarchy (PCH score), experiments on Claude Sonnet 4.6, Claude Opus 4.7, GPT 5.5, and Gemini 3.1 Pro (480 trials) showed Causal Caution maintenance rates dropping from 91.7–100.0% to 6.7–18.3% (p < .001). A self-correction prompt restored rates to 71.4–100.0%, suggesting context-dependent suppression rather than capability limits. Results highlight risks in organizational governance and propose multi-agent architectures for mitigation.
causal cautionpearl's causal hierarchyllm suppressionmulti-agent architecturespch score
Accelerating Disaggregated RL for Visual Generative LLMs with Diffusion-Based Parallelism and Trainer-Assisted Generation
The paper introduces DigenRL, a disaggregated RL framework for diffusion-based generative LLMs that enables flexible resource allocation and heterogeneous GPU deployment. Key innovations include generation-axis pipeline (GAP), time-step parallelism (TSP), and trainer-assisted generation (TAG) to optimize pipeline efficiency. Evaluated on HunyuanVideo-13B, Wan2.1-14B, FLUX.1-12B, and QwenImage-20B models across 16-32 GPU testbeds, DigenRL demonstrates 1.56-2.10x throughput improvements over state-of-the-art systems like veRL-Omni and GenRL.
disaggregated rldiffusion-based generationgeneration-axis pipelinetime-step parallelismtrainer-assisted generation
MVG-KAN: Multi-View Geo-Wind Guided KAN for PM$_{2.5}$ Forecasting
The paper introduces MVG-KAN, a multi-view geo-wind guided model for PM$_{2.5}$ forecasting, addressing limitations of existing spatio-temporal methods in capturing heterogeneous factors like wind-direction-dependent pollutant transport. The method combines three views: local periodic regularity, station-wise residual dynamics, and meteorology-guided spatial dispersion, using a Geo-Wind Graph for spatial prior and a TKAN residual head for autoregressive correction. Results demonstrate improved accuracy in modeling PM$_{2.5}$ evolution through periodic-residual separation and wind-aware spatial propagation.
pm$_{2.5}$ forecastinggeo-wind graphkolmogorov-arnold networkspatio-temporal modelingpollutant dispersion
What Does ODRL Mean? A Cross-Level Ontological Grounding of Permissions, Prohibitions, and Duties in UFO-L
The paper contributes a formal ontological grounding of ODRL (Open Digital Rights Language) by introducing the Cross-Level Design Principle, which requires normative languages to distinguish between conduct-level (Permission, Duty, Right, No right) and competence-level (Power, Subjection, Immunity, Disability) positions. Using UFO-L (Unified Foundational Ontology-Legal), the authors map ODRL rules to legal relators, extending coverage from two to eight legal positions and making violation-declaration authority explicit. The formalization is mechanically verified in Isabelle/HOL and tested across a 39-problem benchmark using Vampire, E, and Z3 solvers.
odrlufo-lnormative positionsformal semanticsisabelle/hol
ZONOS2 Technical Report
ZONOS2 8B introduces a state-of-the-art TTS model with 8B parameters (900M active) using a novel MoE backbone, improving latency and throughput. The model scales training data from 200K to 6M hours via an enhanced pipeline and simplifies post-training for better naturalness and voice cloning. Evaluated on quality, speaker similarity, WER, and ZTTS1-Eval, it matches top systems while maintaining low latency. Weights and code are released under Apache 2.0.
ttsmixture-of-expertsvoice cloningnaturalnessztts1-eval
Prob-BBDM: a Probabilistic Brownian Bridge Diffusion Model for MRI sequence image-to-image translation
We introduce Prob-BBDM, a probabilistic Brownian Bridge Diffusion Model for MRI sequence image-to-image translation, addressing the resource-intensive nature of multi-modal medical imaging acquisition. The model employs a variational encoder-guided diffusion mechanism leveraging probabilistic image distributions, achieving high-quality synthesis with only 4 diffusion steps. Evaluated on BraTS 2021, Prob-BBDM attains 88.46% SSIM and 26.09 dB PSNR, outperforming baselines. External dataset validation confirms domain generalizability. Clinical utility assessment via tumor segmentation yields 88.71% Dice score and 3.49 mm HD95, demonstrating preservation of diagnostic information in synthesized slices.
brownian bridgediffusion modelmri synthesisimage-to-image translationprobabilistic distributions
LemonHarness Technical Report
LemonHarness introduces an execution framework for long-horizon LLM agents, addressing state management challenges by establishing explicit workspace boundaries. The system integrates model invocation, tool execution, and rule knowledge within a controlled environment, structuring state-changing operations (e.g., file writes, dependency installation) through tool interfaces with recorded feedback. It includes a reusable rule knowledge base and time-aware execution mechanism to optimize effort allocation under time constraints. Evaluated on Terminal-Bench 2.0, LemonHarness_GPT-5.3-CodeX achieved 84.49% accuracy (445 trials), while the GPT-5.5 variant reached 86.52% across five jobs, demonstrating improved execution stability.
workspace boundarystate-changing operationsrule knowledge basetime-aware executionlong-horizon agents
Real-Time Interactive Music Generation via Data-Free Streaming Consistency Distillation
The paper introduces a framework for real-time interactive music generation by distilling autoregressive models into low-latency streaming instruments. Key innovations include streaming consistency distillation in latent space without paired training data, using prompt-only inputs to synthesize teacher-guided trajectories, and music-aware consistency objectives combining latent, spectral, and temporal-difference losses. The method achieves single-step generation with preserved acoustic fidelity (timbre, transients, rhythm) through parameter-efficient adaptation, enabling dynamic human steering without audio interruption. This transforms text-to-music models into responsive instruments for live co-creation.
streaming consistency distillationautoregressive latent spacemusic-aware objectivesparameter-efficient adaptationreal-time factor
CALIBER: Calibrating Confidence Before and After Reasoning in Language Models
The paper introduces CALIBER, a method for calibrating confidence estimates in reasoning language models both before and after generating answers. It argues that pre-reasoning confidence should predict prompt-level success, while post-reasoning confidence should predict answer correctness, and aligns supervision targets accordingly. Evaluated on BigMathDigits with a 7B model, CALIBER reduces Expected Calibration Error by 52.5% versus baselines while maintaining competitive accuracy (±2.1 points). The 30B variant achieves best ECE on BigMathDigits and strong out-of-distribution performance on GPQA, TriviaQA, and SimpleQA, with ablations confirming robustness under distribution shift.
confidence calibrationreasoning modelsexpected calibration errordistribution shiftsupervision alignment
Tractable Reasoning and Conjunctive Query Answering for Defeasible DL-Lite under Rational Closure
The paper introduces a plug-in architecture for efficient reasoning and conjunctive query (CQ) answering under Rational Closure (RC) in the DL-Lite family of lightweight description logics. The method leverages existing classical reasoners to handle defeasible knowledge, focusing on core and horn variants of DL-Lite. Results demonstrate that both entitlement (instance checking) and CQ answering can be performed with minimal computational overhead, maintaining tractability. This approach provides a practical solution for non-monotonic reasoning in DL-Lite, extending its applicability to defeasible scenarios.
rational closuredl-liteconjunctive querydefeasible reasoninginstance checking
Pigeonholing: Bad prompts hurt models to collapse and make mistakes
The paper introduces 'pigeonholing,' a phenomenon where unintentionally bad contexts degrade Large Language Models (LLMs) performance, causing mode collapse and errors. The authors investigate pigeonholing in two scenarios: user-suggested solutions and contexts containing the assistant's previous incorrect responses. Experiments across 10 tasks with 10 models reveal that pigeonholing leads to repeating incorrect answers (38-40% performance drop), narrowing answer diversity in coding and text generation, and flipping stances on controversial topics. Performance degradation worsens with conversation turns (additional 14+% drop over 5 turns). The proposed RLVR with synthetic errors mitigates this, improving models by 43-60% under bad contexts.
pigeonholingmode collapsein-context learningrlvrlarge language models
Neural Network-Based Parametric Model Reduction for Predicting Turbulent Flow for Different Vehicle Geometries
The study extends neural-network-based parametric model reduction for turbulent flow prediction by incorporating a variational autoencoder to handle high-Reynolds-number flows around multiple vehicle geometries. The method combines nonlinear subspace projection with a time-evolution technique, implemented via distributed parallel training to process high-resolution flow field data. Evaluation focuses on reconstruction accuracy of vortex generation across spatial and temporal scales, particularly near vehicle rear ends, using compact latent representations.
parametric model reductionvariational autoencoderhigh-reynolds-number flowsdistributed parallel traininglatent representation
SURGELLM: Rethinking Multi-Task Evaluation through Task-Aware Feature Gating with Class-Balanced Normalization
The paper introduces SURGELLM, a transformer framework addressing three challenges in multi-task NLP: mismatched inductive biases, class-imbalance corruption, and lack of lexical knowledge conditioning. The method combines surgical feature gating (learned sigmoid over lexical indicators), task-conditioned prefix tokens (quantized features prepended to inputs), and Instance-Weighted Normalization (IWN) to remove class-prior bias. Evaluated on SST-2, multi-hop retrieval, LLM-prompt attribution, and authorship detection (17,830 examples, 11 model variants), IWN achieves 0.940 macro-F1 (+0.036 over non-IWN baselines), with lexical gains confirmed by a random-vocabulary control (-0.028 F1).
surgical feature gatinginstance-weighted normalizationtask-conditioned prefix tokensmulti-task evaluationlexical knowledge
Social Structure Matters in 3D Human-Human Interaction Generation
The paper introduces a planner-executor framework for text-driven 3D human-human interaction (HHI) generation, addressing the challenge of modeling social structure (phase progression, roles, coordination). It first analyzes LLMs' capability boundaries, showing they can infer interaction phases and roles but fail at motion generation. The proposed Solo-to-Social framework uses an LLM planner to decompose interactions into phases and assign roles, then a motion executor (adapted from a pretrained solo model via LoRA, self-conditioning, and partner conditioning) to generate physically plausible, coordinated motion. Results demonstrate improved phase consistency, role alignment, and partner-awareness in generated HHIs.
human-human interactionsocial structurephase decompositionplanner-executormotion generation
Probing the Misaligned Thinking Process of Language Models
The study introduces a method for detecting misaligned behaviors in large language models by decomposing them into 18 fine-grained cognitive processes (misalignment indicators) and using linear probes on internal activations. An automated meta-plan-guided pipeline generates multi-turn training conversations, evaluated on an out-of-distribution suite combining behavioral elicitation, benchmarks, and benign conversations. Probes achieve 0.935 AUROC on misalignment detection while maintaining low false positives, with in-depth analysis of internal representations.
linear probesmisalignment indicatorsmeta-plan-guided pipelineout-of-distribution evaluationinternal activations
AutoSpec: Safety Rule Evolution for LLM Agents via Inductive Logic Programming
AutoSpec introduces a framework for evolving safety rules in LLM agents via inductive logic programming (ILP) and counterexample-guided inductive synthesis (CEGIS). The method iteratively evaluates expert-designed rules, mines false-positive/negative counterexamples, uses ILP to identify discriminating predicates, generates candidate rule edits, and verifies revisions until convergence. Evaluated on 291 execution traces across code execution and embodied agent domains, AutoSpec achieves F1 scores of 0.98 and 0.93, reduces false positives by up to 94%, and converges in 4-5 iterations. ILP-guided CEGIS outperforms heuristic CEGIS by up to 4.8x in F1, producing interpretable, generalizable rules.
inductive logic programmingcounterexample-guided synthesissafety ruleslarge language modelfalse positives
Towards Federated Long-Tailed Graph Learning: An Energy-Guided Dual Decoupling Approach
The paper proposes FedEPD, a federated graph learning framework addressing long-tailed distributions through dual decoupling of topological purification and semantic recalibration. The method employs distribution-aware Dirichlet energy pruning to filter heterophilic edges and extracts global prototypes from topologically central nodes, injecting them via spatial low-pass prototype injection. A two-stage optimization strategy preserves majority boundaries while improving minority accuracy. Experiments show state-of-the-art performance with absolute gains up to 4.97% (Accuracy) and 5.48% (Macro-F1) across benchmarks.
federated graph learninglong-tailed distributiondirichlet energyheterophilic edgesprototype injection
SP-Mind: An Autonomous Reasoning Agent for Spatial Proteomics Analysis
SP-Mind introduces the first autonomous AI agent for unified spatial proteomics analysis, addressing fragmentation in current workflows by enabling end-to-end processing from raw multiplexed tissue imaging to phenotype discovery. The agent leverages expert-curated biological analysis skills and computational tools, converting natural-language queries into analytical workflows without task-specific fine-tuning. Evaluated on SP-Bench, a comprehensive benchmark with 102 tasks across 18 categories, SP-Mind demonstrates state-of-the-art performance compared to existing open-source biomedical agent baselines, enhancing scalability and reproducibility in spatial proteomics research.
spatial proteomicsautonomous agentmultiplexed tissue imagingphenotype discoverynatural-language queries
FlowR2A: Learning Reward-to-Action Distribution for Multimodal Driving Planning
FlowR2A introduces a novel approach to multimodal driving planning by learning reward-to-action distributions, resolving the tension between scoring-based and anchor-based methods. The method employs a flow-matching decoder to generate reward-conditioned action distributions from dense trajectory-reward pairs, integrating dense supervision with dynamic proposal generation. It incorporates fine-grained per-timestep reward conditioning and reward noise augmentation to balance safety and progress objectives. FlowR2A supports controllable test-time sampling through reward guidance and anchored sampling, producing high-quality proposals. The model achieves state-of-the-art results on the NAVSIM v1 and v2 benchmarks, demonstrating superior multimodal proposal quality compared to prior methods.
multimodal driving planningreward-to-action distributionflow-matching decoderreward conditioninganchored sampling
Exploring the relationship between human-centric AI and firm idiosyncratic risks
This study examines the underexplored relationship between human-centric AI (HCAI) adoption and firm idiosyncratic risk (IR), proposing that HCAI reduces IR by mitigating ethical risks and enhancing AI-human synergies. The authors integrate situated AI theory with socio-technical systems theory, analyzing a multi-source panel dataset of Chinese listed firms (2015-2023) while considering moderating factors like digitalization and executive shareholding. Results show HCAI correlates with lower IR (β=-0.12, p<0.01), with digitalization and executive shareholding strengthening this effect, while operational efficiency and CEO IT background unexpectedly weaken it.
human-centric aiidiosyncratic risksituated ai theorysocio-technical systemsethical risk mitigation
Inclusive Interactive Collisions for Multi-View Consistent Compositional 3D Generation
I2C-3D introduces an optimization-based method for generating multi-view consistent compositional 3D assets with physically plausible interactions. The approach combines Inclusive Interactive Collisions to guide Gaussian primitives into coherent interaction regions and Multi-View Adaptive Score Distillation Sampling to enhance cross-view consistency by modulating attention maps across viewpoints. Experiments show I2C-3D outperforms existing methods in generation quality and multi-view consistency while supporting flexible 3D editing.
3d generationgaussian primitivesscore distillation samplingmulti-view consistencyinteractive collisions
MMed-Bench-IR: A Heterogeneous Benchmark for Multilingual Medical Information Retrieval
The paper introduces MMed-Bench-IR, a heterogeneous benchmark for evaluating multilingual medical information retrieval across three distinct tasks: cross-lingual medical QA retrieval (6,127 UMLS-grounded queries), concept discrimination (4,975 confusion sets), and multilingual evidence retrieval for RAG (2,040 queries). Designed to measure interactions between biomedical expertise and multilingual coverage, the benchmark spans 6 languages with zero concept/query overlap between tasks. Evaluation of 10 systems reveals severe cross-lingual performance gaps, with biomedical encoders dropping from 0.818 nDCG@10 (English) to 0.056 (Japanese).
retrieval-augmented generationcross-lingual alignmentunified medical language systembiomedical encodersndcg@10
Navigating User Behavior toward Personalized Multimodal Generation
NaviGen introduces a personalized multimodal generation framework that transforms user interaction history into executable instructions for downstream synthesis. The method employs dual identifiers (collaborative and textual codes) as behavioral substrates and semantic bridges, enabling language reasoning and instruction writing. A two-stage SFT+RL pipeline first distills preference reasoning from evolutionarily searched supervision, then aligns generation with user intent through hierarchical and self-consistent rewards. Experiments across product, game, and short-video domains demonstrate NaviGen's improvements in personalized image/video generation, next-item prediction, and instruction specificity/relevance. Code is anonymously released.
personalized generationdual identifiersft+rl pipelineevolutionary supervisionhierarchical rewards
Co-occurring associated retained concepts in Diffusion Unlearning
The authors introduce ReCARE, a framework for robust concept erasure in diffusion models that preserves Co-occurring Associated REtained concepts (CARE). ReCARE addresses the limitation of existing unlearning methods, which often suppress benign co-occurring concepts alongside target concepts. The method automatically constructs a CARE-set, a curated vocabulary of benign tokens extracted from target images, and leverages this during training for stable unlearning. Extensive experiments on target concepts including nudity, Van Gogh style, and Tench objects demonstrate that ReCARE achieves state-of-the-art performance in balancing robust concept erasure, overall utility, and CARE preservation.
diffusion modelsunlearningcare preservationconcept erasurevocabulary construction
Deep Learning Approaches for 3D Medical Scene Completion: From Geometric Modeling to Generative Paradigms
This systematic review analyzes advancements in 3D medical scene completion from 2016 to 2026, highlighting the evolution from voxel semantic completion (e.g., SSCNet) to generative paradigms integrating diffusion priors and Gaussian splatting. The study categorizes representation paradigms, including voxel grids, point learning, implicit neural fields, transformer networks, and rendering-aware 3D Gaussian primitives. It provides a taxonomy of contributions, identifies unresolved challenges, and proposes a research agenda for next-generation systems. The review emphasizes the shift towards generative techniques and real-time rendering, offering a comprehensive framework for future developments in autonomous navigation and augmented reality applications.
voxel semantic completiondiffusion priorsgaussian splattingimplicit neural fieldstransformer networks
Zero-Shot Test-Time Canonicalization using Out-of-Distribution Scoring
The paper introduces a zero-shot test-time canonicalization method that reframes input transformation as out-of-distribution (OOD) detection, enabling any OOD score to guide the search for canonical forms. The approach evaluates 20 OOD scores and 9 search algorithms, finding distance-based scores with random search and local refinement optimal. A gating mechanism preserves in-distribution accuracy by selectively transforming inputs based on OOD scores. Experiments across handwritten characters, sketches, natural images, and 3D point clouds demonstrate robustness to affine transformations without model retraining.
test-time canonicalizationout-of-distribution detectionaffine transformationszero-shot learningenergy minimization
Agon: An Autonomous Large-Scale Omnidisciplinary Research System Built on Prompt Economy
The paper introduces Agon, an autonomous research orchestrator that addresses the scalability bottleneck in research production by validating claims within automated workflows while deferring remaining judgments to human scientists. Agon implements six design principles: Prompt Economy, Future-Facing, Minimal Prompts, OmniDisciplinary, Massive Parallelism, and Zero-Code. The system was evaluated through 444 iterations of Prompt Economy loops across domains, demonstrating scalability while revealing failure modes categorized by severity, fixability, visibility, and capability locus. Results suggest a paradigm shift toward machine-scaled research with human oversight.
prompt economyomnidisciplinarymassive parallelismzero-codefailure taxonomy
Lightweight Transformer Models for On-Device Fault Detection: A Benchmark Study on Resource-Constrained Deployment
The study benchmarks lightweight transformer architectures against traditional ML methods for on-device fault detection across three datasets (NASA C-MAPSS, SECOM, UCI AI4I 2020). Evaluating F1-score, AUC, model size, and CPU latency, it finds that transformers match traditional ML performance (87.8% F1) on well-separated data but with 100x larger models and 9000x higher latency. TinyBERT-4L emerges as optimal (55 MB, 18 ms), while INT8 quantization reduces size by 25% with minimal accuracy loss. An adaptive pipeline achieves 87.6% F1 at 19.5 ms. Both methods struggle with severe class imbalance.
lightweight transformerson-device deploymentdynamic quantizationadaptive inferenceclass imbalance
A Pāninian Foundation for Indic Language Processing
The paper proposes leveraging Pānini's Astādhyāyī grammar as a unifying framework for Indic language processing, addressing current fragmentation. By formalizing shared morphosyntactic architecture across 1B+ speakers, the authors argue for improved accuracy, data efficiency, and transferability. They introduce a four-part benchmark suite to operationalize this framework and examine its implications for neural model interpretability regarding Pāninian categories. The approach consolidates sparse resources into a metalanguage bedrock, transcending genealogical boundaries.
pāninian grammarmorphosyntactic architectureindic languagesmetalanguage bedrockinterpretability
Data Scale, Not Latency, Shapes Cross-Lingual Encoder Transfer in Streaming ASR
The study investigates initialization choices for streaming ASR adaptation, comparing multilingual (ML) versus English-only (EN) encoder warm starts across data scales, latency tiers, and quantization. Using a 0.6B-parameter FastConformer transducer, experiments span eight European languages, 100-2500h of target data, and three streaming tiers. Results show ML initialization's advantage decays with data scale (e.g., +4.21pp to +0.20pp WER gap at 160ms latency from 100h to 2500h), following a power-law trend, while latency and quantization (4-bit weight-only, ~3x footprint reduction) have minimal impact. Guidelines recommend ML for low-data regimes and independent latency/quantization decisions.
streaming asrmultilingual encoderfastconformerweight quantizationword error rate
Breaking Shortcut Learning for Cross-Trial EEG-Guided Target Speech Extraction via Two-Stage Training
We propose TRUST-TSE, a two-stage framework to address shortcut learning in EEG-guided target speech extraction. The method employs contrastive pretraining with attended-speaker negative sampling to suppress trial-identity cues and capture fine-grained EEG--speech alignment, followed by confidence-weighted extraction based on EEG--source similarity. Evaluated on KUL and DTU datasets under strict cross-trial protocols, TRUST-TSE outperforms end-to-end baselines, demonstrating improved generalization and addressing reliability bottlenecks in neuro-steered hearing technologies.
eeg-guidedshortcut learningcontrastive pretrainingconfidence-weighted extractioncross-trial protocols
An Introduction to Causal Reinforcement Learning
The paper introduces causal reinforcement learning (CRL), a framework unifying causal inference and reinforcement learning through structural causal models. It demonstrates how environments can be decomposed into autonomous mechanisms with causal invariances, enabling novel learning modalities like generalized policy learning and counterfactual learning. The approach provides a unifying treatment for online, off-policy, and causal calculus learning, revealing previously unexplored dimensions in RL.
causal inferencereinforcement learningstructural causal modelscounterfactual learningpolicy optimization
The Geometry Behind Diffusion and Flow Matching: Gradient Flows and Geodesics in Wasserstein Space
This work unifies diffusion models and Flow Matching under a single geometric framework in Wasserstein space, demonstrating that both approaches share the same underlying manifold structure. By analyzing the quadratic Wasserstein distance W_2 and its Riemannian geometry, the authors show that diffusion models follow free-energy gradient flows (Fokker-Planck equation) while Flow Matching follows Wasserstein geodesics (Benamou-Brenier formula). The key insight is that both methods reach the same endpoints through distinct paths: diffusion via an initial-value problem and Flow Matching via a boundary-value problem. This geometric perspective explains why Flow Matching requires fewer sampling steps, as it follows deterministic ODEs along geodesics.
wasserstein spacegradient flowswasserstein geodesicsflow matchingdiffusion models
Metis: Bridging Text and Code Memory for Self-Evolving Agents
The paper introduces Metis, a self-evolving agent system that bridges text and code memory representations for improved performance. Metis employs a hierarchical dual-representation memory, organizing textual experience into execution plans, environment facts, and common pitfalls, while selectively converting recurring plans into callable tools. Evaluated on the AppWorld benchmark, Metis achieves up to 20.6% higher task accuracy than ReAct and reduces execution cost by up to 22.8%, demonstrating superior balance between accuracy, efficiency, and memory-construction cost.
metisself-evolving agentstext memorycode memoryappworld
T2D-Bench: Evidence-Gated Evaluation of LLM Outputs for Type 2 Diabetes Using a Multi-Layer Clinical-Lifestyle Knowledge Graph
The paper introduces T2D-Bench, a benchmark for evaluating LLM outputs on type 2 diabetes using an evidence-gated framework. It combines a multi-layer knowledge graph (UMLS, DrugBank, SIDER) with ADA Standards of Care rules and lifestyle-glycemic linkages. Testing 100 vignettes, GPT-4o-mini and GPT-4o failed evidence-path checks in 35% and 33% of cases respectively. The framework detects unsupported claims and enables constrained revision for compliance, demonstrating measurable improvement in clinical output verifiability.
knowledge graphevidence-gatedglycemicada standardsvignettes
OmniPath: A Multi-Modal Agentic Framework for Auditing Wheelchair Accessibility
OmniPath introduces a multi-modal agentic framework for auditing wheelchair accessibility by fusing OpenStreetMap network topology with high-density aerial LiDAR (USGS 3DEP) to construct high-fidelity 3D pedestrian environment models. The system proactively traverses the network in 0.5-meter increments, analyzing surface features such as running slope, cross slope, and vertical discontinuities against ADA compliance standards, assigning weighted severity scores from 'Mild' to 'Critical'. Validated against 200 physical ground truth surveys on the National Mall, OmniPath achieved F1-scores of 0.60 for Severe and 0.58 for Critical hazards, demonstrating reliable diagnostic capability for high-severity accessibility barriers.
lidarada complianceseverity scorenetwork topologyf1-score
DTT-BSR+: A Generative-Regression Cascade for Music Source Restoration
DTT-BSR+ introduces a two-stage cascade for music source restoration (MSR), decoupling distribution fitting from signal reconstruction. The first stage employs a generative DTT-BSR separator to produce stems matching clean source priors, while the second stage uses a modified Demucs network with time-domain and multi-resolution spectral losses for enhancement. The system improves multi-mel signal-to-noise ratio (MMSNR) over single-stage DTT-BSR across all stems and outperforms X-LANCE MSR on five stems. Fréchet Audio Distance (FAD) decomposition reveals a trade-off between signal reconstruction accuracy and semantic distribution fitting.
music source restorationgenerative-regression cascademulti-mel signal-to-noise ratiofréchet audio distancetime-domain spectral losses
VeryTrace: Verifying Reasoning Traces through Compilable Formalism and Structured Verification
VeryTrace introduces a zero-shot verification-and-repair framework for Chain-of-Thought reasoning by formalizing natural-language traces into a compilable representation. The method employs a Domain-Specific Language (DSL) to explicitly model step dependencies, mechanize quantitative content, and structure semantic inferences, combining deterministic checks with targeted LLM audits for error localization and repair. Evaluations on competition mathematics (AIME 2025), robotics planning (LLM-BabyBench), and kinship reasoning (CLUTRR) demonstrate improved accuracy over zero-shot baselines without domain-specific training or in-context examples.
chain-of-thoughtdomain-specific languagezero-shot verificationerror localizationdeduction schemas
A Benchmark for Hallucination Detection in VLMs for Gastrointestinal Endoscopy
The paper benchmarks hallucination detection methods for vision-language models (VLMs) in gastrointestinal endoscopy, addressing a gap in clinical VLM evaluation. Nine methods (RadFlag, SelfCheckGPT-NLI, AvgProb, AvgEnt, MaxProb, MaxEnt, Semantic Entropy, VASE, ReXTrust) are evaluated on the Gut-VLM dataset (4,392 VQA pairs) across five VLMs (MedGemma-4B to Lingshu-32B). ReXTrust, a white-box method, achieves peak AUC of 93.0 (MedGemma-4B), outperforming alternatives by 19.5 AUC points on average (p<0.001). Token-level gray-box methods (MaxEnt, MaxProb) outperform clustering-based and black-box approaches, while confident confabulation emerges as a systemic failure mode.
vision-language modelshallucination detectiongastrointestinal endoscopyvisual question answeringauc performance
ReMMD: Realistic Multilingual Multi-Image Agentic Verification for Multimodal Misinformation Detection
The paper introduces ReMMD, a framework for multilingual multi-image misinformation detection, addressing limitations of existing benchmarks that isolate single modalities. It presents ReMMDBench, a comprehensive benchmark with 500 samples, 2,756 images, multilingual support, and detailed veracity labels. The ReMMD-Agent component employs persistent-memory verification, decomposing posts into atomic points and building reusable evidence sets. Evaluations show ReMMD-Agent achieves 41.80% accuracy and 39.12% macro-F1 using GPT-5.2, while reducing costs by 17.5% compared to MMD-Agent and 79.9% versus T2-Agent.
multimodalmisinformationverificationbenchmarkagentic
DramaDirector: Geometry-Guided Short Drama Generation
The paper introduces DramaDirector, a geometry-guided framework for plot-to-short-drama generation that transforms global plots and local context into visually grounded multi-shot videos. The method decouples shots into static visual and dynamic narrative conditions, employs schema-constrained SFT and GRPO training with a learned text-visual alignment reward, and retrieves depth-pose references to guide first-frame generation and image-to-video synthesis. Evaluated on DramaBoard (35 dramas, 2.8K episodes, 81K shots), DramaDirector outperforms baselines in faithfulness, consistency, and controllability.
geometry-groundedplot-to-short-dramaschema-constrained sftdepth-pose retrievalimage-to-video synthesis
The impact of generative artificial intelligence on academic development of Chinese students in humanities and social sciences
This study examines generative AI's impact on Chinese humanities and social sciences students through a large-scale survey, analyzing usage patterns, learning effects, challenges, and curricular integration preferences. Findings indicate over 50% reported enhanced motivation and creativity, though performance gains may reflect assessment limitations. Variations emerged by experience duration, discipline, and gender. Key concerns included accuracy and overreliance, while most valued ethics but were less satisfied with privacy. Students preferred practice-oriented, optional integration, recognizing GenAI's professional relevance.
generative aihumanities educationacademic performanceethical considerationscurricular integration
Exploring Academic Influence of Algorithms by Co-occurrence Network Based on Full-text of Academic Papers
This study introduces a network-based approach to analyze algorithm influence in NLP research by constructing large-scale co-occurrence networks from full-text academic papers. Using deep learning for entity extraction, the authors build cumulative and annual networks spanning four decades, analyzing structural characteristics through multiple centrality measures. Results reveal complex network properties, with classic/high-performing algorithms maintaining core positions, while declining influence manifests first in network position loss then weakened associations. The work provides the first temporal and structural analysis of algorithm co-occurrence networks.
algorithm co-occurrence networksnetwork centralityentity extractionnatural language processingcomplex networks
Beyond Bayer: Task-Optimal Sensor Co-Design for Robust Autonomous-Driving Segmentation
The paper proposes task-optimal sensor co-design for autonomous-driving segmentation by optimizing camera sensor parameters through a differentiable RAW-to-task pipeline. Key findings show spectral color-filter-array (CFA) weight learning improves mIoU by +0.017 (KITTI-360) and +0.023 (ACDC), while point-spread-function optimization degrades performance (-0.020 mIoU). Noise optimization yields marginal gains, and larger CFA tiles beyond 2x2 harm performance due to sRGB input constraints. The method is model-agnostic and validated across adverse conditions (fog, night, rain, snow), recommending learned 2x2 CFA weights with an identity PSF.
color-filter-arraypoint-spread-functionraw-to-task pipelineautonomous-driving segmentationsensor co-design
Predicting Poets' Origins from Verse: A Computational Analysis of Regional Linguistic Fingerprints in the Complete Tang Poems
The study demonstrates that Tang-dynasty poets' geographic origins leave detectable linguistic traces in their verse, achieving 0.69 accuracy in South/North classification (vs. 0.53 baseline) using character $n$-gram TF-IDF and domain features. Analyzing 357 poets from the Complete Tang Poems, the work reveals: (i) linguistic-geographic distance correlation (Mantel $r=0.40$), (ii) temporal variation in regional separability, and (iii) historically interpretable model errors. Notably, GuwenBERT matches but does not surpass TF-IDF performance, suggesting $n$-grams sufficiently capture regional signals. The approach establishes interpretable ML as a tool for literary historical analysis.
linguistic fingerprinttf-idfdistance-decay effectguwenbertinterpretable machine learning
DynaWM: Dynamics-Aware Distillation with World Model and Momentum Targets for Smooth Locomotion over Continuous Stairs
DynaWM introduces a dynamics-aware representation learning framework for bipedal-wheeled robot locomotion over continuous stairs, addressing limitations in terrain geometry encoding and dynamics awareness. The method combines a world model as a forward-dynamics regularizer with a momentum target encoder to stabilize knowledge transfer, preventing dimensional collapse. PCA visualization and quantitative metrics demonstrate hierarchical terrain encoding and improved adaptability, enabling smooth traversal of diverse staircases in both simulation and real-world experiments.
dynamics-aware learningworld modelmomentum target encoderterrain encodingbipedal-wheeled robots
Blockwise Policy-Drift Gating for On-Policy Distillation
The paper introduces blockwise policy-drift gating, a student-only controller for on-policy distillation (OPD) that improves robustness in long-horizon reasoning tasks. The method aggregates log-probability shifts between behavior and current student policies over fixed token blocks, using mean-normalized gates to reweight OPD position losses without altering teacher targets or rollout policies. Evaluated on a six-variant Qwen3 math reasoning benchmark with 200-step training, 64-token block gating increased mean pass@8 from 0.4978 to 0.5160 across four datasets (AIME24, AIME25, MATH500, AMC23), demonstrating its effectiveness as a drift control signal for reused rollouts.
on-policy distillationpolicy-drift gatinglog-probability shiftsrollout reuselong-horizon reasoning
CAVEWOMAN: How Large Language Models Behave Under Linguistic Input and Output Compression
The paper introduces CAVEWOMAN, a two-channel evaluation protocol assessing linguistic compression's impact on LLMs across input and output channels. The method evaluates task accuracy, realized cost, and reference-text agreement for eight models on five datasets at varying compression levels. Results show output compression reduces costs (1.4-2.4x) while input compression increases costs (~1.15x mean) and degrades accuracy, with 50% of non-reasoning model outputs diverging from unconstrained references despite correctness.
linguistic compressionrealized costreference-text agreementtwo-channel evaluationinference efficiency
PixJail: Self-Evolving Paper-to-Pipeline Reproduction for Text-to-Image Jailbreak Evaluation
PixJail introduces a self-evolving paper-to-pipeline agent framework for reproducible Text-to-Image (T2I) jailbreak evaluation, addressing challenges in pipeline-level reproduction across prompt transformation, image generation, safety filtering, and multimodal judging stages. The framework constructs paper-specific attack modules and runnable evaluation pipelines under a unified contract, leveraging a memory bank for storing paper digests, attack patterns, and reusable artifacts. Evaluated on eleven T2I jailbreak methods, PixJail achieves 2.1% average and 0% median error in reproducing original results, demonstrating high fidelity.
text-to-image jailbreakpipeline reproductionself-evolving frameworkmultimodal evaluationprompt transformation
End-to-End Radar and Communication Modulation Recognition with Neuromorphic Computing
EMRFormer introduces a spiking neural network (SNN) architecture for automatic modulation recognition (AMR) that balances accuracy and energy efficiency. The model combines an adaptive spike encoder, Integer Leaky Integrate-and-Fire neurons, and spike-separable CNNs within a spike-driven transformer framework to process raw IQ waveforms. Evaluations show state-of-the-art accuracy across datasets, robustness in low-SNR conditions, 90% lower theoretical energy consumption, and 5× power reduction on a KA200 neuromorphic chip versus GPUs.
spiking neural networkmodulation recognitionneuromorphic computingiq waveformsenergy efficiency
Token Complexity of Certifying Stochastic-Oracle Reliability
The paper introduces certification token complexity, defined as the minimum expected token cost to distinguish stochastic oracles meeting a target reliability level from those below a lower threshold. An SPRT-based certification Stochastic-Oracle Turing Machine (SOTM) is constructed, which queries the oracle, computes binary correctness scores, and halts when log-likelihood evidence crosses a decision threshold. The SOTM achieves two-sided error guarantees and provides an explicit upper bound on certification token complexity. A matching information-theoretic lower bound demonstrates that any error-bounded certification SOTM must incur the same leading-order expected token cost as the SPRT-based construction in the small-error regime.
stochastic-oracle turing machinetoken complexitysprtlog-likelihood evidenceerror guarantee
Beyond Trajectory Imitation: Strategy-Guided Policy Optimization for LLM Reasoning
Strategy-Guided Policy Optimization (SGPO) enhances reasoning transfer in language models by distilling reusable strategies rather than imitating instance-specific trajectories. SGPO extracts structured strategy descriptions from strong models, constructs both autonomous and strategy-guided trajectories, and employs a token-level forward-KL objective with proximal constraints for selective distillation. Adaptive instance-level weighting dynamically adjusts guidance based on model competence. Evaluated on four mathematical benchmarks across two model families, SGPO outperforms SFT, on-policy RL, and hybrid-policy baselines, achieving a 2.2-point average score improvement on Qwen2.5-7B-Instruct. The forward-KL objective provides selective distillation, and strategy distillation scales complementarily with base model capability.
strategy-guided policy optimizationforward-kl objectiveadaptive instance-level weightingtoken-level distillationproximal constraints
Selective Capability Unlearning in End-to-End Spoken Language Understanding
The paper introduces Binding Subspace (BSU), a representation-level framework addressing capability persistence in spoken language understanding (SLU) systems, where autoregressive models retain slot-generation behavior despite intent suppression. BSU isolates and attenuates intent-conditioned mapping directions through subspace manipulation. Evaluations on SLU benchmarks demonstrate BSU's effectiveness in reducing forced-prefix recoverability by 47-63% while maintaining 98.2% retained performance on non-targeted intents.
capability persistenceautoregressive modelsintent suppressionrepresentation-level frameworkslot-generation
RAVEN: A Regime-Aware Variable-context Expert Network for Financial Time Series Forecasting
The paper introduces RAVEN, a Regime-Aware Variable-context Expert Network for financial time series forecasting, addressing the mismatch between fixed context windows and non-stationary price processes. RAVEN employs a Mixture-of-Experts framework with nested contiguous windows determined by learned patch importance, using Cumulative Importance Thresholding for routing and Correlation-Aware Weighting for expert output alignment. Experiments show RAVEN achieves state-of-the-art performance, improving Pearson correlation by 9.2% on HS300 and 20.2% on S&P500, reducing MSE by 18.2% on fund sales, and outperforming in 14 of 16 metrics on PEMS benchmarks.
mixture-of-expertsnon-stationarycontext windowtime series forecastingregime-aware
Ensemble Feature Selection and Harris Hawks Optimization for Explainable Mental Health Risk Prediction in Female Sex Workers
The paper proposes a hybrid predictive model combining ensemble feature selection (ANOVA and mutual information) with Harris Hawks Optimization-tuned logistic regression for mental health risk prediction in female sex workers. The method addresses limitations of conventional ML models in capturing high-dimensional risk patterns, incorporating explainable AI techniques to identify key predictive factors. Evaluation on 3,005 subjects demonstrates superior performance (95.78% accuracy, 95.77% F1, 0.96 AUC), with post-traumatic stress, client-related violence, and occupational factors identified as primary depression contributors through XAI interpretation.
ensemble feature selectionharris hawks optimizationexplainable aimental health predictionswarm intelligence
Rapid FinFET Modelling Using an Autoencoder
The authors propose an autoencoder-based framework for efficient FinFET modeling, compressing current-voltage (ID-VG) characteristics into a low-dimensional latent space while preserving device physics. The method incorporates drain-to-source voltage (VDS) as an input feature to capture bias-dependent variations, trained on BSIM-CMG-generated data. Results show accurate reconstruction of full I-V curves and direct extraction of key metrics (threshold voltage, subthreshold slope, peak transconductance), demonstrating data-driven compact models achieve high accuracy with minimal training data for rapid device characterization.
autoencoderfinfetbsim-cmgcurrent-voltage characteristicscompact modeling
Breaking the Filter Bubble: A Semantic Pareto-DQN Framework for Multi-Objective Recommendation
The authors propose a multi-objective reinforcement learning framework, Semantic Pareto-DQN, to mitigate filter bubbles and semantic homogenization in recommender systems. The method formalizes recommendation as a semantic multi-objective Markov decision process, integrating high-fidelity semantic embeddings with a Pareto-DQN agent to treat engagement, diversity, and fairness as distinct reward signals. Empirical evaluations on the MovieLens small dataset demonstrate that hypervolume-based action selection disrupts feedback loops causing semantic collapse, sustaining high state-trajectory variance and mapping the Pareto frontier. The framework achieves gains in societal objectives with marginal impacts on engagement, offering a path toward responsible recommender systems.
multi-objective reinforcement learningsemantic embeddingspareto frontiermarkov decision processhypervolume-based action selection
Towards Version-aware Operations and Transaction Memories for Multi-layer MeMo
The paper introduces version-aware operations and transaction memories for MeMo, a multi-layer language model with explicit correlation matrix memories (CMMs). It proposes a framework where knowledge updates are achieved through memory edits rather than full retraining, compiling high-level operations like replace, obsolete, and rollback into MeMo-native primitive calls. Two auxiliary CMMs are introduced: Version CMM (V-CMM) for version transitions and Transaction CMM (T-CMM) for reusable change contents and inverse programs. The framework supports sequence-level edits and structured diff-level inputs, with evaluation metrics focusing on update success, rollback, traceability, locality, and transaction reuse.
correlation matrix memoriesversion-aware operationstransaction memoriessequence-level editsstructured diff-level inputs
Can Language Model Agents be Helpful Circuit Explainers in Mechanistic Interpretability?
The paper introduces AgenticInterpBench, a benchmark with 84 semi-synthetic transformer circuits and 163 component-level annotations, to evaluate language model (LM) agents as mechanistic interpretability aids. It proposes HyVE, an agentic explainer that iteratively generates hypotheses, validates them causally, and produces component- and circuit-level explanations. Testing four LM backbones reveals HyVE recovers useful explanations, though no backbone dominates; validation failures (e.g., incomplete plans, execution errors) are the primary bottleneck. A case study on Llama-3-8B demonstrates applicability to naturally trained models.
mechanistic interpretabilitytransformer circuitsagentic explainercausal validationsemi-synthetic benchmarks
Reinforcement Learning Towards Broadly and Persistently Beneficial Models
The paper demonstrates that reinforcement learning (RL) for beneficial behavior in diverse domains yields broad alignment generalization beyond training distributions. Using a novel dataset spanning health, science, and education, the authors train models to exhibit traits like truthfulness and fairness, then evaluate on 50+ alignment benchmarks. Beneficial-trait RL improves performance on 80% of out-of-distribution tasks, showing cross-domain transfer (e.g., health-trained models reduce deception in non-health contexts) and persistence against adversarial prompting and harmful finetuning. Results suggest RL can enhance robust alignment with human values.
reinforcement learningalignment generalizationreward hackingout-of-distribution transferadversarial prompting
Safe and Generalizable Hierarchical Multi-Agent RL via Constraint Manifold Control
The paper proposes a hierarchical multi-agent reinforcement learning framework that guarantees safety while maintaining coordination efficiency. The method enforces hard safety constraints at the low level via constraint manifold control, while learning high-level coordination policies, providing theoretical safety guarantees and stationary learning dynamics. Empirical results demonstrate competitive performance with near-perfect safety rates (unspecified metric) and generalization across varying agent counts and obstacle configurations.
multi-agent reinforcement learningconstraint manifoldsafety guaranteesharchical controlstationary learning dynamics
Fast and Slow Variational Continual Learning
The paper introduces Continual IVON (CoVON), a variational continual learning optimizer that balances fast and slow adaptation via posterior merging. CoVON leverages the VCL framework by using merged past posteriors as priors to mitigate knowledge drift, implemented efficiently within the IVON optimizer with computational costs comparable to Adam. Experiments demonstrate CoVON's superiority over existing VCL optimizers and weight-regularization methods in domain-incremental learning, continual pre-training, and LLM fine-tuning.
variational continual learningposterior mergingivon optimizerfast-slow adaptationdomain-incremental learning
Towards Spec Learning: Inference-Time Alignment from Preference Pairs
The paper introduces spec learning, an inference-time alignment framework that compiles user instructions and preference pairs into natural-language prompts for LLMs without parameter updates. The method transforms preference judgments into human-readable specifications that condition model outputs, outperforming direct preference optimization (DPO) on specialized domains with dense preference signals. Results demonstrate improved response quality while maintaining interpretability through transparent, editable prompt specifications.
inference-time alignmentpreference learningnatural-language specificationsdirect preference optimizationinterpretable prompting
EMAgnet: Parameter-Space EMA Regularization for Policy Gradient Self-Play in Large Games
The paper introduces EMAgnet, a policy gradient self-play method that regularizes toward an exponential moving average (EMA) of previous policy parameters instead of a uniform distribution. This adaptive regularization target evolves with strategy improvement, addressing limitations of uniform regularization in games with dominated strategies. Evaluated on zero-sum imperfect-information games, including modified benchmarks with exploration challenges, EMAgnet outperforms PPO with uniform-magnet regularization in exploitability reduction, particularly in environments containing strictly dominated strategies.
policy gradientself-playexponential moving averageimperfect-information gamesexploitability
Learning to Trigger: Reinforcement Learning at the Large Hadron Collider
The paper presents a reinforcement learning approach for dynamic trigger threshold tuning at the Large Hadron Collider, addressing suboptimal static configurations. The authors adapt Group-Filtered Policy Optimization (GFPO) to streaming control, introducing GFPO-F and GFPO-FR variants that enforce background rate feasibility during training. Evaluated on Monte Carlo streams and real CMS Run 283408 collision data, the method improves in-tolerance time intervals by 48% (H_T trigger) and 28% (anomaly-detection trigger), with cumulative signal efficiency gains up to 2% on in-tolerance intervals, demonstrating the first RL-based trigger control on real LHC data.
reinforcement learningtrigger tuninggroup-filtered policy optimizationlarge hadron collidersignal efficiency
RASC+: Retrieval-Constrained LLM Adjudication for Clinical Value Set Authoring
The paper introduces RASC+, a retrieval-constrained LLM adjudication method for clinical value set authoring, addressing limitations of direct zero-shot LLM generation. The approach combines Qwen3-based retrieval with vocabulary-aware expansion and code-display rescue retrieval to improve candidate-pool recall from 0.553 to 0.730, followed by GPT-5 adjudication for candidate selection. Results show macro F1 improvements to 0.549 (full-test) and 0.533 (held-out-publisher), demonstrating enhanced performance while maintaining auditable candidate pools.
retrieval-augmentedclinical value setsllm adjudicationvocabulary-aware expansionmacro f1
Critique of Agent Model
The paper critiques contemporary AI agent models by distinguishing between 'agentic' (externally scaffolded) and 'agentive' (endogenously capable) systems. Drawing on Cartesian philosophy and sci-fi tropes, it analyzes agency through five dimensions: goal, identity, decision-making, self-regulation, and learning. The authors propose a Goal-Identity-Configurator (GIC) architecture for general-purpose agents, featuring hierarchical goal decomposition, identity evolution, simulative reasoning via world models, and self-directed learning. The work also examines auditability and safety implications of autonomous agentive systems under human oversight.
agenticagentivegoal-identity-configuratorself-regulationworld-model
Faithful by Construction: Claim-Anchored Attribution for Multi-Document Summarization
The paper introduces CAMS, a Claim-Anchored Multi-document Summarization framework that improves faithfulness and attribution in multi-document summaries by modularizing the process into claim extraction, clustering, selection, and rewriting. CAMS extracts atomic claims with token-level provenance, clusters equivalent claims while flagging conflicts, selects a support-aware subset, and rewrites with fine-grained source traceability. Evaluated on MultiNews, DiverseSumm, and WCEP, CAMS matches end-to-end baselines in summary quality while improving faithfulness and citation precision by roughly two-thirds, exposing a controllable faithfulness-coverage trade-off.
multi-document summarizationfaithfulnessattributionclaim extractiontoken-level provenance
Maestro Order: A Model-Agnostic Orchestration Harness
Maestro Order introduces a model-agnostic orchestration harness that enhances unreliable base solvers (e.g., language models prone to hallucinations) via four structural primitives (decompose, ensemble, verify, recurse) and a budget-aware controller. The system employs black-box solvers, layered verifier ensembles, and dynamic compute allocation to maximize reliability per unit cost. Monte Carlo simulations demonstrate geometric reliability improvements (0.55→0.98 with two gates, →0.999 with four), with budget-aware control achieving target reliability at reduced cost. Key findings include verification's superiority over voting and critical failure modes (verifier gaming, correlated errors).
orchestration harnessverifier ensemblebudget-aware controllermonte carlo simulationreliability amplification
Offline Reinforcement Learning for Warehouse SLAM Throughput Control
The authors propose an offline reinforcement learning framework for optimizing SLAM (Scan/Label/Apply/Manifest) throughput control in warehouse fulfillment systems. The method features a history-informed state representation, action space abstraction for delayed-impact control, and a multi-metric reward function, evaluated using three offline RL algorithms on historical operational data. Empirical results show that Conservative Q-Learning (CQL) improves system health by 22.97% and reduces throttling duration by 3.18% compared to alternatives, demonstrating effective warehouse throughput optimization.
offline reinforcement learningthroughput controlwarehouse fulfillmentconservative q-learningslam optimization
Neuro-Symbolic Drive: Rule-Grounded Faithful Reasoning for Driving VLAs
Neuro-Symbolic Drive introduces a neuro-symbolic framework for driving VLAs that grounds reasoning in rule-based planner traces. The method instruments classical planners to capture decision traces, serializes them into structured reasoning steps, and uses these to fine-tune Qwen3.5-4B. This ensures causal coupling between reasoning and motion generation. Results show improved performance: ADE@3s reduced from 0.47 to 0.26 (3-camera) and 0.54 to 0.26 (8-camera), with miss rates dropping from 8.30% to 6.40% and 10.13% to 5.99% respectively.
neuro-symbolicrule-groundedvlachain-of-thoughttrajectory
When Retrieval Metrics Mislead: Measuring Policy Signal in Long-Horizon Tool-Use Agents
The study challenges the use of exact-match retrieval recall as a proxy for policy context utility in long-horizon tool-use agents. Using Qwen2.5-3B/7B classifiers on tau-bench, structured state representations improved macro-F1 by 0.13-0.17 over raw trajectories. Despite low exact-match recall (7% rank-1 retrieval), retrieved clauses achieved comparable macro-F1 (0.58 vs. 0.60 gold) to gold clauses, with mismatched/no-policy controls scoring 0.32/0.21. Results held across retrievers and model sizes, suggesting recall metrics may underestimate downstream utility in policy classification tasks.
retrieval metricspolicy classificationmacro-f1qwen2.5tau-bench
RIFT-Bench: Dynamic Red-teaming For Agentic AI Systems
The paper introduces RIFT-Bench, a graph representation-based methodology for dynamic red-teaming of agentic AI systems, addressing limitations of existing security evaluations tied to specific implementations. The approach employs a hierarchical representation with two automated phases: Discovery (extracting system structure) and Scanning (deploying adaptive adversarial attacks), producing comprehensive evaluations across diverse attack vectors. Evaluations on 45 heterogeneous agentic systems demonstrate the method's generalization capability, while also supporting direct assessment of mitigation strategies. The framework provides a scalable foundation for security evaluation of autonomous AI systems.
red-teamingagentic aigraph representationadversarial attackssecurity evaluation
Catastrophic Compositional Generation: Why Vanilla Diffusion Models Fail to Extrapolate
The paper demonstrates that vanilla conditional diffusion models fundamentally fail at compositional generation tasks, where target distributions are geometric combinations of source distributions seen during training. Through theoretical analysis and experiments on synthetic and real data, the authors show that inference-time techniques cannot efficiently produce correct samples when targets are out-of-distribution, with score estimation error being particularly catastrophic. While methods like Feynman-Kac correction reduce approximation error, the results indicate a need for alternative approaches to compositional generation.
compositional generationdiffusion modelsscore estimationout-of-distributionfeynman-kac correction
ARIA: Adaptive Region-Based Importance Allocation for Conditional Diffusion Distillation
ARIA introduces adaptive region-based importance allocation for conditional diffusion distillation, addressing the challenge of training effort allocation across a large conditioning space. The method maintains online estimates of teacher-student discrepancy at coarse region levels, focusing updates on misaligned regions while preserving the original distillation objective. Empirical results show ARIA outperforms RC across architectures, particularly in unseen and underrepresented regimes, with theoretical guarantees under bounded variance and drift assumptions.
conditional diffusionknowledge distillationadaptive trainingregion-based allocationteacher-student discrepancy
The Professor: Multi-Teacher Unsupervised Prompt Distillation for Vision-Language Models
TheProfessor introduces multi-teacher unsupervised prompt distillation for vision-language models, extending PromptKD by distilling from a fixed ensemble of two teachers: a domain-finetuned PromptSRC ViT-L/14 and a zero-shot EVA-CLIP-L/14. The method evaluates confidence-weighted and equal-probability ensembling on four datasets (Caltech-101, DTD, UCF101, EuroSAT), showing average HM improvements of +1.77 and +1.37 points respectively. Results indicate domain-shifted datasets (e.g., EuroSAT, +5.78 HM) benefit most from complementary teacher supervision.
prompt distillationvision-language modelsmulti-teacher ensembledomain shiftharmonic mean
E-MRL: Cross-view Aligned Evidence-driven Multimodal Reinforcement Learning for Reliable 3D Tumor Analysis
We propose Evidence-driven Multimodal Reinforcement Learning (E-MRL), a novel framework for reliable 3D tumor analysis that addresses visual hallucinations in Vision-Language Models by grounding diagnostic reports in verifiable visual evidence. E-MRL formulates report generation as a Markov Decision Process comprising diagnosis, localization, and verification stages, explicitly identifying a 'key evidence slice' alongside global diagnoses. A cross-view consistency reward ensures semantic alignment between golden-standard reports and localized visual re-queries. Experiments on large-scale 3D CT tumor datasets demonstrate that E-MRL significantly reduces hallucinations and improves diagnostic accuracy compared to Supervised Fine-Tuning and Reinforcement Learning baselines.
vision-language modelsmarkov decision processcross-view consistencysupervised fine-tuning3d ct tumor analysis
Mind the Heads: Topological Representation Alignment for Multimodal LLMs
We propose Head-Wise Representation Alignment (HeRA), a method for improving Multimodal Large Language Models (MLLMs) by aligning individual attention heads with an external vision encoder. HeRA enforces cross-modal alignment at the head level, preserving topological structure via a contrastive objective based on the Mutual K-Nearest Neighbor (MKNN) metric. Counterintuitively, aligning the least aligned heads yields the largest gains. Evaluations across 18 benchmarks demonstrate that HeRA consistently improves performance on vision-centric tasks and reduces visual hallucinations by curbing over-reliance on linguistic priors.
multimodal large language modelsattention headsrepresentation alignmentmutual k-nearest neighborvisual hallucinations
One Year Later...The Harms Persist, But So Do We!
This study introduces an eight-dimension harm taxonomy and multi-dimensional evaluation framework to assess the safety of general-purpose large language models (LLMs) in mental health contexts. Six proprietary LLMs were evaluated across 16 DSM-5 conditions using four adversarial attack variants. Results indicate that safeguards are reliable only for suicide and self-harm, while failure rates reach up to 100% for conditions like eating disorders, substance use disorder, and major depressive disorder. The findings highlight the urgent need for ethical design and deployment of LLMs, emphasizing the necessity of clearly defined harm categories and robust safeguards to mitigate risks to vulnerable populations.
large language modelsdsm-5 conditionsadversarial attackharm taxonomysafeguards
Promise and challenges of heart chamber segmentation from non-contrast CT scans using contrastive unpaired image translation: a feasibility study
The study introduces ChameleonNet, a framework for heart chamber segmentation from non-contrast CT scans using contrastive unpaired image translation. The method combines a Contrastive Unpaired Translation (CUT) network with decoupled contrastive learning (DCL) loss to synthesize non-contrast CT from contrast-enhanced scans, followed by a Hausdorff distance loss-enhanced nnU-Net for segmentation. Evaluated on 36 synthesized and 36 real non-contrast CT scans, the model achieved Dice scores of 0.91-0.94 and Hausdorff distances of 3.63-5.74 mm, with Pearson correlations of 0.82-0.93 for volume agreement. While feasible, volume errors (9.22-20.79% MAPE) indicate need for refinement before clinical deployment.
contrastive unpaired translationhausdorff distancennunetheart chamber segmentationnon-contrast ct
JupOtter: Cell-Level Bug Detection in Jupyter Notebooks
JupOtter introduces a bug detection system for Jupyter Notebooks, addressing the rise of buggy notebooks in data science workflows. The system employs notebook-specific tokenization preserving cell structure, cell-level bug prediction, and a new labeled dataset (OtterDataset) with 21,000+ annotated notebooks. Evaluations show JupOtter outperforms static analyzers and large language models in cell-level bug detection F1 scores across two of three benchmark datasets.
jupyter notebooksbug detectiontokenizationcell-level predictionotterdataset
MGI: Member vs Generated Inference
The paper introduces Member vs Generated Inference (MGI), a framework to distinguish between training data and generated samples from a target generative model. Existing methods fail due to overlapping likelihood signals between members and generated outputs. The proposed Data Circuit Breaker (DCB) combines autoencoder and latent generator signals to address this, demonstrating robustness across autoregressive and diffusion models, including cases of near-duplicate reproduction and model derivatives trained on generated data.
member vs generated inferencedata circuit breakergenerative modelsmembership inferencediffusion models
Are Safety Guarantees in Neural Networks Safe? How to Compute Trustworthy Robustness Certifications
The paper introduces apothem-optimal certifications for neural network robustness, computed via linear calls to a verifier oracle, addressing the intractability of volume-optimal certifications. It proves the impossibility of volume-optimal oracle-based algorithms and proposes dual certifications for class-wide robustness bounds. The ParallelepipedoNN system demonstrates a two-fold improvement in minimum edge length on MNIST and Fashion MNIST benchmarks compared to prior work.
adversarial examplesrobustness certificationsapothem measureneural network verifierdual certifications
The Measurable Majority
The paper introduces social decision frames to model strict majority reasoning in finite electorates, characterizing representability via finitely additive measures. It develops a minimal logic for strict majorities, proving soundness and completeness, and examines combinatorial aspects of incoherence in set families. Applications include correcting Suppes' representation theorem and establishing a May-type characterization for majority rule.
social decision framesstrict majorityfinitely additive measurescoherence criterionmay-type characterization
Decentralized Coordination of Autonomous Traffic Through Advanced Air Mobility Corridors
This paper demonstrates that autonomous aircraft can self-organize into Advanced Air Mobility (AAM) corridors without centralized control, achieving >94% corridor boundary compliance in decentralized settings. Using fixed-wing aircraft scenarios—single-corridor traversal, sequential corridors, and corridor bifurcations—the study shows efficient goal attainment with minimal tactical interventions for separation violations in low/medium traffic densities. Intervention frequency rises only in high-density conditions.
advanced air mobilitydecentralized coordinationautonomous trafficcorridor complianceseparation minimum
Deciphering Fingerprints of 3D Molecular Surfaces for Accurate Epitope Prediction
SurfBind introduces a surface-centric learning framework for epitope prediction, leveraging molecular surface representations to capture geometric and physicochemical patterns crucial for antibody-antigen recognition. The method employs a Transformer-based architecture with patch-level surface modeling, binder-aware cross-attention, and a hierarchical coarse-to-fine prediction paradigm. Evaluated on benchmarks SAbDab and DB5.5, SurfBind achieves state-of-the-art performance and demonstrates strong generalization across unseen antibodies and conformational states, underscoring the efficacy of interaction-aware surface modeling in protein-protein interaction analysis.
epitope predictionmolecular surfacetransformer-based architecturecross-attentionprotein-protein interaction
From Spatial to Spectral: An Efficient, Frequency-Guided Feature Representation Learner for Small Object Detection
The paper introduces a frequency-guided feature representation framework for small object detection, addressing feature scarcity by shifting from spatial to spectral processing. The method employs a Decompose--Enhance--Reconstruct (DER) operator with three modules—Wavelet-Difference Gate, Log-Gabor Enhancer, and Frequency-Driven Head—to inject frequency-aware modulation into backbone, neck, and head components. Evaluated on VisDrone2019, UAVDT, TinyPerson, and DOTAv1, DERNet outperforms YOLOv11 models with 1/6 the parameters, demonstrating robust performance across diverse detector architectures.
frequency-guidedsmall object detectionspectral processingwavelet-difference gatelog-gabor enhancer
Ten Digits on a Train: AI-Assisted Verification of Two Eigenvalue Problems
The paper presents a human-AI collaborative approach for certifying eigenvalues in singular and non-normal operators, achieving 10-digit precision. For a singular self-adjoint Schrödinger operator, Dirichlet-Neumann bracketing and zero counting verified the negative spectrum. A non-normal atom-molecule benchmark resolved a resonance pair via global matching of projective solution lines, with tail uncertainty encoded in terminal data and validated using componentwise Krawczyk-Brouwer inclusion. AI generated candidate solutions and proof strategies but failed in some cases, highlighting the need for human oversight in verified computation. The work underscores the importance of proof objects and adapts validation methods for ill-conditioned systems.
eigenvalue certificationnon-normal operatorskrawczyk-brouwer inclusionverified computationprojective solution lines
From Task-Guided Conversational Graphs to Goal-Oriented Dialogue Runtimes
The paper introduces Goal-Oriented Dialogue Runtime (GODR), a framework-neutral design pattern for managing complex, interruptible multi-goal conversations in LLM workflows. GODR treats goals, task frames, lifecycle states, invalidation rules, and resumption contracts as first-class runtime objects, delegating bounded execution to existing graph runtimes or agents. The work formalizes the problem of conversational continuity in multi-domain settings and proposes architectural criteria, positioning evaluation as future work rather than presenting empirical results.
goal-oriented dialogueconversational continuityruntime objectsinvalidation rulesmulti-agent orchestration
Integrated Sensing and Communications for Real-time Avatar Control in XR over 5G
The paper proposes a multimodal sensing architecture for XR avatar control, combining 5G mmWave ISAC for coarse body-level gesture recognition and sEMG for fine finger-level gestures. The 5G-based approach uses power-per-beam-pair (PPBP) derived from standard beam management, achieving 82.2±5.9% accuracy on unseen users. sEMG sensors capture discriminative forearm muscle activity for precise interactions. Evaluations demonstrate complementary performance: 5G handles macro-movements while sEMG enables micro-gestures, forming a complete real-time control framework without line-of-sight constraints.
5g mmwave isacpower-per-beam-pairsurface electromyographyintegrated sensingavatar control
Polycepta: Object-Centric Appearance Estimation for Multi-Object Tracking
Polycepta introduces an object-centric appearance estimation framework for multi-object tracking (MOT), reformulating appearance modeling as a recursive estimation problem. Unlike static descriptors, Polycepta constructs and updates independent appearance states for each tracked object, refining estimates as observations accumulate. The method learns object-specific representations without memorization, enabling generalization to unseen classes. Integrated into tracking-by-detection pipelines, Polycepta reduces identity switches and improves performance, achieving 92.27% MOTA on KITTI and operating at 90.57 Hz. Experiments on KITTI, Waymo Open Dataset, and MOT17 validate its effectiveness.
polyceptamulti-object trackingappearance estimationrecursive estimationkitti benchmark
GUI vs. CLI: Execution Bottlenecks in Screen-Only and Skill-Mediated Computer-Use Agents
The study introduces a matched execution-layer benchmark to evaluate GUI and CLI agents under identical conditions, controlling for task, state, and verification variables. Using 440 desktop tasks across 18 applications and 12 workflow categories, it compares screen-only GUI agents with skill-mediated CLI agents. Results show GUI agents achieve 59.1% full pass rate, surpassing CLI agents (48.2%), but CLI performance rises to 69.3% with verifier-guided skill augmentation, revealing modality-specific bottlenecks: GUI agents struggle with long-horizon interaction grounding, while CLI agents face skill coverage limitations.
execution-layer benchmarkgui agentscli agentsskill augmentationgrounded interaction
Cryptographic certificates of validity for trustworthy AI
The authors propose cryptographic certificates of validity for agentic AI systems, enabling formal verification of policy compliance without re-execution or trust in the agent. The method transforms correctness conditions into logical predicates, compiles them into polynomial constraint witness-checking problems, and employs succinct cryptographic proofs (optionally zero-knowledge) to certify policy adherence. This approach bridges formal verification and cryptographic authentication, allowing independent verification of agent actions. The paper outlines the mathematical translation, relates it to proof-carrying code, zkVMs, formal methods, and agent governance, and identifies key implementation challenges.
cryptographic certificatesagentic aipolynomial constraintszero-knowledge proofsformal verification
New Bounds for the Last Iterate of the Stochastic subGradient Method
The paper establishes tight bounds for the last iterate of stochastic subgradient method (SsGM) in one-dimensional convex Lipschitz optimization. For fixed horizon $n$ and stepsize $η=Θ(1/\sqrt n)$, it proves an $O(1/\sqrt n)$ optimization error under i.i.d. subgradient noise with uniformly bounded variance, eliminating the $\log n$ factor from prior generic bounds. Conversely, without the i.i.d. assumption, the error becomes $Ω((\log n)/\sqrt n)$, demonstrating SsGM's suboptimality under uniform variance alone and resolving an open problem from Koren and Segal (COLT 2020).
stochastic subgradient methodlast iterate convergenceconvex optimizationlipschitz objectivesoptimization error
Real vs. Complex Spectral Bases for Neural Operators: The Role of Green's Function Alignment
The Hartley Neural Operator (HNO) is introduced as a real-valued alternative to Fourier Neural Operators (FNO) for learning solution operators of partial differential equations (PDEs). HNO replaces the complex FFT with the Discrete Hartley Transform, retaining twice as many frequency corners while using real-valued multipliers. The choice between HNO and FNO is shown to depend on the operator's Green's function symmetry: HNO excels with self-adjoint elliptic operators due to exact diagonalization of symmetric Green's functions, while FNO is preferred for time-dependent operators with phase content. Empirical benchmarks across PDE classes confirm this elliptic-versus-time-dependent split, providing a predictive rule for spectral basis selection.
hartley neural operatorfourier neural operatorgreen's functiondiscrete hartley transformspectral basis
L3Cube-MahaPOS: A Marathi Part-of-Speech Tagging Dataset and BERT Models
The paper introduces L3Cube-MahaPOS, a manually annotated Marathi POS tagging dataset with 32,354 sentences following Universal Dependencies guidelines, addressing the language's under-resourced status despite its 83 million speakers. The dataset features Devanagari-specific preprocessing and evaluates six model families, including MahaBERT-v2 and MuRIL, with the best system achieving 88.67% token-level accuracy and 81.67% macro-F1. The release includes annotation guidelines and model checkpoints to advance Marathi NLP research.
part-of-speech tagginguniversal dependenciesdevanagari tokenisationmahabert-v2morphological richness
Dirac-Frenkel dynamics with inertia for nonlinearly parametrized solutions of evolution problems
The paper introduces inertial Dirac-Frenkel dynamics to address non-uniqueness and ill-conditioning in parameter dynamics for nonlinearly parametrized systems like neural networks. By incorporating inertia, the method preserves useful historical velocity information in weakly informed parameter directions while maintaining Dirac-Frenkel dynamics in well-informed directions. Theoretical analysis establishes well-posedness and provides a posteriori error bounds. Discretized implementation solves regularized linear least-squares problems with velocity anchoring, demonstrating improved robustness in numerical experiments.
dirac-frenkel dynamicsnonlinear parametrizationinertial regularizationa posteriori errorleast-squares problem
Model selection with proper scoring rules on data sets of time series
The study investigates model selection among probabilistic models for time series data using proper scoring rules, focusing on score aggregation methods. It demonstrates that conflicting model selection decisions arise from skewed score distributions and that selection criteria converge as test set size increases. Empirical analysis on intermittent time series, including the M5 competition dataset, reveals that mean score reliably identifies the true model for short test sets, while mean rank remains invariant under scaling. Results emphasize the importance of large test sets for consistent model selection.
probabilistic modelsproper scoring rulestime seriesmodel selectionscore aggregation
A Physics-Informed Fourier-Wavelet Transformer for Multiscale Computational Fluid Dynamics Surrogate Modeling
The study introduces a physics-informed Fourier-wavelet transformer for multiscale computational fluid dynamics surrogate modeling, combining hybrid Fourier-wavelet spectral encoding with physics-biased self-attention based on PDE residual diagnostics. The model employs self-supervised pretraining via Masked Physics Prediction and Equation Consistency Prediction. Evaluated on cylinder-wake flow and fluid-structure interaction benchmarks, it achieves superior performance with normalized mean-squared errors of 0.05875 and 2.70×10⁻⁴, respectively, outperforming spectral, transformer-based, and physics-informed neural-network baselines in recovering localized wake structures.
physics-informedfourier-wavelet transformercomputational fluid dynamicssurrogate modelingself-attention
Extended pseudo-spectral physics-informed neural networks for phase-field models
The study introduces an extended pseudo-spectral physics-informed neural network (ESPINN) framework for inverse identification of phase-field models from transient snapshot data. ESPINN simultaneously recovers bulk chemical potential and gradient coefficients, demonstrating accurate reconstruction of the Cahn-Hilliard equation in noiseless conditions. The method shows robustness under noise, with improved stability from additional snapshots. Results highlight ESPINN's data efficiency and physical consistency in learning free-energy structures for phase separation models.
phase-field modelspseudo-spectralphysics-informed neural networkscahn-hilliard equationfree-energy structure
QC-SMOTE: Quality-Controlled SMOTE for Imbalanced Classification
QC-SMOTE introduces a quality-controlled oversampling framework for imbalanced classification, addressing low-quality synthetic sample generation in noisy or overlapping regions. The method estimates minority sample reliability via a composite neighbourhood trustworthiness score, combining local density, safe-level, and isolation from the majority class. Synthetic candidates are generated using an IPQ-guided best-of-K strategy, evaluating midpoint purity and majority clearance, with allocation based on sample reliability and boundary informativeness. The framework adapts across overlap-imbalance regimes, adjusting interpolation range and selection criteria. Experiments on 30 imbalanced datasets demonstrate QC-SMOTE's superior average AUC-ROC and Macro F1, especially under moderate and severe imbalance.
oversamplingimbalanced classificationsynthetic samplesneighbourhood trustworthinessauc-roc
EERLoss: A Novel Loss Function for Training Deep Biometric Models. A Case Study in Keystroke Dynamics
The paper introduces EERLoss, a subdifferentiable approximation of Equal Error Rate (EER) for training deep biometric models, directly optimizing the primary evaluation metric. The method is validated on keystroke dynamics verification using the KVC-onGoing benchmark (185,000+ subjects), demonstrating superior performance to state-of-the-art losses. Results show a 30% relative EER reduction and faster convergence compared to existing approaches, particularly in high-variance biometric scenarios.
eerlossbiometric verificationkeystroke dynamicssubdifferentiable approximationkvc-ongoing
Reasoning as Attractor Dynamics: Latent Memory Retrieval via Gibbs-Weighted Energy Minimization
The paper reinterprets Large Language Models (LLMs) as high-dimensional Dense Associative Memories, where correct reasoning chains correspond to stable attractor basins in the energy landscape. It introduces a Gibbs-weighted retrieval mechanism that samples multiple reasoning paths, weighting them by inverse energy (P ∝ e^{-βE}) to approximate the equilibrium distribution. This method improves Microsoft Phi-3.5's GSM8K performance by 5.38% (84.7% → 90.1%), demonstrating that inference benefits from modeling as attractor dynamics rather than autoregressive generation.
attractor dynamicsgibbs measuredense associative memoryenergy landscapespectral entropy
An LLM-based Two-Stage Transformer Framework for Cross-Domain Bearing Fault Diagnosis with Limited Data
The paper proposes a two-stage Transformer framework for cross-domain bearing fault diagnosis under limited labeled data. The method employs a lightweight GPT-2-style architecture with causal self-attention for hierarchical feature extraction, combining pre-trained encoder weights and fault prototype embeddings as knowledge carriers. It addresses domain shifts through multi-source learning, prototype-based modulation, and taxonomy-adaptive classification. Experiments on four real-world datasets show 92.61% average accuracy with only 10% labeled target data, outperforming SOTA by 17.24 percentage points.
transfer learningcausal self-attentionfault prototypedomain adaptationpredictive maintenance
An Agnostic Machine Learning Model of Photosynthetic Habitability
The study introduces an agnostic Photosynthetic Habitable Zone (PHZ) model, independent of Earth-centric biases, by optimizing a generalized photosynthesis model based on thermodynamics and redox chemistry. A genetic algorithm simulates evolution, optimizing photochemical reactions against stellar irradiance spectra for exoplanets orbiting main-sequence stars. Results indicate photosynthetic organisms compensate for reduced flux by evolving larger light-harvesting structures, enabling viability to decline linearly with orbital distance despite quadratic flux reduction. The agnostic PHZ extends beyond Earth-based estimates, supporting oxygenic and anoxygenic photosynthesis across M, K, and G star habitable zones, with M-dwarf exoplanets potentially exhibiting NIR-driven oxygenic photosynthesis distinct from Earth.
photosynthetic habitable zoneredox chemistrygenetic algorithmstellar irradiancelight-harvesting structures
Data Augmentation: A Fourier Analysis Perspective
The paper establishes that partial data augmentation with randomly sampled group elements achieves the same minimax rates as full augmentation for classical learning problems, up to a vanishing approximation error. Using Fourier analysis and finite group representation theory, the authors demonstrate that statistical benefits persist despite approximate symmetry enforcement. They also prove an impossibility result: exact invariance requires full-group averaging when hypothesis spaces are sufficiently expressive, contrasting with partial augmentation's efficacy for approximate cases.
data augmentationfourier analysisminimax ratesgroup invariancerepresentation theory
Natural Identifiers for Privacy and Data Audits in Large Language Models
This work introduces natural identifiers (NIDs) as a novel solution for scalable post-hoc privacy and data audits in large language models (LLMs). NIDs are structured random strings, such as cryptographic hashes and shortened URLs, naturally occurring in LLM training datasets. Their format enables the generation of unlimited additional random strings from the same distribution, serving as alternative canaries for differential privacy audits and as same-distribution held-out data for dataset inference. Evaluations demonstrate that NIDs facilitate post-hoc differential privacy auditing without retraining and enable dataset inference for suspect datasets containing NIDs, eliminating the need for private non-member held-out datasets.
natural identifiersdifferential privacydataset inferencelarge language modelspost-hoc audits
RE4: Transformation-aware Imitation of Object Interactions Using Manipulation Modes
The RE4 framework introduces a transformation-aware imitation learning approach for object interaction tasks, combining interpretability with performance. It employs self-supervised pose estimation from demonstration data, followed by manipulation mode-aware retrieval, transformation, replanning, and rollout. Evaluated on Push-T and Robomimic benchmarks, RE4 demonstrates robustness in sparse data regions and low-data regimes, outperforming end-to-end methods like diffusion and flow-based variants. The framework leverages principled manipulation theories, preserving interpretability while achieving competitive results in both state-based and image-based settings.
imitation learningpose estimationmanipulation modesself-supervisionbenchmark evaluation
Parallel Manifold Steering: Efficient Adaptation of Large Associative Memories via Residual Energy Shaping
The paper introduces Hierarchical Residual Steering (H-Res), a method for efficient adaptation of large Transformer-based Dense Associative Memories (DAMs) without weight modification or sequence expansion. H-Res formulates adaptation as control on the activation manifold, learning a state-dependent vector field that steers token trajectories into task-specific attractor basins while preserving attention entropy and facilitating Neural Collapse. Experiments show 26% improvement over weight-modification methods in associative retrieval tasks and computational efficiency gains over prompt-based approaches, with scalability demonstrated on structured domains.
dense associative memoriesmanifold steeringattention entropyneural collapseresidual energy shaping
Open-Vocabulary BEV Segmentation with 3D-Aware Geometric Constraints
The paper introduces Open-Vocabulary BEV Segmentation (OVBS), a framework for bird's-eye-view perception that leverages vision-language models to recognize novel categories beyond the training set while maintaining real-time efficiency. The proposed OVBEVSeg method addresses 3D geometric inconsistencies in 2D-to-BEV projection through three stages: 2D-to-BEV pseudo-labeling, joint 2D-BEV optimization with structural constraints, and 3D geometric distillation. On nuScenes, OVBEVSeg achieves 15.3 mIoU improvement on unseen categories over closed-set methods, matches semi-supervised baselines without novel-class labels, and reduces inference time by 2.5x with 0.22x memory usage compared to projection-based approaches.
open-vocabulary bev segmentation3d geometric constraintsgaussian splattingvision-language modelsnuscenes dataset
Managing Task Execution for Unknown Workloads in Batteryless IoT: A Hardware-Agnostic Evaluation
The paper introduces two hardware-agnostic dynamic scheduling strategies for batteryless IoT devices facing unpredictable workloads: a model-free Reinforcement Learning (RL) agent and an Approximated Prediction (AP) method. These are evaluated against adaptive task rate (AsTAR) and static thresholds using a physically accurate simulation framework with real-world solar data and LoRa transmission profiles. Results show AP achieves near-oracle task throughput, RL balances survival-execution trade-offs, and AsTAR excels in long energy gaps, while simpler static policies suffice for devices with larger energy buffers.
batteryless iotenergy-harvestingdynamic schedulingreinforcement learningtask throughput
PROTECT-90: A Fault Dataset for Power System Protection
The PROTECT-90 dataset introduces a standardized benchmark for high-voltage fault studies, addressing the lack of publicly available datasets for power system protection. Generated via electromagnetic transient (EMT) simulation, it comprises 9,022 short-circuit episodes on a 90 kV double-line topology, with domain randomization of grid operating points, line parameters, and fault conditions. Each episode includes synchronized three-phase voltage and current waveforms from eight measurement locations, accompanied by structured metadata detailing fault characteristics. The dataset ensures reproducibility through explicit documentation of modeling assumptions and generation procedures, enabling transparent evaluation of protection-oriented signal processing and learning-based methods.
electromagnetic transientshort-circuitdomain randomizationsignal processinghigh-voltage
Deep numerical schemes for systems of Ergodic BSDEs with applications to regime-switching forward utilities
The paper introduces two neural-network-based numerical schemes for solving systems of coupled ergodic Backward Stochastic Differential Equations (eBSDEs), focusing on optimal strategies in regime-switching forward utilities. The first method links eBSDE solutions to a multidimensional BSDE with random terminal time, using a locally additive deep learning scheme to minimize aggregated local errors. The second employs a Deep Galerkin Method (DGM) to minimize the residual of the associated ergodic PDE system. Numerical experiments validate the methods, highlighting the impact of regime switches on forward preferences.
ergodic bsdesforward utilitiesdeep galerkin methodregime-switchingstochastic factor model
MotifGen: Spatiotemporal interpolation of misaligned satellite images via multi-source generative modeling, in an application to tropical cyclones
MotifGen introduces a generative model for spatiotemporal interpolation of misaligned satellite images from multiple geospatial sources, addressing challenges in tropical cyclone monitoring. The model handles heterogeneous microwave data from varying instruments, irregular time intervals, and geographic misalignment, combining microwave and infrared observations. Training employs a self-supervised task where a random source is masked and reconstructed, achieving a significant reduction in Continuous Ranked Probability Score compared to supervised methods. The generative model produces an ensemble mean comparable to deterministic models, with a power spectrum closer to true observations, demonstrating improved interpolation accuracy.
generative modelspatiotemporal interpolationmicrowave imageryself-supervised learningtropical cyclones
Automated Residual Plot Assessment With the R Package autovi and the Shiny Application autovi.web
The authors introduce autovi, an R package with a Shiny application (autovi.web) for automated residual plot assessment in linear models. The system employs computer vision to replace manual evaluation, addressing scalability and consistency issues in visual diagnostics. Using the lineup protocol framework, it computes visual signal strength (VSS) metrics from residual samples to quantify model fit quality. The tool provides auxiliary diagnostic information while reducing human effort compared to traditional visual inspection methods.
residual plotslinear modelscomputer visionvisual signal strengthshiny application
Project Ariadne: Prompt-Conditioned Route Generation for Synthesis Planning
Project Ariadne introduces a decoder-only route generator for retrosynthetic planning, unifying target molecules, optional constraints, and routes into prompt-completion sequences. The method eliminates the need for separate models for different planning specifications. Evaluated on the RetroCast/PaRoutes mkt-cnv-160 benchmark, a 24-layer Ariadne checkpoint improves Solv-0 by 13.7 points for depth constraints and 31.2 points for required-leaf constraints compared to baseline methods. It outperforms DESP in required-leaf Top-10 and Solv-0 metrics with significantly reduced GPU-time (24 minutes vs. 6.8 hours). While comparable to DMS Explorer XL in standard reconstruction, Ariadne excels in route-holdout reconstruction, though AiZynthFinder MCTS remains stronger in some Solv-0 comparisons.
retrosynthetic planningdecoder-onlyprompt-completionroute generationsolv-0
BehaviorBench: Benchmarking Foundation Models for Behavioral Science Tasks
We introduce BehaviorBench, a comprehensive benchmark evaluating foundation models across four behavioral science capabilities: behavior prediction/simulation, strategic decision-making, subject-trait inference, and behavioral knowledge application. The benchmark assesses both individual-level accuracy and population-level alignment, crucial for behavioral validity. We develop Be.FM-1.5, extending the Be.FM family of behavioral foundation models fine-tuned on behavioral data. Results show proprietary general-purpose models excel at individual-level prediction, while Be.FM-1.5 achieves superior distributional alignment and remains competitive on individual metrics, demonstrating the effectiveness of behavioral adaptation. BehaviorBench establishes a foundation for developing behaviorally aligned AI systems.
behavioral sciencefoundation modelsdistributional alignmentsubject-trait inferencestrategic decision-making
Autonomous Video Generation with Counterfactual Controllability for Self-Evolving World Models
The paper critiques the assumption that video generation alone constitutes world modeling, arguing instead for counterfactual controllability as a key criterion. It proposes autonomous video generation with this capability to enable self-evolving world models, where generated futures can be tested against embodiment constraints and fed back into the system. The approach emphasizes not just predictive realism but the ability to reason about interventions and controllable variables in dynamic scenes.
counterfactual controllabilityself-evolving world modelsautonomous video generationspatiotemporal modelingembodiment constraints
AsyncOPD: How Stale Can On-Policy Distillation Be?
The paper presents AsyncOPD, the first systematic study of staleness in asynchronous on-policy distillation (OPD) for large language models, where teacher feedback uses local KL losses and finite teacher-score caches. It demonstrates that teacher-weighted forward KL is robust to stale rollouts, while student-weighted reverse KL is vulnerable, and proposes recomputing the reverse-KL signal as an effective surrogate. The analysis of finite teacher-score caches reveals a bias-variance tradeoff, addressed via multi-sample Monte Carlo estimation. AsyncOPD achieves 1.6× to 3.8× throughput gains over synchronous training while maintaining accuracy.
on-policy distillationasynchronous trainingkl divergenceteacher-score cachesmonte carlo estimation
A Time-Reparameterized Cumulative Intensity Extrapolation Sampler for Discrete Flow Matching
The paper introduces the Time-Reparameterized Cumulative Intensity Extrapolation (TR-CIE) sampler for discrete flow matching (DFM), improving sampling quality under limited function evaluations (NFE). TR-CIE combines schedule-based time reparameterization to mitigate stiffness near terminal stages and cumulative-intensity extrapolation using cached model outputs for better stepwise approximations. Theoretical analysis bounds local approximation errors and establishes convergence. Experiments on synthetic tasks, text generation, and text-to-image benchmarks show enhanced sampling quality with one NFE per step, matching the computational cost of standard τ-leaping.
discrete flow matchingtime reparameterizationcumulative intensitysampling efficiencymarkov chain
Uniform Sampling from High-dimensional Spectral Norm Balls
The paper introduces a method for uniform sampling of matrices from high-dimensional unit spectral norm balls, motivated by optimization challenges in machine learning. It proves that singular values of sampled matrices converge to 1 almost surely as dimensions increase, providing theoretical justification for a computationally efficient sampling approach. The method is particularly relevant for large-scale matrices akin to those in modern large language models. Experimental validation confirms both the convergence of singular values and the accuracy of the proposed sampling technique compared to exact methods.
spectral normsingular valuesuniform samplinghigh-dimensionaloptimization
Holistic Data Scheduler for LLM Pre-training via Multi-Objective Reinforcement Learning
The Holistic Data Scheduler (HDS) introduces a multi-objective reinforcement learning framework for adaptive data mixing during LLM pre-training. HDS formulates data scheduling as a continuous control problem using Soft Actor-Critic (SAC), with a novel reward function integrating data quality (data-driven), inter-domain influence (loss-driven), and model dynamics (weight norms). Evaluated on The Pile benchmark, HDS achieves target validation perplexity 44% faster than prior methods while improving MMLU 0-shot accuracy by 7.2%, demonstrating simultaneous gains in training efficiency and model capability.
online data mixingsoft actor-criticmulti-objective reinforcement learningllm pre-trainingdata scheduler
When Top-1 Fails: Calibrating LoRA Monitors for Masked Diffusion LMs
The study demonstrates that top-1 argmax concentration fails as a collapse warning for LoRA fine-tuning of masked diffusion language models (DLMs), showing zero precision across 816 configurations. Instead, the authors propose using max LoRA gradient norm as a parameter-side signal, which achieves precision 0.68 and F1=0.79 on a held-out LLaDA-family split, outperforming the top-1 baseline. The method involves sampling gradient routing rather than token concentration and calibrating thresholds per DLM family. Results are bounded to short-horizon DLM-LoRA inspection, highlighting the need for family-specific calibration.
loradiffusion language modelstop-1 argmaxgradient normpeft
FedUP: One-Shot Federated Unlearning via Centroid-Guided Plug-in Filters
FedUP introduces a one-shot federated unlearning framework using centroid-guided plug-in filters to address the trade-off between non-target knowledge loss and request latency in federated systems. The method employs differentially private class centroids to train lightweight server-side filters while freezing original model parameters, eliminating multi-round communication and enabling sub-second unlearning. Experiments on image and text tasks demonstrate reduced non-target knowledge loss, improved unlearning precision, and inherent reversibility through filter removal.
federated unlearningdifferential privacyclass centroidspluggable filtersknowledge preservation
PORTER: Language-Grounded Event Representations for Portable Structured EHR Foundation Models
PORTER introduces a language-grounded EHR foundation model that decouples event representation from fixed vocabularies by encoding events via text descriptions (using a frozen text encoder) and numeric attributes (via a dedicated pathway), with clinical dynamics learned through autoregressive pretraining. Evaluated on 74 pediatric hospital tasks, PORTER matched fixed-vocabulary models (mean AUROC parity) while enabling zero-shot transfer to unseen descriptions (97.1% AUROC retention) and outperformed them on MIMIC (69% event drop avoided). Analyses revealed transfer relied on preserved patient-level geometry and numeric-aware representations, achieving 329× compute efficiency over text serialization baselines.
electronic health recordsfoundation modelslanguage-groundedzero-shot transferautoregressive pretraining
NeuroSonic: Conditional Flow Matching for EEG-to-Speech Reconstruction
NeuroSonic introduces a conditional flow-matching framework for EEG-to-speech reconstruction, addressing the challenge of mapping diffuse EEG signals to coherent acoustic trajectories. The method learns a deterministic probability-flow velocity field that transports noise-corrupted acoustic states toward clean speech under EEG conditioning, using a time-conditioned gated Transformer to parameterize the transport ODE. Evaluated on CineBrain and EAV benchmarks, NeuroSonic outperforms GAN-, diffusion-, and mean-flow baselines by up to 26.3% in perceptual quality, particularly in artifact-heavy segments, demonstrating stable EEG-driven reconstruction.
eeg-to-speechconditional flow matchingprobability-flow velocity fieldgated transformerdeterministic transport
RoPE-Aware Bit Allocation for KV-Cache Quantization
Block-GTQ introduces a RoPE-aware bit allocation method for KV-cache quantization, addressing the position-dependent sensitivity of key-cache quantization under Rotary Position Embedding (RoPE). The method computes energy scores for each RoPE block and greedily allocates integer bit widths based on marginal gain, optimizing quantization fidelity. Evaluated on a ten-model diagnostic panel, Block-GTQ reduces per-layer MAE by 32-80% at 2 and 3 bits/dimension K-only quantization, outperforming uniform TurboQuant-MSE in all 367 layer comparisons. Downstream tasks show significant improvements: on Llama-3.1-8B-Instruct, Block-GTQ raises NIAH and LongBench-EN averages by 26.8 and 16.44 points respectively. Packed-cache implementation achieves 3.24x KV-cache compression with fp16-comparable quality and reduces peak memory usage by 64.7%.
rope-awarekv-cachebit allocationquantizationturboquant-mse
Information-Theoretic Classifier-Free Guidance with Adaptive Schedule Optimization
The paper introduces an information-theoretic framework for optimizing classifier-free guidance (CFG) schedules in diffusion models, addressing the consistency-coverage trade-off during reverse trajectory generation. The method leverages a clean endpoint reference to specify desired trade-offs and optimizes the guided sampler's induced distribution toward this reference using trajectory-level formulas that avoid explicit density estimation. Experiments on ImageNet-512 (EDM-XXL) and COCO (SD-XL) demonstrate that learned schedules outperform constant guidance, achieving improved or competitive trade-offs by selectively allocating guidance across noise levels.
diffusion modelsclassifier-free guidanceinformation-theoretic optimizationreverse trajectoryconsistency-coverage trade-off
You Don't Need to Run Every Eval
The authors demonstrate that model evaluation across multiple benchmarks can be significantly reduced by leveraging low-rank structure in performance matrices. They compile a public score matrix of 84 frontier models across 133 benchmarks (2,604 cells, 23.3% filled) and find it is approximately rank-2, explaining over 90% of performance variation. They introduce BenchPress, a logit-space rank-2 matrix completion method that predicts held-out scores within 4.6 points, with a confidence layer indicating prediction reliability. BenchPress identifies two benchmark subsets: {GPQA-D, HLE, Codeforces, MMLU-Pro, ARC-AGI-1} predicts full scorecards within 3.93 points, while {GPQA-D, MMLU-Pro, Aider Polyglot, MATH-500, AIME 2026} achieves 4.55-point accuracy.
benchmarkmatrix completionlogit-spacelow-rankevaluation
Low-rank Updates in Slowly Time-varying Graphs for Spatial-Temporal Signal Interpolation
The paper proposes a joint optimization method for spatial-temporal signal interpolation in slowly time-varying graphs, modeling graph changes between consecutive time steps as low-rank matrix updates. Given an initial adjacency matrix, the approach alternates between signal interpolation via linear system solving and graph update via proximal gradient descent with a fast OMP-based rank approximation. The algorithm is unrolled into a lightweight neural network for parameter tuning. Experiments demonstrate superior interpolation accuracy compared to existing time-varying graph models.
graph signal processinglow-rank approximationproximal gradient descentorthogonal matching pursuitspatial-temporal interpolation
Cyclic Denoising Reveals Ultrastable Memories in Diffusion Models
The paper introduces cyclic denoising, a physics-inspired extraction attack that reveals ultrastable memorized training images in diffusion models through repeated forward and reverse diffusion at controlled noise amplitudes. The method requires only sampler-level control, operating without gradients, weight inspection, or prompts, and demonstrates consistent behavior in both Stable Diffusion v1.4 and pixel-space DDPM. Results show hierarchical attractor basins containing memorized content (e.g., stock photos, watermarks), with ultrastable attractors persisting through thousands of cycles, revealing a yielding-like transition between trivial fixed points and structured memorized basins.
cyclic denoisingdiffusion modelsmemorization auditingattractor basinsextraction attack
A Comparative Study of Bayesian Contextual Bandits for Real-Time Warehouse Sorter Optimization
This study compares three hybrid machine learning frameworks—Linear Regression with Gradient Descent Optimization (LR+GDO), XGBoost with Bayesian Optimization (XGB+BO), and Bayesian Contextual Bandits (BCB)—for real-time sorter diversion optimization in warehouse automation. Using a physics-aware emulator, the authors evaluate predictive accuracy, sensitivity, and reward uplift. BCB outperforms others with a 2.03% reward uplift over baselines, exhibiting decisive time-optimal policies, continuous online learning, and low inference latency. Results highlight BCB's suitability for dynamic warehouse environments.
bayesian contextual banditsgradient descent optimizationxgbootreal-time controlwarehouse automation
3D Masked Autoencoders are Robust Learners of Volumetric and Multimodal Cellular Representations for Microscopy
The study demonstrates that 3D masked autoencoders (MAE-3D) outperform 2D counterparts in volumetric microscopy data analysis, achieving superior performance on single-cell tasks. The method employs matched architectures and training protocols, incorporating channel cross-attention and frequency-domain regularization to leverage 3D spatial context. Cross-modal alignment with ESM2, a protein language model, further enhances volumetric representations. Results show MAE-3D achieves ROC-AUC of 0.865 on protein-protein interaction tasks (+0.025 over prior methods) and state-of-the-art AUC$_{\text{micro}}$ (0.952) and F1$_{\text{micro}}$ (0.742) for protein localization.
masked autoencodersvolumetric microscopyprotein language modelcross-modal supervisionfrequency-domain regularization
Forget Without Compromise: Nexus Sampling for Streaming KV-Cache Eviction Under Fixed Budgets
The paper introduces Nexus Sampling, a training-free KV-cache eviction method for streaming LLM inference under fixed memory budgets. The method combines Nexus scoring, an iterative attention walk identifying bridge tokens, with weighted reservoir sampling to probabilistically retain tokens instead of deterministic top-K selection. Theoretical analysis shows superior long-run survival of subtly important tokens compared to top-K baselines. Experiments demonstrate 1% accuracy parity with dense attention on LongBench at 80% eviction, while outperforming baselines by up to 10x in memory efficiency on retrieval tasks.
kv-cache evictionnexus samplingweighted reservoir samplinglong-context llmsstreaming inference
Does My Embedding Reflect That $A = B$? Evaluating Mathematical Equivalence in Embedding Models
We propose a contrastive learning approach to improve embedding models' ability to capture mathematical equivalence across diverse formulations, addressing the limitation of current state-of-the-art models that group statements by terminology rather than underlying mathematical content. Our method focuses on aligning informal mathematical statements with their formal counterparts, leveraging the Mathematically Equivalent but Lexically Different Pairs (MELD) Dataset, which contains natural language statements expressing identical mathematical concepts in different terminology. Experiments demonstrate that our approach improves performance on both informal-formal retrieval tasks and the MELD benchmark, indicating enhanced capability in recognizing mathematical equivalence across varied formulations.
mathematical equivalenceembedding modelscontrastive learningmelds datasetinformal-formal retrieval
Learning the Koopman Operator using Attention Free Transformers
The paper introduces two components to enhance the robustness of Koopman operator learning for long-horizon predictions. First, an attention-free latent memory (AFT) block aggregates past latents to correct predictions, operating in linear time with ≈30k parameters. Second, dynamic re-encoding uses lightweight change-point triggers to detect latent drift and project predictions back onto the autoencoder manifold. Evaluated on the Duffing oscillator, Repressilator, and IRMA systems, the Koopman+AFT model reduces error accumulation compared to Koopman autoencoders, GRUs, and Transformers, achieving lower long-horizon error and inference latency over up to 1000 steps.
koopman operatorattention-free transformerdynamic re-encodinglatent drifterror accumulation
Stochastic Expectation Maximization for Robust State-Space Radio Interferometric Imaging
The authors propose a robust state-space estimation method for radio interferometric imaging under compound-Gaussian noise, addressing limitations of Gaussian assumptions in interference-prone environments. Their approach employs a Stochastic Approximation Expectation–Maximization (SAEM) algorithm with Monte Carlo Gibbs sampling for latent states and noise texture, enabling tractable inference despite heavy-tailed likelihoods. Experiments demonstrate superior reconstruction fidelity and RFI robustness compared to Gaussian EM and oracle RTS smoothers, validating the benefits of heavy-tailed modeling and SAEM inference in interference-dominated scenarios.
state-space modelscompound-gaussian noisestochastic approximation emgibbs samplingradio interferometry
DREG: A Layer-Wise Jacobian Regularization as a General-Purpose Penalty
The paper introduces Derivative Regularization (DREG), a layer-wise Jacobian regularization penalty, and evaluates its efficacy through a large-scale empirical study. The study conducts 960 experiments across 4 activations, 6 regularizers, 8 datasets, and 5 random seeds, analyzing DREG's performance in accuracy, noise robustness, and data scarcity. Results show DREG achieves the highest overall and clean-regime accuracy, particularly excelling with GELU activation in messy vision and NLP benchmarks. It performs best under data scarcity, acting as a geometric inductive bias. DREG requires no per-dataset tuning, using a fixed hyperparameter λ=10^-2.5, demonstrating its plug-and-play applicability for neural networks with significant Jacobian structure.
jacobian regularizationgelu activationgeometric inductive biasnoise robustnessdata scarcity
Prediction of Viscoelastic Droplet Impact Dynamics Using a Vision Transformer-Based Approach
The paper proposes a Video Vision Transformer (ViViT) to predict viscoelastic droplet impact dynamics from initial 10-20% of Volume of Fluid (VOF) simulations, reducing computational cost by 80-90% versus full simulations. The method extends traditional Newtonian parameters (Reynolds number Re, Weber number We) with viscoelastic terms (solvent viscosity ratio β, Weissenberg number Wi), capturing both spreading and bouncing regimes while preserving geometric features. Results demonstrate physically consistent predictions across parameters and time horizons, with potential for experimental data integration via volume fraction field extraction from videos.
viscoelastic fluidsvision transformervolume of fluiddroplet impactweissenberg number
Constrained Variable Projection for Structured Problems
The authors propose a constrained variable-projection framework for structured data-science problems, extending classical variable projection to handle convex constraints and lower-level least-squares variables. By interpreting variable projection as a collapsed bilevel optimization problem, they derive exact reduced-gradient formulas compatible with automatic differentiation and introduce a conditional-gradient algorithm for the constrained reduced problem. Convergence is guaranteed under standard smoothness and compactness assumptions. Numerical experiments on sparse autoencoding, dictionary learning, blind deconvolution, and few-shot learning demonstrate improved wall-clock efficiency and data efficiency compared to joint-optimization baselines.
variable projectionbilevel optimizationautomatic differentiationconditional-gradient algorithmleast-squares
Flow-Corrected Thompson Sampling for Non-Stationary Contextual Bandits
Flow-Corrected Thompson Sampling (fcTS) is introduced for non-stationary linear contextual bandits, addressing reward model drift by transporting past rewards to the present using an explicit drift model and incorporating them with confidence weights. The method specializes in linear parameter drift, periodic variation, and recurring regime switches, maintaining closed-form posterior updates under a linear Gaussian model. Evaluated across five case studies and a semi-synthetic portfolio-selection benchmark, fcTS outperforms forgetting-based baselines, particularly in settings with recurring temporal structure, demonstrating superior sample efficiency by correcting and reweighting historical observations.
non-stationary banditsreward driftbayesian methodposterior updatessample efficiency
KLip-PPO: A per-sample KL perspective on PPO-Clip
The paper demonstrates that Proximal Policy Optimization's (PPO) clipped surrogate gradient is mathematically equivalent to a Kullback-Leibler (KL) penalty with per-sample coefficients, derived from the importance ratio and advantage. This reformulation holds throughout minibatch updates and inner-loop optimization, revealing PPO-Clip's implicit step-function penalty at trust region boundaries. Empirical validation on five MuJoCo continuous-control benchmarks shows identical training curves between the two formulations. The analysis suggests new algorithmic design directions by exposing the structural properties hidden by PPO-Clip's min notation.
proximal policy optimizationkullback-leibler divergenceimportance ratiotrust regioncontinuous-control
Do LLM Attribution Metrics Transfer? Auditing Retrieval-Augmented Generation Evaluation Across Datasets and Constructs
The study audits eight automatic attribution metrics for retrieval-augmented LLM generation across three evaluation constructs, revealing no scorer transfers reliably between datasets. Methods include lexical, embedding, and BERTScore baselines, plus entailment/grounding-trained models, tested on multi-dataset human-labeled benchmarks (AttributionBench, HAGRID). Results show metric rankings invert (Kendall tau = -0.64, p = 0.031) between datasets, with NLI scorers collapsing from AUROC 0.90 to 0.53 on long-form LFQA, while BERTScore wins (0.91). LLM judges avoid collapses but are costly and non-deterministic, shifting validation burdens.
retrieval-augmented generationattribution metricsbertscorenatural language inferenceauroc
Closing the Loop: Formally Verified Law as a Reward Signal for Self-Improving Legal AI
The article introduces a novel architecture for training legal AI systems using formally verifiable reward signals, adapting the 'LLM proposes, verifier disposes' paradigm from mathematical AI to legal domains. The architecture integrates LLM-driven autoformalization into a formal legal calculus extending Catala, a verification kernel, and explanation generation based on formal proof traces. It ensures provable correctness for computational legal components and structural guarantees for open-textured legal analysis. The system was validated on procedural deadline calculations in German law, Commerce Clause analysis in U.S. constitutional law, and cross-jurisdictional sanction proportionality. The architecture also provides a deterministic external verifier, closing the reinforcement-learning loop gap in legal AI training.
autoformalizationformal legal calculusverification kernelproof tracesreinforcement-learning loop
GRACE: Gated Refinement for Accurate Causal Edge Discovery in High-Dimensional Time Series
GRACE introduces gated refinement for causal edge discovery in high-dimensional time series, combining constraint-based methods with Hard Concrete gates and $L_0$ regularization to robustly binarize edge scores. The method first generates high-recall candidate edges via fast linear CI tests, then prunes false positives using a gated model that adapts regularization to problem dimensions. Evaluated on synthetic benchmarks (up to $d=100$) and a river flow dataset, GRACE improves F1 over base CI methods (75× faster than nonlinear CI tests) and reduces false positives by 99%, achieving $F_1 = 0.86$ and AUROC $= 0.99$ on real-world data.
causal discoverytime seriesl0 regularizationhard concrete gatesconstraint-based methods
Federated Survival Analysis in Healthcare: A Multi-Model Evaluation on Cross-Institutional Heterogeneous Breast Cancer Data
This paper evaluates federated learning (FL) for survival analysis on heterogeneous breast cancer data across institutions, comparing three models (Cox Proportional Hazards, DeepSurv, Random Survival Forest) and three FL optimization strategies (FedAvg, FedProx, FedAdam). FL outperforms local training and approaches centralized performance, with Random Survival Forest showing the best balance of discrimination, calibration, and robustness. Performance depends on client distribution diversity, with FedAvg and FedProx proving more stable than FedAdam. The study provides guidelines for model and training paradigm selection in federated survival modeling based on data, privacy, and resource constraints.
federated learningsurvival analysisheterogeneous datarandom survival forestfedavg
Exact Schur-Sylvester Dimensionality Reductions for Non-Smooth Stochastic Complexity and Manifold Sampling
(No summary returned.)
Sesame: Structure-Aware Molecular Generation via Spatial Density-Map Conditioning
The paper introduces Sesame, a diffusion-based molecular generation model that employs a spatial pairformer module to condition on partial molecular structures and protein pockets via continuous spatial density maps. This approach enables both de novo generation and fragment-conditioned lead optimization, allowing medicinal chemists to grow molecular scaffolds. The model features a novel diffusion framework for joint denoising of atom types, bond types, and positions, along with a trajectory finetuning scheme to enhance generation quality. Sesame is trained on extensive ligand-only and protein-ligand datasets, demonstrating its versatility in computational drug design.
sesamespatial pairformerdiffusion frameworklead optimizationtrajectory finetuning
Machine Learning Modeling for Real-Time Melt Pool Monitoring in Laser Powder Bed Fusion Additive Manufacturing: A Hybrid Approach
The study presents a hybrid machine learning approach for real-time melt pool monitoring in laser powder bed fusion additive manufacturing, combining transfer learning with ensemble methods. Using a balanced dataset of 1,200 Nickel superalloy 625 images from the NIST AMMT platform, the authors benchmark three CNN architectures (ResNet50, EfficientNetB0, MobileNetV2) against Random Forest variants, evaluating performance metrics and computational efficiency. The hybrid EfficientNetB0-plus-Random Forest model achieves 0.9458 accuracy, 0.9451 F1 score, and 0.9904 AUC with 1.15ms inference latency, outperforming pure deep learning approaches in both accuracy and speed for this data-limited industrial application.
melt pool monitoringtransfer learningrandom forestlaser powder bed fusionreal-time inference
The Degeneracy Distillery
The Degeneracy Distillery introduces a method for (1) detecting and (2) resolving degenerate parameter combinations automatically and symbolically from parameter-data pairs, using Fisher information matrix estimation and flattening. By analyzing the information geometry of the likelihood, it identifies degeneracies as intrinsic model properties without requiring observed data. The approach discovers symbolic coordinate transformations that isolate independent parameter effects, globally flattening Fisher information (unlike point-specific posterior methods). Experiments show up to 10× fewer simulations needed for neural posterior estimation while improving validation calibration and providing physical insights.
degeneracyfisher informationinformation geometryneural posterior estimationsymbolic transformation
Reconstructing GRACE Terrestrial Water Storage with Spatio-Temporal Graph Neural Networks: An Application to South America
The study presents a spatio-temporal graph neural network (MTGNN) for reconstructing GRACE terrestrial water storage anomalies (TWSA) from 1940–2002 using ERA5 meteorological data, addressing the short observational record of GRACE/GRACE-FO missions. The model employs a hybrid adjacency matrix combining geodesic proximity and lagged climatic correlations to capture local and teleconnected hydrological processes. Results show 0.69 grid-cell and 0.94 basin-mean Pearson correlation, outperforming baselines (GTWS-MLrec, RM-REC, GRAiCE) in efficiency (50–90% fewer predictors) while matching basin-scale accuracy (±0.025 correlation). The model successfully reproduces El Niño/La Niña signatures but reveals arid-region weaknesses common across approaches.
terrestrial water storagegraph neural networkgrace-foera5teleconnections
Hessian-augmented Supervised Learning for Hamilton-Jacobi-Bellman PDEs
A Hessian-augmented supervised learning method is proposed for approximating value functions in deterministic optimal control problems with nonlinear control-affine dynamics. The approach solves the Pontryagin Maximum Principle optimality system from multiple initial conditions to generate training data comprising values, gradients, and Hessians of the value function, with Hessians obtained via a matrix Riccati equation. These quantities augment a weighted least-squares regression over sparse polynomial bases on hyperbolic cross index sets, reducing sample complexity significantly compared to value-only regression. Validation on high-dimensional problems demonstrates improved approximation accuracy and closed-loop performance, with up to an order-of-magnitude reduction in required training samples relative to lower-order methods.
hessian-augmentedpontryagin maximum principlematrix riccati equationhyperbolic cross index setsnonlinear control-affine dynamics
One Ruler: A Same-Hands Re-Evaluation of Bivariate Causal Direction on Tuebingen, with a Parameter-Free Compression Baseline
The paper introduces a standardized re-evaluation framework for causal discovery methods on the Tuebingen benchmark, enforcing uniform protocols (102 pairs, no tuning, forced decisions) to address inconsistent reporting practices. It proposes a parameter-free baseline using bz2 compression on quantized, sorted, first-differenced data. Results show the baseline achieves 74.7% weighted accuracy (p = 3.7e-7), statistically tied with SLOPE (77.2%) under identical conditions, while exposing inflated literature claims due to model selection and significance-gated abstention. Additional findings include compression scores as confounding detectors (p = 2.8e-68) and a failed falsification test bounding method interpretability.
causal discoverybenchmark standardizationparameter-free baselinecompression-based inferencesignificance-gated abstention
Machine Learning and Deep Learning for Exoplanet Detection and Atmospheric Characterization with JWST and the Upcoming Ariel Mission
The review synthesizes ML/DL advancements in exoplanet detection and atmospheric characterization for JWST and Ariel missions. It covers classical algorithms (Random Forests, CNNs), modern architectures (Transformers, RNNs), and simulation-based inference (Neural Posterior Estimation, Flow Matching). Benchmarking includes Ariel ML Challenges and JWST case studies like WASP-39b. Results show DL outperforms traditional pipelines in speed/accuracy, with ML-driven retrievals reducing inference time by 3-8x. Challenges include interpretability, uncertainty calibration, and cross-instrument generalization, with a roadmap extending to Ariel's 2029 launch.
exoplanet detectionatmospheric characterizationneural posterior estimationnormalizing flowstransformer architectures
SkyJEPA: Learning Long-Horizon World Models for Zero-Shot Sim-to-Real Control of Quadrotors
The paper introduces SkyJEPA, a Joint Embedding Predictive Architecture (JEPA) for long-horizon dynamics modeling in quadrotor control. The method combines latent-space dynamics with a physics-inspired prober for interpretable state prediction, enabling robust sim-to-real transfer. A sampling-based optimal control framework leverages the model for real-time embedded deployment. Automated dataset generation reduces reliance on real-world data. Experiments demonstrate accurate long-horizon prediction, zero-shot transfer, and generalization across diverse conditions.
joint embedding predictive architecturelatent-space dynamicssim-to-real transfersampling-based optimal controlquadrotor
Exploring Dualistic Meta-Learning to Enhance Domain Generalization in Open Set Scenarios
The paper proposes MEDIC, a dualistic meta-learning strategy for open set domain generalization that jointly optimizes inter-domain and inter-class task splits. The method addresses label mismatch between source and target domains by balancing gradient matching across both domain and class boundaries, mitigating over-rejection of out-of-distribution data. Experiments demonstrate MEDIC's superiority in open set scenarios while maintaining competitive close set generalization performance.
domain generalizationopen set recognitionmeta-learninggradient matchingout-of-distribution detection
Synergizing Physically Constrained MCMC and Chemical-Informed Gaussian Processes for Reaction Network Discovery
The paper introduces PC-MCMC-CIGP, a gray-box workflow combining spike-and-slab MCMC topology sampling with Chemical-Informed Gaussian Processes (CIGP) for reaction network discovery. The method enforces physical constraints (conservation laws, thermodynamics) and uses uncertainty-aware acquisition for parameter calibration. On the H2 + Br2 benchmark, it identifies elementary radical pathways, while CIGP optimization improves styrene epoxidation yield by 12.5% over GP-BO baselines. Acquisition studies show PC-EI reduces low-yield suggestions, with EI criteria achieving best final-yield performance.
spike-and-slabgaussian processesmcmcreaction networkuncertainty-aware
EnerInfer: Energy-Aware On-Device LLM Inference
EnerInfer introduces an energy-aware on-device LLM inference framework that jointly optimizes energy efficiency, throughput, and thermal comfort. It replaces per-model profiling with disaggregated, model-structure-aware prediction and ranking-driven online feedback, predicting throughput and power for unseen LLMs across NPU/DDR frequency settings. The framework dynamically selects QoE-satisfying configurations under runtime interference and uses lightweight limited-horizon thermal prediction to switch between energy-optimized and thermally constrained inference. Evaluations demonstrate energy efficiency improvements of up to 65%, 12%, and 24% on phones, laptops, and development boards, respectively, without QoE violation.
llm inferenceenergy efficiencythermal comfortnpu frequencyqoe
Scalable Physics-Inspired Transformers for Spin Glasses
The authors introduce a physics-inspired transformer with sparse attention and spin-specific positional embeddings to address scaling and efficiency challenges in sampling Boltzmann distributions for frustrated spin glasses. The method leverages FlashAttention for parallel ancestral sampling, achieving 100x speedup over vanilla variational autoregressive networks while enabling single-GPU simulations at unprecedented scales. Results demonstrate accurate resolution of probability distributions, free energies, and overlap statistics across temperatures for Sherrington-Kirkpatrick and Edwards-Anderson models, overcoming limitations of prior machine-learning approaches.
spin glassesboltzmann samplingsparse attentionpositional embeddingsflashattention
📰 Industry Media (1)
The emergence of the web data infrastructure layer for AI
The article identifies a critical bottleneck in AI systems: the lack of real-time, structured web data infrastructure to support dynamic retrieval and contextual grounding. It proposes a new infrastructure layer capable of large-scale, low-latency data collection (80B requests/day) while mimicking human browsing behavior to bypass anti-bot measures. Survey data indicates 56% of practitioners require real-time web data to improve output trustworthiness, while 97% of AI organizations depend on such infrastructure despite 90% facing access restrictions. The solution emphasizes GDPR/CCPA-compliant protocols and structured data feeds to reduce hallucinations and enable applications like dynamic pricing.
retrieval-augmented generationweb data infrastructurereal-time retrievalanti-bot circumventiondata governance
Generated automatically at 2026-06-24 21:12 UTC. Summaries and keywords are produced by an LLM and may contain inaccuracies — always consult the original article.
