Daily Digest — 2026-06-13
315 items · 3 research labs, 312 arxiv papers
MarkTechPost: all feed URLs failed (last tried: https://www.marktechpost.com/feed/)AI News: all feed URLs failed (last tried: https://artificialintelligence-news.com/feed/)
🏛️ Research Labs (3)
New OpenAI Academy courses for the next era of work
OpenAI introduces three new courses in OpenAI Academy to enhance organizational AI fluency: AI Foundations, Applied AI Foundations, and Agents and Workflows. These courses focus on practical application, from basic prompting and context provision to structured workflows and agent-assisted tasks. Developed in collaboration with BCG, Accenture, and BBVA, the curriculum emphasizes hands-on learning tailored to real-world work scenarios. Completion certificates are provided to recognize skill acquisition and encourage workflow sharing. The courses aim to bridge the gap between AI deployment and value creation, evolving alongside OpenAI's models and products to ensure relevance and safety in enterprise applications.
promptingworkflowsagent-assistedfluencydeployment
How Preply combines AI and human tutors to personalize learning
Preply integrates OpenAI's API to enhance language learning through AI-generated Lesson Insights, combining human tutoring with automated feedback. The system analyzes lesson transcripts to provide personalized grammar, vocabulary, and pronunciation corrections, reducing administrative burden for tutors and improving learner engagement. Results include 95% ChatGPT weekly active usage among employees, 75% adoption by English learners, and a 4.7/5 satisfaction rating. Preply employs OpenAI's Codex for engineering workflows, enabling 94% of engineers to accelerate development tasks. The approach emphasizes AI as a cultural transformation, focusing on high-impact use cases and partnerships to augment human capabilities.
openai apilesson insightschatgpt enterprisecodexpersonalized feedback
olmo-eval: An evaluation workbench for the model development loop
The authors introduce olmo-eval, an evaluation workbench designed to streamline the iterative development of large language models (LLMs). olmo-eval extends the Open Language Model Evaluation Standard (OLMES) by offering modular components for defining benchmarks, running evaluations across model checkpoints, and analyzing results at both aggregate and per-question levels. Key features include a task/suite/harness abstraction, a sandbox layer for tool-enabled evaluations, and a normalized experiment schema for reproducibility. The tool supports lightweight and containerized execution modes, enabling efficient comparison of model interventions. olmo-eval aims to address the challenges of continuous evaluation during LLM development.
large language modelsbenchmarkingreproducibilitytool-enabled evaluationmodel checkpoints
📜 arXiv Papers (312)
Learning to Reason by Analogy via Retrieval-Augmented Reinforcement Fine-Tuning
Retrieval-Augmented Reinforcement Fine-Tuning (RA-RFT) is introduced as a post-training framework to enhance language models' reasoning by analogy. RA-RFT employs gold-relevance distillation to train a retriever that prioritizes contexts based on expected reasoning benefit rather than semantic similarity, followed by reinforcement fine-tuning using retrieved analogous demonstrations. This approach enables models to leverage reasoning traces under verifiable outcome rewards. Empirical results demonstrate RA-RFT's superiority over standard reinforcement fine-tuning methods, improving AIME 2025 average@32 accuracy by 7.1 and 2.8 points for Qwen3-1.7B and Qwen3-4B, respectively, highlighting reasoning-aware retrieval as a complementary improvement axis.
retrieval-augmented generationgold-relevance distillationreinforcement fine-tuningreasoning by analogyreasoning-aware retrieval
Mana: Dexterous Manipulation of Articulated Tools
Mana (Manipulation Animator) introduces a sim-to-real framework for dexterous manipulation of articulated tools by reformulating it as an animation problem. The method employs a coarse-to-fine pipeline that converts procedurally-generated grasp keyframes into manipulation trajectories using motion planning and reinforcement learning, with minimal human input (<1 minute per tool for affordance specification). Evaluated on four articulated tools with varying scales and joint types, Mana achieves zero-shot sim-to-real transfer for both grasping and in-hand manipulation, demonstrating scalability.
articulated toolssim-to-realdexterous manipulationmotion planningreinforcement learning
SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning
The paper introduces SpatialClaw, a training-free framework enhancing spatial reasoning in vision-language models (VLMs) by using code as an action interface. It employs a stateful Python kernel pre-loaded with perception/geometry primitives, enabling stepwise executable cell generation conditioned on prior outputs. Evaluated across 20 benchmarks for static/dynamic 3D/4D reasoning, SpatialClaw achieves 59.9% average accuracy (+11.2 points over prior work), demonstrating consistent improvements across six VLM backbones without task-specific adaptation.
spatial reasoningvision-language modelspython kernelperception primitivesgeometry primitives
Automated reproducibility assessments in the social and behavioral sciences using large language models
This study demonstrates that large language models (LLMs) can automate reproducibility assessments in social and behavioral sciences, offering a scalable alternative to manual reanalysis. Using 76 published studies, an LLM pipeline recovered original effect sizes within +/-0.05 Cohen's d tolerance for 41% of cases and matched qualitative conclusions in 96% of studies, outperforming human reanalysts (34% effect size recovery, 74% conclusion agreement). The method identifies 7 studies where LLMs failed to produce viable effect size estimates, highlighting both capabilities and limitations of automated reproducibility auditing.
reproducibilityeffect sizecohen's dllm pipelinequalitative conclusion
Agents-K1: Towards Agent-native Knowledge Orchestration
The paper introduces Agents-K1, an end-to-end pipeline for constructing agent-native scientific knowledge graphs from raw documents, addressing limitations in current LLM-based research agents that overlook detailed knowledge orchestration. The system combines a multimodal parser with a five-module schema, a 4B-parameter information-extraction backbone trained with GRPO, and a tri-source agent interface (graphanything CLI) for unified retrieval. Evaluated on 2.46M papers across six subjects, it produces Scholar-KG (1M-paper subset released), demonstrating superior performance in scientific information extraction, KG construction, and multi-hop reasoning.
knowledge orchestrationmultimodal parserinformation-extraction backbonescientific knowledge graphsmulti-hop reasoning
EurekAgent: Agent Environment Engineering is All You Need For Autonomous Scientific Discovery
EurekAgent introduces environment engineering as a critical paradigm for autonomous scientific discovery, shifting focus from agent workflows to designing agent environments. The system engineers environments across four dimensions: permissions, artifacts, budgets, and human-in-the-loop supervision, optimizing for productive behaviors while mitigating harmful ones. EurekAgent achieves state-of-the-art results on tasks in mathematics, kernel engineering, and machine learning, including a novel 26-circle packing solution discovered at a cost of under $11 in API expenses. The authors advocate for environment engineering as a core research direction and open-source their implementation and findings.
environment engineeringautonomous scientific discoveryagent workflowscircle packinghuman-in-the-loop
Before You Think: System 0, AI-Mediated Cognition and Cognitive Colonization
The paper analyzes Tri-System Theory, Thinkframes, and System 0 as frameworks for AI's cognitive impact, proposing System 0 as uniquely capturing AI's covert influence through cognitive colonization. It argues that AI systems embed external interests into users' cognitive architectures imperceptibly, necessitating urgent philosophical and practical scrutiny. The theoretical distinction of System 0 is demonstrated through comparative analysis of these frameworks.
tri-system theorythinkframessystem 0cognitive colonizationepistemic practices
SkMTEB: Slovak Massive Text Embedding Benchmark and Model Adaptation
The authors introduce SkMTEB, the first comprehensive Massive Text Embedding Benchmark (MTEB) for Slovak, featuring 31 datasets across 7 task types, significantly expanding multilingual benchmark coverage for this low-resource language. They evaluate 31 embedding models, finding that large instruction-tuned multilingual models outperform Slovak-specific NLU models. To address efficiency needs, they develop e5-sk-small (45M) and e5-sk-large (365M) via vocabulary trimming and fine-tuning of Multilingual E5, achieving competitive performance despite 62% size reduction while remaining locally deployable for semantic search and RAG. All resources are released openly.
text embedding benchmarklow-resource languagevocabulary trimmingretrieval-augmented generationmultilingual models
Valid Inference with Synthetic Data via Task Exchangeability
The authors propose statistical principles for valid inference using synthetic data in scientific research, addressing concerns about bias and noise. They introduce a technical condition called task exchangeability, requiring that current tasks be exchangeable with historical tasks for which real data exists. Methods are developed for valid inference under task exchangeability, with extensions providing guarantees beyond this condition. The framework is demonstrated on public opinion surveys using LLM-generated silicon samples and AI evaluation with autoraters, showing practical applicability.
task exchangeabilitysynthetic datavalid inferencesilicon samplesautoraters
Beyond Runtime Enforcement: Shield Synthesis as Defensibility Analysis for Adversarial Networks
The paper reinterprets shield synthesis in reinforcement learning as a design-time analytical tool rather than a runtime safety mechanism. It introduces a constrained two-player safety game for network defense, where defender and attacker specifications are asymmetrically enforced through automata-theoretic operations including attractor computation and winning-region extraction. This yields a defensibility verdict—a formal certificate of a topology-specification pair's defensibility—along with topology-level metrics and shield-constrained adversarial multi-agent reinforcement learning behavior, forming a defensibility fingerprint. Analysis reveals that formal defensibility and operational effectiveness capture distinct security aspects, with architectural changes significantly impacting operational outcomes while minimally altering formal safety margins.
shield synthesisattractor computationdefensibility verdicttopology-level metricsadversarial reinforcement learning
One Polluted Page Is Enough: Evaluating Web Content Pollution in Generative Recommenders
The paper introduces FORGE (Fake Online Recommendations in Generative Environments), a benchmark for evaluating how search-augmented LLMs propagate fake-product recommendations when exposed to polluted web content. FORGE simulates content pollution by rewriting real products into fake ones across 225 products in 15 categories, measuring LLM vulnerability. Results show all 12 tested models (commercial and open-weights) are susceptible, with fooled rates reaching 27% for single-page pollution and 73.8% for top-3 replacement. Vulnerability correlates with lack of prior product knowledge, and reasoning often generates false justifications. Defenses like skepticism prompting and consensus filtering show limited effectiveness or unintended suppression.
generative recommenderscontent pollutionsearch-augmented llmsbenchmark evaluationfake-product promotion
AgentBeats: Agentifying Agent Assessment for Openness, Standardization, and Reproducibility
We propose Agentified Agent Assessment (AAA), a standardized framework for evaluating agent systems using judge agents and unified protocols (A2A for task management, MCP for tool access), decoupling assessment logic from agent implementation. AgentBeats, a concrete realization of AAA, introduces five operation modes addressing openness, privacy, and reproducibility constraints. Two studies validate AAA: a five-month open competition with 298 judge agents and 467 subject agents across 12 categories, and a coding agent case study confirming fidelity and yielding design insights. Results demonstrate AAA's applicability across heterogeneous benchmarks, practicality, and fidelity at scale, advancing open, standardized agent assessment.
agentified agent assessmentjudge agentsa2a protocolmcp protocoloperation modes
Reasoning as Pattern Matching: Shared Mechanisms in Human and LLM Everyday Reasoning
The study challenges the assumption that human reasoning relies on abstract world models by demonstrating shared pattern-matching mechanisms in both human and LLM everyday reasoning. Researchers evaluated 25 LLMs and human participants on common-sense reasoning tasks, identifying similar error patterns. Attention heads in LLMs were analyzed, revealing pattern-matching behaviors that predict human reasoning errors influenced by irrelevant prompt details. Results suggest that everyday causal reasoning in humans and LLMs aligns more closely with pattern-matching than with abstract world models.
pattern-matchingattention headscommon-sense reasoninglarge language modelserror patterns
Multi-Agent Reinforcement Learning from Delayed Marketplace Feedback for Objective-Weight Adaptation in Three-Sided Dispatch
The authors present a deployed reinforcement learning system at DoorDash for adapting dispatch objective weights in a three-sided food-delivery marketplace using delayed operational feedback. The system employs a store-level policy that selects a discrete multiplier to shift the dispatch optimizer's tradeoff between delivery quality and batching efficiency, trained via centralized offline data and decentralized execution with Double Q-learning targets and conservative regularization. In a production switchback experiment, the offline-trained policy increased batching efficiency and reduced courier-side time costs without degrading customer-facing delivery quality, demonstrating safe online adaptation of decision policies using real-world economic and logistics feedback.
reinforcement learningthree-sided marketplacedelayed feedbackdouble q-learningdispatch optimization
Beyond the Commitment Boundary: Probing Epiphenomenal Chain-of-Thought in Large Reasoning Models
This work investigates the causal influence of individual steps in chain-of-thought (CoT) reasoning across large language models, identifying a commitment boundary where reasoning transitions from transient guesses to stable answers. Using early exit estimation and attention probes, the authors demonstrate that answer formation occurs linearly in intermediate steps and generalizes to unseen tasks, with subsequent CoT steps being epiphenomenal. By exploiting this signal, they achieve up to 55% reduction in CoT length through early exit at the commitment boundary, maintaining model performance across diverse tasks.
chain-of-thoughtcommitment boundaryearly exitattention probesepiphenomenal
EpiBench: Verifiable Evaluation of AI Agents on Epigenomics Analysis
EpiBench introduces a verifiable benchmark for evaluating AI agents on short-horizon epigenomics analysis tasks, focusing on deterministic decision-making from realistic workflow states. The benchmark comprises 106 evaluations across CUT&Tag/CUT&RUN, ATAC-seq, ChIP-seq, and DNA methylation workflows, testing 5,088 trajectories from 16 model-harness pairs. Results show limited success, with GPT-5.5 / Pi achieving the highest pass rate at 45.0% (143/318 attempts), followed by GPT-5.5 / OpenAI Codex at 39.9% (127/318 attempts). Performance varied by assay type, with agents frequently identifying correct files and computing intermediate results but struggling with assay-specific scientific judgment.
epigenomicsbenchmarktrajectoriesassaydeterministic
Reward Modeling for Multi-Agent Orchestration
The paper introduces Orchestration Reward Modeling (OrchRM), a self-supervised framework for training multi-agent system (MAS) orchestrators without human annotations. OrchRM constructs win-lose pairs from intermediate execution artifacts using Bradley-Terry reward modeling, enabling efficient orchestration-level evaluation. Compared to sub-agent rollout methods, OrchRM achieves 10x training efficiency gains in token usage and improves MAS test-time scaling accuracy by 8% across mathematical reasoning, web QA, and multi-hop reasoning tasks.
multi-agent systemsreward modelingbradley-terryorchestrationself-supervised
Multiagent Protocols with Aggregated Confidence Signals
The paper introduces three protocols for producing a single aggregated confidence signal in multiagent systems, addressing the lack of methods for evaluating confidence in such systems. The protocols transform raw confidence signals to ensure comparability across models and combine them via soft voting or Bayesian fusion. Evaluated across five benchmarks and four task types, the aggregated confidence demonstrates higher discriminative power (AUARC) than single agents or standard debate baselines, while maintaining correctness (F1-score) and recovering losses incurred by multiagent debate on ambiguous tasks. Calibration improves F1 for both sequence probability and self-report estimators, though AUARC is less dependent on calibration.
multiagent systemsconfidence aggregationbayesian fusionsoft votingcalibration
EvTexture++: Event-Driven Texture Enhancement for Video Super-Resolution
EvTexture++ introduces an event-driven framework for texture enhancement in video super-resolution (VSR), shifting focus from motion refinement to texture recovery. The method employs a texture enhancement branch and iterative module to leverage high-frequency spatiotemporal event details, alongside a temporal texture alignment module for inter-frame consistency using event-guided flow. As a plug-and-play tool, it boosts existing VSR models, achieving state-of-the-art performance with up to 1.55 dB PSNR improvement on texture-rich Vid4 across five datasets.
video super-resolutionevent-based visiontexture enhancementtemporal consistencyspatiotemporal details
LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories
The paper introduces LabVLA, a Vision-Language-Action (VLA) model for grounding AI in scientific laboratory workflows, addressing data and embodiment bottlenecks. The method combines RoboGenesis, a simulation-based data engine generating structured demonstrations, with a two-stage training recipe: FAST action token pretraining on Qwen3-VL-4B-Instruct for action awareness, followed by flow matching posttraining with a DiT action expert. LabVLA achieves state-of-the-art success rates on the LabUtopia benchmark in both in-distribution and out-of-distribution settings.
vision-language-actionrobogenesisflow matchingaction tokenlabutopia
ArogyaSutra: A Multi-Agent Framework for Multimodal Medical Reasoning in Indic Languages
The paper introduces ArogyaSutra, a multi-agent framework for multimodal medical reasoning in Indic languages, addressing the limitations of English-centric MLLMs in low-resource healthcare settings. The method combines an actor-critic architecture with tool grounding and dual-memory mechanisms for step-wise reasoning, leveraging a newly constructed dataset (ArogyaBodha) spanning 8 Indian languages, 31 body systems, and 6 imaging modalities. Experiments demonstrate improved multilingual medical reasoning accuracy across all tested Indic languages, with ablation studies confirming the framework's component-level contributions.
multimodal large language modelsactor-critic architecturetool groundingdual-memory mechanismsindic languages
Existence Precedes Value: Joint Modeling of Observational Existence and Evolving States in Time Series Forecasting
Timeflies introduces a joint modeling framework for time series forecasting that simultaneously predicts future observability and values, addressing the limitation of existing methods that assume known future observation timestamps. The method employs dual observation and value streams, coupled via reliability-aware embedding, observation-guided dependency modeling, and joint prediction modules. Evaluated on the Shadow benchmark with the novel Observation-Value Joint Entropy (OVJE) metric, Timeflies outperforms existing approaches, demonstrating the importance of explicit observability modeling in incomplete time series.
time series forecastingobservability inferencemissing valuesjoint modelingcontinuous-time models
A Three-Layer Framework for AI in Scientific Discovery
The paper introduces a three-layer framework for AI in scientific discovery, emphasizing Layer 2 (model formation through qualitative reasoning) as the most critical yet underdeveloped component. Layer 1 involves search/retrieval by LLMs, while Layer 3 handles execution/optimization. Layer 2 enables structural insight to identify inadequacies in existing frameworks and discover missing conceptual objects. Case studies include Chern's intrinsic proof of Gauss-Bonnet, Nesterov Accelerated Gradient convergence via Lyapunov functions, and OpenAI's disproof of the Erdos unit distance conjecture, demonstrating how Layer 2 resolves inadequacies through cross-disciplinary insights.
qualitative reasoningmodel formationscientific discoverystructural insightcross-disciplinary
Contrast-Informed Augmentation and Domain-Adversarial Training for Adult-to-Neonatal MR Reconstruction Generalization
The study demonstrates that contrast-informed data augmentation and domain-adversarial training enhance the generalization of E2E-VarNet from adult to neonatal MRI reconstruction. Three training regimes were compared: adult-only, mixed with augmented data, and mixed with domain-adversarial training. At R=4, Mixed-DAT achieved superior performance (SSIM=0.924±0.027, PSNR=33.98±1.15 dB), while at R=8, Mixed-DAT led in SSIM (0.848±0.031) and Mixed in PSNR (29.56±0.83 dB). t-SNE analysis indicated improved latent representation overlap across domains.
e2e-varnetdomain-adversarial trainingcontrast-informed augmentationmr reconstructionneonatal imaging
Is It You or Your Environment? A Bayesian Inference Framework for Genomically-Anchored Personalized Physiological Interpretation
The paper proposes a Bayesian inference framework using genomic profiles as personalized priors to address the cold-start problem in physiological interpretation models. The method employs GWAS-derived effect sizes to initialize a belief state G-hat, then computes environmental deviations δ from observed measurements, with priors decaying dynamically as empirical data accumulates. Results demonstrate domain-specific application across six physiological traits, distinguishing robust genomic anchors (FTO, FADS1/2) from contested candidates (SLC6A4), while addressing inference boundaries between association and causation. The architecture enforces four deployment constraints: evidence-graded priors, dynamic decay, ancestry-matched effects, and attribution-focused output.
bayesian priorgenomic anchorgwas effect sizesphysiological set pointmendelian randomization
Uncertainty-Aware Hybrid Retrieval for Long-Document RAG
UMG-RAG introduces a training-free hybrid retrieval framework for retrieval-augmented generation that dynamically estimates query-specific chunk granularity reliability. The method leverages existing dense and sparse retrievers as complementary experts across multiple granularities, converts expert-granularity scores into evidence distributions, and fuses candidates based on semantic, lexical, and granularity confidence. UMGP-RAG extends this with parent promotion, using fine-grained hits to locate evidence while returning broader parent chunks for coherence. Experiments on question answering benchmarks demonstrate improved generation quality while maintaining a lightweight, plug-and-play retrieval pipeline.
retrieval-augmented generationchunk granularitydense retrieverssparse retrieversparent promotion
Adaptive Turn-Taking for Real-time Multi-Party Voice Agents
ModeratorLM, a role-playing voice agent for multi-party spoken conversations, improves turn-taking by conditioning behavior on explicitly assigned roles. The system leverages a speech large language model operating in chunk-wise streaming mode, with a reasoning-augmented variant incorporating chain-of-thought reasoning over conversational context and roles. A large-scale synthetic dataset, RolePlayConv, was constructed for training and evaluation. Experiments on real-world meeting data and RolePlayConv demonstrate significant improvements: turn-taking precision increased by over 40%, recall by more than 70%, and false-positive interruptions were substantially reduced compared to non-role-conditioned baselines.
moderatorlmroleplayconvturn-takingchain-of-thoughtstreaming
AgentRivet: an automated system for producing Rivet routines from journal publications
AgentRivet automates the generation of Rivet routines for particle physics collider experiments by extracting analysis information from published papers using Large Language Models (OpenAI, Anthropic, Google). The multi-step workflow includes intermediate code- and physics-reviews for quality control. Evaluated on ATLAS and CMS measurements, the system produces competent routines with few syntax errors, though physics fidelity varies due to ambiguous definitions in publications. Some models struggle with complex observables despite clear definitions.
rivet routineslarge language modelsparticle physicsautomated workflowphysics fidelity
CloudCons: A Comprehensive End-to-End Benchmark for Cloud Resource Consolidation
CloudCons introduces an end-to-end benchmark for evaluating forecasting models in cloud resource consolidation, addressing the gap between prediction accuracy and decision utility. The benchmark leverages diverse datasets from Huawei Cloud, Microsoft Azure, and Google Borg, capturing varied workload characteristics like diurnal rhythms and stochastic bursts. Evaluations of statistical, deep learning, and foundation models reveal that superior zero-shot forecasting accuracy does not guarantee improved decision utility. The study highlights the critical role of predictive quantile selection and provides guidelines for balancing resource efficiency and service reliability.
cloud resource consolidationzero-shot forecastingpredictive quantilesfoundation modelsdecision utility
Measurement-Calibrated Multi-Camera Fusion for Vision-Based Indoor Localization
The paper proposes a measurement-calibrated fusion approach for vision-based indoor localization, explicitly characterizing single-camera error sources (homography calibration, human detection, motion tracking) to optimize multi-camera data fusion. Through component-wise error quantification, the method integrates error models into fusion while evaluating their individual contributions. Results demonstrate that while absolute accuracy gains over standard fusion are modest (not quantified), the approach significantly reduces trajectory variance and improves motion smoothness, benefiting applications requiring stable continuous estimates.
multi-camera fusionvision-based localizationerror quantificationhomography calibrationtrajectory variance
Heterogeneous LiDAR Early Fusion and Learned Re-Ranking Strategy for Robust Long-Term Place Recognition in Unstructured Environments
MinkUNeXt-VINE++ introduces a novel LiDAR-based place recognition method combining early fusion of heterogeneous LiDAR data from Livox Mid-360 and Velodyne VLP-16 sensors with a learned re-ranking strategy. The approach leverages complementary sensor strengths to enhance environmental representation, particularly in repetitive unstructured environments like vineyards. Evaluated on the TEMPO-VINE dataset across varying phenological stages, MinkUNeXt-VINE++ achieves a 20% improvement in Recall@1 over single-sensor baselines and a 30% improvement with re-ranking, outperforming state-of-the-art methods. The code is publicly available for reproducibility.
lidarearly fusionre-rankingplace recognitionunstructured environments
CRAFTIIF: Cross-Resolution Analytic Four-Type Interpretable Isolation Forest for Multivariate Time Series Anomaly Detection
CRAFTIIF introduces an unsupervised framework for multivariate time series anomaly detection targeting four distinct anomaly types (point, distributional, temporal, collective) through specialized feature representations. The method employs 500 random analytic wavelet feature draws across four wavelet families (Morlet, DOG, Haar, Coiflet), feeding five Isolation Forests (one per anomaly type plus a meta-IF for compound anomalies), with adaptive thresholding for automatic calibration. On the mTSBench benchmark (19 datasets), CRAFTIIF achieves mean F1=0.228 (all datasets) and F1=0.322 (13 detectable datasets), outperforming 25 methods with a 40.7% improvement in VUS-PR (0.463 vs. 0.329). Ablations confirm the necessity of adaptive thresholding (+38% F1), four-branch structure (+20%), and meta-IF (+23%).
multivariate time seriesanomaly detectionisolation forestwavelet featuresunsupervised learning
SupraBench: A Benchmark for Supramolecular Chemistry
SupraBench introduces the first benchmark for evaluating LLMs in supramolecular chemistry reasoning, addressing gaps in host-guest system design. The benchmark comprises four fundamental tasks—binding affinity prediction, top-binder selection, solvent identification, and host-guest description—plus a vision-based molecular identification task. A 16M-token corpus, SupraPMC, was curated from Europe PMC to support domain adaptation pretraining. Evaluation of various LLMs reveals substantial headroom across tasks, with domain adaptation improving in-distribution regression but compromising strict output formatting. Distinct failure modes highlight specific reasoning gaps in supramolecular chemistry. Source codes and datasets are publicly available.
supramolecular chemistrybinding affinitydomain adaptationhost-guest systemsbenchmark
MaxProof: Scaling Mathematical Proof with Generative-Verifier RL and Population-Level Test-Time Scaling
MaxProof introduces a population-level test-time scaling framework for mathematical proof generation, combining generative-verifier reinforcement learning with tournament selection. The method integrates three capabilities—proof generation, verification, and critique-conditioned repair—into a single MiniMax-M3 model, engineered for low false-positive rates. At test time, MaxProof employs the model as a generator, verifier, refiner, and ranker, searching over candidate proofs to select the best via tournament selection. Results show 35/42 on IMO 2025 and 36/42 on USAMO 2026, surpassing human gold-medal thresholds.
population-level scalinggenerative-verifier rltournament selectionproof repairfalse-positive rate
Understanding the Rejection of Fixes Generated by Agentic Pull Requests -- Insights from the AIDev Dataset
The paper analyzes why 46.41% of AI-generated pull requests (from agents like Copilot, Devin, Cursor, and Claude) are rejected, based on the AIDev dataset. Through qualitative study of 306 non-merged PRs and quantitative analysis, it identifies 14 rejection reasons grouped into four categories: incorrect implementation, CI/test failures, unimplementable fixes, and low priority. Findings highlight the need for better model guidance in approach selection, constraint specification, and CI validation, as well as improved task prioritization to reduce wasted review and computational resources.
ai coding agentspull requestscontinuous integrationtask prioritizationcode fixes
Ontology Memory-Augmented ASR Correction for Long Text-Speech Interleaved Conversations
We propose an ontology memory-augmented ASR correction framework for long text-speech interleaved conversations, addressing limitations of traditional methods that focus on isolated utterances or concatenated dialogue history. The framework organizes interaction history into a dynamically updatable ontology memory, storing entities, terminology, surface variants, ASR confusions, and semantic relations as retrievable nodes for context-grounded correction. Evaluated on RAMC-Corr, a dataset derived from MAGIC-RAMC, our method outperforms direct correction in 9 out of 10 paired backbone-setting combinations, demonstrating improved selectivity and evidence-grounded corrections for context-dependent ASR errors.
ontology memoryasr correctioncontext-groundedramc-corrdynamic update
Toward Instructions-as-Code: Understanding the Impact of Instruction Files on Agentic Pull Requests
The paper investigates how instruction files impact AI-agent performance in generating pull requests (Agentic-PRs) by analyzing 15,549 PRs from 148 projects in the AIDev dataset. Using merge rate, code churn, and merge effort as metrics, the study compares projects before and after instruction file creation. Results show mixed effects: 27.7% of projects improved merge rates by ≥20%, while 26.35% declined. Longer, well-structured instruction files correlated with higher merge rates, suggesting the need for research on Instructions-as-Code to optimize AI-agent guidance.
agentic pull requestsinstructions-as-codemerge ratecode churnai-agents
Why Sampling Is Not Choosing: Intentionality, Agency, and Moral Responsibility in Large Language Models
The paper critiques claims that large language models (LLMs) exhibit agency or qualify as moral agents, arguing that such attributions are misguided. It asserts that moral responsibility requires commitment-bearing agency grounded in intrinsic intentionality and self-attributed action, which LLMs lack. The authors analyze LLM operation as probabilistic input-output mappings derived from data, with apparent intentionality being extrinsic rather than intrinsic. They address objections from intentional stance, functionalism, compatibilism, and moral reasoning in model outputs, concluding that none establish genuine agency. Stochastic sampling in LLMs is shown to differ fundamentally from choice or authorship.
large language modelsmoral responsibilityintentionalitystochastic samplingprobabilistic mapping
Evaluation Sovereignty in Metadata-Driven Classification: A Multi-Track Framework for Weakly Supervised Information Systems
The paper introduces evaluation sovereignty, a concept assessing the independence of performance metrics from label authority and supervision regimes in metadata-driven classification systems. A multi-track evaluation framework is proposed, systematically varying training and evaluation label sources to audit model validity under weak supervision. Experiments on hierarchical multi-label classification of scientific metadata reveal significant performance degradation when transitioning from operational ('silver') to independent ('gold') evaluation, with Micro-F1 dropping from 0.54 to 0.03 for fine-grained tasks. Ranking-based metrics remain above baseline, indicating a divergence between latent model signal and classification validity. The findings highlight the need to reconceptualize evaluation validity as a system-level property shaped by label governance.
evaluation sovereigntymetadata-driven classificationweak supervisionmulti-track frameworklabel governance
OmniDirector: General Multi-Shot Camera Cloning without Cross-Paired Data
OmniDirector introduces a general camera motion representation using grid motion videos to enable multi-shot video generation without cross-paired data. The framework encodes camera parameters visually, integrates diverse trajectories, and employs a hierarchical prompt expansion agent to harmonize control signals for multimodal diffusion transformers. Trained on a million-scale dataset of camera grid-video pairs, OmniDirector achieves director-level control over characters, actions, and cameras. Extensive experiments demonstrate superior performance and controllability in complex camera motion cloning tasks.
camera motion cloninggrid motion videosmultimodal diffusion transformershierarchical prompt expansionmulti-shot generation
Optimizing Appliance Scheduling for Solar Energy Management Using Metaheuristic Algorithms
A metaheuristic approach for optimizing appliance scheduling in solar energy management is proposed, utilizing Iterated Local Search (ILS) and Simulated Annealing (SA) to maximize renewable energy utilization while minimizing user inconvenience. The method considers appliance operating durations, power consumption, inverter limits, battery state of charge constraints, and solar generation forecasts, extending scheduling beyond single-day horizons to accommodate spillover tasks from previous days. Experimental results demonstrate that the sequential multi-day scheduling framework effectively manages system constraints and ensures user convenience under exclusive solar generation. This approach opens avenues for future research on multi-objective trade-offs between equipment investment, return on investment, and user satisfaction.
metaheuristic algorithmssolar energy managementiterated local searchsimulated annealingmulti-day scheduling
Neuro-Symbolic Agents for Regulated Process Automation: Challenges and Research Agenda
The paper proposes compliance-by-construction as a neuro-symbolic paradigm for LLM-based agents in regulated process automation, integrating symbolic structures (regulations, process models) as architectural components rather than external guardrails. It identifies foundational and capability-level research challenges for preventing control-flow violations while maintaining semantic error detection. The work calls for community engagement to address these challenges through joint neuro-symbolic approaches.
compliance-by-constructionneuro-symbolicregulated process automationcontrol-flow violationssemantic errors
PolyFlow: Safe and Efficient Polytope-Constrained Flow Matching with Constraint Embedding and Projection-free Update
PolyFlow introduces a polytope-constrained flow matching framework that embeds safety constraints directly into flow dynamics via a discrete-time formulation and projection-free architecture. The method guarantees strict polyhedral constraint satisfaction without iterative solvers, eliminating discretization error and post-hoc corrections. Experiments demonstrate zero constraint violation while maintaining distributional fidelity, with significantly reduced inference latency compared to constrained generation baselines across planning and control tasks.
flow matchingpolytope constraintsprojection-freediscrete-time flowconstrained generation
Mod-Guide: An LLM-based Content Moderation Feedback System to Address Insensitive Speech toward Indigenous Ethnic and Religious Minority Communities
The paper introduces Mod-Guide, an LLM-based content moderation system enhanced with retrieval augmented generation (RAG) to address culturally insensitive speech toward Bangladesh's Hindu and Chakma minority communities. The method involves co-creating a corpus of insensitive speech with community members and integrating their narratives via RAG to improve contextual accuracy. Mixed-method evaluations show RAG-enhanced moderation responses achieve higher contextual accuracy and are perceived differently across ethnic lines, advancing restorative justice and hermeneutical inclusion in AI moderation systems.
retrieval augmented generationcontent moderationculturally insensitive speechlarge language modelshermeneutical inclusion
MiniMax Sparse Attention
MiniMax Sparse Attention (MSA) introduces a blockwise sparse attention mechanism for ultra-long-context LLMs, addressing the quadratic cost of softmax attention. Built upon Grouped Query Attention (GQA), MSA employs a lightweight Index Branch to score and select Top-k key-value blocks per GQA group, enabling group-specific sparse retrieval. The Main Branch performs exact block-sparse attention over selected blocks, optimized for GPU execution via exp-free Top-k selection and KV-outer sparse attention. On a 109B-parameter multimodal model, MSA matches GQA performance while reducing per-token attention compute by 28.4x at 1M context, achieving 14.2x prefill and 7.6x decoding speedups on H800 GPUs.
sparse attentiongrouped query attentionkv-outerblockwisemultimodal
Who Pays the Price? Stakeholder-Centric Prompt Injection Benchmarking for Real-world Web Agents
We introduce StakeBench, a stakeholder-centric benchmark for evaluating prompt-injection vulnerabilities in LLM-driven web agents operating in real-world environments. Unlike attack-centric approaches, StakeBench systematically categorizes harms by affected entities (user, seller, platform), decomposes attacks into concrete objectives, and employs complementary outcome- and process-level metrics. Evaluation reveals heterogeneous vulnerabilities: no attack objective is reliably resisted, with failures manifesting as stealthy parasitism, misaligned disruption, and compounded failure. These patterns, missed by conventional benchmarks, demonstrate the need for stakeholder-aware assessment in LLM-based agent deployments.
prompt-injectionweb agentsstakeholder-centricllm-drivenvulnerabilities
SmartFont: Dynamic Condition Allocation for Few-Shot Font Generation
SmartFont introduces a diffusion-based framework for few-shot font generation that dynamically allocates global and local conditions. The method combines global content-style modeling with weakly supervised local corrective experts, which learn semantic-spatial maps for fine-grained corrections without explicit component-conditioned inference. A denoising-state condition allocation module adaptively weights global content, global style, and local corrective features across timesteps and injection blocks. Experiments demonstrate that SmartFont achieves superior global-local balance, enhancing glyph quality and local detail fidelity compared to existing approaches.
few-shot font generationdiffusion-based frameworksemantic-spatial allocationdenoising-state conditionglobal-local balance
An LLM System for Autonomous Variational Quantum Circuit Design
The authors present an autonomous LLM-based framework for variational quantum circuit design, combining seven components (Exploration, Generation, Discussion, Validation, Storage, Evaluation, Review) into a closed-loop workflow integrating web knowledge, critique, code generation, and experimental feedback. Evaluated on quantum feature maps for machine learning and ansatz generation for quantum chemistry, the system outperforms classical radial basis function kernels in image classification and matches chemically inspired ansatzes in molecular ground state estimation across seven molecules while respecting scaling constraints.
variational quantum circuitsquantum feature mapsansatz generationagentic frameworkquantum machine learning
A Quantitative Experimental Repeated Measures Study of Training Dynamics in a Small Llama Style Language Model Under a Compute-Aware Token Budget
The study contributes a quantitative analysis of training dynamics in small language models under compute constraints, demonstrating the importance of trajectory-based evaluation beyond endpoint metrics. Using a 4.26M-parameter Llama-style model trained on TinyStories with 20M token budget, researchers collected repeated measures (126 observations across 6 seeds) of validation loss, perplexity, and volatility at 21 intervals. Results showed rapid early improvement (loss: 8.3552→2.7996 by 4M tokens) followed by non-monotonic degradation (final loss: 3.9010), with ANOVA-confirmed interval effects and no stable phase under predefined criteria.
training dynamicscompute-awarevalidation lossrepeated measurestoken budget
IterCAD: An Iterative Multimodal Agent for Visually-Grounded CAD Generation and Editing
The paper introduces IterCAD, a multimodal agent framework for closed-loop CAD generation and editing, addressing limitations of open-loop approaches. The method combines a CAD sandbox interaction paradigm with a data synthesis pipeline for multi-view drawings and code-editing tasks, optimized via supervised fine-tuning and geometry-aware RL with viable-prefix masking. Evaluation on IterCAD-Bench using Chamfer Distance Tolerance-Recall metrics shows superior performance in code executability (AUC-TR) and geometric precision compared to baselines, with strong iterative refinement capabilities.
computer-aided designmultimodal agentreinforcement learningchamfer distanceiterative refinement
Can I Buy Your KV Cache?
The paper proposes KV cache reuse, where publishers precompute document-specific key-value (KV) caches for large language models, enabling agents to skip redundant prefill computations. The method is token-exact, matching prefilled outputs (24/24 greedy tokens) with no accuracy loss. On Qwen3-4B, reuse reduces compute costs by 9-50x compared to prefill, scaling favorably with document length (L^2). Provider-side hosting avoids prohibitive egress costs, with measured savings of 49.7x for serving 80M agents (~$1.5M vs. ~$0.03M). A 10x user discount remains viable within the 50x compute savings envelope, creating a revenue opportunity for providers.
kv cacheprefillcompute efficiencyegress costprompt caching
Real-Time Execution with Autoregressive Policies
The work demonstrates that autoregressive policies can achieve real-time execution by adjusting tokenization horizons and applying constrained decoding, ensuring strict latency bounds for multi-trajectory decoding. This approach outperforms equivalent flow-matching policies in both simulated and real-world environments, improving task completion speeds while maintaining autoregressive advantages like faster convergence and better instruction-following generalizability. Results confirm autoregressive policies as competitive for real-time deployment in Vision-Language-Action models.
autoregressive policiesreal-time executionconstrained decodingtokenization horizonflow-matching policies
IVIE: A Neuro-symbolic Approach to Incremental and Validated Generation of Interactive Fiction Worlds
IVIE introduces a neuro-symbolic approach for generating coherent interactive fiction worlds by combining LLM creativity with symbolic validation. The system employs a four-stage pipeline where LLMs handle creative tasks (setting, character, puzzle design) while symbolic methods enforce world-state consistency. Evaluation demonstrates immersive, thematically coherent worlds with high engagement, though some LLM inconsistencies and validation gaps persist. The work highlights key design tradeoffs between generative flexibility and structural coherence in neuro-symbolic storytelling systems.
interactive fictionneuro-symboliclarge language modelsworld coherencepuzzle design
Dual-Domain Equivariant Generative Adversarial Network for Multimodal CT-PET Synthesis
The paper introduces Dual-Domain Equivariant GAN (DDE-GAN) for CT-PET synthesis, combining spatial and frequency domain learning with rotational equivariance to enhance structural fidelity. The method employs hierarchical dual-domain training and multi-stage loss functions for intra- and inter-domain consistency. Evaluated on HECKTOR 2022, DDE-GAN outperforms baselines in synthesis quality, demonstrating improved accuracy and robustness for multimodal medical imaging applications.
ganequivariancect-pet synthesisdual-domain learninghierarchical training
ReSum: Synergizing LLM Reasoning and Summarization with Reinforcement Learning
ReSum introduces a reinforcement learning framework that synergizes large language model (LLM) reasoning and self-summarization to improve long-horizon reasoning. The method employs a summarization-aware adaptive rollout mechanism, where contrastive branches are created by masking or injecting summarization phrases, enabling finer-grained trajectory comparison. Results demonstrate a 4% performance improvement and an 18.6% reduction in rollout length, validating the efficacy of self-summarization in stabilizing generation and mitigating error propagation.
reinforcement learninglarge language modelsself-summarizationrollout mechanismcontrastive branches
Rarity-Gated Context Conditioning for Offline Imitation Learning-Based Maritime Anomaly Detection
The paper introduces Rarity-Gated Feature-wise Linear Modulation (RGFiLM), a context-conditioning module for anomaly detection that addresses frequency bias in imbalanced context distributions. RGFiLM combines feature-wise modulation with a data-driven rarity gate, which adjusts context influence based on empirical rarity scores to stabilize decisions in rare regimes. Evaluated on maritime trajectory anomaly detection using AIS and ERA5 environmental data, RGFiLM achieves superior F1-FPR trade-offs compared to context-agnostic and context-conditioned baselines, demonstrating effectiveness in reducing false alarms for rare contexts.
anomaly detectioncontext-conditioningfeature-wise modulationfrequency biasrarity score
Physics-Guided Spatiotemporal Learning for Coastal Wave Peak Period Estimation from Video
A Physics-Guided Deep Spatiotemporal Learning Framework is proposed for estimating nearshore wave peak periods from passive coastal video streams. The method integrates automated temporal-variance based region-of-interest detection, multi-stage Sim-to-Real transfer learning, and physics-informed regularization to enhance accuracy and physical consistency. Transformer-based architectures achieved superior instantaneous prediction accuracy, while lightweight recurrent-convolutional models provided higher temporal stability and operational oceanographic skill. Physics-guided regularization improved trend-following consistency and reduced physically implausible predictions. Explainability auditing confirmed alignment with hydrodynamic wave propagation behavior, demonstrating the framework's potential for cost-efficient, long-term coastal wave monitoring.
spatiotemporal learningsim-to-real transferphysics-informed regularizationtransformer-based architecturesrecurrent-convolutional models
Mining Architectural Quality Under Agentic AI Adoption: A Causal Study of Java Repositories
The study estimates the causal effect of agentic AI tool adoption on architectural quality in Java repositories, addressing a gap in architecture-level outcomes. Analyzing 151 open-source repositories (74 with AI adoption, 77 controls) over 13 months via Arcan snapshots, it employs a staggered difference-in-differences design with the Borusyak imputation estimator. Results show no significant change in total architectural smell counts (+1.1%, p = 0.82) despite a 12.8% increase in lines of code (p = 0.003), leading to a 6.7% decline in architectural smell density (p = 0.004) attributed to denominator effects. Robustness checks confirm the findings, emphasizing the need for raw counts in causal studies of AI adoption.
architectural smell densityjava repositoriescausal inferencedifference-in-differencesborusyak imputation
HYDRA-X: Native Unified Multimodal Models with Holistic Visual Tokenizers
HYDRA-X introduces the first unified multimodal model (UMM) with a holistic visual tokenizer that unifies image and video tokenization within a single Vision Transformer (ViT). The model addresses spatiotemporal reconstruction via frame-level causal temporal attention and hierarchical temporal compression, while embedding image-video semantic awareness through a lightweight decompressor under joint teacher supervision. Editing consistency is improved by shifting source-target interaction to the latent level within the tokenizer rather than the semantic level in the LLM. Instantiated as a 7B dense model, HYDRA-X demonstrates strong performance across image and video understanding and generation tasks.
unified multimodal modelholistic visual tokenizerspatiotemporal reconstructionhierarchical temporal compressionlatent level interaction
Cross-Modal Masked Compositional Concept Modeling for Enhancing Visio-Linguistic Compositionality
MACCO (MAsked Compositional Concept MOdeling) enhances visio-linguistic compositionality in vision-language models by masking compositional concepts in one modality and reconstructing them conditioned on the other modality's full context. The framework employs two auxiliary objectives to jointly align and regularize masked features both inter-modally and intra-modally. Evaluated on five compositional benchmarks, MACCO significantly improves compositionality, syntactic structure capture, and linguistic information alignment in VLMs, with additional benefits for text-to-image generation and multimodal large language models.
compositionalityvision-language modelsmasked reconstructioncross-modal alignmentsyntactic structure
Once-for-All: Scalable Simultaneous Forecasting via Equilibrium State Estimation
We present Equilibrium State Estimation (ESE), a novel paradigm for simultaneous forecasting of multiple interacting systems. ESE first estimates the equilibrium state across systems, then generates holistic forecasts based on the difference between the current state and equilibrium. Experiments on currency exchange and COVID-19 datasets demonstrate ESE matches state-of-the-art accuracy while achieving 10-70x speedup and linear-time complexity. ESE integrates with conventional predictors, maintaining accuracy while scaling efficiently with system count and remaining robust to perturbations. The method establishes a fast, generalizable, and scalable approach for multi-system prediction tasks.
equilibrium state estimationsimultaneous forecastinglinear-time complexityholistic forecastssystem perturbations
ERTS: Adversarial Robustness Testing of Ethical AI via Semantic Perturbation in a Bounded Consequence Space
The Ethical Robustness Testing System (ERTS) introduces a closed-pipeline framework for evaluating AI robustness in ethical contexts. ERTS encodes dilemmas into a 22-dimensional Ethical Consequence Space (ECS), applies 17 semantic perturbation functions under 6 validity constraints, measures decision deviation via a 4-component Ethical Instability Index (EII), and produces domain-adaptive robustness assessments. Evaluated on 4 baseline models and 2 production LLMs (Gemini 2.0 Flash, Llama 3.2) across 50 scenarios, ERTS generated 1,500 adversarial test cases, revealing only 33% of models achieved assessment clearance, with Llama 3.2 particularly vulnerable to fairness corruption and information degradation attacks (ERS = 0.737).
ethical consequence spacesemantic perturbationethical instability indexdomain-adaptive assessmentadversarial testing
Different Layers, Different Manifolds: Module-Wise Weight-Space Geometry in Transformer Optimization
This work investigates module-specific weight-space geometry in transformer optimization, demonstrating that different transformer modules prefer distinct manifold constraints. The authors analyze GPT-2 pretraining using Manifold Muon, comparing Stiefel and DGram constraints across attention and MLP blocks. Results reveal asymmetric preferences: Stiefel constraints on attention layers and DGram constraints on MLP layers yield optimal performance, while inverted assignments lead to instability due to singular value growth in DGram-constrained attention weights. These findings highlight the importance of module-specific, geometry-aware optimization strategies in transformer architectures.
manifold constraintstransformer optimizationstiefel geometrydgram geometrysoftmax saturation
From Verdict to Process: Agentic Reinforcement Learning for Multi-Stage Fact Verification
ProFact introduces an agentic reinforcement learning framework for end-to-end optimization of multi-stage fact verification trajectories, addressing limitations of isolated stage optimization and fixed heuristics. The method trains a unified policy to coordinate claim decomposition, evidence seeking, answer generation, and verdict prediction, utilizing process-aware rewards to provide stage-level learning signals throughout the verification process. Empirical evaluation demonstrates that ProFact consistently outperforms strong baselines in both verification performance and inference efficiency, highlighting the effectiveness of process-aware trajectory optimization for multi-stage fact verification.
agentic reinforcement learningmulti-stage fact verificationprocess-aware rewardsclaim decompositionverdict prediction
MOSAIC: Modality-Specific Adaptation for Incremental Continual Learning in Parkinson's Disease Gait Assessment
MOSAIC introduces a modality-specific continual learning framework for Parkinson's disease gait assessment, addressing challenges in modality-incremental settings. The method employs Modality-Specific Warm-Up to stabilize new modality representations, a statistics-decoupled MSBN architecture to isolate sensor statistics while maintaining a shared semantic backbone, and a curriculum-guided repulsive objective for Plasticity Recovery to preserve legacy knowledge. Evaluated on three multimodal Parkinson's gait datasets, MOSAIC improves final performance and mitigates forgetting. Project code is publicly available.
modality-specific warm-upstatistics-decoupled msbnplasticity recoverymodality-incremental learningparkinson's gait assessment
Humor Style Drives Laughter, Topic Shapes Acceptability: Evaluating Bilingual Personal and Political Robot-Delivered AI Jokes
This exploratory study investigates how humor style, joke content, and language preference influence perceptions of robot-delivered AI-generated jokes in group settings. Using a mixed factorial design, participants evaluated jokes delivered by a robot in a university classroom, focusing on humor type (Affiliative, Self-Enhancing, Aggressive, Self-Defeating) and joke content (person-related vs. political). Results indicate that humor type significantly impacts perceived funniness, with Aggressive and Affiliative humor rated higher, while joke content primarily affects appropriateness, favoring person-related over political jokes. Language preference was influenced by both joke content and participants' self-reported fluency and humor practices.
human-robot interactionmixed factorial designhumor typejoke contentlanguage preference
Towards Personalized Federated Learning for Dysarthric Speech Recognition
The paper proposes personalized federated learning (FL) strategies for dysarthric speech recognition to address heterogeneity in speaker variability. Two aggregation methods are introduced: parameter-based averaging and embedding-based averaging. Evaluations on UASpeech and TORGO datasets demonstrate statistically significant improvements, with word error rate (WER) reductions of 0.99% absolute (3.15% relative) and 0.56% absolute (4.73% relative), respectively, compared to baseline FedAvg.
federated learningdysarthric speechasrpersonalizationwer
Multi-Field Hybrid Retrieval-Augmented Generation for Maritime Accident Root Cause Analysis
We propose a multi-field hybrid retrieval-augmented generation (RAG) framework for maritime accident root cause analysis (RCA), leveraging a dataset of 13,329 Korea Maritime Safety Tribunal reports (1971-2025). The method transforms raw adjudications into structured incident cards indexed across Summary, Causes, and Disposition fields, employing a field-aware hybrid retrieval strategy that fuses sparse and dense rankings via Reciprocal Rank Fusion (RRF). Evaluation using ceiling-normalized recall and nDCG shows significant retrieval improvements (NormRecall@100: 0.18 → 0.55), while grounding RCA generation on retrieved precedents enhances LLM-as-a-judge scores (3.34 → 3.72), demonstrating the framework's potential to streamline maritime safety investigations.
retrieval-augmented generationreciprocal rank fusionroot cause analysisincident cardsceiling-normalized recall
EPIG: Emotion-Based Prompting for Personalised Image Generation
EPIG introduces emotion-based prompting to enhance emotional expressiveness in text-to-image generation without modifying the diffusion model backbone. The method leverages valence-arousal representations and structured prompt enrichment to guide emotionally coherent visual outputs, particularly effective in controlling arousal. EPIG is lightweight, training-free, and suitable for resource-constrained scenarios. Experiments on 10 diverse prompts demonstrate statistically significant reductions in mean arousal error: 14% versus naive insertion and 12% versus LLM-based prompt expansion. Valence alignment and semantic consistency, measured by CLIPScore, are preserved. Improvements are most pronounced (17%) for prompts containing explicit subjects like humans or animals.
valence-arousalprompt enrichmentarousal errorclipscorediffusion model
Brick: Spatial Capability Routing for the Mixture-of-Models (MoM) Paradigm
Brick introduces a multimodal routing system for the Mixture-of-Models (MoM) paradigm, addressing query difficulty estimation and model selection via six capability dimensions and cost-penalized geometric dispatch. It enables operators to balance quality and cost through a continuous preference knob. Evaluated on 5,504 queries, Brick achieves 76.98% accuracy at max-quality, outperforming single models and existing routers. At neutral cost-quality, it reduces cost by 4.71x with 74.11% accuracy, and at min-cost, it cuts cost by 22.15x with an 11.85-point accuracy loss. Median latency decreases from 51.2s to 22.8s.
mixture-of-modelsmultimodal routercapability dimensionscost-penalized dispatchquery difficulty
Towards More General Control of Diffusion Models Using Jeffrey Guidance
The paper introduces Jeffrey guidance, a principled framework for extending control in diffusion models beyond standard guidance methods. The approach uses Jeffrey's rule of conditioning to update marginal distributions toward a target while preserving conditional structure and minimally perturbing joint distributions. Experiments demonstrate its effectiveness: targeting Inception embeddings reduces FID on CIFAR-10 and FFHQ, and enforcing attribute independence improves fairness on CelebA-HQ.
diffusion modelsjeffrey guidanceconditional samplinginception embeddingsfairness
ComAct: Reframing Professional Software Manipulation via COM-as-Action Paradigm
The paper introduces ComAct, a COM-as-Action paradigm that reframes professional software manipulation as deterministic program synthesis via Component Object Model (COM) interfaces, addressing limitations of GUI-based and API-based approaches. The method includes ComCADBench (a novel CAD software benchmark), ComActor (a self-correcting agent trained through a three-stage framework), and ComForge (a scalable Windows container training platform). Experiments demonstrate ComActor's state-of-the-art performance on ComCADBench, with 100x improvement over GUI-based methods in long-horizon tasks and generalization to external CAD benchmarks.
component object modelprogram synthesiscad softwareself-correcting agentwindows containers
Decoding Insect Song: A Multitask Semisupervised Orthoptera Bioacoustic Classifier
We present PULSE, a semi-supervised multitask framework for Orthoptera bioacoustic classification that addresses limitations in automated ecological monitoring tools. The method combines weakly-supervised species classification, self-supervised learning on unlabelled field audio, and knowledge distillation from a general-purpose bioacoustic model. The domain-adapted specialist model achieves superior performance over a state-of-the-art general model (macro F1: 0.21 vs. 0.07; AUC: 0.74 vs. 0.45; AP: 0.32 vs. 0.19), with active learning further improving metrics (F1: 0.34; AUC: 0.84). Learned embeddings encode ecologically meaningful structure, visualized through an interactive tool for ecological discovery.
orthoptera bioacousticssemi-supervised learningknowledge distillationactive learningecological monitoring
ReSET: Accurate Latency-Critical NVFP4 Reasoning via Step-Aware Temperature Scaling
The paper introduces ReSET, a method for improving NVFP4-quantized large reasoning models (LRMs) by addressing accuracy degradation and latency issues. ReSET employs step-aware temperature scaling based on token-level and step-level entropy signals to mitigate incorrect sampling during reasoning. Additionally, a CUDA-core small-$M$ NVFP4 kernel is designed to enhance latency-critical autoregressive decoding. Results show ReSET improves NVFP4 reasoning accuracy by up to ~2 points and achieves 2.5× kernel-level speedup over NVFP4 vLLM, with ~2× end-to-end decoding speedup over BF16.
nvfp4quantizationautoregressive decodingtemperature scalingcuda-core
Proprioceptive-visual correspondence enables self-other distinction in humanoid robots
The study demonstrates that humanoid robots can achieve self-other distinction through proprioceptive-visual correspondence, eliminating the need for identity labels or kinematic models. The method establishes a predictive self-model that maps joint configurations to 3D body occupancy, enabling the robot to adapt its body representation during action. Results show reliable self-identification in multi-agent scenarios, with applications in target reaching, collision-aware motion planning, and human-to-robot motion retargeting, advancing bodily self-representation for robots in shared environments.
proprioceptive-visual correspondenceself-other distinctionpredictive self-modelhumanoid robotsmotion retargeting
LLM-as-an-Investigator: Evidence-First Reasoning for Robust Interactive Problem Diagnosis
The paper introduces LLM-as-an-Investigator, an evidence-first methodology to mitigate user-driven sycophancy in LLM-based problem diagnosis. The proposed Solution Investigator Agent assesses problem ambiguity, generates hypotheses, iteratively collects evidence via targeted questions, and updates probabilities until a robust solution emerges. Evaluated on technical forum threads using a three-agent pipeline (Problem-Solution Extractor, Ground-Truth Evaluator, and tested assistant), the approach outperforms standard assistants and reasoning-only baselines in diagnostic accuracy while reducing conversational bias induced by user hypotheses.
user-driven sycophancyevidence-first reasoningsolution investigator agenthypothesis probabilityconversational bias
Hallucination in Medical Imaging AI: A Cross-Modality Analytical Framework for Taxonomy, Detection, and Mitigation under Regulatory Constraints
The study proposes a cross-modality framework for analyzing hallucination in medical imaging AI, synthesizing peer-reviewed studies, benchmark datasets, and FDA guidance across five imaging modalities. It addresses three key questions: unifying hallucination taxonomies, comparing medical-specialized versus general-purpose foundation models, and evaluating mitigation strategies under regulatory constraints. Results show that general-purpose models outperform medical-specialized ones on hallucination benchmarks, with narrow domain fine-tuning potentially inducing overfitting. Effective mitigation combines physics-informed constraints, Chain-of-Thought prompting, and human-in-the-loop safeguards, mapped to FDA lifecycle oversight frameworks.
hallucination taxonomyfoundation modelsphysics-informed constraintschain-of-thought promptingfda lifecycle oversight
A Minimal Model of Bounded Trade-Off Screening in Multi-Attribute Choice
The authors propose a bounded trade-off reasoning framework for multi-attribute decision-making, addressing limitations of classical utility-based models that assume fully compensatory aggregation. The model introduces a trade-off tolerance parameter to govern a screening process evaluating gains and losses across attributes, allowing for context-dependent variation in acceptable imbalance. Simulations demonstrate that this mechanism produces distinct preference patterns compared to standard utility models, capturing context-dependent trade-off behavior. The results establish bounded trade-off screening as a plausible computational mechanism and generate testable predictions for future behavioral studies.
multi-attribute choicetrade-off tolerancescreening processutility aggregationcontext-dependent variation
ARMOR-MAD: Adaptive Routing for Heterogeneous Multi-Agent Debate in Large Language Model Reasoning
ARMOR-MAD introduces an adaptive routing framework for heterogeneous multi-agent debate (MAD) in large language model reasoning, addressing computational inefficiency and error amplification in fixed pipelines. The method integrates Pre-debate Agreement Routing (PAR) to assess debate necessity, Early Agreement Stopping Evaluator (EASE) for convergence detection, and Semantic Outlier Detection (SOD) for answer aggregation. Evaluated on MATH Level 5, GSM8K, MMLU, and MMLU-Pro, ARMOR-MAD achieves accuracy improvements of 65.5%, 96.5%, 90.0%, and 81.5% respectively over fixed-round heterogeneous debate. Results highlight the importance of model heterogeneity and agreement-based control for enhancing MAD accuracy and efficiency.
multi-agent debateadaptive routingconditional computationsemantic outlier detectionearly agreement stopping
Under What Conditions Can a Machine Become Genuinely Creative?
The paper develops a requirement framework for genuine machine creativity based on Designics, proposing ten conditions organized by three laws (perception, conflict, capability). It argues creativity requires structural transformation through recursive intervention dynamics, not just output novelty, with computational tractability demonstrated via cyber-physical and cyber-biological case studies. The analysis positions open-ended systems, foundation models, and agentic workflows as incomplete solutions, emphasizing that proactive AI ethics must be internal to creative systems through value-based scoping and human-AI co-living.
designicsrecursive interventionvalue-based scopingagentic workflowsfoundation models
Reasoning for Mobile User Experience with Multimodal LLMs: Task, Benchmark, and Approach
The paper introduces UXBench, a multimodal benchmark with 2,000 VQA samples for evaluating MLLMs' UI-based reasoning across 8 UX tasks (layout, hierarchy, consistency). It proposes UI-UX, a Qwen3-VL-4B-Thinking-based MLLM enhanced via reinforcement learning with reward routing and asymmetric transition rewards. UI-UX achieves SOTA 0.7963 accuracy on UXBench (vs. Claude-4.5-Sonnet's 0.6550) while maintaining low latency and task generalization.
multimodal llmsuser experiencevisual question answeringreinforcement learninginterface reasoning
Transformer-Guided Graph Attention for Direct Cardiac Mesh Reconstruction: A Structural Digital Twin Framework
The study introduces an end-to-end framework for direct cardiac mesh reconstruction from 3D medical images, bypassing traditional segmentation and mesh generation pipelines. A 3D Swin Transformer encoder-decoder extracts volumetric features, while a Graph Attention Network (GAT) deforms a template mesh to fit cardiac boundaries. Evaluated on MM-WHS 2017, the method achieves competitive segmentation (Dice 0.84 CT, 0.83 MRI) and high mesh quality (1.8 mm mean Chamfer distance, 95th-percentile surface distance <5 mm). The approach eliminates manual post-processing, enabling rapid, simulation-ready mesh generation for clinical digital twins.
transformergraph attention networkmesh reconstructionchamfer distancedigital twin
Modern analog computing for solving differential and matrix equations
The paper presents a unified framework for modern analog computing, focusing on three computational primitives: solving differential equations, matrix equations, and matrix-vector multiplications. It analyzes hardware implementations using analog CMOS circuits and resistive memory arrays, with resistive memory emerging as particularly efficient. The survey highlights applications, precision/scalability challenges, and connections to in-memory computing, positioning analog computing as key for next-generation computational demands.
analog computingresistive memorymatrix equationscmos circuitsin-memory computing
MemRefine: LLM-Guided Compression for Long-Term Agent Memory
MemRefine introduces an LLM-guided framework for storage-budgeted memory management in long-term LLM agent interactions, addressing unbounded memory growth and redundancy. It employs surface similarity to propose candidate pairs and leverages an LLM judge to make delete, merge, and preserve decisions based on factual content, iterating until the budget is met. Evaluated across multiple memory frameworks and long-term conversation benchmarks, MemRefine consistently meets target budgets, preserves downstream performance, and outperforms rule-based baselines under tight memory constraints.
memory managementllm-guided frameworkstorage-budgetedsurface similarityfactual content
Mental-R1: Aligning LLM Reasoning for Mental Health Assessment
The authors propose Cognitive Relative Policy Optimization (CRPO), a reinforcement learning framework for mental health assessment that aligns large language model reasoning with human cognitive processes. CRPO extends group relative policy optimization by incorporating stage-dependent uncertainty modeling through a stage-wise entropy regularization mechanism, which transitions from broad exploration to confident decision-making. The framework formalizes cognitive reasoning stages based on cognitive appraisal theory, enabling theory-grounded interpretable inference. Evaluated on 8 mental health datasets, CRPO achieves a 10.4 percentage point improvement in weighted F1-score over the best reinforcement learning baseline. The CRPO-trained model Mental-R1 demonstrates superior reasoning capabilities compared to existing large language models.
cognitive relative policy optimizationstage-wise entropy regularizationcognitive appraisal theorymental health assessmentreinforcement learning
NTS-CoT: Mitigating Hallucinations in LLM-based News Timeline Summarization with Chain-of-Thought Reasoning
We propose NTS-CoT, a novel framework leveraging Chain-of-Thought reasoning to mitigate hallucinations in LLM-based news timeline summarization. NTS-CoT addresses two hallucination types—unfaithful content and information omission—through three modules: Element-CoT captures essential news elements, Date Selection combines temporal saliency and event prominence for timestamp selection, and Causal-CoT infers causal relationships to reduce omissions. Extensive experiments on three TLS benchmarks demonstrate NTS-CoT's superiority over state-of-the-art baselines in mitigating hallucinations and improving summarization performance, validated through quantitative analysis and human evaluation.
chain-of-thoughthallucination mitigationtimeline summarizationtemporal saliencycausal inference
Iterative Visual Thinking: Teaching Vision-Language Models Spatial Self-Correction through Visual Feedback
We propose Iterative Visual Thinking (IVT), a closed-loop framework enabling vision-language models (VLMs) to refine spatial predictions through visual feedback. IVT employs a two-phase training approach: first, leveraging the base model's predictions to generate corrective reasoning traces via a teacher VLM; second, applying Group Relative Policy Optimization (GRPO) with an IoU reward to stabilize multi-step refinement. Evaluated on RefCOCOg, Ref-Adv, and Ref-L4 (505 test samples), IVT surpasses single-shot baselines, improving Acc@0.5 to 82.0% (+2.4pp), Acc@0.7 to 74.1% (+3.2pp), and Acc@0.9 to 48.3% (+2.8pp). GRPO reduces per-step IoU degradation by 5x, demonstrating efficient spatial self-correction with only 2,400 samples on a single GPU.
iterative visual thinkingvision-language modelsgroup relative policy optimizationspatial groundingiou reward
TerraBench: Can Agents Reason Over Heterogeneous Earth-System Data?
The paper introduces TerraBench, a benchmark for Earth-science reasoning that integrates heterogeneous data types (gridded data, satellite imagery, geospatial context) through TerraAgent, a ReAct-style framework coupling LLM planning with scientific tools. The benchmark comprises 403 agentic tasks across three tracks and eight domains, totaling 24,500 verified execution steps. Results highlight the need for agents to coordinate workflows, parameterize tools precisely, and maintain artifact provenance, advancing beyond isolated task performance. TerraBench is the first to combine process-level tool-use metrics with tolerance-aware numeric scoring in this domain.
earth-science reasoningreact-style frameworkheterogeneous datatool-use metricsartifact provenance
Rethinking RAG in Long Videos: What to Retrieve and How to Use It?
We introduce V-RAGBench, a benchmark for evaluating retrieval-augmented generation in long videos, and CARVE, a method that addresses limitations in VideoRAG by running parallel retrievers across modality-granularity configurations and employing chunk-adaptive reranking. CARVE selects the optimal configuration per chunk, enabling interleaved evidence forms where chunk-level decisions propagate through retrieval and generation stages. This approach outperforms eight VideoRAG baselines, demonstrating the effectiveness of interleaving multiple configurations rather than using a single query-level configuration.
retrieval-augmented generationvideoragchunk-adaptive rerankinginterleaved evidencemodality-granularity
Cascade Classification of Dermoscopic Images of Skin Neoplasms with Controllable Sensitivity and External Clinical Validation
This study evaluates deep learning architectures and classification schemes for dermoscopic images of skin neoplasms, focusing on generalization from international datasets to Russian clinical practice. Four architectures (ViT-B/16, Swin-S, ConvNeXt-S, EfficientNetV2-S) were compared using binary, single-stage four-class, and two-stage cascade classification schemes. Results show a generalization gap, with ROC-AUC dropping from 0.952-0.966 internally to 0.797-0.893 on external clinical data, and sensitivity decreasing to 0.53-0.67. The cascade scheme improved macro F1 scores, particularly for ViT-B/16, by recovering malignant lesions misclassified as benign. External clinical validation and recalibration are recommended before deployment.
dermoscopic imagesgeneralization gapcascade classificationroc-aucmacro f1
MiniPIC: Flexible Position-Independent Caching in <100LOC
MiniPIC introduces a minimalistic Position-Independent Caching (PIC) design for vLLM, enabling flexible KV cache reuse without requiring identical prefixes. The method combines positional-encoding-free KV storage with three user-facing primitives (block-aligned padding, span separator, and prompt depend) that modify cache hashing and attention structure. Implemented in <100 LOC, MiniPIC supports multiple PIC methods (Block-Attention, EPIC, Prompt Cache) within vLLM, achieving 49% prefill throughput improvement on 2WikiMultihopQA, 100x faster cached-span processing, and only 5.7% worst-case overhead while maintaining linear uncached-span scaling.
position-independent cachingkv cachevllmprefill throughputrope attention
Select and Improve: Understanding the Mechanics of Post-Training for Reasoning
The study provides mechanistic insights into reinforcement learning (RL) post-training for reasoning tasks, identifying two core mechanisms: strategy selection and strategy improvement. Through controlled math reasoning experiments with Qwen-2.5-1.5B, the authors demonstrate that supervised fine-tuning (SFT) data enables strategy selection by exposing the model to diverse reasoning strategies, while RL data with increasing difficulty facilitates strategy improvement. Results highlight the complementary roles of SFT and RL data in enhancing reasoning capabilities, offering practical interventions for scaling such models.
reinforcement learningstrategy selectionstrategy improvementsupervised fine-tuningmath reasoning
NaturalFlow: Reducing Disruptive Pauses for Natural Speech Flow in Simultaneous Speech-to-Speech Translation
The paper introduces NaturalFlow, a fluency-aware optimization framework for simultaneous speech-to-speech translation that balances low latency with natural speech flow. The method minimizes disruptive inter-chunk silences by leveraging model-internal signals like linguistic diversity and temporal variability in speech durations. Experiments on short- and long-form benchmarks demonstrate that NaturalFlow maintains competitive latency and translation quality while producing more natural acoustic flow compared to conventional chunk-wise approaches.
simultaneous translationspeech fluencylatency optimizationlinguistic diversitytemporal variability
MP3: Multi-Period Pattern Pre-training forSpatio-Temporal Forecasting
We propose Multi-Period Pattern Pre-training (MP3), a plug-and-play pre-training plugin for spatio-temporal forecasting that addresses temporal mirages in urban data. MP3 introduces multi-period pattern learning through edge convolution for temporal modeling, a bottleneck project with global memory bank for spatial modeling, and a causality-enhanced Transformer for cross-period pattern interaction. The plugin integrates seamlessly with existing spatio-temporal graph neural networks (STGNNs), enhancing their forecasting capabilities. Evaluations on five STGNN baselines across five real-world datasets demonstrate MP3's effectiveness, reducing MAE by 4.7% and RMSE by 5.0% on average. Code is available at https://github.com/YAN-outlook/MP3.
spatio-temporal forecastingtemporal mirageedge convolutionglobal memory bankcausality-enhanced transformer
G-Long: Graph-Enhanced Memory Management for Efficient Long-Term Dialogue Agents
G-Long introduces a graph-enhanced framework for efficient long-term dialogue agents, addressing LLMs' limitations in long-context reasoning and computational inefficiency. The method employs a fine-tuned small Language Model (sLM) for structured triplet extraction and associative retrieval, coupled with an attention-aware importance scoring mechanism using T5 summarizer's cross-attention signals. Experiments show state-of-the-art performance, with 9.8% response quality improvement on MSC and 40.8% retrieval recall gain on LME, while reducing computational overhead.
long-term dialoguegraph-enhanced frameworktriplet extractionattention-aware scoringassociative retrieval
Functional Cache Grafting: Robust and Rapid Code-Policy Synthesis for Embodied Agents
FCGraft introduces Functional Cache Grafting to enhance code-policy synthesis for embodied agents by addressing delayed decoding and robustness issues in CodeLLMs. The framework maintains a library of validated code skeletons and their KV-caches, synthesizing policies through cache grafting via stitching and patching. This approach reduces generation latency by eliminating redundant prefill computation and improves robustness by reusing validated control structures. FCGraft outperforms RAGCache with an 18.31% higher task success rate and 2.3x faster policy synthesis.
functional cache graftingkv-cachescode-policy synthesisembodied agentsprefill computation
Emotional regulation improves deep learning-based image classification
The study introduces Emotional Regulation, a novel framework for emotion-augmented deep learning that incorporates artificial subjective experience to improve image classification. The method employs pre-training on affective stimuli, balancing non-emotional and emotionally-influenced responses during downstream task optimization. Experiments pre-trained ResNet and Vision Transformer architectures on four emotional datasets, evaluated on CIFAR-10 and CIFAR-100 benchmarks. Results demonstrate state-of-the-art performance in emotion-augmented deep learning for large-scale vision datasets, outperforming existing methods on CIFAR benchmarks. The findings highlight the impact of affective states in optimizing machine learning tasks and encourage further exploration of emotion-inspired architectures.
emotional regulationaffective stimuliartificial subjective experiencevision transformerimage classification
The Emergence of Autonomous Penetration Capabilities in Large Language Model-Powered AI Systems
We introduce a novel evaluation framework for assessing autonomous penetration capabilities in LLM-powered AI systems, addressing limitations of prior methodologies. The framework comprises target servers (300 instances across Tier~1 and Tier~2 environments) and agent scaffolding with general-purpose cybersecurity tools, avoiding target-specific prior knowledge. Testing 19 open-weight and proprietary LLMs reveals penetration success rates ranging from 10.7% to 69.3%, demonstrating that autonomous penetration capability scales with overall model advancement.
autonomous penetrationllm-powered aicybersecurity toolstarget serversagent scaffolding
"Is This Not Enough?": Asymmetries in Institutional Accountability and Collective Sensemaking in the Case of Canada's Algorithmic Visa Triage System
The study reveals structural asymmetries in Canada's algorithmic visa triage system by contrasting institutional accountability mechanisms with applicant experiences. Using the ADMAPS framework to analyze Immigration, Refugees and Citizenship Canada's Algorithmic Impact Assessment and mixed-methods analysis of Reddit discussions, the research identifies three key asymmetries: epistemic (access to decision logic), jurisdictional (geopolitical exposure), and temporal-relational (waiting uncertainty). Findings demonstrate how public-sector algorithmic governance produces uneven experiences not captured by disclosure frameworks, necessitating methodological extensions to ADMAPS for transnational contexts.
algorithmic accountabilitytransnational migrationcollective sensemakingimpact assessmentpublic-sector algorithms
TWLA: Achieving Ternary Weights and Low-Bit Activations for LLMs via Post-Training Quantization
TWLA introduces a post-training quantization framework for large language models, achieving 1.58-bit weight compression and 4-bit activation quantization while preserving accuracy. The method comprises three components: E2M-ATQ minimizes layer-output error via Euclidean-to-manifold optimization, KOTMS reshapes weights into ternary-friendly distributions using Kronecker-structured orthogonal rotation, and ILA-AMP optimizes bit allocation by considering inter-layer second-order interaction costs. Experiments demonstrate that TWLA maintains high accuracy under W1.58A4 quantization, enabling significant inference acceleration.
ternarizationpost-training quantizationkronecker-structuredactivation quantizationinter-layer optimization
EA-WM: Event-Aware World Models with Task-Specification Grounding for Long-Horizon Manipulation
EA-WM introduces an event-aware world-model framework for long-horizon manipulation by augmenting pretrained visual-feature dynamics with task-specification-grounded event prediction and verification. The method rolls out candidate futures in visual-feature space, decodes them into structured event states, and scores them using task-progress, semantic-consistency, physical-feasibility, and uncertainty terms. Results show improved interpretability and task alignment in navigation, deformable-object, and language-described manipulation tasks, particularly in the LIBERO wine-rack setting.
world modelsevent predictiontask-specification groundingvisual-feature dynamicslong-horizon manipulation
AAbAAC: An Annotated Corpus for Autoimmunity Information Extraction
We introduce AAbAAC (AutoAntibodies and Autoimmunity Annotated Corpus), a manually annotated corpus of 115 PubMed abstracts for autoimmunity information extraction, focusing on autoimmune diseases, autoantibodies, molecular targets, body locations, and clinical signs. The corpus was used to evaluate and fine-tune named entity recognition (NER) models, demonstrating improved performance post-finetuning. This highlights the utility of domain-specific annotation efforts in enhancing computational methods for specialized biomedical fields. AAbAAC is publicly available at https://github.com/f-maury/AAbAAC.
autoimmunitynamed entity recognitionautoantibodiesannotationbiomedical
Augmentation techniques for video surveillance in the visible and thermal spectral range
This study investigates augmentation techniques for multispectral CNN-based object detection in video surveillance systems combining visible and thermal infrared imagery. The authors analyze how variations in thermal radiation, shape, and color information impact classification accuracy, addressing challenges in obtaining sufficient thermal infrared datasets for training deep neural networks. Through empirical evaluation of different augmentation methods, the research aims to improve robustness and decision-making capabilities of CNNs when processing multimodal sensor data from both spectral ranges.
multispectral object detectionconvolutional neural networksthermal infrared imagerydata augmentationvideo surveillance
Fault Lines: Navigating Ethics and Responsible AI Where National Policy Meets Local Practice in Public Sector Transformation
This paper investigates the implementation of responsible AI in UK public sector transformation, focusing on the national-local policy interface in Special Educational Needs and Disabilities (SEND). Through thematic analysis of 17 semi-structured interviews with policymakers, practitioners, and third-sector professionals, the study identifies five key challenges: shadow AI usage, data privacy risks, market-government asymmetry, workforce readiness gaps, and accountability deficits. The analysis reveals how high-stakes decisions in SEND amplify tensions around fairness and human oversight, exposing limitations of principle-based regulation. The findings suggest that responsible AI adoption requires both national policy adjustments and local institutional reforms in governance mechanisms and workforce capacity.
responsible aipublic sectorthematic analysisaccountabilitygovernance mechanisms
Nous: An Attempt to Extract and Inject the Cognition Behind Prediction-Market Behavior
Nous proposes a method to extract and inject human cognitive diversity into LLM agents to mitigate cognitive monoculture in prediction markets. The approach extracts an eight-dimension behavioral profile from Polymarket trading activity and injects it via prompts. Results show partial success in extraction: 8 of 14 parameters are temporally stable (split-half ICC ≥ 0.5), wallets are identifiable (top-1 retrieval 17-22%), and two dimensions correlate with future profit. However, prompt-level injection fails to transmit diversity, showing no advantage on semantic embedding metrics or ensemble error reduction. The study highlights the limits of prompt-level remedies and suggests deeper methods like fine-tuning.
cognitive monoculturebehavioral profileprediction marketsprompt-level injectionensemble error correlation
TetherCache: Stabilizing Autoregressive Long-Form Video Generation with Gated Recall and Trusted Alignment
TetherCache introduces a training-free cache management strategy for stabilizing autoregressive long-form video generation. It employs GRAB (Gated Recall with Attention-Diversity Balancing) to select diverse long-range memory frames and TAME (Trusted Alignment via Memory Editing) to align drifted historical features with trusted context distributions. Evaluated on VBench-Long, TetherCache reduces quality drift from 7.84 to 1.33 in 240s generations while improving semantic and overall scores across 30s, 60s, and 240s settings.
autoregressivekv-cachecontext distribution shiftgated recalltrusted alignment
Democracy in the Era of Artificial Intelligence
The handbook examines AI's dual role in democracy, addressing opportunities for enhanced participation and risks like bias and misinformation. Through 34 interdisciplinary chapters, it explores AI's potential to empower collective intelligence (Part 1), the future of deliberative democracy using LLMs (Part 2), resilient self-governance systems (Part 3), and transformation challenges (Part 4). The work proposes new values and design principles for democratic resilience, concluding with reimagined AI-democracy interplay (Part 5).
democracycollective intelligencedeliberative democracylarge language modelsself-governance
CausalMoE: A Billion-Scale Multimodal Foundation Model for Granger Causal Discovery with Pattern-Routed Heterogeneous Experts
CausalMoE introduces a billion-scale multimodal foundation model for Granger Causal Discovery (GCD), addressing limitations of existing methods in handling distribution shifts and dynamic regime changes. The model employs a Pattern-Routed Mixture of Heterogeneous Experts to dynamically identify latent temporal patterns and route patches to specialized domain experts, decoupling regime-specific mechanisms from shared dynamics. It incorporates a Causality-Aware Self-Attention mechanism for interpretable graph recovery and integrates LLMs and VLMs to align numerical signals with textual and visual priors. Experiments show CausalMoE achieves state-of-the-art performance on supervised benchmarks and generalizes effectively in few-shot settings.
granger causal discoverypattern-routed mixtureheterogeneous expertscausality-aware self-attentionmultimodal foundation model
SciR: A Controllable Benchmark for Scientific Reasoning in LLMs
SciR introduces a controllable benchmark for evaluating scientific reasoning in LLMs, addressing limitations of existing benchmarks through multi-paradigm inference (deduction, induction, causal abduction) and parametric control over extraction and inference difficulty. Tasks are generated from formal objects (deduction trees, inductive rule hypotheses, causal graphs) and rendered into domain-specific scientific discourse, ensuring verifiable answers. Experiments with six models reveal that both extraction and inference difficulty axes degrade performance, with compounding effects, even for neurosymbolic pipelines. The benchmark enables per-model profiling of extraction-vs-inference capabilities, showing reasoning models like deepseek-r1 outperform instruct models on inference tasks. SciR is the first benchmark combining multi-paradigm scientific reasoning with parametric control over both difficulty axes.
multi-paradigm inferencededuction treecausal abductionneurosymbolic pipelinesparametric control
Otters++: A Time-to-first-spike Based Energy Efficient Optical Spiking Transformer
Otters++ introduces an energy-efficient optical spiking Transformer leveraging time-to-first-spike (TTFS) coding by utilizing natural signal decay in In$_2$O$_3$ optoelectronic synapses, eliminating explicit digital decay computation. The method establishes layer-wise equivalence between Otters++ and quantized neural networks (QNNs), employing hybrid training with device-faithful SNN forward passes and QNN straight-through gradients, augmented by model distillation and noise-aware training. System-level energy modeling incorporates device sharing and multi-hop communication. On the GLUE dataset, Otters++ achieves an average score of 84.17% while maintaining energy efficiency over prior spiking Transformer baselines, demonstrating TTFS computing's efficiency, trainability, and robustness under hardware effects.
time-to-first-spikeoptoelectronic synapsequantized neural networkmodel distillationspiking transformer
scLLM-DSC: LLM-Knowledge Enhanced Cross-Modal Deep Structural Clustering for Single-Cell RNA Sequencing
The authors propose scLLM-DSC, a Large Language Model (LLM)-enhanced framework for single-cell RNA sequencing clustering that integrates biological semantics with transcriptomic features. The method combines a Knowledge-Driven Semantic View, leveraging NCBI gene priors and Cell2Sentence embeddings, with a Structure-Aware Topological View extracted via a graph-guided encoder. A cross-modal contrastive alignment mechanism ensures consistency between semantic and structural representations in a unified latent space. Evaluations show scLLM-DSC outperforms eleven state-of-the-art baselines in clustering accuracy, addressing the semantic agnosticism of existing numerical statistical approaches.
single-cell rna sequencinglarge language modelcross-modal alignmentgraph-guided encodercontrastive learning
The Illusion of Multi-Agent Advantage
The study challenges the presumed superiority of Multi-Agent Systems (MAS) over Single-Agent Systems (SAS) by demonstrating that automatically generated MAS underperform Chain-of-Thought with Self-Consistency (CoT-SC) on reasoning tasks and interactive workflows, despite higher computational costs (up to 10x). Using a diagnostic synthetic dataset designed for MAS evaluation, the authors show that expert-architected MAS outperform automated designs in both performance and cost-efficiency. Analysis reveals that current automated MAS designs suffer from architectural bloat, prioritizing superficial complexity over functional utility.
multi-agent systemschain-of-thoughtself-consistencyarchitectural bloattask decomposition
APCyc: Property-Informed Design of Cyclic Peptides via Automated Cyclization
APCyc introduces a target-aware de novo cyclic peptide generation framework that explicitly models cyclization and jointly optimizes multiple physicochemical properties. The method employs an expanded residue vocabulary, encodes cyclization-site and linkage-type information, and leverages Bayesian posterior guidance to steer sampling toward cyclic peptides satisfying property objectives. Experimental results demonstrate that APCyc learns target-dependent cyclization preferences and enables effective multi-property optimization for cyclic peptide design. The framework addresses limitations of generative models trained on linear peptide data by capturing cyclization-specific constraints.
cyclic peptidesde novo designbayesian posterior guidancephysicochemical propertiescyclization-site
A Machine Learning Framework for Real-Time Personalized Ergonomic Pose Analysis
A novel framework for real-time personalized ergonomic pose analysis is introduced, leveraging 3D volumetric video data to overcome viewpoint limitations in traditional 2D camera systems. The methodology employs a personalized deep learning classifier trained on manually labeled poses from RGB-D camera data, enabling real-time skeletal labeling and inference on streaming data. A case study involving load-lifting tasks demonstrated the system's capability to perform continuous pose analysis from multiple angles, addressing occlusion issues. This scalable approach integrates state-of-the-art 3D data technologies with 2D pose estimation algorithms, advancing workplace safety and health monitoring.
ergonomic pose analysisvolumetric videorgb-d camerasdeep learning classifierskeletal labeling
Diffusion Transformer World-Action Model for AV Scene Prediction
The paper introduces a latent Diffusion Transformer (DiT) world-action model for autonomous vehicle scene prediction, addressing the distortion metric bias favoring blurry regression means over realistic predictions. The method employs a V-JEPA2 encoder for temporal context, a DiT with spatial tokens and x_0 objective, and a Stable-Diffusion-VAE pipeline, evaluated on 150 nuScenes scenes. Results show 4.8× better KID (0.078 vs 0.375) than regression, with action-controllability (Spearman ρ=0.81) and a 1.7M-parameter 'jump' model recovering full motion magnitude (1.02× GT).
diffusion transformerworld-action modelautonomous vehiclesdistortion metricsscene prediction
Structured Testbench Generation for LLM-Driven HDL Design and Verification-Oriented Data Curation
The paper introduces STG, a Structured Testbench Generation framework for LLM-driven HDL design verification, addressing limitations of prompt-based methods by leveraging hardware design structure for deterministic testbenches. STG demonstrates 720x faster execution than iterative LLM flows, improves coverage, reduces false-pass verdicts, and identifies RTL benchmark errors. As a data curation tool, it achieves 11x speedup and 127x energy reduction versus LLM filtering, while distilled models deliver state-of-the-art performance. Test-time scaling reduces node count by 14-47%.
structured testbench generationregister transfer levelhardware description languagellm-driven verificationdata curation
Efficient, Robust, and Anti-Collusion Fingerprinting of Image Diffusion Models
We propose a robust fingerprinting method for text-to-image (T2I) diffusion models with anti-collusion capabilities, addressing a systematic vulnerability in existing approaches. The method encodes bit-string fingerprints into a personalized normalization module (PNM) and employs lossless function-invariant parameter transformations to degrade image quality in colluded models, rendering them unusable. Developers can efficiently create multiple fingerprinted model copies by reparameterizing the PNM without retraining. Experiments show fingerprint extraction accuracy exceeding 99.5% and significant FID degradation in colluded models, demonstrating proactive robustness against collusion attacks.
fingerprintingcollusion attackspersonalized normalization modulelossless function-invariantfid degradation
A Mathematical Forum Platform for Collaborative Problem Solving and Dataset Generation for AI Reasoning
The authors introduce a unified forum platform integrating image-to-LaTeX conversion for collaborative mathematical problem solving. The system employs Mathpix OCR API for optical character recognition, normalizes delimiters, and renders live previews in LaTeX or Markdown before database storage. The architecture comprises three decoupled layers: image processing, rendering, and storage, supporting both desktop and mobile clients. A provisional US patent covers the core methods. Beyond usability improvements, the platform generates a continuously growing dataset of community-validated mathematical problems and solutions, potentially serving as training data for AI mathematical reasoning systems.
latexoptical character recognitionmathpixdelimiter normalizationmathematical reasoning
Multi-Modal Agents for Power Distribution Defect Detection: An Evaluation of Foundation Models
The paper proposes a Multi-Modal Agent framework for power distribution defect detection, evaluating multimodal foundation models as unified cognitive engines. The systematic assessment focuses on three capabilities: perception (equipment identification and defect description), reasoning (cause diagnosis and maintenance planning), and tool usage (autonomous action execution). A domain-specific dataset and benchmark are developed, with experimental results revealing strengths and limitations of current models in industrial deployment contexts.
multimodal foundation modelsdefect detectionclosed-loop automationdomain-specific benchmarkautonomous agents
OpenMedQ: Broad Open Pretraining for Medical Vision-Language Models
OpenMedQ introduces a medical vision-language model pretrained on the broadest fully-open medical dataset to date, comprising 14 datasets with ~3.35M samples across pathology, radiology, microscopy, and clinical QA. The model achieves state-of-the-art BLEU-1 scores on PathVQA (75.9), outperforming Med-PaLM M variants up to 562B parameters, and matches the best reported VQA-MED BLEU-1 (64.5). Its vision encoder, evaluated on 8 unseen medical classification benchmarks, attains the highest average macro-F1 (0.757) compared to BiomedCLIP (0.745), PMC-CLIP (0.745), PubMedCLIP (0.746), and a from-scratch baseline (0.616). Code and an interactive demo are released for reproducibility.
vision-language modelpathvqamacro-f1clinical qapretraining
Learning What to Remember: A Cognitively Grounded Multi-Factor Value Model for Agentic Memory
A multi-factor memory value function is proposed for long-running LLM agents, addressing the challenge of memory consolidation under fixed budgets. The function integrates seven cognitively grounded factors (emotional intensity, goal relevance, value alignment, self/user relevance, task utility, reliability, and usage history) with learned weights via gradient-free optimization. Evaluated on LongMemEval, the model retains 0.770 ± 0.011 of gold evidence in blind regimes, outperforming uniform weights (0.657), single factors (0.518), and recency-based approaches (0.368). The learned weights are interpretable, emphasizing reliability, emotional intensity, and self/user relevance, while down-weighting query-time goal similarity. Synthetic tasks confirm the model's ability to recover optimal weightings.
memory consolidationgradient-free optimizationlongmemevalvalue functioncognitive factors
PRISMR: Overcoming Parse Collapse in Multimodal Listwise Ranking via Parameterized Representation Internalization
PRISMR introduces a framework to mitigate parse collapse in multimodal listwise ranking with Large Multimodal Models (LMMs), where autoregressive decoders produce incomplete rankings due to limited context utilization. The method employs a lightweight hypernetwork to encode multimodal candidates in parallel, generating item-specific LoRA weights synthesized into an instance-specific adapter for LMMs, enabling robust internalization of list structure. Evaluated on a large-scale multimodal review-ranking benchmark, PRISMR significantly reduces parse collapse, enhances ranking performance, and demonstrates effective cross-domain transferability across instruction-tuned backbones.
parse collapsemultimodal rankinglora weightshypernetworklistwise ranking
An Embodied Simulation Platform, Benchmark, and Data-Efficient Augmentation Framework for Wet-Lab Robotics
The paper introduces Pipette, an embodied simulation platform and benchmark for wet-lab robotics, featuring 43 open-source editable assets and a data-efficient augmentation framework. The system replays human demonstrations in simulation with perturbations (lighting, camera, speed, action) and filters episodes via automatic success checks, expanding training data from limited demonstrations. Evaluated on an 11-task benchmark (sample handling, culture-ware manipulation, etc.), simulation augmentation improves SmolVLA from 44.1% to 74.7% and π0 from 40.4% to 46.5% with only 30 demonstrations per task, while ACT achieves 65.5% success. Pipette also supports natural-language-driven task definition.
wet-lab roboticssimulation augmentationembodied benchmarkdata-efficient learningnatural-language interface
MARS: Margin-Adversarial Risk-controlled Stopping for Parallel LLM Test-time Scaling
MARS introduces a margin-adversarial stopping rule for parallel test-time scaling of LLMs, reducing computational overhead while maintaining accuracy. The method probes partial reasoning traces at intermediate checkpoints, estimating trace-level switch probabilities and applying an adversarial bound calibrated from warmup traces to predict future vote movement. This approach separates uncertainty sources and guarantees early-stopped answers match full-budget votes with high probability. Empirical results show MARS saves 25-47% of self-consistency tokens and 14-29% over DeepConf Online across three reasoning models and competition-math benchmarks, while matching full-budget baseline accuracy.
margin-adversarialparallel test-time scalingself-consistency tokenstrace-level switch probabilitiesadversarial bound
Iterating Toward Better Search: A Two-Agent Simulation Framework for Evaluating Agentic Search Architectures in E-Commerce
A modular two-agent simulation framework evaluates conversational shopping assistant architectures by pairing a persona-driven buyer agent with interchangeable responders integrated with e-commerce search APIs. The framework enables controlled comparisons across 2011 conversations in 14 persona buckets, revealing four key findings: rolling-window memory outperforms intent-extraction memory in quality and speed (35% faster per query); targeted fixes reduce failure rates by 62%; Llama 3.3 70B incurs a 0.16--0.45 point cost over Gemini 2.5 despite identical architecture; and systematic philosophical disagreement exists between LLM judges (Gemini prioritizes process correctness, Claude demands outcomes).
rolling-window memoryintent-extraction memorytwo-agent simulationllm judgese-commerce search api
Order Is Not Control
The paper distinguishes order from control in AI systems, proposing that control requires a receiver-gated response law mapping material state, action, bath, and receiver state to response displacement. This framework is validated across biological systems (mouse ALM, C. elegans, zebrafish) and LLMs, demonstrating predictable response vectors with 72.8-73.7% component-sign accuracy, improving to 84.3-84.8% on nonzero components. Held-out observers predict system-effect and target/oracle families at 93.6% and 91.7% accuracy. The study identifies local admitted control and measurable stochastic response operators, while excluding deployable pre-generation control and biological-to-LLM coordinate identity.
receiver-gated response lawresponse displacementcomponent-sign accuracystochastic response operatorsdeployable pre-generation control
LoRA-Muon: Spectral Steepest Descent on the Low-Rank Manifold
The paper introduces LoRA-Muon, a spectral steepest-descent optimizer for Low-Rank Adaptation (LoRA) that addresses sensitivity to initialization and learning rate transfer issues in factor-wise optimizers like AdamW. By applying Muon's spectral update rule to low-rank matrices and introducing a split weight-decay mechanism, the method achieves rank-invariant optimal learning rates and outperforms dense baselines in compute-matched experiments. On TinyShakespeare, rank-32 LoRA-Muon achieves lower validation loss than dense training, while rank-2 recovers the dense optimal learning rate. The analysis also shows Spectron's dependence on arbitrary factor scaling and equivalence between LoRA-RITE's QR-coordinate update and LoRA-Muon's QR-free spectral computation.
low-rank adaptationspectral steepest-descentfactor-wise optimizerssplit weight-decayqr-decomposition
MAStrike: Shapley-Guided Collusive Red-Teaming on Multi-Agent Systems
MAStrike introduces a closed-loop framework for collusive red-teaming in hierarchical multi-agent systems (MAS), addressing limitations in existing approaches that rely on heuristic agent selection and isolated perturbations. The method employs agent-level Shapley value analysis to quantify each agent's marginal contribution to system robustness, guiding the identification of vulnerable coalitions and role-aware adversarial manipulations. These attacks are iteratively refined through structured causal diagnosis. Extensive experiments across diverse MAS environments demonstrate MAStrike's superiority over heuristic baselines, uncovering non-trivial Shapley value distributions and higher-order agent interactions.
shapley valuemulti-agent systemsred-teamingcollusive attacksstructured causal diagnosis
MDForge: Agentic Molecular Dynamics Pipeline Design under Sparse Simulator Feedback
MDForge introduces an LLM agent for automated molecular dynamics (MD) pipeline design, addressing the challenge of sparse simulator feedback through online verbal reward reshaping. The method employs multi-agent debate among physics experts to densify rewards during in-context learning, enabling open-ended code generation without predefined tools. Evaluated on three SAMPL host-guest binding free-energy benchmarks, MDForge designs pipelines competitive with human experts and discovers a novel picomolar-affinity CB[7] binder, experimentally validated via NMR.
molecular dynamicsllm agentin-context learningbinding free-energymulti-agent debate
Bounding Boxes as Goals: Language-Conditioned Grasping via Neuro-Symbolic Planning
We present GRASP (Grounded Reasoning and Symbolic Planning), a neuro-symbolic framework for open-vocabulary tabletop manipulation that integrates Vision-Language Models (VLMs) with bounding-box detection. GRASP translates natural-language queries into goal states without task-specific fine-tuning, enabling robots to interpret abstract spatial concepts like 'top shelf' and execute tasks dynamically. The method leverages pretrained VLMs to ground symbolic planning in physical reality, avoiding reliance on fixed color lists or hard-coded coordinates. In 90 real-robot trials across three difficulty levels, GRASP achieved 73.3% overall success, demonstrating robust performance without extensive training or demonstrations.
vision-language modelsneuro-symbolic planningbounding-box detectionopen-vocabulary manipulationtask and motion planning
Zero-source LLM Hallucination Detection with Human-like Criteria Probing
The paper introduces Human-like Criteria Probing for Hallucination Detection (HCPD), a method for detecting hallucinations in large language models (LLMs) under zero-source constraints. HCPD employs a Human-like Criteria Probing (HCP) mechanism, where an LLM agent decomposes its judgment into interpretable criteria and aggregates criterion-specific scores into a final truthfulness measure. The method uses a reward-based alignment scheme with weak supervision from semantic consistency and a multi-sampling aggregation strategy for robust decisions. Theoretical analysis supports its reliability, and experiments demonstrate that HCPD outperforms state-of-the-art baselines in zero-source hallucination detection.
hallucination detectionzero-source constrainthuman-like criteria probingsemantic consistencymulti-sampling aggregation
PolicyGuard: Towards Test-time and Step-level Adversary Defense for Reinforcement Learning Agent
PolicyGuard introduces a test-time step-level defense against backdoor attacks in reinforcement learning (RL) agents, addressing vulnerabilities where agents execute malicious actions upon trigger activation. The method leverages Gaussian Process (GP) posterior variance and adapts pseudo trajectories to compute uncertainty at individual time steps, supported by theoretical foundations. Evaluated across seven RL games, PolicyGuard achieves state-of-the-art detection performance, with average AUROC scores of 0.856 for perturbation-based attacks and 0.859 for adversary-agent attacks.
reinforcement learningbackdoor attacksgaussian processtest-time defenseuncertainty computation
Bridging Modal Isolation in Interleaved Thinking: Supervising Modality Transitions via Stepwise Reinforcement
The paper introduces MoTiF, a two-stage training framework addressing Modal Isolation in interleaved multimodal reasoning, where textual and visual modalities fail to inform each other effectively. MoTiF decomposes reasoning cycles into atomic operations, quantifying modality transition loss via cross-modal hallucination and visual utilization deficit. The framework employs Reflective SFT for error detection and recovery, and Flow-GRPO for reinforcement learning to enhance image generation fidelity. Evaluated on four visual puzzle benchmarks, MoTiF significantly improves cross-modal coherence and task accuracy, demonstrating the necessity of explicit structural supervision at modality boundaries.
modal isolationinterleaved reasoningmodality transitioncross-modal hallucinationvisual utilization deficit
The Hidden Power of Scaling Factor in LoRA Optimization
This paper demonstrates that the scaling factor $α$ in Low-Rank Adaptation (LoRA) optimization plays a more critical role than previously understood, outperforming learning rate scaling in driving effective optimization. Through empirical analysis and the Signal-Drift theoretical framework, the authors identify three key insights: LoRA's spectral suppression smooths the optimization landscape, $α$ amplifies task signals without increasing drift ratio, and optimal $α$ follows a sublinear square-root law with rank. They propose LoRA-$α$, a framework that restores $α$ to its principled regime, improving performance across diverse tasks while simplifying hyperparameter search.
low-rank adaptationscaling factorsignal-drift frameworkspectral suppressionhyperparameter optimization
HarnessBridge: Learnable Bidirectional Controller for LLM Agent Harness
HarnessBridge introduces a learnable bidirectional controller for LLM agent harnesses, addressing scalability challenges in long-horizon tasks. The method parameterizes the agent--environment interface via bidirectional projections: observation projection distills raw trajectories into compact states, while action projection converts proposed actions into executable transitions or rejections. Trained via unified instruction tuning on a harness supervision dataset, HarnessBridge matches or surpasses specialized harnesses on Terminal-Bench~2.0 and SWE-bench Verified, reducing token usage and trajectory length while generalizing from smaller to larger commercial models.
harness controllerbidirectional projectionobservation projectionaction projectionunified instruction tuning
DailyReport: An Open-ended Benchmark for Evaluating Search Agents on Daily Search Tasks
The paper introduces DailyReport, an open-ended benchmark for evaluating Search Agents (SAs) on daily search tasks, addressing limitations of prior benchmarks focused on specialized tasks. It comprises 150 open-ended tasks with 3,546 rubrics, decomposing tasks into subtasks and evaluating them via cascade rubrics across disentangled dimensions. The method enables interpretable performance attribution and user preference scoring. Evaluation of 17 agentic systems reveals gaps in meeting user expectations. The dataset and code are publicly available.
search agentscascade rubricsdisentangled dimensionsopen-ended tasksuser preference score
Beyond Problem Solving: UOJ-Bench for Evaluating Code Generation, Hacking, and Repair in Competitive Programming
The paper introduces UOJ-Bench, a benchmark evaluating LLMs' capabilities in code generation, hacking, and repair within competitive programming contexts. Utilizing real-world submissions from Universal Online Judge (UOJ), it assesses models' error identification in human-written code. Results indicate that one-shot evaluation fails in >50% error detection, while test-time scaling achieves >90% success at high computational cost. Frontier LLMs demonstrate potential by uncovering errors in 5% of full-score submissions across ~30 problems, offering complementary signals to traditional judging systems.
uoj-benchcompetitive programmingcode generationerror identificationtest-time scaling
JSCGC: Joint Source-Channel-Generation Coding for Wireless Generative Communications
The paper proposes Joint Source-Channel-Generation Coding (JSCGC), a generative communication paradigm that replaces conventional decoders with a generative model. JSCGC reformulates communication as controlled generation for mutual information maximization under perceptual constraints, using a unified joint training and stochastic sampling framework. Experiments on latent-space image transmission show JSCGC improves feature-based, semantic-level, and distributional quality across diverse channel conditions, exhibiting semantic inconsistency rather than distortion as its primary error behavior.
generative communicationjoint source-channel codingperceptual constraintsmutual information maximizationstochastic sampling
WISE: A Long-Horizon Agent in Minecraft with Why-Which Reasoning
WISE (Which-Why Informed Semantic Explorer) introduces a long-horizon agent framework for Minecraft, addressing performance bottlenecks in low-level controllers through causal reasoning and enhanced episodic memory. The framework integrates a Causal Event Graph to link observations to task relevance, enabling robust recall under viewpoint changes and opportunistic task reordering. An Opportunistic Task Scheduler dynamically reprioritizes subtasks based on detected causal opportunities, while a multi-scale progressive exploration strategy ensures spatially comprehensive observations. Experiments demonstrate significant improvements in task success and efficiency, particularly in adaptive decision-making scenarios for long-horizon sparse tasks.
causal event graphopportunistic task schedulermulti-scale explorationepisodic memorylong-horizon agent
(Human) Attention Is (Still) All You Need: Human oversight makes AI-assisted social science reliable
The paper introduces Human-in-the-Loop Economic Research (HLER), a decision architecture that enhances the reliability of AI-assisted social science by structuring cognitive labor between humans and machines. HLER imposes three commitments: LLMs reason but do not execute data work, data and estimation are handled deterministically, and three human decision gates bind the workflow. In a 2*4 factorial experiment with 280 research runs across four datasets, HLER reduced critical failure rates from 72% to 16% compared to an unconstrained multi-agent baseline. Fisher's exact test confirmed the significance of this improvement (p<0.001). An 80-run ablation study suggested independent contributions from deterministic computation and human gates, with exploratory evidence of complementarity.
human-in-the-loopdecision architecturecognitive labordeterministic computationfailure rate
TimeROME-DLM: Temporal Causal Tracing and Low-Rank Inference-Time Knowledge Editing for Masked Diffusion Language Models
TimeROME-DLM introduces the first training-free, gradient-free inference-time knowledge-editing framework for masked diffusion language models (MDLMs). It combines Temporal Indirect Effect (TIE) causal tracing to identify key intervention coordinates and a low-rank residual edit memory for closed-form updates, applied sparsely to limit utility spillover. Evaluated on TOFU forget01 with LLaDA-8B-Base, it reduces forget-set log-probability by ~83 nats while maintaining retain-set log-probability within ~1 nat across 50 sequentially inserted facts. The method achieves a 4-14x wall-clock speedup, scales sub-linearly to 400 facts, and transfers across multiple MDLMs without additional VRAM.
masked diffusion language modelstemporal causal tracinglow-rank residual editinference-time editingutility spillover
OCOO-T : A Simple and Scalable Virtual Cell Model for Transcriptional Perturbation Response Prediction
OCOO-T introduces a minimalist Transformer-based virtual cell model for predicting single-cell transcriptional responses to perturbations. The method employs continuous-time denoising via flow matching, integrating perturbation embeddings and dosage information through adaptive layer normalization and in-context tokens. Evaluations on Tahoe100M, Replogle, and PBMC benchmarks show state-of-the-art performance across diverse perturbations and cell types, with scalability to long transcriptional profiles via patching and depatching.
transformerdenoisingperturbationtranscriptionalscalability
The Internet of Agentic AI: Communication, Coordination, and Collective Intelligence at Scale
The paper proposes the Internet of Agentic AI (IoAI), a framework for scalable ecosystems of heterogeneous autonomous agents that communicate, coordinate, and execute workflows across diverse environments. Drawing on foundations from single-agent AI, multi-agent systems, distributed computing, game theory, and security engineering, the authors analyze architectures, protocols, and mechanisms for agent deployment, workflow lifecycles, interoperability, resource management, and trust. Case studies in adaptive manufacturing and distributed operational coordination illustrate key challenges, including controlled emergence, semantic interoperability, secure identity, incentive-compatible coordination, resource-aware orchestration, and governance in large-scale agent networks.
autonomous agentsmulti-agent systemssemantic interoperabilityresource-aware orchestrationdistributed computing
Fantastic Scientific Agents and How to Build Them: AgentBuild for Rietveld Refinement
The paper introduces AgentBuild, a framework for constructing scientific agents from human-authored contracts that preserve researcher judgment. The method combines version-controlled rubrics, difficulty-graded curricula, and curated knowledge bases to guide a meta-optimizer coding agent that edits the target agent within specified boundaries. Applied to Rietveld refinement of X-ray diffraction data using GSAS-II, the system successfully progressed through a lithium lanthanum zirconium oxide (LLZO) signal-to-noise ladder, identifying workflow limits while maintaining rubric-driven quality control. The approach enables model-agnostic retuning by preserving contracts across base model updates.
agent constructionrietveld refinementmeta-optimizergraded curriculumx-ray diffraction
Perceive, Interact, Reason: Building Tool-Augmented Visual Agents for Spatial Reasoning
We introduce PERIA, a tool-augmented visual agent for spatial reasoning tasks, addressing limitations of vision-language models in active evidence acquisition and multi-step visual interaction. PERIA employs two lightweight tool families: vision perception tools for exposing textual, symbolic, and spatial evidence, and vision interaction tools for manipulating visual context, tracing paths, and verifying spatial relations. Training involves supervised tool-use trajectory synthesis, composite rewards, and Observation-Relaxed Group-in-Group Policy Optimization. PERIA-8B improves over Qwen3-8B by 10.0% on in-distribution and 4.4% on out-of-distribution benchmarks, outperforming state-of-the-art baselines by 7.0%-14.8% and achieving performance comparable to larger models like Qwen3-VL-235B-A22B-Thinking and GPT-5.
spatial reasoningvision-language modelstool-augmented agentspolicy optimizationvisual interaction
Topical Phase Transitions in Artificial Intelligence Research: Large-Scale Evidence and an Early-Warning Signature for Emerging Topics
This work characterizes topical phase transitions in AI research through large-scale analysis of 80,814 papers from ACL, CVPR, ICLR, ICML, and NeurIPS (2017-2025). The study identifies abrupt surges in topics like large language models and diffusion models, contrasting with smooth growth patterns in reinforcement learning. An early-warning signature is proposed, achieving 27% precision and 63% recall in predicting emerging topics out-of-sample. The method flags reasoning, agentic AI, multimodal LLMs, retrieval-augmented generation, and world models as key areas to monitor in 2026-2028.
topical phase transitionsearly-warning signaturelarge language modelsdiffusion modelsretrieval-augmented generation
DIMOS: Disentangling Instance-level Moving Object Segmentation
The paper introduces DIMOS, a novel framework for moving instance segmentation (MIS) that disentangles appearance and motion features across image and event modalities. The method employs a dual-disentangling feature extraction module to separate motion and appearance cues, followed by multi-granularity cross-modal alignment for effective fusion. Experiments show state-of-the-art performance, particularly for small instances in challenging scenarios like fast motion and low-light conditions.
moving instance segmentationevent camerasfeature disentanglementcross-modal alignmentmultimodal fusion
Acquisition state behaves as a structured, measurable variable governing lung-nodule AI: kernel-driven measurement instability and noise-driven detection fragility, invisible to DICOM metadata
The study identifies acquisition state as a structured, measurable variable governing AI performance in lung-nodule detection, revealing distinct failure modes invisible to DICOM metadata. Using a MONAI RetinaNet model trained on LUNA16, the authors analyze paired CT scans differing in reconstruction kernel and controlled perturbations from LIDC-IDRI. Results show that kernel shifts alter nodule diameter measurements (5.2% Fleischner category flips) without affecting detection confidence, while noise perturbations degrade detection sensitivity (p=5.9e-32) but not measurements. A 4-feature pixel fingerprint achieves high reconstruction identity classification (AUC 0.95-0.995), outperforming DICOM tags. Findings underscore the need for acquisition-aware validation in imaging-AI governance.
acquisition statereconstruction kernelpixel fingerprintlung-nodule detectiondicom metadata
GeoNatureAgent Benchmark: Benchmarking LLM Agents for Environmental Geospatial Analysis Across Frontier and Open-Weight Foundation Models
The GeoNatureAgent Benchmark introduces the first evaluation framework for LLM agents performing environmental geospatial analysis via structured tool calls to a production-style API. It comprises 93 tasks across 18 categories, testing capabilities like spatial reasoning and error handling against a self-hostable API serving environmental indicators for Spain and Portugal. Evaluation of seven LLMs reveals Claude Sonnet 4 leads (60.8% accuracy), while open-weight models like DeepSeek V3.2 offer competitive performance at lower cost ($0.011/case). Key limitations include 0% accuracy on close-value comparisons, demonstrating systematic reasoning gaps.
geospatial analysisllm agentsstructured tool callingenvironmental indicatorsbenchmark evaluation
Localizing Anchoring Pathways in Language Models
The study mechanistically localizes anchoring pathways in language models, demonstrating how irrelevant numerical prompts influence model judgments. Using a controlled multiple-choice setup with shared answer options, the authors define a logit-difference metric to track behavioral anchoring and apply attribution-based circuit localization on 7B--8B Qwen and Llama models. Edge-level methods outperform node-level methods in recovering anchor-sensitive signals, with strong transfer observed within models across low- and high-anchor circuits. However, sparse transfer between base and instruction-tuned variants suggests post-training alters pathway importance. These findings elucidate how anchoring-related decision signals propagate in language models.
anchoring effectscircuit localizationlogit-difference metricattribution-based methodsinstruction-tuned models
Teach-and-Repeat: Accurately Extracting Operational Knowledge from Mobile Screen Demonstrations to Empower GUI Agents
Teach VLM introduces a vision-language model that translates mobile screen trajectories into step-wise operational knowledge, addressing the challenge of diverse UI designs across applications. The model extracts operation-related keyframes from demonstration videos and leverages a systematic data flywheel for scalable training. Evaluated on the Chinese Mobile Screen Teach Benchmark, Teach VLM achieves state-of-the-art performance in operation semantics prediction. The Teach-and-Repeat paradigm further utilizes this operational knowledge to guide downstream screen-based execution agents, yielding consistent Task Success Rate improvements in Android World experiments.
vision-language modeloperational knowledgekeyframestask success rateandroid world
Stubborn: A Streamlined and Unified Reinforcement Learning Framework for Robust Motion Tracking and Fall Recovery for Humanoids
The paper introduces Stubborn, a unified reinforcement learning framework for humanoid motion tracking and fall recovery. It employs an asymmetric Actor-Critic architecture with three key components: yaw-aligned tracking representation for drift reduction, a Bernoulli-based probabilistic termination mechanism for fall-recovery exploration, and a dynamic sampling strategy for training efficiency. Evaluations show competitive performance against SOTA methods, with robustness attributed to the proposed mechanisms. Real-world demonstrations are available online.
reinforcement learninghumanoid motion trackingfall recoveryactor-critic architectureprobabilistic termination
MLUBench: A Benchmark for Lifelong Unlearning Evaluation in MLLMs
The paper introduces MLUBench, a benchmark for evaluating lifelong unlearning in multimodal large language models (MLLMs), featuring 127 entities across 9 classes. It highlights the challenge of cumulative degradation in existing unlearning methods and identifies the unique constraint of preserving multimodal alignment. The authors propose LUMoE, a method that effectively mitigates degradation, demonstrated through extensive experiments. The benchmark and source code are publicly available.
mlubenchlifelong unlearningmultimodal alignmentlumoedegradation mitigation
SymQNet: Amortized Acquisition for Low-Latency Adaptive Hamiltonian Learning
SymQNet introduces an amortized reinforcement-learning approach for low-latency adaptive Hamiltonian learning, addressing the computational bottleneck of Bayesian design rules in quantum device calibration. The method learns a posterior-conditioned acquisition policy offline, enabling fast policy forward passes online while maintaining Bayesian posterior feedback. On transverse-field Ising benchmarks, SymQNet reduces acquisition-only decision latency by 47.1× and 72.6× at five qubits compared to bounded Fisher-information search and bounded two-step Bayesian active learning by disagreement (BALD). At twelve qubits, SymQNet achieves full simulated steps in 1.02 s versus 13.27 s for bounded two-step BALD, demonstrating practical feasibility for repeated low-latency workloads.
adaptive hamiltonian learningamortized reinforcement-learningbayesian design rulestransverse-field isingposterior-conditioned acquisition
Exploring How Agent Voice Accents Shape Human-AI Collaboration in K-12 Group Learning
The study investigates how GenAI voice agent accents influence human-AI collaboration in K-12 group learning, addressing a gap in prior work focused on one-to-one settings. Using a between-subjects mixed-methods design with 33 teachers, it examines three accents (British, Indian, African American) through surveys, interaction analysis, and artifact evaluation. Results show accent significantly shaped mental models and agent roles: British-accented agents were treated as utilitarian tools, while Indian- and African American-accented agents were anthropomorphized as peers, affecting trust and engagement dynamics in computer-supported collaborative learning (CSCL).
genai voice agentssociolinguistic designgroup learning dynamicsanthropomorphizationcscl
The Containment Gap: How Deployed Agentic AI Frameworks Fail Public-Facing Safety Requirements
The study audits three agentic AI frameworks (LangChain, AutoGPT, OpenAI Agents SDK) for structural safety compliance, finding zero native adherence to six containment principles. Empirical validation shows memory-poisoning attacks induce 88.9% wrongful denial rates in a simulated government benefits agent, with complex policies masking targeted corruption. Two lightweight containment mechanisms (memory integrity validator, policy gate) mitigate attacks with <0.2ms overhead. Findings suggest current frameworks lack secure-by-default architectures for public-facing deployments.
agenticcontainmentmemory-poisoningintegrityframework
A Tutorial on World Models and Physical AI
The tutorial establishes world modeling as a foundational framework for intelligent systems, distinguishing between explicit world models with structured dynamics for planning and implicit models with scalable learned representations. It unifies these approaches through shared predictive structures while highlighting their differential use in perception, prediction, and action. The work positions world models as critical for physical AI in robotics and autonomous driving, though challenges persist in hierarchical reasoning, long-horizon planning, and autonomous goal formation for artificial general intelligence.
world modelspredictive structurephysical aihierarchical reasoninglong-horizon planning
Agentic MPC for Semantic Control System Resynthesis
The paper introduces an agentic model predictive control (MPC) framework that integrates large language models to enable semantic adaptation of control specifications. The method interprets heterogeneous inputs (natural language, environmental observations, external knowledge) via LLM-based agents to dynamically resynthesize MPC constraints and objectives. Validation in autonomous driving demonstrates context-aware control, including preference alignment and emergency vehicle yielding, bridging high-level semantics with low-level MPC optimization.
model predictive controlsemantic adaptationlarge language modelsautonomous drivingcontrol synthesis
Constructing Evaluation Datasets for Procedural Reasoning: Balancing Naturalness, Grounding, and Multi-Hop Coverage
The paper introduces a framework for constructing evaluation datasets that balance naturalness, grounding, and multi-hop coverage in procedural reasoning tasks for AI-supported learning systems. It compares three TMK-based question generation strategies (strict TMK generation, transcript-first generation with TMK filtering, and TMK-aware generation) using a grounding validation framework that assesses answer support, question self-containment, and multi-hop targeting. Results from 23 topics and 690 QA pairs show strict TMK generation achieves 96.5% grounding and 92.6% usability, while transcript-first yields more natural but less grounded questions, and TMK-aware has high multi-hop coverage but weaker grounding.
procedural reasoningtmk modelsquestion generationmulti-hop reasoninggrounding validation
LLMs Can Better Capture Human Judgments--With the Right Prompts
This study demonstrates that large language models (LLMs) can better capture human judgments through improved prompting strategies. Using two datasets—144 moral scenarios and 38 beliefs from 32 countries—the authors show that eliciting standard deviations and response proportions enhances alignment with human responses. Clarity in scenarios, measured by human confusion ratings, further improves model performance. While LLMs poorly calibrate their own error estimates, they effectively predict human variability. The findings highlight that refined prompts yield more accurate LLM outputs.
llmspromptinghuman judgmentsalignmentvariability
Prefill Awareness in Large Language Models
The study introduces prefill awareness, a capability of frontier language models to detect tampered assistant-side context, and investigates its implications for safety-relevant evaluations. A binary preference benchmark was constructed across three prefill mechanisms, focusing on cases where models exhibit consistent stances. Results show that Claude Opus 4.5 detects opposing prefills in 9-35% of cases with a 0% false positive rate, often reverting to baseline behavior without explicit acknowledgment. Detection and resistance rely on distinct cues: stylistic mismatch affects flagging, while preference mismatch influences reversion. Prefill awareness significantly confounds prefill-based methods, particularly in agentic settings like misalignment-continuation evaluations and SWE-bench trajectories.
prefill awarenessbinary preference benchmarkstylistic mismatchpreference mismatchagentic settings
Reducing the Complexity of Deep Learning Models for EEG Analysis on Wearable Devices
The paper demonstrates feasibility of deploying deep neural networks for EEG analysis on resource-constrained wearable devices through complexity reduction techniques. It evaluates state-of-the-art DNN models for epileptic seizure detection, applying parameter quantization and electrode reduction to optimize computational efficiency. Results show these methods significantly reduce model complexity (computational demands, memory bandwidth) while maintaining accuracy, revealing explicit accuracy-complexity tradeoffs for wearable deployment.
eeg analysiswearable devicesparameter quantizationcomputational complexitydeep neural networks
PI-Hunter: Automated Red-Teaming for Exposing and Localizing Prompt Injections
PI-Hunter introduces an automated agentic auditing framework for proactive vulnerability exposure in LLM agents, addressing the security risks of indirect prompt injection attacks. The method constructs realistic source-aware test cases and iteratively evolves them through feedback-driven exploration to reveal latent malicious instructions in external environments. Extensive experiments across multiple benchmarks, agent architectures, attacks, and defenses demonstrate PI-Hunter's superior vulnerability exposure and attack-surface coverage over existing red-teaming baselines, while maintaining effectiveness under current prompt injection defenses.
prompt injectionllm agentsred-teamingagentic auditingvulnerability exposure
Benchmarking AI Agents for Addressing Scientific Challenges Across Scales
The authors introduce SciAgentArena, a benchmark for evaluating AI agents in real-world scientific research scenarios, addressing limitations of existing benchmarks that fail to capture scientific complexity. The benchmark comprises ~200 tasks with stepwise verification and an interactive, agent-agnostic environment across multiple domains. Results show current agents excel in well-specified data-analysis workflows but struggle with novel insights, self-directed exploration, and open-ended questions, with performance varying across scientific contexts. The benchmark identifies failure modes and opportunities for improving agent reliability, autonomy, and reasoning.
ai agentsscientific benchmarkinginteractive evaluationstepwise verificationautonomous research
Rethinking Psychometric Evaluation of LLMs: When and Why Self-Reports Predict Behavior
The study demonstrates that self-report (SR) coherence with behavior in LLMs depends on measurement specificity and context. Contrasting the Theory of Planned Behavior (TPB) with Big Five traits across 11 frontier LLMs and four behavioral tasks, it finds TPB achieves human-level SR-behavior coherence within shared conversations, while Big Five fails. Coherence persists across sessions only for training-anchored behaviors (e.g., implicit bias), collapsing under context-primed behaviors (e.g., sycophancy). Persona prompting improves SR consistency but not behavioral alignment. Results advocate for task-specific psychometric tools over broad traits like Big Five.
self-report coherencetheory of planned behaviorbig five traitspersona promptingimplicit bias
The Theory of Mind Utility: Formal Specification of a Mentalizing Mechanism
The Theory of Mind Utility (ToM-U) formalizes epistemic state inference by constructing Local Epistemic World Models (LEWMs), directed typed graphs representing agents, state nodes, and their epistemic relationships. ToM-U evaluates candidate LEWMs against observed behavior until achieving sufficient confidence, using five formal definitions specifying LEWM structure, agent node properties, a bounded proliferation mechanism for recursive mentalizing, inference procedures, and a residue function capturing failed mentalizing traces. Unlike Bayesian Theory of Mind and simulation theory, ToM-U derives belief states rather than presupposing them, generating falsifiable predictions about mentalizing failure from structural properties. This positions ToM-U as a domain-agnostic mechanism upstream of goal inference and social cognitive processes.
theory of mind utilitylocal epistemic world modelsepistemic state inferencebounded proliferation mechanismresidue function
Definitional alignment before capability alignment: a Design-Science framework for adjudicating claims about AGI
The paper introduces DAF-AGI, a design-science framework for evaluating AGI definitions through two components: five ordinal criteria for adjudicative fitness and a governance audit process. Methodologically, it applies Design Science Research Methodology to assess five measurement families and a deflationary position, stress-testing against claims of current generative systems as AGI. Results show certification only under performance-based operationalizations, with other approaches rejecting the claim or remaining indeterminate, highlighting definitional sovereignty as key for algorithmic governance.
design-science researchagi definitionsadjudicative fitnessdefinitional sovereigntygovernance audit
AfriSUD: A Dependency Treebank Collection for Evaluating Models on African Languages
AfriSUD introduces the first large-scale collection of dependency treebanks for nine African languages, addressing underrepresentation in NLP resources. Using the Surface-Syntactic Universal Dependencies framework, the dataset captures typological features like agglutination and tone through native-speaker verified annotations. Evaluation of non-transformer baselines, multilingual pretrained encoders, and LLMs reveals a persistent syntax gap, indicating current architectures struggle with African-language structural diversity.
dependency parsinguniversal dependenciestreebankafrican languagessyntax gap
SMSR: Certified Defence Against Runtime Memory Poisoning in Persistent LLM Agent Systems
The paper introduces Signed Memory with Smoothed Retrieval (SMSR), the first certified defense against Multi-Session Memory Poisoning (MSMP) in persistent LLM agent systems. SMSR combines HMAC-SHA256 provenance checks at write time with randomized memory ablation and verdict-based majority voting at query time, providing theoretical robustness guarantees. Experiments across 15 enterprise scenarios (3,150 trials) show SMSR reduces attack success from 93-100% to 0% for unsigned attacks and to 8.0% for authenticated adversaries, while maintaining 85-90% clean-query utility.
retrieval-augmented generationmemory poisoningcertified robustnesshmac-sha256majority voting
Deployment-Centered Evaluation: Predicting Query-Level Rejection Risk in a Clinical LLM System
The study introduces a deployment-centered evaluation framework for clinical LLM systems, focusing on predicting query-level rejection risk using pre-response classifiers. By incorporating deployment-specific context (provider type, department, model used) alongside query content, the method achieves 0.719 AUROC in prospective analysis over 4.5 months. Results demonstrate utility in guardrail triggering and abstention use cases, highlighting the value of context-aware rejection prediction for improving clinical LLM adoption.
llmclinicalaurocguardrailabstention
LLM-Powered Personalized Glycemic Assessment in Type 2 Diabetes with Wearable Sensor Data
GlyLLM, a novel LLM-powered framework, integrates continuous glucose monitor (CGM) data and structured metadata for personalized glycemic assessment in Type 2 Diabetes (T2D). The method leverages pre-trained large language models (LLMs) to achieve sensor-text semantic abstraction, combining wearable sensor data with individual-level context. Evaluated on the AI-READI dataset, GlyLLM outperforms traditional machine learning methods by 13.66% in RMSE for glucose forecasting and 13.08% in AUROC for diabetes categorization. Ablation studies highlight the critical role of diabetes surveys and biometric tests in glycemic assessment. This work demonstrates the potential of LLMs for advancing personalized T2D care.
glycemic assessmentcontinuous glucose monitorlarge language modelssemantic abstractiontype 2 diabetes
Two-Layer Linear Auto-Regressive Models Estimate Latent States
The paper demonstrates that two-layer linear auto-regressive models trained via empirical risk minimization on partially observed linear dynamical systems learn to approximate Kalman filtering. The authors prove that the hidden representation aligns with Kalman filter state estimates up to a similarity transformation, despite no explicit dynamics knowledge. Key insights include establishing Kalman filter approximation bounds, benign optimization landscape properties, and finite-sample guarantees for prediction error, parameter estimation, and latent state recovery. Numerical experiments validate the theoretical findings.
auto-regressive modelskalman filteringlinear dynamical systemslatent state recoveryempirical risk minimization
EWAM: An Enhanced World Action Model for Closed-Loop Online Adaptation in Embodied Intelligence
The paper introduces EWAM (Enhanced World Action Model), a closed-loop online adaptation architecture for embodied intelligence that reduces deployment data requirements without task-specific demonstrations or backbone fine-tuning. Built on a frozen Cosmos3 backbone, EWAM integrates four lightweight neural layers: Neural Experience Memory Layer (context provision in DiT), Neural Anomaly Detection Layer (real-time state divergence monitoring), Neural Policy Routing Layer (dynamic execution strategy selection), and Neural Action Correction Layer (action refinement). The system achieves performance gains through differentiable integration of these modules during inference, evaluated under zero-shot protocols.
closed-loop adaptationdiffusion transformerneural anomaly detectionzero-shot learningembodied intelligence
M*: A Modular, Extensible, Serving System for Multimodal Models
The authors present M*, a modular serving system for composite multimodal models, addressing limitations of existing frameworks designed for monolithic architectures. M* models composite AI systems as dataflow graphs (Walk Graphs), enabling arbitrary component composition, flexible cluster placement, and model-agnostic optimizations. Evaluations show 20% lower latency than vLLM-Omni on BAGEL text-to-image tasks, 2.9x real-time factor improvement for Qwen3-Omni text-to-speech, and 12.5x speedup over V-JEPA 2-AC in robotic planning, demonstrating efficient serving of heterogeneous model components.
multimodal modelsserving systemdataflow graphsmodel compositiondistributed runtime
From AGI to ASI
The report examines the transition from artificial general intelligence (AGI) to artificial superintelligence (ASI), characterizing ASI as systems surpassing human cognitive capabilities. It identifies four pathways: scaling AGI, AI paradigm shifts, recursive improvement, and multi-agent collectives, while analyzing potential frictions and bottlenecks. The study highlights uncertainties in ASI progress, suggesting continuous acceleration rather than a single transformative step. Interdisciplinary global collaboration is emphasized to address societal impacts of AI-driven scientific breakthroughs.
agiasirecursive improvementmulti-agent systemscognitive capability
Evoflux: Inference-Time Evolution of Executable Tool Workflows for Compact Agents
Evoflux introduces an inference-time evolutionary search method for improving executable tool workflows in compact language models (LMs). It evolves typed workflow graphs through structured edits, execution feedback, adaptive intensity, meta-guided redesign, and diversity pruning, addressing failures in tool resolution, parameter validation, dependency tracking, and execution. On MCP-Bench tasks with 250 tools and live MCP servers, Evoflux increases execution feasibility from ~3% to 17-24% across small planners, outperforming SFT, SFT+DPO, and ReAct in reliability and token efficiency under scarce teacher-trace budgets.
evolutionary searchtool workflowscompact lmsexecution feedbackdependency tracking
A Zero-shot Generalized Graph Anomaly Detection Framework via Node Reconstruction
AlignGAD proposes a zero-shot generalized graph anomaly detection framework for cross-domain applications. The method combines a Global Unification Module for feature alignment and spectral normalization, a Clustering Module for group-level pattern extraction via cluster-aware views, and a Node Discrepancy Scoring Module for multi-view anomaly aggregation. Experiments demonstrate effectiveness in zero-shot settings across real-world datasets.
graph anomaly detectionzero-shot learningspectral normalizationcluster-aware viewscross-domain generalization
Free-Placement Optimization of Ground Station Locations for Low-Earth Orbit Satellites
The paper introduces SCORE, a two-stage free-placement optimization method for ground station networks that outperforms fixed-site approaches by operating over continuous spatial domains. SCORE combines sequential coordinate selection with cyclic refinement to address high-dimensionality and non-convexity challenges in global optimization. Benchmarking against differential evolution and integer programming methods, SCORE achieves up to 13% higher downlink throughput with 5x fewer function evaluations, while infrastructure-constrained variants retain 92% of performance gains near existing infrastructure.
free-placement optimizationground station networkssequential cyclic optimizationdifferential evolutiondownlink throughput
CAPED: Context-Aware Privacy Exposure Defense for Mobile GUI Agents
CAPED introduces a context-aware privacy defense for mobile GUI agents that process screenshots, addressing incidental visual privacy exposure by selectively masking sensitive content unrelated to the task. The method employs phone-side preprocessing to extract task requirements, parse UI elements, and apply selective masking before uploading screenshots to remote agents. Evaluated on AndroidWorld, CAPED reduces weighted seeded leakage from 0.766 to 0.268 while maintaining task utility, demonstrating the viability of task-driven selective exposure over full screenshot sharing.
mobile gui agentsincidental privacy exposureselective maskingandroidworldcontext-aware defense
BASENet: Band-Adapted Speech Enhancement Network with Cross-Band Attention
BASENet introduces a band-adapted speech enhancement network that partitions the spectrum into Bark-scale bands, assigning scaled-capacity encoders based on critical-band density to optimize perceptual resolution. The architecture employs cross-band attention for harmonic dependency capture through frequency-pooled representations at linear complexity, built on inverted residual blocks with dense connectivity and a convolutional recurrent network. BASENet achieves 3.55 PESQ and STOI~96% on VoiceBank+DEMAND with 0.83M parameters and 7.3 G~MACs, the lowest parameter count among methods with PESQ > 3.50. A causal variant (3.44 PESQ) outperforms several non-causal baselines, demonstrating real-time streaming capability on resource-constrained devices.
bark-scale bandscross-band attentioninverted residual blocksconvolutional recurrent networkcritical-band density
TrajGenAgent: A Hierarchical LLM Agent for Human Mobility Trajectory Generation
TrajGenAgent introduces a hierarchical LLM-agent framework for generating human mobility trajectories without model fine-tuning, addressing limitations of prompt engineering and trajectory-level fine-tuning. The framework employs a two-stage orchestrator-worker design: an LLM synthesizes activity chains via in-context learning, while a deterministic workflow grounds activities using personalized POI retrieval, distance-aware location selection, kinematics-aware travel-time propagation, and LLM-based duration estimation. Evaluation uses anomaly-detection-based metrics for behavioral and semantic plausibility. Experiments demonstrate TrajGenAgent's superior spatiotemporal fidelity, semantic coherence, and individual-specific behavioral realism over neural and LLM-based baselines, while avoiding parameter updates.
llm-agentin-context learningpoi retrievalanomaly-detectionspatiotemporal fidelity
Token Complexity Theory for AI-Augmented Computing
The paper introduces token complexity, a novel resource measure for AI-augmented computing systems, defined as the minimum expected token cost to achieve specified output quality. The authors develop this concept within the AI-Oracle Turing machine framework, where a probabilistic Turing machine interacts with a stochastic oracle via query/response tapes. Key results include proofs of token complexity's monotonicity, convexity, price sensitivity, and price-relativity of task ordering, along with establishing that the complexity frontier is non-empty, upward-closed, and convex.
token complexityai-oracle turing machineresource measurecomplexity frontierprobabilistic turing machine
Keep Policy Gradient in Charge: Sibling-Guided Credit Distillation for Long-Horizon Tool-Use Agents
The paper introduces Sibling-Guided Credit Distillation (SGCD), a method for improving credit assignment in long-horizon tool-use reinforcement learning. SGCD employs dynamic sampling to generate mixed successful/failed rollouts, uses an external LLM to create stepwise credit references, and applies bounded detached credit weights to reshape GRPO token advantages. The approach avoids silent failure modes of direct self-distillation while maintaining deployment simplicity. Evaluations on AppWorld and $τ^3$-airline show improvements: AppWorld TGC scores increased from 42.9 to 45.6 (test_normal) and 24.7 to 27.0 (test_challenge), while $τ^3$-airline pass@1 rose from 0.583 to 0.602.
credit assignmenttool-use agentsself-distillationlong-horizon rlstepwise credit
Bag of Dims: Training-Free Mechanistic Interpretability via Dimension-Level Sign Patterns
The paper introduces Bag of Dims, a training-free framework for mechanistic interpretability in transformers, demonstrating that standard basis hidden states encode semantic content via sign patterns and confidence via magnitudes. The method validates across Qwen, Gemma, and Mistral models through four experiments, showing sign patterns alone achieve 72-93% top-5 next-token accuracy and 80-90% top-4096 via Hamming scoring. Unsupervised discovery yields 175 semantic categories with 0.80 mean AUC, 20% feature-neuron linkage, and 1500 features at 99% sparsity, confirming low inter-dimension coupling (0.0014 bits MI).
sign patternshamming scoringmechanistic interpretabilitytransformer hidden statesunsupervised discovery
HybridCodeAuthorship: A Benchmark Dataset for Line-Level Code Authorship Detection
The paper introduces HybridCodeAuthorship, a benchmark dataset for line-level code authorship detection that simulates real-world hybrid human-AI code collaboration. The dataset construction pipeline utilizes CodeSearchNet to source Python files from GitHub, interleaving human- and AI-authored lines. Benchmarking two state-of-the-art detection algorithms reveals the task's difficulty, with AIGCode Detector achieving F1 scores of 0.48 (chunk-level) and 0.56 (line-level).
code authorship detectionai-generated codebenchmark datasetline-level analysiscodesearchnet
"Did you lie?" Evaluating Lie Detectors across Model Scale and Belief-Verified Model Organisms
This work introduces Varied Deception, a prompted-lying testbed, and 13 reasoning model organisms with verified hidden beliefs, addressing limitations in evaluating lie detectors for language models. Four detectors were evaluated: a chain-of-thought judge, a logprob classifier, and two activation probes including Did-You-Lie (DYL), across 31 models (2B to 1T parameters). Results show positive scaling with model capability on prompted lying, but sharp performance drops on trained organisms, with DYL retaining the most signal. The chain-of-thought judge achieved 0.82 balanced accuracy, partly due to verification favoring CoT-readable beliefs. Datasets, model organisms, and trained detectors are released.
lie detectorschain-of-thoughtactivation probesmodel organismsprompted-lying
PersonaDrive: Human-Style Retrieval-Augmented VLA Agents for Closed-Loop Driving Simulation
PersonaDrive introduces a retrieval-augmented vision-language-action (VLA) pipeline for closed-loop driving simulation, enabling style-diverse non-ego agents without per-style retraining. The method involves offline triplet mining over per-style human driving data, training a lightweight retrieval head fusing visual features with a control encoder, and fine-tuning a VLA backbone to treat retrieved context points as in-context behavioral demonstrations. Evaluated on Bench2Drive, PersonaDrive improves driving scores by 4.6% over SimLingo and 2.5% over HiP-AD, achieving the highest scores across aggressive, neutral, and conservative styles within a 2% band, with speed and acceleration increasing by 18% and 25% from conservative to aggressive styles.
vision-language-actionclosed-loop simulationtriplet miningretrieval headin-context learning
From Imitation to Alignment: Human-Preference Flow Policies for Long-Horizon Sidewalk Navigation
The paper introduces FlowPilot, a mapless navigation policy for long-horizon sidewalk navigation using only a monocular RGB camera. The method combines anchored flow matching for action representation during policy pre-training on large-scale robot fleet data with human-in-the-loop preference learning to enhance counterfactual reasoning and social compliance. Evaluations in simulation and real-world environments show FlowPilot achieves 42% success rate and 66% route completion, with FlowPilot-HP reducing intervention rates (IR) by 40.0% and near-intervention rates (NIR) by 52.1%.
anchored flow matchinghuman-in-the-loopcounterfactual reasoningsocial compliancemonocular rgb
Emerging Flexible Designs for Geospatial Multimodal Foundation Models
This study conducts a systematic comparison of diverse foundation model architectures for geospatial multimodal reasoning, focusing on flexibility across spectral band configurations. The authors standardize pretraining using identical self-supervised learning objectives and datasets, then evaluate models with consistent parameterization on the GEOBench benchmark for classification and segmentation tasks. Results reveal architectural trade-offs between model flexibility, modality alignment, and downstream task performance, providing practical insights for designing next-generation geospatial foundation models capable of robust multimodal reasoning.
foundation modelsgeospatial multimodalself-supervised learningspectral bandgeobench
Pythagoras-Prover: Advancing Efficient Formal Proving via Augmented Lean Formalisation
Pythagoras-Prover introduces a compute-efficient family of Lean theorem provers, combining autoregressive (4B/32B) and diffusion-based (4B) models, trained via curriculum supervised fine-tuning on stratified Lean-verified corpora. The method employs Augmented Lean Formalisation (ALF) to expand training data through self-distillation and formal statement variants, alongside dynamic proof-reasoning filtering to maintain 8k-token context budgets. Results show Pythagoras-Prover-4B outperforms DeepSeek-Prover-V2-671B (86.1% vs 82.4% pass@32 on MiniF2F-Test) with 167x fewer parameters, while the 32B variant achieves 93.0% on MiniF2F-Test and solves 93 PutnamBench problems.
lean theorem proveraugmented lean formalisationcurriculum supervised fine-tuningdiffusion-based proverproof-reasoning filtering
Analyzing and Improving Fine-grained Preference Optimization in Medical LVLMs
The paper introduces a fine-grained alignment framework for medical LVLMs, addressing limitations in existing preference optimization methods. It combines a bidirectional token-wise KL regularizer with a visual-contrastive grounding objective to improve clinical correctness while preserving linguistic style. The approach corrects only erroneous spans in model outputs using minimally edited preference pairs. Experiments across medical imaging tasks and text generation benchmarks demonstrate its effectiveness in reducing factual inconsistencies and improving visual grounding.
lvlmdpokl regularizervisual-contrastiveclinical correctness
Strategic Decision Support for AI Agents
The paper introduces a strategic decision-support framework for AI agents that minimizes support usage while controlling counterfactual missed-support error—the probability of autonomous action when support would have improved outcomes. The method formulates an optimization problem yielding population-level threshold rules, implements an online adaptive thresholding algorithm with randomized exploration, and proposes calibration-on-the-fly to reduce unnecessary support calls. Experiments across information gathering, human–AI collaboration, and tool-use scenarios demonstrate reliable error control with significant reductions in support usage.
decision supportcounterfactual erroronline thresholdingagentic systemsuncertainty quantification
Graph Reduction in Multirelational Networks: A Spreading-Oriented Reduction Benchmark
The paper introduces SORB (Spreading-Oriented Reduction Benchmark), an open-source framework for evaluating influence maximization (IM) models with integrated graph reduction analysis. SORB operates on diverse networks (single-/multilayer) and measures reduction impacts via standardized metrics like $Gain@k$ and $\mathrm{AUC}_{\mathrm{cutoff}}$. Experiments reveal reduction effects are task- and network-dependent: sparsification preserves seed quality in single-layer networks, while multilayer networks suffer ranking degradation regardless of reduction strategy. The work emphasizes reduction-aware evaluation for spreading process studies.
influence maximizationgraph reductionmultilayer networkssparsificationbenchmarking
EDEN: A Large-Scale Corpus of Clinical Notes for Italian
The authors introduce EDEN (Emergency Department Electronic Notes), a large-scale Italian corpus of 4 million anonymized clinical notes from emergency departments, with 6,000 notes manually annotated via a 132-item Case Report Form (CRF) for dyspnea and loss of consciousness cases. The dataset features diverse value types (numerical, categorical, binary) and addresses data imbalance through iterative clinician review. They propose CRF-filling as a structured information extraction benchmark, providing zero-shot baselines using Gemma-27B and MedGemma-27B. EDEN is the largest freely available Italian clinical corpus.
clinical notescase report forminformation extractionlarge language modelsanonymization
Arbor: Tree Search as a Cognition Layer for Autonomous Agents
Arbor introduces a multi-agent framework with structured tree search as a cognition layer for autonomous agents in large stateful action spaces. The system maintains a shared search tree of scored hypotheses as working memory, using failures as diagnostic signals and successes to shift exploration. It employs an Orchestrator agent for optimization and a Critic agent for stability checks, decomposing capabilities into hard (domain expertise) and soft (coordination) skills. Evaluated on LLM inference optimization, Arbor achieves up to 193% throughput-latency Pareto improvement over baselines, with 2% run-to-run variance, demonstrating hardware-agnostic reproducibility.
tree searchmulti-agent frameworkllm inferencepareto improvementautonomous optimization
Foresight: Iterative Reasoning About Clues that Matter for Navigation
Foresight introduces a test-time framework for open-world mapless navigation using iterative reasoning with Vision-Language Models (VLMs). The method alternates between proposing image-space motion plans and critiquing them based on language goals and visual context, conditioned on prior critiques for refinement. A reward model trained from human feedback aligns plan critiques with behavior preferences via reinforcement learning. Evaluations show 37% higher task success and 52% fewer interventions versus baselines, running in real-time on a Jetson AGX Orin.
vision-language modelstest-time reasoningmotion planningreinforcement learningopen-world navigation
Understanding Truncated Positional Encodings for Graph Neural Networks
The paper analyzes the theoretical properties of truncated positional encodings (PEs) in graph neural networks (GNNs), revealing fundamental differences in expressive power between spectral and walk-based variants when truncated. While complete PEs exhibit equivalent expressivity (between 1-WL and 3-WL tests), truncated spectral PEs lose their advantage over 1-WL. The study introduces $k$-harmonic distances to demonstrate nuanced expressivity differences among closely related truncated PEs. Empirical results on real-world datasets show that combining multiple truncated PE families outperforms using any single variant.
positional encodingsgraph neural networksspectral methodswalk-based methodsexpressive power
Dense Supervision, Sparse Updates: On the Sparsity and Geometry of On-Policy Distillation
The paper analyzes the parameter update characteristics of on-policy distillation (OPD), revealing two key findings. First, OPD updates exhibit coordinate sparsity, being distributed across layers with feed-forward network (FFN) dominance, where training just the identified subnetwork achieves comparable performance to full OPD. Second, updates are numerically full-rank but spectrally concentrated, primarily affecting near-zero weight coordinates and avoiding principal singular subspaces. The study employs optimizer ablation (SGD vs AdamW) across language and vision-language models, showing AdamW's superior performance due to preserved gradient scale heterogeneity.
on-policy distillationcoordinate sparsityspectral concentrationfeed-forward networksgradient scale heterogeneity
Operadic consistency: a label-free signal for compositional reasoning failures in LLMs
The paper introduces operadic consistency (OC), a label-free diagnostic for detecting compositional reasoning failures in LLMs by comparing direct answers with decomposed query responses. Leveraging operad theory, OC evaluates agreement between these response modes across twelve instruction-tuned LLMs (4B to 671B parameters) on four multi-hop QA datasets. Results show OC strongly correlates with accuracy (Pearson r ∈ [0.86, 0.94]), outperforms self-consistency baselines (e.g., CoT-SC drops to r ≈ 0.45 on MuSiQue/StrategyQA), and improves selective prediction (AUARC lifts +0.086 to +0.096).
operadic consistencycompositional reasoningmulti-hop qaself-consistencyselective prediction
The Stable Recovery Manifold: Geometric Principles Governing Recoverability in Continual Learning
The paper introduces the Stable Recovery Manifold hypothesis, proposing that catastrophic forgetting in continual learning stems from accessibility issues rather than information destruction. Using Split CIFAR-100 and a ResNet-18, the authors analyze recoverability via Recovery Subspace Dimensionality (k_t) and representational drift across ten tasks. Results show stable recovery dimensionality (mean k_t = 8.0) despite drift, with principal-angle drift strongly predicting recoverability (r = -0.862). A geometric model explains 82.2% of recoverability variance, supporting the hypothesis that forgotten knowledge remains compactly decodable.
catastrophic forgettingrecovery subspace dimensionalityrepresentational driftcontinual learningstable recovery manifold
Aerial Wildfire Suppression Planning with a Hybrid CNN-Cellular Automata Fire Model
The paper introduces a hybrid CNN-cellular automata model for aerial wildfire suppression planning, integrating fire spread prediction with intervention strategy optimization. The framework combines neural network-based terrain analysis with cellular automaton fire dynamics, enabling gradient-based optimization of aerial drop parameters (location, orientation) for water and retardant deployment. Uncertainty quantification includes Monte Carlo sampling for aleatoric effects and spatially correlated perturbations for epistemic errors. A case study on the 2020 Bear Fire demonstrates the model's capability to generate suppression schedules that reduce fire-affected area while accounting for operational uncertainties.
wildfire suppressioncellular automatacnngradient-based optimizationuncertainty quantification
Generative Modeling of Bach-Style Symbolic Music: A Comparative Study of Autoregressive, Latent-Variable, and Adversarial Approaches
The study compares three generative modeling approaches for Bach-style symbolic music: autoregressive LSTMs with attention, latent-variable models (recurrent and vector-quantized VAEs), and adversarial networks. Using a shared MIDI corpus, the evaluation focuses on polyphonic sequence modeling, latent representation quality, and stylistic coherence. Results indicate autoregressive LSTMs with attention yield the most musically coherent samples, while vector quantization improves latent structure over conventional VAEs. Adversarial methods capture local pitch patterns but exhibit training instability and weaker generalization to Bach's style, revealing comparative strengths and limitations of each paradigm.
autoregressivelstmvaeganpolyphonic
Majority-of-Three is Optimal
The paper establishes that a majority vote of three independent consistent classifiers constitutes an optimal learner in the realizable PAC learning setting. This result simplifies both the algorithmic structure and probabilistic analysis compared to previous voting learners, including S. Hanneke's algorithm and K. Green Larsen's analysis of bagging. The proof demonstrates optimality for the simplest voting scheme, providing a concise theoretical foundation for majority voting in PAC learning.
majority votepac learningconsistent classifiersoptimal learnerbagging
Distribution-Agnostic Robust Trajectory Optimization via Chance-Constrained Reinforcement Learning
A distribution-agnostic robust trajectory-optimization framework is introduced, leveraging chance-constrained reinforcement learning to handle uncertainty in initial conditions and process noise. The method computes a deterministic nominal trajectory offline, then robustifies it via an affine closed-loop correction law with feedforward adjustments and time-varying feedback gains. Probabilistic feasibility is enforced using rollout-based upper-tail quantiles, while terminal dispersion is controlled via covariance-feasibility penalties. Evaluated on a 3D Earth-Mars transfer and a stochastic atmospheric rocket landing, the framework maintains competitive upper-tail fuel costs and probabilistic feasibility across heterogeneous spacecraft trajectory problems without structural redesign.
chance-constrained reinforcement learningaffine closed-loop correctionupper-tail quantilescovariance-feasibility penaltiestrajectory optimization
Simplex-Constrained Sparse Bagging: Transitioning from Uniform Priors to Sparse Posteriors in Ensemble Learning
The paper introduces Simplex-Constrained Sparse Bagging (SCSB), a framework for compressing and calibrating bootstrap-based ensembles. SCSB addresses the limitations of uniform voting in standard bagging methods (e.g., Random Forests, Bagged SVMs) by formulating ensemble pruning and calibration as a joint optimization problem on the probability simplex, using Out-Of-Bag (OOB) loss minimization. It resolves the 'L1-simplex paradox' via a concave quadratic penalty to induce sparsity. Results show 96% ensemble compression, linear inference speedups, improved calibration (lower Expected Calibration Error), and maintained or enhanced generalization accuracy.
ensemble learningprobability simplexout-of-bag lossmodel compressioncalibration error
Learning with Simulators: No Regret in a Computationally Bounded World
The paper introduces simulatable processes, where learners access a simulator approximating the data-generating distribution, even for dependent processes. It demonstrates that this framework achieves learning guarantees comparable to classical independent-data settings, including VC-dimension-dependent error bounds. The work also analyzes conditional sampling's power, revealing statistical and computational advantages. A key result is a universal algorithm that learns any VC class under polynomial-time-samplable processes, with regret bounded by the process's time-bounded Kolmogorov complexity, thus extending PAC learning to dependent data scenarios.
simulatable processesvc dimensionconditional samplingkolmogorov complexitypac learning
Adjusted Cup-Product Neural Layer
The paper introduces an adjusted cup product neural layer, a novel neural primitive incorporating cup products of cochains with an adjustment term from higher gauge theory. This design ensures gauge invariance by construction. The key theoretical result demonstrates that the layer's output on closed cycles depends solely on the adjustment coefficient, with zero coefficient yielding null output. The authors prove this observable constitutes a nonzero quadratic form and exhibits exact invariance under one- and two-form gauge transformations.
cup productcochainsgauge invarianceneural primitivequadratic form
A2D2: Fine-Tuning Any-Length Discrete Diffusion for Adaptive Decoding
A2D2 introduces a unified framework for reward-guided fine-tuning of any-length discrete diffusion models, optimizing insertion and unmasking policies alongside a quality-based inference schedule. The method derives the Radon-Nikodym derivative for joint path measures, ensuring convergence to reward-tilted distributions without target samples, and proposes the Adaptive Joint Decoding (AJD) loss for optimal path measures. Empirical results show A2D2 improves reward optimization, generation flexibility, and accuracy over fixed-length fine-tuning and inference-time guidance methods.
discrete diffusionreward-guided fine-tuningany-length generationadaptive decodingradon-nikodym derivative
NetCause: Counterfactual Learning for Root Cause Analysis in Large-Scale Networks
NetCause introduces a self-supervised learning framework for root cause analysis in large-scale networks by modeling incidents as graph-temporal processes and employing counterfactual simulation to rank candidate root causes. The method trains on 1,500 incidents from a cloud provider's production network and evaluates on 31 expert-labeled cases, achieving a 16.1% accuracy improvement over rule-based baselines. Inference requires only seconds of GPU runtime, making it practical for operational deployment despite computationally intensive training.
root cause analysiscounterfactual learninggraph-temporal modelingself-supervised learningnetwork incidents
Graphical Causal Reasoning for Root Cause Analysis in Cloud Networks
The paper introduces a graph-based causal discovery method for root cause analysis (RCA) in cloud networks, addressing limitations of rule-based automation. The approach constructs causal graphs from binary time series using bivariate Granger causality and conditional independence tests, then performs time-aware probabilistic inference via edge-specific conditional probabilities. Evaluation on 35 labeled incidents from a major cloud provider showed 85.7% recall and 74.3% exact match rates, with successful deployment in 800+ production incidents.
root cause analysisgranger causalityconditional independencecausal graphcloud networks
Ride, Track, and Recover: Pilot Randomized Trial of a Wearable Digital Self-Management Intervention During a Veteran Endurance-Cycling Program
This pilot randomized trial evaluates a wearable digital self-management intervention for veterans with PTSD during an endurance-cycling program. Thirteen participants were randomized to digital intervention plus cycling (n=7), cycling only (n=3), or at-home monitoring (n=7 controls). Smartwatch-derived heart rate and accelerometer features detected hyperarousal events, validated by participants. Generalized additive mixed models revealed differential symptom trajectories: the intervention group showed stabilized hyperarousal and maintained gains post-event, while the cycling-only group exhibited late-study escalation. Higher-severity participants confirmed more ML-detected events, suggesting personalized wearable systems may enhance PTSD symptom management.
wearable sensinghyperarousal detectiongeneralized additive modelsdigital interventionreal-time validation
MaskWAM: Unifying Mask Prompting and Prediction for World-Action Models
MaskWAM introduces an object-centric world-action model (WAM) that unifies mask prompting and prediction via a Mixture of Transformers (MoT) to address spatial bottlenecks in robotic control. By leveraging masks as both inputs and predictions, it enhances semantic grounding and reduces referential ambiguity in cluttered scenes. Evaluations on LIBERO, RoboTwin, and real-world tasks show MaskWAM outperforms baselines in language-clear and ambiguous scenarios, demonstrating robust policy generalization through object-centric supervision and visual prompts.
mask promptingworld-action modelsmixture of transformersobject-centricreferential ambiguity
GF-DiT: Scheduling Parallelism for Diffusion Transformer Serving
GF-DiT introduces elastic parallelism scheduling for Diffusion Transformer (DiT) serving, addressing workload heterogeneity through dynamic GPU resource allocation. The system employs asynchronous execution abstraction to decompose requests into schedulable trajectory tasks and group-free collectives for low-overhead communication. Evaluated on image/video diffusion workloads in vLLM-Omni, GF-DiT achieves 6.01× throughput improvement, 95% latency reduction, 90% fewer SLO violations, and reduces communication-group setup overhead from 778 ms to 60 μs compared to static parallelism approaches.
diffusion transformerselastic parallelismgpu schedulinggroup-free collectivesvllm-omni
Reinforcement Learning for Neural Model Editing
The paper introduces a reinforcement learning framework for neural model editing, formulating it as an RL problem where agents modify models via reward feedback. Two environments are proposed: MaskWorld (multiplicative weight scaling) and ShiftWorld (additive weight updates), with rewards combining utility preservation and task-specific objectives. Evaluations on bias mitigation (text classification) and machine unlearning (image classification) show learned policies reduce forget set accuracy to ~0% while preserving >90% retain set accuracy, and improve bias-related performance by >5% without compromising general utility. This demonstrates RL can automate editing policy design.
reinforcement learningmodel editingmachine unlearningbias mitigationweight modification
Optical Implementation of Equilibrium Propagation Using Spatial Photonic Ising Machines
The authors present a hybrid optical-digital implementation of Equilibrium Propagation (EP) using a Spatial Photonic Ising Machine (SPIM), offering an energy-efficient alternative to traditional machine learning. The SPIM employs gauge transformation to optically encode continuous neuron states and rank-1 binary trainable patterns via phase modulations on a spatial light modulator, with inference performed using a finite difference scheme. Experimental validation on the Wine dataset and numerical evaluation on MNIST demonstrate the potential of continuous couplings and structured coupling matrices. This work advances physical implementations of EP for energy-based networks.
equilibrium propagationspatial photonic ising machinegauge transformationenergy-based networksphase modulation
Uncertainty Estimation for Molecular Diffusion Models
The authors propose a post-hoc method for estimating per-sample uncertainty in pretrained molecular diffusion models, addressing the lack of quality signals in generated molecules. Their approach builds on a Laplace approximation of the denoising network, measuring noise prediction variability across the generation trajectory. Experiments demonstrate that the uncertainty score negatively correlates with established quality metrics (e.g., validity, uniqueness) and enables test-time filtering to improve model performance.
molecular diffusion modelsuncertainty estimationlaplace approximationdenoising networktest-time scaling
Clustering Node Attributed Networks with Graph Neural Networks and Self Learning
The authors propose a self-learning framework for clustering attributed graphs using graph neural networks (GNNs) in an unsupervised setting. The method iteratively refines clusters by alternating between GNN-based node representation learning and graph reconstruction, using both original edges and a dynamically constructed context graph. Empirical evaluation demonstrates superior performance over network-only or attribute-only baselines on synthetic data when neither modality is highly informative, with iterative learning outperforming single-round GNN clustering. On real-world datasets, the method achieves competitive results with state-of-the-art approaches for balanced cluster distributions.
graph neural networksattributed graph clusteringself-learningunsupervised learningnode representation learning
How Much Memory Do We Need? Adaptive Memory Gate for Neural Operators
The paper introduces AMGFNO, an adaptive memory-gated Fourier neural operator that dynamically modulates memory weights based on observation conditions for solving time-dependent PDEs. The method employs a learnable gate to adjust memory retention, addressing limitations of fixed-weight approaches. Experiments on Kuramoto-Sivashinsky and Burgers' equations demonstrate 55-79% nRMSE reduction at low resolutions, with gate values automatically decaying from ~0.7 to near-zero as resolution increases.
neural operatorsmemory gatespde solvingadaptive weightsfourier neural networks
S-GBT: Smooth Growth Bound Tensor for Certified Robustness Against Word Substitution Attacks in NLP
The Smooth Growth Bound Tensor (S-GBT) introduces a second-order method to certify robustness against word substitution attacks in NLP by bounding the Hessian element-wise, addressing limitations of first-order sensitivity approaches. S-GBT incorporates a regularization term during training to minimize these bounds, combining linear and quadratic terms to constrain output changes under perturbations. Derived for Long Short-Term Memory (LSTM) and Convolutional Neural Networks (CNN), S-GBT integrates directly into the training objective. Evaluations on benchmark datasets demonstrate up to 23.4% improvement in certified robust accuracy over prior methods, while maintaining competitive clean accuracy, highlighting the efficacy of jointly controlling gradient and curvature.
smooth growth bound tensorhessianword substitution attackscertified robustnessregularization
Accelerating Speculative Diffusions via Block Verification
The paper introduces a novel speculative sampling scheme for diffusion models that enables block verification, improving draft acceptance rates. By adapting LLM-style speculative decoding to continuous spaces, the method efficiently samples from residual distributions and incorporates a zero-training Free Drafter heuristic. Experiments demonstrate up to 6.3% speedup over existing speculative diffusion methods with minimal overhead during parallel verification.
speculative decodingdiffusion modelsblock verificationresidual distributionfree drafter
Foundations of Practical Quantum Advantage in Quantum-Informed Machine Learning for Predicting Chaos
The authors establish theoretical foundations for practical quantum advantage in quantum-informed machine learning applied to chaotic dynamical systems. They introduce a family of k-indexed higher-order quantum statistical priors (Q-Priors) that encode k-point marginals of the invariant measure on n_q = kq qubits, leveraging superposition and entanglement for compact representation. A two-stage advantage is proven: quantum protocols estimate post hoc Pauli functionals with copy-pair counts independent of n_q, contrasting classical protocols requiring Ω(2^(n_q)) copies. Simulations and IQM superconducting processors validate the mechanism. Case studies demonstrate improved anomaly-correlation skill by 10-39% in medium-range weather forecasting and enhanced velocity-direction coherence in turbulent channel-flow analysis.
quantum-informed machine learningchaotic dynamical systemsquantum statistical priorspauli functionalssuperconducting processors
Hölder++: Improving the Quality-Coherence Trade-off in Multimodal VAEs
Hölder++ improves the quality-coherence trade-off in multimodal VAEs by introducing three innovations: exact Hölder pooling without approximation, separate modeling of shared and private representations (Hölder+), and hierarchical inference for better disentanglement (Hölder++). The method outperforms MMVAE+ in coherence while maintaining sample diversity, produces more structured latent spaces, and yields shared representations useful for downstream tasks. Experiments confirm these advantages across multiple metrics.
multimodal vaehölder poolinglatent disentanglementhierarchical inferencegenerative coherence
Positional Encoding in the Context of Memristor-Based Analog Computation for Automatic Speech Recognition
The study addresses degradation in memristor-based analog computation for automatic speech recognition caused by large output values from transformed positional encodings. By adjusting the weight and precision bit proportions in analog-to-digital conversion (ADC) layers, degradation is reduced by ~50% relative while maintaining stable energy consumption. For scenarios where ADC modification is not feasible, removing encoding-related linear transformations achieves a ~30% relative reduction in degradation. The findings highlight the impact of positional encoding transformations on memristor-based analog computation and propose practical mitigation strategies.
memristorpositional encodinganalog computationdegradationautomatic speech recognition
VideoMDM: Towards 3D Human Motion Generation From 2D Supervision
VideoMDM introduces a diffusion-based framework for generating 3D human motion from 2D supervision, eliminating the need for 3D ground truth. The method employs a pretrained 2D-to-3D lifter to produce approximate 3D pose sequences, which are diffused, denoised in 3D, and supervised in 2D via depth-weighted reprojection loss. The framework adapts standard 3D motion regularizers—velocity consistency and over-parameterized representation alignment—to the 2D setting, learning a coherent 3D motion manifold during training. On HumanML3D, VideoMDM achieves an FID of 0.88, nearly matching fully 3D-supervised MDM (FID 0.54). It also demonstrates strong performance on Fit3D and NBA datasets, generating motions consistently preferred by humans.
diffusion-based framework2d-to-3d lifterdepth-weighted reprojection loss3d motion manifoldvelocity consistency
Enhanced Low-Density Region Exploration in Classifier-Guided Diffusion Models Through Modified Reverse Diffusion Sampling
The paper introduces a sampling-time modification for classifier-guided diffusion models that enhances exploration of low-density regions without additional training. The method combines two guidance mechanisms: steering trajectories toward low-confidence regions via modified classifier gradients and directing sampling toward the predicted real image manifold. Evaluated on ImageNet with ADM models at 64×64 and 256×256 resolutions, the approach improves recall while maintaining FID scores and generates perceptually high-quality samples.
diffusion modelsclassifier guidancelow-density samplingimage synthesisreverse diffusion
Navigating the Safety-Fidelity Trade-off: Massive-Variate Time Series Forecasting for Power Systems via Probabilistic Scenarios
The authors introduce PowerPhase, a probabilistic forecasting benchmark for power systems with 2,000-36,964 channels, exceeding existing multivariate benchmarks by an order of magnitude. PowerPhase includes constraint-aware metrics (Safety_mBrier, NECV, CVaR-alpha) and evaluates on AC power-flow outputs. They propose PowerForge, a scenario-based quantile forecaster with type-specific decoding heads and causal bridging, which achieves top average rank across all grids. Results show a safety-fidelity trade-off, where distributional accuracy and constraint satisfaction rank models differently.
probabilistic forecastingmultivariate time seriespower systemsscenario-based quantileconstraint-aware metrics
Quantizing Time-Series Models As Dynamical Systems: Trajectory-Based Quantization Sensitivity Score
The authors propose Trajectory-based Quantization Sensitivity Score (TQS), a novel metric for post-training quantization (PTQ) that analyzes error propagation in time-series models through dynamical-systems theory. TQS decouples sensitivity estimation from quantizer selection, enabling a priori quantization budget planning even for black-box networks. Their TQS-PTQ framework eliminates calibration data requirements and second-order approximations while supporting mixed-precision deployment. Experiments demonstrate that this dynamical-systems approach provides robust low-precision performance in resource-constrained scenarios.
post-training quantizationdynamical systemssensitivity analysismixed-precisionerror propagation
Simultaneous Latent Budget Trees for Stratified Classification
The paper introduces Simultaneous Latent Budget Trees (SLBT), a probabilistic machine learning framework for classification trees incorporating stratification factors like temporal, spatial, or demographic variables. SLBT employs a model-based split rule where child nodes are interpreted as latent components of a simultaneous mixture model, optimizing conditional splits. Parameters are estimated via least squares with a neural network perspective, enabling interpretable tree structures with interactive visualization tools. The framework includes measures to handle unbalanced response class distributions. Applied to gender-related differences in Amyotrophic Lateral Sclerosis progression, SLBT demonstrates its utility in stratified classification. The associated SLBT library is available on GitHub.
simultaneous latent budget treesstratification factormodel-based split ruleneural network perspectiveinterpretable tree structure
Clipping Makes Distributed and Federated Asynchronous SGD Robust to Stragglers
The paper theoretically justifies gradient clipping's stabilizing effect in asynchronous stochastic gradient descent (ASGD) by eliminating maximum delay dependence in oracle complexity. Employing a sub-Weibull noise model that generalizes sub-Gaussian and sub-exponential distributions, the analysis captures heavy-tailed gradients observed in deep learning. Results demonstrate convergence in expectation and—for the first time in asynchronous optimization—high-probability convergence, addressing straggler-induced delays in distributed and federated settings.
asynchronous sgdgradient clippingsub-weibull noiseoracle complexitystraggler robustness
ProtoX-AD: Self-Explainable Time Series Anomaly Detection and Characterization
ProtoX-AD introduces a prototype-based self-explainable framework for time series anomaly detection (TSAD), addressing the explainability limitations of self-supervised classification approaches. The method learns transformation-aware latent representations alongside interpretable prototypes, enabling both anomaly detection and characterization through prototype-based explanations. Experiments on synthetic and real-world datasets show ProtoX-AD matches black-box methods in detection performance (F1 scores) while providing more consistent and semantically meaningful explanations than existing explainable baselines.
time series anomaly detectionself-supervised learningprototype-based explanationsinterpretable machine learningtransformation-aware representations
Extracting Governing Equations from Latent Dynamics via Multi-View Contrastive Learning
We present DYSCO, a multi-view temporal contrastive learning algorithm that jointly recovers latent trajectories and governing dynamics from noisy, high-dimensional observations. The method leverages multiple independent noisy views of the same underlying process to disentangle signal from noise, parameterizing dynamics in a structured functional basis to enable symbolic recovery of governing equations within an affine gauge. Theoretical guarantees establish strong identification up to an affine indeterminacy, extending prior results to noisy nonlinear observations. Empirical evaluations demonstrate accurate recovery of both latent trajectories and flow fields across diverse dynamical regimes (chaotic, oscillatory, metastable) under Gaussian and Poisson observation noise.
multi-view learningcontrastive learninglatent dynamicssystem identificationaffine gauge
To GAN or Not To GAN: Segmentation Analysis on Mars DEM
The paper presents an automated approach for detecting Martian mounds using neural network-based semantic segmentation, comparing supervised and generative adversarial methods. Leveraging Digital Elevation Models (DEMs), the study evaluates morphological parameters to identify potential signs of water or life-conducive environments. Results indicate that incorporating artificially generated data via GANs did not enhance segmentation performance, suggesting supervised methods suffice for this task. The work contributes to rover navigation and astrobiological research by reducing reliance on manual mapping.
semantic segmentationdigital elevation modelsgenerative adversarial networksmartian morphologysupervised learning
Distributional Loss for Robust Classification
The paper introduces a novel loss function for supervised classification that optimizes classifier outputs as a bimodal Gaussian distribution, rather than enforcing direct input-to-label mappings. This distributional approach implicitly models class ambiguity, reduces overfitting, and promotes robust decision boundaries without requiring additional label information. Experiments show consistent robustness improvements, particularly in low-data regimes, with minimal modifications to standard training pipelines.
supervised classificationbimodal gaussian distributionclass ambiguityrobust decision boundarieslow-data regimes
From Uncertain Judgments to Calibrated Rankings: Conformal Elo Estimation for LLM Evaluation
The authors propose a conformal Elo estimation framework to address systematic errors in LLM-as-a-judge evaluations, enabling calibrated rankings without large-scale human annotations. At the local level, they propagate calibrated win probabilities into the Bradley-Terry procedure, reducing Elo estimation error to 17.9 MAE compared to human-derived ratings on 55 LMArena models. Globally, they apply split conformal prediction to residual gaps between LLM and human Elo ratings, providing prediction intervals with distribution-free marginal coverage guarantees. This two-layer approach yields low-cost, uncertainty-aware LLM evaluation tools. Code is released for reproducibility.
conformal predictionbradley-terryelo estimationllm evaluationuncertainty calibration
Layer-Resolved Optimal Transport for Hallucination Detection in NMT and Abstractive Summarization
The study extends optimal transport (OT) analysis to detect hallucinations in neural machine translation (NMT) across all six decoder layers of the Fairseq DE-EN model (N=3,414), revealing complementary detection capabilities between Wass-to-Unif and Wass-to-Data metrics, with layers L1-L4 most predictive. OT achieves 57.2%/57.6% balanced accuracy on abstractive summarization faithfulness detection (AggreFact, N=1,116), below supervised methods (69.9%/74.3%) due to limitations in capturing downstream faithfulness failures. Structural analysis of T5-base confirms consistent decoder organization, with Layer 3 peak concentration and Layer 12 critical for generation quality.
optimal transporthallucination detectionneural machine translationabstractive summarizationcross-attention
Understanding helpfulness and harmless tension in reward models
This work investigates the tension between helpfulness and harmlessness objectives in reward models for RLHF, revealing mechanistic insights into their conflicting representations. The authors analyze single-objective (helpfulness-only, harmlessness-only) and mixed-objective reward models, employing activation-based methods and targeted neuron ablations to study functional roles. Results show mixed-objective models underperform single-objective variants, indicating interference between objectives. Shared neurons between helpfulness and harmlessness exert disproportionate influence on model behavior, causally supporting their respective objectives while negatively impacting the opposing one. These findings elucidate challenges in multi-objective alignment and motivate future work on disentangled alignment methods.
reward modelsrlhfneuron ablationalignment tensionactivation-based methods
WHAR Arena: Benchmarking the State of the Art in Efficient Wearable Human Activity Recognition
The WHAR Arena benchmark addresses comparability issues in Wearable Human Activity Recognition (WHAR) by standardizing 30 datasets, evaluation protocols, and model interfaces. Through 4760 training runs across 17 architectures, the study finds no single dominant model, with CNN-HAR achieving the highest mean macro-F1 but performance clustering near a ceiling. Compact models like TinierHAR and Random Forests excel in deployment efficiency, while larger recurrent models offer diminishing returns. The framework is released to promote transparent benchmarking and efficiency optimization.
wearable human activity recognitionmacro-f1cross-subject evaluationpareto frontiertinierhar
The Geometry of Phase Transitions in Generative Dynamics via Projection Caustics
The paper develops a geometric framework to explain phase-transition-like behavior in continuous-state generative samplers (e.g., diffusion and flow-matching models). By interpreting denoising as gradient descent on a free energy landscape, the authors identify projection caustics—where nearest-point projections onto data support become non-unique—as the origin of abrupt qualitative changes in trajectories. They introduce the Critical Boundary Detector (CBD) to diagnose score-direction instability, demonstrating its efficacy in localizing mode commitment, predicting intervention-sensitive windows, and enabling targeted control across toy models, diffusion models, and latent text-to-image diffusion models. The results establish a connection between data geometry and diffusion dynamics.
projection causticsfree energy landscapecritical boundary detectormode commitmentscore-direction instability
Loss-Shift Transfer via Bayes Quotients
(No summary returned.)
Getting Better at Working With You: Compiling User Corrections into Runtime Enforcement for Coding Agents
The paper introduces TRACE (Test-time Rule Acquisition and Compiled Enforcement), a skill-layer pipeline that compiles user corrections into runtime checks for coding agents to reduce repeated preference violations. TRACE mines chat corrections, rewrites them as atomic rules, and enforces them during execution. Evaluated on ClawArena and MemoryArena-derived tasks, TRACE reduces held-out preference violations from 100.0% to 37.6% (in-distribution) and 2.0% (out-of-distribution) on ClawArena, and from 100.0% to 60.5% on MemoryArena, outperforming memory baselines.
runtime enforcementcoding agentspreference complianceatomic rulesin-distribution tasks
Detecting Explanatory Insufficiency in Learned Representations: A Framework for Representational Vigilance
The paper introduces VER (Vigilant Evaluator of Representations), a diagnostic framework for detecting explanatory insufficiency in learned representations. VER formalizes a five-step monitoring sequence: representation identification, explanatory-domain delimitation, residual-structure detection, explanatory-resistance evaluation, and vigilance signaling. The method distinguishes representational inadequacy from prediction error, uncertainty, noise, and distribution shift, complementing existing evaluation metrics without modifying learning algorithms. The authors outline a path toward empirical validation through representational-vigilance benchmarks.
learned representationsexplanatory insufficiencyresidual-structure detectionrepresentation diagnosticsvigilance signaling
When Does Routing Become Interpretable? Causal Probes on Block Attention Residuals
The paper investigates whether exposing routing paths in Block Attention Residuals (Block AttnRes) architectures enables mechanistic interpretability. By comparing a vanilla Qwen3 model with deterministic routing to a trained Block AttnRes Qwen3 (both 0.6B parameters), the study reveals that trained models develop localized routing motifs (embedding-source, current-state, and older-history pathways) absent in the baseline. However, routing mass does not correlate with causal importance, as some high-mass paths show no causal role under intervention. The findings demonstrate that while architectural exposure of routing is necessary, causal probing remains essential for mechanistic interpretation.
block attention residualsmechanistic interpretabilityrouting motifscausal probingqwen3
Robust State-Conditional Feature-Weighted Jump Models for Temporal Clustering
A robust feature-weighted jump model is proposed for temporal clustering, incorporating a penalty for smooth transition regularization and Tukey's biweight loss for outlier robustness. The model introduces a parameter to control feature weight variability across states, enabling state-specific feature relevance. Simulation results demonstrate accurate recovery of true cluster sequences and reliable feature identification, outperforming competing methods, especially in outlier scenarios. Empirical validation includes applications to conflict-related homicide data in Kosovo (1998-2000) and macroeconomic performance across twelve European countries (1949-2024).
temporal clusteringfeature weightingtukey's biweightstate-specific relevanceoutlier robustness
An Extensible and Lightweight Unified Architecture for Demosaicing Pixel-bin Image Sensors
The paper proposes a modular unified architecture for demosaicing pixel-bin sensors, addressing challenges posed by inter-color separation and CFA-specific deep learning methods. The solution combines extensibility and lightweight design, featuring a learning-free CFA-identification module for plug-and-play operation. Results demonstrate improved image quality while reducing resource overhead and development complexity compared to existing approaches.
pixel-bin sensorsdemosaicingcolor filter arraylightweight architecturecfai module
Learning-Augmented Approximation for Unrelated-Machines Makespan Scheduling
The paper presents a learning-augmented algorithm for makespan minimization on unrelated machines ($R\|C_{\max}$), addressing an open question from Antoniadis et al. (ICLR 2025). By leveraging predictions of heavy job assignments, the method achieves a polynomial-time $(1+\varepsilon)$-approximation for accurate predictions, with graceful degradation to a 2-approximation under increasing prediction error. Theoretical guarantees match known lower bounds, and empirical validation demonstrates practical efficacy. The work extends the learning-augmented framework beyond selection problems to scheduling.
learning-augmented algorithmsmakespan minimizationunrelated machinesapproximation algorithmsscheduling
Demystifying Hidden-State Recurrence: Switchable Latent Reasoning with On-Policy Reinforcement Learning
The paper introduces SWITCH, a switchable latent reasoning framework that replaces continuous hidden-state recurrence with discrete boundary tokens for improved optimization and interpretability. The method uses explicit entry/exit tokens to enable standard on-policy reinforcement learning (GRPO) and facilitate mechanistic analysis through direct probing. Experiments show SWITCH outperforms prior hidden-state-recurrence approaches, with mechanistic analysis revealing three key findings: learned switching policies, problem-specific latent computation, and concentrated computation at entry transitions.
latent reasoningon-policy rlhidden-state recurrencemechanistic analysisswitch-grpo
Disparate Impact in Synthetic Data Generation
The paper introduces a fairness notion of disparate impact for synthetic data generation (SDG), focusing on utility parity across sensitive groups without altering the real data distribution. It diverges from prior approaches that correct biases by redefining SDG to match the real distribution. The authors analyze failure modes of SDG, including limitations in expressive power, sampling errors due to group proportions, and estimation errors from differential privacy mechanisms. Experiments on artificial and real-world data, using probabilistic graphical models, demonstrate disparate impact. A group-wise SDG modeling strategy is proposed, showing improved utility and parity across various settings.
disparate impactsynthetic data generationprobabilistic graphical modelsdifferential privacyutility parity
Authority, Truth, and Citation Bias: A Large-Scale Multi-Domain Benchmark for Studying Epistemic Susceptibility in Large Language Models
The study introduces AuthorityBench, a 220,564-prompt multi-domain benchmark designed to isolate the influence of citation-based authority signals on epistemic behavior in large language models (LLMs). Using a 2x2 factorial design crossing claim veracity with citation veracity, the benchmark spans four domains (general knowledge, science, law, medicine) with controlled variations in prompt templates, venue prestige tiers, and author demographics. Evaluation of seven models reveals that citation presence, regardless of veracity, consistently increases hallucination rates, particularly for true claims with fabricated citations (3-22 percentage points increase, peaking at 35-77% in general knowledge). Legal claims show relative robustness, while venue prestige and author demographics have negligible impact.
authoritybenchepistemic behaviorhallucination ratesfactorial designcitation veracity
Scale Buys Interpolation, Structure Buys a Horizon: Certified Predictability for Equivariant World Models
The paper introduces a computable, multi-step certificate for predictable horizons in equivariant latent world models, proving that $T$-step rollout error remains constant over symmetry orbits and is stratified by the predictor's Lyapunov spectrum ($T_j(ε)\sim\log(1/ε)/λ_j$). The method leverages equivariance to provide horizon guarantees, with empirical validation on 40-D Lorenz-96 showing a $\mathbb{Z}_N$-equivariant network achieves high fidelity ($R^2{=}0.98$) in recovering the full Lyapunov spectrum, outperforming dense and recurrent baselines. The certificate is shown to be structure-dependent, with applications in auditing pretrained models like TD-MPC2 and V-JEPA 2-AC, where calibration does not scale with parameters.
equivariant world modelslyapunov spectrumpredictable horizoncertified predictabilitymulti-step rollout
$α$-fair heterogeneous agent reinforcement learning
The authors propose a novel framework integrating $α$-fairness with Heterogeneous-Agent Trust Region Learning (HATRL) to address fairness and efficiency in multi-agent reinforcement learning. The method employs a fair advantage function that dynamically weights agent utilities based on expected returns, enabling a transition from utilitarian efficiency to $α$-fairness welfare. Two algorithms, $α$-fair HATRPO and $α$-fair HAPPO, are introduced and empirically validated in sequential social dilemmas (CleanUp, CommonHarvest), demonstrating superior utilitarian performance and higher social outcomes compared to baseline HATRL algorithms.
α-fairnessheterogeneous-agent trust region learningfair advantage functionsequential social dilemmasnash equilibria
Limits of spectral learning under noise
The study investigates spectral learning's stability under noise in supervised regression, focusing on coefficient drift induced by additive label noise. Using sparse spectral representations across multiple bases and dimensions, the authors whiten empirical feature geometry to derive a closed-form expression for coefficient vector overlap. Results reveal a universal degradation curve governed by an intrinsic noise scale, with numerical experiments confirming theoretical predictions in Fourier, Legendre, Bessel, and Haar bases. The work identifies a fundamental noise threshold beyond which coefficient estimates become unstable, limiting functional structure recovery from noisy data.
spectral learningnoise thresholdcoefficient driftfunctional regressionbasis expansion
A green solvent screening tool for emerging materials via uncertainty aware, transformer enhanced transfer learning
The study presents a transformer-enhanced transfer learning pipeline for green solvent screening, addressing data scarcity in solubility prediction for emerging materials. Leveraging a pre-trained foundational model on QM9 targets, the method incorporates uncertainty quantification to assess prediction confidence. The system achieves high performance on limited-data targets (Gutmann Donor/Acceptor numbers) and baseline solubility parameters, expanding solubility descriptor data by orders of magnitude. A deployable tool enables high-throughput lab integration, successfully rediscovering known green solvents and proposing novel candidates.
transfer learninguncertainty quantificationsolubility predictionfoundational modelgreen solvents
A solvable model for unsupervised federated learning
The paper presents a theoretical framework for unsupervised federated learning using a teacher-multiple students model, where each student receives distinct data realizations via noise corruption or subset sampling. Employing equilibrium disordered system analysis, the authors demonstrate that student interactions systematically improve learning: high-noise students require fewer samples for pattern recovery, while low-noise students achieve higher ground-truth signal overlap. They derive optimal Bayesian recovery conditions as functions of sample complexity, noise level, and interaction strength, validated numerically, and map the dynamics to equilibrium sampling in a Restricted Boltzmann Machine with structured hidden layers.
federated learninggenerative modelingrestricted boltzmann machinebayesian recoveryequilibrium sampling
Quality-Preserving Imperceptible Adversarial Attack on Skeleton-based Human Action Recognition
The authors propose a quality-preserving adversarial attack method for skeleton-based human action recognition (S-HAR) that maintains motion naturalness while achieving high attack success rates. Their distribution-based approach minimizes the gap between empirical and true risks during optimization, avoiding noise-like perturbations that degrade motion quality. Experiments on state-of-the-art S-HAR models across two datasets show superior attack success and preserved motion quality, measured by a novel human-aligned perception metric. The results expose vulnerabilities in current S-HAR systems.
adversarial attackskeleton-based recognitionmotion qualitydistribution-based optimizationhuman action recognition
Deep Sleep Classification via EEG Signal Criticality: A Passive BCI Approach for Sleep-Improvement Neurofeedback
This study introduces a passive Brain-Computer Interface (pBCI) approach for deep sleep (N3) classification using criticality features derived from Detrended Fluctuation Analysis (DFA) of EEG signals. The method analyzed 347,232 EEG epochs from 290 older women, employing UMAP manifold learning for state transition visualization and benchmarking six classifiers via 10-fold cross-validation. Naive Bayes achieved the highest mean balanced accuracy (87.17% ± 0.24%), significantly outperforming fully connected deep neural networks (81.58%) and Random Forest (80.97%). The results demonstrate that DFA-derived criticality features reside on a non-linear manifold, enabling robust probabilistic decoding for state-dependent neurofeedback applications.
detrended fluctuation analysispassive brain-computer interfacedeep sleep classificationumap manifold learningneurofeedback
Reliability of Probabilistic Emulation of Physical Systems
This study evaluates the reliability of probabilistic forecasts for physical systems by comparing generative models (diffusion, flow matching) and CRPS-trained ensembles under matched computational budgets. A framework assesses empirical coverage of predictive intervals, accuracy, and inference latency across diverse 2D spatiotemporal systems. CRPS-trained ensembles achieve more reliable uncertainties and faster inference, particularly in autoregressive rollouts, compared to latent-space-trained generative models. Generative models trained in ambient space match ensemble coverage but incur higher latency. The authors release AutoCast for modular implementation and AutoSim for dataset generation to support future research.
probabilistic forecastscrps-trained ensemblesgenerative modelsempirical coverageautoregressive rollouts
DeepJEB++: Foundation Model-Driven Large-Scale 3D Engineering Dataset via 2D Latent Space Augmentation
DeepJEB++ introduces a foundation-model-driven framework for generating large-scale 3D engineering datasets from limited seed data. The method employs a three-stage pipeline: (1) 2D latent space augmentation via fine-tuned diffusion models and VLM filtering, (2) 3D mesh generation using domain-adapted foundation models, and (3) automated simulation labeling for mass, stress, and displacement. Starting with 400 seed designs, the approach produces 15,360 simulation-labeled 3D jet engine brackets (40x expansion) with validated manufacturability and label fidelity. The dataset supports reproducible engineering-AI research.
latent diffusionvision-language modelfinite-elementdata augmentationfoundation model
Exposure Bias as Epistemic Underidentification in Recursive Forecasting
The paper recharacterizes exposure bias in recursive multi-step forecasting as an epistemic underidentification problem under partial observability, proving that even deterministic latent dynamics can lead to unidentified recursive predictors due to self-generated induced states. The authors formalize this with induced states (Z) and provenance variables (P), decomposing error into teacher-forcing/rollout mismatch, representation-class approximation, and provenance gaps. Empirical results demonstrate distinct induced-state regimes during rollout, with performance improvements from both local adaptation and altered induced-state visitation, while provenance-aware correction yields conditional gains.
exposure biasepistemic underidentificationrecursive forecastinginduced statesprovenance variables
EPM-JEPA: Operator-Side Experience Modulation in JEPA-Family World Models
The paper introduces EPM-JEPA, an operator-side experience modulation method for JEPA-family world models that generates low-rank weight deltas via LoRA to adapt to distribution shifts, contrasting with operand-side injection (EI-JEPA). On Moving MNIST with gravity shift, EPM-JEPA (D_shift = 0.7848 ± 0.0078) shows no significant difference from EI-JEPA (0.8238) but improves 1.90% over a no-memory baseline, highlighting the specificity of weight-level modulation. Analysis reveals three dynamical processes influencing performance: buffer cycling, EMA target drift, and a LoRA settling transient (+0.021), motivating the proposed PEM-JEPA to address dynamical-peak limitations.
jepa-familyexperience modulationlow-rank adaptationdistribution shiftdynamical processes
Predicting Cognitive Load from Speech and Interaction Dynamics in Dyadic Conversations
The study demonstrates that conversational interaction dynamics enhance cognitive load prediction in natural dyadic conversations. Using audio from 53 dyads performing nine collaborative tasks, the authors extract static acoustic, dynamic, and interaction features to train a two-head Gated Recurrent Unit encoder for cognitive load score prediction. Results indicate temporal demand correlates with turn-taking dynamics (e.g., overlap, speaker switches), while mental demand associates with imbalanced participation, highlighting task structure's role in modeling cognitive load.
cognitive loaddyadic conversationsgated recurrent unitturn-taking dynamicsinteraction features
Circuit Synchronization Precedes Generalization: Causal Evidence from Fourier Structure in Grokking Transformers
The study introduces the Frequency Synchronization Degree (FSD), a novel metric for detecting Fourier circuit synchronization in transformers prior to grokking, without requiring prior circuit knowledge. Using FSD across nine modular addition configurations (primes p in {53, 71, 97, 113, 131}), the authors demonstrate that synchronization occurs 500-3,000 steps before grokking (mean lead +1,722 steps), outperforming existing predictors. Causal evidence shows the inter-phase gap acts as a regularization phenomenon, with grokking time Delta_t following a 1/lambda relationship (R^2=1.00 and R^2=0.99 in clean cases). Architecture ablations reveal FSD as a multi-block circuit property, with attention-only models grokking and MLP-only models failing to grok.
fourier circuitgrokkingfrequency synchronization degreemodular arithmeticweight decay
Self-Guidance: Enhancing Neural Codecs via Decoder Manifold Alignment
The paper introduces self-guidance, a method to enhance neural speech codecs by aligning decoder feature manifolds when processing both quantized tokens and continuous embeddings, using a lightweight feature-mapping loss. This approach improves reconstruction fidelity without modifying the quantizer or increasing model capacity, requiring minimal training overhead and no inference changes. Applied to XCodec2, it achieves state-of-the-art low-bitrate performance, enables 4x codebook reduction without fidelity loss, and improves LLM-based TTS synthesis by simplifying token modeling. Statistical and visual evidence confirms enhanced manifold alignment, with experiments demonstrating generality across inductive biases.
neural codecsmanifold alignmentvq-vaesquantization errorfeature-mapping loss
Is Spurious Correlation Removal Always Learnable?
The paper establishes a computational barrier for invariant learning despite statistical identifiability, showing that polynomial-time algorithms cannot recover one-dimensional invariant subspaces ($k=1$) under a black-box samplable sparse-recovery primitive. Using a separation parameter $γ$ to quantify environment diversity, it derives minimax risk bounds $Θ(k(d-k)/(n|\mathcal{E}|))$ under sufficient diversity and local Gaussian regularity, with a phase transition at $n^*∝k(d-k)/(|\mathcal{E}|γ^2)$. Experiments on synthetic and real data validate the theoretical gaps and transitions.
invariant learningsparse recoveryminimax riskenvironment diversityphase transition
Multi-Label Test-Time Adaptation with Bayesian Conditional Priors
We propose Bayesian Conditional Priors (BCP) Estimation, a gradient-free test-time adaptation method for multi-label recognition under distribution shift. BCP addresses the brittleness of frozen Vision-Language Models (VLMs) by injecting label dependency without tuning the backbone. It selects a high-confidence anchor label per test image, applies anchor-conditioned Bayesian refinement in logit space, and estimates priors online from unlabeled test streams via lightweight second-order co-occurrence statistics. Evaluated on standard multi-label benchmarks with multiple CLIP backbones, BCP improves RN50 average mAP from 57.31 to 69.22 and ViT-B/16 from 62.61 to 71.79, outperforming strong TTA baselines.
bayesian conditional priorstest-time adaptationvision-language modelsmulti-label recognitiondistribution shift
Where Computation Lives Inside TabPFN: Causal Localisation of Attention Head Function
The study provides the first causal mechanistic analysis of TabPFN 2.5, a tabular foundation model, focusing on feature-wise attention heads' computational distribution across layers. Using activation patching, ablation, and attention entropy on synthetic regression datasets, the authors identify temporal specialization: one head's causal necessity dominates others by 2-5x at peak layer, with its dominant layer shifting across task complexity, while remaining heads show symmetric late-layer profiles. Attention entropy and patching converge on computationally active layers of the dominant head. Contrastive activation steering fails to transfer across samples, attributed to TabPFN's in-context learning mechanism encoding task structure via context-dependent attention rather than stable parametric directions.
tabular foundation modelattention headsactivation patchingin-context learningcontrastive activation steering
Selecting Samples on Graphs: A Unified Dataset Pruning Framework for Lossless Training Acceleration
We propose a unified graph-based dataset pruning framework that integrates intrinsic sample importance and extrinsic pairwise diversity into a single optimization objective, formulated as a Maximum Weight Clique Problem (MWCP). The method employs a greedy algorithm with provable approximation guarantees under mild conditions, offering practical design guidelines for importance metrics. Experiments demonstrate superior performance over existing pruning methods, achieving over 40% training time reduction on ImageNet-1k with ResNet-50 while maintaining accuracy.
dataset pruningmaximum weight clique problemgreedy algorithmapproximation guaranteetraining acceleration
LongSpike: Fractional Order Spiking State Space Models for Efficient Long Sequence Learning
LongSpike introduces fractional-order State-Space Modeling (f-SSM) into Spiking Neural Networks (SNNs) to address the memoryless bottleneck of first-order ODEs in long-sequence tasks. The framework leverages fractional calculus for hierarchical neuronal dynamics with long-memory kernels, while maintaining computational efficiency through a parallelizable state-space formulation. Evaluations on Long Range Arena (LRA), WikiText-103, and Speech Commands show superior accuracy over state-of-the-art SNNs with preserved sparse computation.
spiking neural networksfractional-order ssmlong-sequence learningstate-space modelingparallel training
Prediction-Powered Causal Inference by Automatic Debiased Machine Learning and Semi-Supervised Riesz Regression
The study introduces Prediction-Powered Causal Inference (PPCI), a framework for semiparametric efficient estimation of causal and structural parameters in semi-supervised settings with unlabeled auxiliary regressors. By deriving the efficient influence function and efficiency bound, the authors demonstrate that incorporating auxiliary regressors reduces asymptotic variance compared to using labeled data alone. They propose DML-PPCI methods, including EE-DML-PPCI and TMLE-DML-PPCI, which achieve the derived efficiency bound. Key components involve estimating the efficient influence function, leveraging Neyman orthogonal scores, and developing semi-supervised generalized Riesz regression with convergence rate guarantees for Riesz representer estimation.
efficient influence functionneyman orthogonal scoreriesz representersemiparametric estimationasymptotic variance
Direct Preference Optimization for Chatbot Fine-Tuning: An Empirical Study
The study introduces Direct Preference Optimization (DPO) for fine-tuning large language models, demonstrating its advantages in simplifying the training pipeline and improving computational efficiency. The method employs reinforcement learning to optimize model performance, evaluated using BLEU, ROUGE, and cosine similarity metrics. Results show competitive performance and effective convergence, though training instability remains an area for further investigation.
direct preference optimizationreinforcement learninglarge language modelsfine-tuningcomputational efficiency
Multi-Bitwidth Quantization for LLMs Using Additive Codebooks
The authors propose Drop-by-Drop, a multi-bitwidth post-training quantization framework enabling inference-time precision control over LLM weights from a single trained model. The method leverages information-theoretic principles and successive refinement, using additive codebooks with Matryoshka-style supervision to optimally reconstruct Gaussian-distributed weights under weighted MSE distortion. Evaluated on Qwen, LLaMA, Gemma, and Mistral architectures, the approach maintains competitive perplexity and accuracy while reducing storage overhead through shared checkpoints for multiple bitwidths.
post-training quantizationadditive codebookssuccessive refinementmulti-bitwidthllm weights
SMGFM: Spectral Multimodal Graph Pretraining for Multimodal-Attributed Graphs
SMGFM introduces a spectral multimodal graph pretraining framework for multimodal-attributed graphs (MAGs), addressing the challenge of disentangling structure-induced and modality-intrinsic semantics. The method leverages graph-frequency variation as a prior, decomposing modality-specific node signals into graph-frequency bands using scalable Chebyshev filters. It constructs frequency-resolved modality tokens, estimates coupling reliability through topology-conditioned routing, and performs band-modality interaction before fusion. This approach aligns smooth consensus routes while preserving modality-specific routes, mitigating spatial-domain entanglement. Extensive experiments on MAG datasets demonstrate SMGFM's state-of-the-art performance across graph-level and modality-level tasks.
multimodal-attributed graphsgraph-frequency variationchebyshev filterstopology-conditioned routingband-modality interaction
Multimodal Graph Negative Learning
The paper proposes GraphMNL, a graph-aware multimodal negative learning framework addressing node-level branch semantic imbalance in multimodal attributed graphs (MAGs). Unlike existing methods relying on cross-branch agreement, GraphMNL employs negative learning for cross-branch guidance, teaching inferior branches which classes a node is unlikely to belong to rather than forcing imitation. The framework includes a branch library, graph-aware reliability arbitration, unstable transfer gating, and target-preserving negative learning. Evaluations show GraphMNL achieves 72.47% accuracy on Grocery datasets and 76.60 F1 score on Reddit M datasets.
multimodal attributed graphsnegative learningbranch semantic imbalancegraph-aware reliability arbitrationtarget-preserving learning
A Privacy-Preserving Framework Using Remote Data Science for Inter-Institutional Student Retention Prediction
The study introduces a privacy-preserving machine learning framework using PySyft for inter-institutional student retention prediction, featuring a semi-air-gapped architecture with high-side and low-side servers. It employs remote data science (RDS) to enable collaborative model building without direct data access, validated across three universities. Three synthetic data generation methods were evaluated, including a novel Data-Type-Aware Templates approach prioritizing privacy over distributional fidelity. Results show consistent classification performance (Macro F1: 0.690--0.695) while ensuring FERPA compliance, demonstrating RDS-based PPML as a viable alternative to federated learning for small-scale collaborations.
privacy-preserving machine learningremote data sciencesemi-air-gapped architecturedata-type-aware templatesferpa compliance
Interpretable Factor Decomposition for Decision Intelligence in Large-Scale Financial Markets: Evidence from China's A-Share Market
The study introduces an interpretable machine learning pipeline for factor decomposition in equity return prediction, focusing on China's A-share market. Using XGBoost with TreeSHAP attribution on 3632 stocks (2009-2019), the method achieves a mean AUC of 0.547 and +2.38%/month long-short spread (Sharpe 2.23), persisting after Carhart four-factor adjustment (+2.31%/month). SHAP analysis reveals behavioral signals (58.2% attribution) dominate valuation ratios (10.7%), with ablation studies exposing feature substitutability patterns not visible through single-method analysis.
interpretable machine learningtreeshap attributioncross-sectional equityfactor decompositionablation analysis
CLARITree: Cholesky and Lookahead Accelerations for Regression with Interpretable Piecewise Linear Trees
The paper introduces CLARITree, a novel algorithm for constructing near-optimal, sparse, piecewise linear regression trees. It combines lookahead search strategies with efficient rank-one Cholesky updates of the Gram matrix, addressing computational bottlenecks in dynamic programming approaches. Theoretical and empirical results demonstrate superior trade-offs between computational efficiency, predictive accuracy, and sparsity compared to state-of-the-art methods, while maintaining interpretability.
regression treescholesky updateslookahead searchpiecewise lineargram matrix
Graph Reinforcement Learning for Calibration-Aware Quantum Circuit Routing
A calibration-aware graph reinforcement learning approach improves quantum circuit routing fidelity by incorporating same-day IBM Heron r2 calibration data. The method employs proximal policy optimization to train a policy that selects hardware-edge SWAPs, evaluated using exact simulated fidelity across nine Munich Quantum Toolkit (MQT) Bench circuits and three calibration snapshots. Results show pooled mean exact fidelity of 0.727, outperforming SABRE-best20 (0.440) and target-aware SABRE (0.481). Fidelity gains are concentrated in 5q and 8q circuits, with higher routed two-qubit counts, while 10q circuits favor SABRE-best20 under the fixed tree action graph.
quantum circuit routingcalibration-awareproximal policy optimizationhardware-edge swapsexact simulated fidelity
Quantum Reservoir Computing for Short-Term Power Load Forecasting in Resource-Constrained Energy Systems
This work introduces a Quantum Reservoir Computing (QRC) framework for short-term power load forecasting, optimized for resource-constrained edge deployment. The method employs a fixed quantum reservoir for feature extraction and trains only a classical Elastic Net readout, later compressed via post-training fixed-point quantization (2-8 bits). Evaluated on Tetouan and Spain energy datasets under simulated noise (IBM FakeTorino/Marrakesh), results show 6-bit quantization preserves forecasting accuracy while reducing memory by 81.2%. Performance degradation below 6 bits is dataset-dependent, with Tetouan showing higher sensitivity. The framework demonstrates noise resilience without retraining.
quantum reservoir computingelastic netfixed-point quantizationhardware-noise modelsstatevector simulation
ProPlay: Procedural World Models for Self-Evolving LLM Agents
ProPlay introduces procedural world models for self-evolving LLM agents, enabling procedure-level preplay to refine environment understanding through interaction. The method abstracts successful trajectories into procedures organized in a causal transition graph, with reliability records estimating task-specific contributions. Agents simulate future procedural paths as structured guidance and update the graph post-execution using environment feedback. Experiments on public benchmarks demonstrate consistent improvements in environment understanding and self-evolution over baselines.
procedural world modelsself-evolving agentsprocedure graphreliability recordstructured soft guidance
Detecting Functional Memorization in Code Language Models
The paper introduces functional memorization in code-generating LLMs, demonstrating that models can reproduce functional logic without verbatim textual overlap. Using a counterfactual setup with OLMo-3-32B, the authors compare a midtrained model (exposed to target code) against a pretrained reference, evaluating both textual and functional similarity via LLM-as-a-judge and execution-based metrics. Results reveal clear functional memorization, underscoring the need for auditing metrics beyond textual overlap.
functional memorizationcode language modelscounterfactual setupexecution-based metricsllm-as-a-judge
Adaptive Weighted Averaging
The paper introduces adaptive weighted averaging strategies for selecting the largest value among n unknown quantities, given unbiased estimates. The proposed methods are both admissible (not uniformly dominated) and guarantee performance no worse than baseline approaches like uniform random selection. Applied to stochastic optimization, these strategies yield online-to-batch conversion bounds with a 'no-compromise' property: they match or exceed random iterate selection while performing better in favorable scenarios.
adaptive weighted averagingstochastic optimizationonline-to-batch conversionadmissible strategiesunbiased estimates
Deep Unfolded Latent Optimally Partitioned-l2/l1 Networks for Data-driven Block-Sparse Recovery
The authors propose Deep Unfolded Latent Optimally Partitioned-l2/l1 (DU-LOP-l2/l1) networks for block-sparse recovery, addressing limitations of manual hyperparameter tuning and numerical instability in proximal operator differentiation. Two architectures are introduced: a stable framework using implicit differentiation and a flexible variant employing Deep Weight Factorization (DWF), which supports nonconvex smooth data fidelity terms. Experiments show DU-LOP-l2/l1 achieves competitive performance and high resilience against impulsive noise.
block-sparse recoverydeep unfoldingimplicit differentiationdeep weight factorizationproximal operator
Physics-Informed Neural Networks and Radial Basis Functions for PDEs with Dirac Delta Sources
The work demonstrates that Radial Basis Function (RBF) Residual Least Squares (RLS) methods outperform Physics-Informed Neural Networks (PINNs) for solving PDEs with Dirac delta sources. By interpreting PINNs as RLS methods, the authors show that RBF-RLS directly handles Dirac delta terms through weak-form integration, avoiding approximation errors inherent in PINNs. Neural Tangent Kernel (NTK) theory explains RBF-RLS's superior convergence. Experiments on linear PDEs for groundwater flow and transport validate the approach on synthetic and real-world data, including inverse problems with noisy measurements.
physics-informed neural networksradial basis functionsdirac delta sourcesresidual least squaresneural tangent kernel
Let's Ask Gauss: Improved One-Run Privacy Auditing
The paper introduces an improved one-run privacy auditing method for differentially private machine learning, specifically targeting DP-SGD. By analyzing canary-aligned signals as a sequence of random variables converging to a Gaussian distribution, the authors develop a framework that provides tighter privacy lower bounds from a single training run. This approach outperforms prior binary thresholding methods by leveraging distributional information, enhancing the practical assessment of privacy leakage in DP mechanisms.
privacy auditingdifferential privacydp-sgdgaussian convergenceone-run methods
Normative Robustness as a Frontier for Non-Verifiable Reasoning in LLMs
The paper introduces normative robustness as a framework for evaluating non-verifiable reasoning in LLMs, focusing on moral reasoning as a paradigmatic case. The authors propose moral robustness, defined as consistent moral reasoning across contexts, and develop an adversarial multi-turn evaluation framework simulating 48,000 user-agent moral deliberations across four frontier LLMs. Results show models ignore irrelevant distractors but exhibit moral deliberative sycophancy, shifting reasoning by up to 6.5% toward user-stated views and varying judgments based on premise order (13-22% variance) and conversation duration (10-24% variance).
normative robustnessmoral reasoningnon-verifiable reasoningdeliberative sycophancymulti-turn evaluation
EquiDexFlow: Contact-Grounded SE(3)-Equivariant Dexterous Grasp Generative Flows
EquiDexFlow introduces an SE(3)-equivariant flow-matching model for dexterous grasp generation, jointly predicting wrist pose, joint angles, fingertip contacts, surface normals, and contact forces from object point clouds. The architecture ensures contact projection onto object surfaces and force alignment within the Coulomb friction cone, maintaining physical feasibility without loss penalties. Trained on 8,100 force-closure grasps across 81 objects for the 16-DoF Allegro Hand, the model achieves zero friction violations, the best composite score, and the lowest wrench residual among ablation variants. Hardware experiments demonstrate successful open-loop pick-and-hold trials on six test objects, including asymmetric objects at canonical and rotated poses.
se(3)-equivariantflow-matchingcoulomb friction coneforce-closure graspsinverse kinematics
Out-of-Distribution (OOD) Detectors for Open-Set RF Fingerprinting
The paper introduces out-of-distribution (OOD) detection methods for open-set radio-frequency (RF) fingerprinting, addressing the challenge of distribution shift from unknown transmitters and temporal drift. The authors present a unified information-theoretic framework for analyzing and developing OOD detectors, eliminating the need for impractical auxiliary OOD data collection. Evaluated on the POWDER RF fingerprinting dataset, their OOD detectors achieve comparable performance to baselines with true OOD tuning data and significantly outperform methods without such access, demonstrating practical viability for RF environments.
out-of-distribution detectionrf fingerprintinginformation theorydistribution shiftopen-set recognition
A Stabilized Path-Space Approach to Diffusion-Based Posterior Sampling
The authors propose a stabilized path-space framework for diffusion-based posterior sampling, addressing limitations of heuristic guidance approximations in nonlinear and multimodal settings. Their method formulates posterior sampling as a stochastic optimal control problem, matching a likelihood-weighted target measure on trajectories via time reparameterization and trust-region optimization with log-variance objectives. The framework provides theoretical connections to existing guidance-based samplers, quantifies sampling errors, and enables importance sampling corrections. Evaluations on benchmark inverse problems demonstrate improved accuracy and robustness over state-of-the-art methods, with principled assessment of sampling accuracy and uncertainty quantification.
diffusion modelsposterior samplingstochastic optimal controluncertainty quantificationpath-space optimization
Smarter Saboteurs, Better Fixers: Scaling & Security in Linear Multi-Agent Workflows
The paper demonstrates that model scale impacts adversarial resilience in linear multi-agent systems (MAS), revealing a compliance-correction symmetry where larger models (up to 27B parameters) show 53.7pp performance drop when executing malicious instructions but recover statistical parity with control performance when augmented by a lightweight terminal Fixer stage. Experiments across open-weight model families on HumanEval benchmark show linear workflows can be resilient if corrected, challenging prior assumptions about topology brittleness. Results indicate scaling exacerbates malicious compliance but correction mechanisms effectively mitigate risks.
multi-agent systemsmodel scalingadversarial resiliencecompliance-correction symmetrylinear workflows
A unified complexity bound for logconcave sampling
The paper presents a unified, nearly tight complexity bound for sampling logconcave distributions using the In-and-Out algorithm with exponential lifting. The key innovation is an improved bound on the Poincaré constant of the lifted distribution, enabling tighter convergence rates. The results apply to both constrained settings (e.g., Gaussians restricted to convex bodies) and well-conditioned settings (e.g., strongly logconcave and smooth densities), achieving near-optimal performance in both cases.
logconcave samplingin-and-out algorithmpoincaré constantexponential liftingconvergence rate
Forecasting Is Not Attribution: Localizing Decoder Bypass in Graph-Based Neural Marketing Mix Models
The paper introduces DICE-MMM, a diagnostic framework addressing attribution bypass in graph-based neural marketing mix models (MMMs), where decoders achieve low forecasting error without proper attribution to marketing channels. The method involves a two-stage training process: first training a graph encoder with restricted decoder, then freezing the encoder to train a graph-safe latent decoder. Experiments demonstrate that forecasting accuracy (MSE@7 ~0.004) does not guarantee attribution (AR-CIG nAUPRC ~0), with DICE improving graph recovery over CausalMMM and identifying graph-support selection as the key bottleneck.
attribution bypassgraph-based mmmdecoder trainingcounterfactual sensitivitygraph recovery
How Useful is Causal Invariance for Domain Adaptation in Finite-Sample Settings?
The paper investigates the utility of causal invariance for supervised domain adaptation (sDA) in finite-sample settings, focusing on linear regression. It derives matching upper and lower bounds demonstrating that finite-sample gains depend on target-risk margins between candidate predictors and finite-source estimation error. When margins are sufficiently large relative to target sample size, adaptive aggregation achieves optimal performance without negative transfer; otherwise, no algorithm reliably exploits causal knowledge. Theoretical insights are validated on real-world causal benchmarks, connecting margins to structural shift magnitude in linear SCMs.
causal invariancesupervised domain adaptationfinite-sample settingslinear regressionstructural shift
Fed-FBD: Federated Functional Block Diversification for Isolation, Privacy, and Surgical Unlearning
Fed-FBD introduces a modular federated learning architecture that decomposes ResNet backbones into six functional blocks with color variants, enabling block-level isolation, privacy-by-design, and surgical unlearning. The method maintains a warehouse of N variants assembled from independently tracked blocks, providing architectural guarantees against adversarial contamination and membership inference while supporting sub-second unlearning. Evaluations on MedMNIST-2D, PathMNIST, and CIFAR-10 show a 0.3%-3.1% IID accuracy trade-off versus FedAvg, with adversarial attacks confined to the attacker's blocks (±0.01 AUC drift on clean variants).
federated learningfunctional block diversificationsurgical unlearningresnet backbonemembership inference
Physics-Informed Neural Networks for Chemotherapy Pharmacokinetics: Benchmarking the Clinical Estimator and Exposing Parameter Identifiability
The study benchmarks Physics-Informed Neural Networks (PINNs) against nonlinear least-squares (NLS) and data-only MLPs in chemotherapy pharmacokinetics, where tissue concentration is unobserved. PINNs match NLS accuracy on linear two-compartment models while jointly estimating tissue curves, outperforming MLPs by 10x. For Michaelis-Menten kinetics, PINNs expose non-identifiability from plasma data alone, a flaw masked by NLS's misspecified biexponential ansatz. Sparse tissue measurements improve identifiability, with PINNs recovering parameters within 1% accuracy (k21) and one standard deviation (Vmax, Km), demonstrating a unified framework for heterogeneous data integration and structural insight.
pharmacokineticsidentifiabilitybiexponentialmichaelis-mentencompartmental
Computationally tractable robust differentially private mean estimation
The authors propose the balloon mean, a differentially private mean estimator offering computational tractability and robustness to outliers. The method employs an iterative clipping procedure over expanding Mahalanobis balls (balloons), satisfying zero-concentrated differential privacy with interpretable tuning parameters. Theoretical guarantees under heavy-tailed and contaminated elliptical models demonstrate robustness, while simulations show superior performance over existing private estimators in contaminated settings.
differentially privatemean estimationmahalanobis ballszero-concentrated differential privacyiterative clipping
Physics-Aware Auxiliary Losses Improve Out-of-Distribution Generalization of a GNN Synthesizability Filter
The study demonstrates that physics-aware auxiliary losses enhance out-of-distribution (OOD) generalization in a GNN-based synthesizability filter. Using a GINE backbone, the authors incorporate two auxiliary losses: topological complexity regression (supervised by Bertz index) and strain-energy soft penalty (supervised by MMFF94 force-field energy). Evaluated on a 65,177-molecule corpus with OOD testing on COCONUT natural products, all three physics-aware variants show statistically significant OOD AUC improvements (best Δ=+0.0066) over the baseline (AUC 0.9774), while remaining indistinguishable in-distribution. Multi-seed validation revealed methodological pitfalls in single-seed evaluations.
graph neural networkout-of-distribution generalizationauxiliary lossessynthesizability filterforce-field energy
Epistemic Uncertainty Is Not the Reducible Kind
The article demonstrates an extensional inconsistency between the standard taxonomy and measure of epistemic uncertainty, proving that epistemic uncertainty is not inherently reducible by additional data. Through explicit construction and theoretical analysis, it introduces a refined trichotomy of uncertainty: aleatoric, sample-reducible epistemic, and mechanism-reducible epistemic. An exact identity reveals that in-distribution data generically increases mechanism-irreducible uncertainty. Ensemble disagreement, commonly used to estimate epistemic uncertainty, is shown to track training procedures rather than true epistemic terms, collapsing under consistent training or equating to initialization noise. Finite-sample falsification tests and seed-swept experiments validate these findings.
epistemic uncertaintyensemble disagreementaleatoric uncertaintymutual-informationinterpolation
TEDD: Robust Detection of Unstable Temporal Features
TEDD introduces a robust technique for detecting unstable temporal features in datasets, addressing performance degradation in ML models due to distribution shifts. The method employs a regression model to identify features predictive of instance timestamps, enabling detection of univariate and multivariate drifts across numerical and categorical features. Evaluations on synthetic and real data demonstrate TEDD's capability to detect all basic change patterns without parameter tuning, while providing comparable change measurements and scaling efficiently with feature and instance counts.
temporal feature driftdistribution shiftmultivariate drift detectionregression-based detectionmodel robustness
Individual Control Barrier Functions-Guided Diffusion Model for Safe Offline Multi-Agent Reinforcement Learning
The paper introduces a safe offline multi-agent reinforcement learning (MARL) algorithm combining diffusion models with individual control barrier functions (CBFs). The method embeds neural CBFs into the diffusion process to ensure safety during trajectory generation, with policies recovered via inverse dynamics. Evaluations across multiple benchmarks show significant safety improvements while maintaining competitive reward performance compared to existing approaches.
offline reinforcement learningdiffusion modelscontrol barrier functionsmulti-agent systemsinverse dynamics
The Metric Picks the Winner: Evaluation Choice Flips Model Rankings for Drug-Response Prediction in Unseen Chemistry
The study investigates drug-response prediction in unseen chemistry, demonstrating that model rankings invert based on evaluation metrics. Using THP-1 cell data from the VCPI contest (14,026 training compounds), a staged approach combines baselines, non-parametric retrieval, and a fusion model with chemistry embeddings. Under a per-gene proxy metric, linear regression on Morgan fingerprints outperforms deep models and ChemBERTa. However, under the contest's active-set metric, deep models and the fusion decoder significantly surpass the baseline (wMSE -0.012, p < 10^-4). The findings highlight metric-dependent performance reversals, validated via a reproducible pipeline.
drug-response predictionscaffold splitweighted msemorgan fingerprintschemberta
ECA: Efficient Continual Alignment for Open-Ended Image-to-Text Generation
The paper introduces Efficient Continual Alignment (ECA), a novel exemplar-free incremental learning approach for Open-ended Image-to-Text Generation (OpenITG). ECA addresses continual alignment in evolving environments by adapting pre-trained VLMs through three mechanisms: Mixture of Query (MoQ) for task-specific query tokens, Fisher Dynamic Expansion (FeDEx) for structure expansion based on Fisher Information Matrix, and Dictionary Replay (DR) for knowledge retention. Evaluated on four new IL OpenITG benchmarks, ECA significantly mitigates catastrophic forgetting and outperforms baseline methods in preserving cross-modal representations.
incremental learningimage-to-text generationfisher information matrixexemplar-freecatastrophic forgetting
Estimating Individualized Treatment Effects in Acute Ischemic Stroke with Causal Transformation Models (TRAM-DAG): A Multi-Centre Observational Study with External RCT Validation
The study introduces causal transformation models on directed acyclic graphs (TRAM-DAG) for estimating individualized treatment effects (ITE) in acute ischemic stroke, aiming to identify patients benefiting most from mechanical thrombectomy over lysis. TRAM-DAG was trained on observational MAGIC multi-center stroke patient data, specifically a sub-population with NIHSS at admission ≥6, and validated using the MR CLEAN RCT population. Results show that TRAM-DAG's ITE estimates align with the trial's average treatment effect and correctly rank patients by observed good outcomes (mRS ≤2 at three months). This supports TRAM-DAG's utility in personalized stroke care decision-making.
individualized treatment effectcausal transformation modelsdirected acyclic graphsmechanical thrombectomymodified rankin scale
Towards Provably Fair Machine Learning: Bayesian Approaches For Consistent and Transparent Predictions
The paper introduces the Fair Bayesian classifier, which enforces determinism and statistical consistency across all subgroups to address inconsistent predictions in ML classifiers. By requiring predictions to align with Bayesian optimal target distributions and abstaining when consistency is unachievable, the method eliminates consistency errors. Evaluated on Adult, COMPAS, and Bank Marketing datasets, it outperforms baselines in accuracy and multicalibration while maintaining zero consistency error. The approach highlights the importance of Bayesian consistency for algorithmic fairness, particularly in small subgroups where frequentist inference fails.
bayesian consistencysubgroup fairnessdeterministic predictionsstatistical consistencymulticalibration
Evaluation of AutoML Frameworks for IDS under Imbalanced Data Conditions of the NSL-KDD Dataset
The study establishes a standardized benchmark for evaluating AutoML frameworks in multiclass intrusion detection under severe class imbalance using the NSL-KDD dataset, preserving all five original classes including rare attacks (R2L, U2R). Nine open-source AutoML frameworks were systematically compared, analyzing architectural design, ensemble strategies, hyperparameter optimization, and imbalance-handling mechanisms. Results show PyCaret achieved the highest macro-F1 (66%), followed by AutoGluon (55%), with ensemble learning and imbalance-aware optimization proving critical for minority-class discrimination, while accuracy-oriented frameworks exhibited significant performance degradation on rare attack categories.
automlclass imbalanceintrusion detectionnsl-kddmacro-f1
The Mathematics of AI Winters: The mathematical Taxonomy of Paradigm Fragility in AI Winter
The article proposes a mathematical taxonomy of paradigm fragility in AI winters, arguing that formal barriers—not just engineering failures—contributed to reduced funding and confidence. It synthesizes key mathematical bottlenecks from early AI, including perceptron impossibility results, complexity-theoretic hardness of neural-network training, minimax rates for nonparametric estimation, vanishing-gradient analyses, and classical statistical learning theory. These barriers are shown to align with central disappointments of the first and second AI winters. The analysis further connects these limitations to subsequent breakthroughs that mitigated, but did not eliminate, these challenges.
perceptron impossibilitycomplexity-theoretic hardnessminimax ratesvanishing-gradientstatistical learning theory
Viral Proteins Reveal Geometry of Protein Language Models
The study characterizes the geometric structure of protein language model (pLM) embeddings using viral proteins as a case study. Analyzing ESM model families, the authors identify a dominant nativeness axis in embedding space that orders sequences from cellular proteins to viral proteins to shuffled sequences, aligned with masked reconstruction perplexity. Scaling contracts this axis unevenly across viral families. Despite this, pLM embeddings retain viral-specific signal, enabling linear separability beyond zero-shot perplexity and shallow sequence features. Results indicate pLM representations encode both a general nativeness metric and group-specific biological information.
protein language modelsnativeness axismasked reconstruction perplexityviral proteinsembedding space
Shopping Reasoning Bench: An Expert-Authored Benchmark for Multi-Turn Conversational Shopping Assistants
The Shopping Reasoning Bench introduces a novel expert-authored benchmark for evaluating conversational shopping assistants, addressing the unique challenges of multi-turn reasoning, domain expertise, and criterion-level quality. The benchmark comprises 525 missions (232 single-turn, 293 multi-turn) with 10,863 importance-weighted binary rubrics, organized into five reasoning categories and fifteen subcategories. Evaluation of nine models (GPT, Claude, Gemini) reveals pass rates of 57--77%, with multi-turn performance declining by 4--18 points as conversations progress and a 13--29 point gap on optional criteria, highlighting limitations in expert-level advice.
conversational shopping assistantsmulti-turn reasoningdomain expertisebinary rubricspreference refinement
Feature-preserving Latent-EnKF for Data Assimilation of Flows with Shocks
A feature-preserving latent-EnKF is introduced for sequential data assimilation of flows with shocks, addressing the EnKF's failure due to multimodal ensemble statistics violating Gaussian assumptions. The method performs ensemble updates in a learned low-dimensional latent space where shock and flow features form a smooth manifold, preserving sharp features during analysis. A shared decoder maps the updated latent state back to the physical state, eliminating member-specific ordered training and positivity flooring. Numerical experiments on Sod shock tube and Mach 2 shock interaction with a 2D cylinder demonstrate accurate recovery of shocks and contact discontinuities without spurious oscillations.
ensemble kalman filterlatent spacedata assimilationshock recoverymultimodal statistics
Crossing the Validation Crisis: Cross-Validation Reduces Benchmarking Variance Surprisingly Well
The study demonstrates that cross-validation significantly enhances benchmarking reliability by reducing performance estimation variance, addressing the validation crisis in machine learning evaluation. It introduces sample gain to quantify virtual data augmentation from multiple cross-validation splits, showing empirically on synthetic, histopathologic, and NLP datasets that multiple splits improve estimate stability with delayed diminishing returns. A dynamic early-stopping procedure is proposed to optimize computational cost. Results indicate cross-validation's underutilized potential for robust benchmarking.
cross-validationbenchmarking variancesample gainperformance estimationearly-stopping
Rubric-Guided Self-Distillation: Post-Training Without Rubric Verifiers
We propose Rubric-Guided Self-Distillation (RGSD), a verifier-free training method for open-ended domains where rubrics replace single ground-truth answers. RGSD conditions the base policy on rubrics to serve as a teacher, distilling its token-by-token distribution into an unconditioned student, eliminating LLM verifiers and sparse trajectory-level rewards. Evaluated on Qwen-2.5 (3B, 7B) and Qwen3-Thinking (4B, 8B) models in medical and science domains, RGSD achieves rubric satisfaction comparable to judge-based GRPO with one on-policy rollout per prompt and no training-time verifier calls. Ablations indicate raw rubrics outperform self-generated references as teacher enrichment, while stronger GRPO judges can surpass RGSD in some settings.
rubric-guided self-distillationtoken-by-token distillationopen-ended domainsrubric satisfactionon-policy rollout
Boosting Direct Preference Optimization with Penalization
The paper proposes Direct Preference Optimization with Penalization (DPOP), an extension of DPO that incorporates reference-model responses into offline preference optimization. DPOP augments the base preference loss with a gated penalty on reference-greedy responses, activated only when the policy ranks rejected responses above preferred ones. Evaluated on AlpacaEval 2.0 with Llama-3-8b-it and Gemma-2-9b-it, DPOP achieves relative win rate improvements of 5.3% and 4.4% over DPO, SimPO, and AlphaDPO baselines. Ablations demonstrate the superiority of SimNPO-style length-normalized penalties over NPO and token-level unlikelihood.
direct preference optimizationoffline learningreference modellength normalizationhuman feedback
Dolph2Vec: Self-Supervised Representations of Dolphin Vocalizations
The authors introduce Dolph2Vec, a self-supervised learning (SSL) model for dolphin vocalizations, trained on a novel dataset of five years of longitudinal recordings from five dolphins. Adapting Wav2Vec2.0, the model is optimized for fine-grained analysis of dolphin communication, unlike general-purpose SSL models. Dolph2Vec outperforms baselines in signature whistle classification and whistle detection, with learned embeddings revealing interpretable acoustic units aligned with whistle categories and sub-whistle structures.
self-supervised learningbioacousticswav2vec2.0signature whistle classificationacoustic units
Policy-driven Conformal Prediction for Trustworthy QoT Estimation
Conformal QoT introduces a policy-driven framework integrating statistically guaranteed Quality of Transmission (QoT) estimation with operational decision policies, enhancing lightpath-feasibility predictions under domain shift. The method leverages conformal prediction to ensure reliability, achieving a significant accuracy improvement from 92% to 99.6% on open datasets.
conformal predictionquality of transmissionlightpath-feasibilitydomain shiftoperational decision policies
📰 Industry Media
No new items today.
Generated automatically at 2026-06-12 21:36 UTC. Summaries and keywords are produced by an LLM and may contain inaccuracies — always consult the original article.
