Daily Digest — 2026-07-01
342 items · 8 research labs, 333 arxiv papers, 1 industry media
MarkTechPost: all feed URLs failed (last tried: https://www.marktechpost.com/feed/)AI News: all feed URLs failed (last tried: https://artificialintelligence-news.com/feed/)
🏛️ Research Labs (8)
How ChatGPT adoption has expanded
OpenAI Signals data reveals global expansion and diversification of ChatGPT adoption, demonstrating increased usage intensity and task variety over time. The study analyzes aggregated interaction data from Individual ChatGPT plans (Free, Go, Plus, Pro) to track behavioral evolution across demographics and regions. Key findings show: 50% increase in daily messages and doubled task diversity after six months; fastest growth in Africa/Asia and lower-HDI countries; gender parity shifts with feminine-name dominance in 54% of usage; non-English languages now constitute majority usage, led by Spanish, Portuguese, and Arabic with Uzbek/Kazakh/Burmese showing highest growth rates.
chatgpt adoptionopenai signalshuman development indexnon-english usagetask diversity
Introducing GeneBench-Pro
GeneBench-Pro introduces a research-level benchmark for evaluating AI agents' ability to handle ambiguity and make consequential judgments in computational biology. The benchmark comprises 129 synthetic problems simulating real-world datasets, requiring iterative analysis, causal reasoning, and methodological choices. Problems are validated by domain experts and graded deterministically against known targets. GPT-5.6 Sol achieves a 31.5% pass rate with Pro mode enabled, outperforming open-source models like GLM 5.2. Results indicate significant progress in high-level scientific reasoning but highlight limitations in closing inferential loops. GeneBench-Pro aims to accelerate scientific discovery by addressing bottlenecks in computational analysis.
computational biologycausal reasoningiterative analysissynthetic datasetsdeterministic grading
Core dump epidemiology: fixing an 18-year-old bug
OpenAI identified and resolved two distinct crash-inducing bugs in their Rockset data infrastructure through population-level core dump analysis. The investigation revealed a silent hardware corruption on an Azure host and an 18-year-old race condition in GNU libunwind. By automating core dump analysis with a ChatGPT-generated script, the team separated crash populations, enabling targeted fixes: denylisting the faulty host and improving fault detection mechanisms. This epidemiological approach proved critical for diagnosing complex, low-level failures in C++ systems.
core dump analysismemory corruptiongnu libunwindstack misalignmentazure host
Inside Genebench-Pro
Genebench-Pro introduces 10 case studies demonstrating its biomedical benchmark for evaluating AI models on complex genomic tasks. Each case presents a distinct challenge (e.g., clinical utility estimation, lncRNA dependency analysis, cis-MVMR) with provided prompts and datasets requiring multi-modal evidence integration. The benchmark tests capabilities including structural variant interpretation, ancestry tract analysis, and selection inference while controlling for technical confounders like ambient RNA, LD artifacts, and mappability biases. Representative tasks involve processing pharmacogenomic evidence, single-cell RNA-seq data, and ancient allele-frequency time series with rigorous statistical controls.
genomic benchmarkstructural variantmendelian randomizationsingle-cell rna-seqlocal-ancestry tracts
ScarfBench: Benchmarking AI Agents for Enterprise Java Framework Migration
ScarfBench introduces a benchmark for evaluating AI agents on enterprise Java framework migration across Spring, Jakarta EE, and Quarkus. Unlike traditional code generation benchmarks, it assesses whether migrated applications successfully build, deploy, and preserve behavior. Evaluation of state-of-the-art agents reveals significant gaps: compile success (29/30) overestimates deploy success (22/30), with configuration layers requiring disproportionate iterative effort. Agents exhibit overconfidence in self-assessment and struggle with environmental dependencies beyond code transformation.
framework migrationjava ecosystemsbehavioral validationdependency resolutionbuild verification
Why Specialization Is Inevitable
The article synthesizes evidence from optimization theory, evolutionary biology, competitive markets, and machine learning to argue that specialization is an inevitable consequence of performance optimization under resource constraints. It cites Wolpert and Macready's No Free Lunch Theorem (1997) as mathematical foundation, showing that algorithmic performance gains require domain-specific adaptation. Empirical support includes biological niche specialization, market competition dynamics, and ML phenomena like negative transfer and mixture-of-experts architectures. The analysis distinguishes domain specialization (resource concentration) from domain knowledge (hand-coded features), reconciling specialization with Sutton's Bitter Lesson on scaling.
no free lunch theoremnegative transfermixture-of-expertsdomain specializationbitter lesson
Featuring Every Eval Ever Results on Hugging Face Model Pages
Hugging Face integrates Every Eval Ever (EEE) with Community Evals to standardize AI model evaluation reporting. EEE employs a JSON schema capturing evaluation metadata, including model, metric, and generation settings, consolidating results from diverse sources into a unified format. Community Evals enables decentralized benchmark score reporting via YAML files in model repositories, linking results to EEE records. The integration includes a converter automating YAML generation from EEE JSON, supporting benchmarks like MMLU-Pro and GPQA. As of February 2026, the EEE datastore contains 229,000 evaluation results across 22,000 models and 2,200 benchmarks, enhancing reproducibility and transparency in AI evaluation.
json schemacommunity evalsbenchmark reportingevaluation metadatayaml converter
Unlocking Britain’s next era of productivity: Building a nation of AI trailblazers
A UK-wide study by Google and Public First reveals AI adoption has doubled to 73% in workplaces, but with uneven progression. The workforce is segmented into four stages: AI Spectators (10%), Experimenters (38%), Practitioners (37%), and Trailblazers (15%). Trailblazers, who leverage AI for advanced workflows, report significant professional advantages, including 84% higher promotion likelihood and 55% higher pay rise probability. Barriers to adoption include behavioral habits, cognitive mindsets, and organizational permissions. Google’s nationwide AI upskilling initiative, AI Works for Britain, aims to train 10 million workers by 2030, supported by tools contributing £140 billion to the UK economy in 2025.
ai adoptionworkforce segmentationai trailblazersupskilling initiativeeconomic impact
📜 arXiv Papers (333)
VLK: Learning Humanoid Loco-Manipulation from Synthetic Interactions in Reconstructed Scenes
The paper introduces VLK, a method for learning humanoid loco-manipulation from synthetic vision-language-kinematics supervision. The approach reconstructs metric-scale indoor scenes using 3D Gaussian Splatting, synthesizes 48,000 navigation and object-interaction trajectories with privileged scene information, and renders paired egocentric observations. A VLK policy trained on this data predicts short-horizon whole-body kinematic trajectories, executed via a whole-body tracker on the Unitree G1 humanoid. Physical experiments demonstrate successful sim-to-real transfer for navigation and single-object transport tasks.
humanoid loco-manipulation3d gaussian splattingvision-language-kinematicssim-to-real transferwhole-body tracking
LeVo 2: Stable and Melodious Song Generation via Hierarchical Representation Modeling and Progressive Post-Training
LeVo 2 introduces a hybrid LLM-Diffusion framework for full-length song generation, addressing the trade-off between vocal-instrument coordination and track-specific acoustics through hierarchical modeling. The system employs LeLM for semantic planning and track-specific refinement, coupled with a diffusion-based Music Codec for waveform reconstruction. Key innovations include an aesthetics-guided training schedule with progressive post-training (SFT, offline DPO, semi-online DPO) and modular extension for acoustic refinement. Evaluations demonstrate LeVo 2 outperforms open-source baselines across six subjective dimensions and approaches commercial systems in listening metrics, validating the effectiveness of hierarchical architecture, aesthetics guidance, and training strategy.
hierarchical modelingdiffusion-based music codecaesthetics-guided trainingprogressive post-trainingtrack-specific refinement
Self-Evolving World Models for LLM Agent Planning
WorldEvolver introduces a self-evolving world model framework for LLM agents that improves foresight without modifying model parameters. The method combines Episodic Memory (retrieval-based simulation of real transitions), Semantic Memory (rule extraction from prediction-observation mismatches), and Selective Foresight (confidence-based prediction filtering). Evaluated on ALFWorld and ScienceWorld using Word2World and AgentBoard benchmarks, WorldEvolver achieves superior prediction accuracy across three model backbones and outperforms baselines in downstream agent success rates, demonstrating test-time memory revision enhances both prediction and planning.
world modelllm agentsepisodic memorysemantic memoryselective foresight
GROW$^2$: Grounding Which and Where for Robot Tool Use
GROW$^2$ addresses open-world affordance grounding for robot tool use by hierarchically decomposing the problem into semantic and geometric levels. The method leverages Vision-Language Models for semantic task parsing and tool selection, followed by vision foundation models for 3D region grounding from RGB-D images. Experiments demonstrate superior performance on affordance prediction benchmarks, with zero-shot generalization over open-category objects and improved tool use in simulated and real-world settings compared to baselines.
affordance groundingvision-language modelszero-shot generalizationrobot tool use3d region grounding
Pessimism's Paradox: Conservative Offline Training Amplifies Reward Hacking During Online Adaptation in Reasoning Models
The study demonstrates that conservative offline training paradoxically increases reward hacking during online adaptation, contrary to conventional wisdom. Using Qwen3-14B trained with Direct Preference Optimisation (DPO) at three conservatism levels (β ∈ {β_lo, β_mid, β_hi}), the authors show that higher conservatism monotonically raises reward-hacking damage (Spearman ρ = 1.0), measured via Goodhart gap and AUGC on GSM8K. Mechanistic analysis reveals a causal chain: high-β DPO reduces policy entropy and response diversity, concentrating outputs in high-epistemic-uncertainty regions exploited during online optimization. A power-law fit identifies an optimal conservatism level β* balancing alignment and hacking vulnerability.
direct preference optimisationreward hackinggoodhart gappolicy entropyepistemic uncertainty
DOPD: Dual On-policy Distillation
The paper introduces DOPD (Dual On-policy Distillation), a novel distillation paradigm addressing privilege illusion in on-policy knowledge transfer. The method dynamically routes token-level supervision between privileged teacher and student policies based on advantage gaps and relative probabilities, applying varying supervision strength and objectives. Experiments on LLMs and VLMs show DOPD outperforms Vanilla OPD and other baselines, with additional validation on stability, robustness, continual learning, and OOD tasks.
on-policy distillationprivilege illusiontoken-level supervisionadvantage gapcapability transfer
Optimization Dynamics Imprint Semantic Specificity in Contrastive Embedding Norms
The work establishes a theoretical framework explaining why embedding norms in contrastive models encode semantic properties despite being typically ignored in cosine similarity metrics. Through analysis of optimization dynamics, the authors derive an analytic formula showing that embedding length naturally captures concept specificity, token frequency, and human uncertainty during training. The results demonstrate how these norms provide calibration signals for model interpretability and retrieval tasks, offering a principled explanation for an empirical phenomenon previously treated heuristically.
contrastive learningembedding normsoptimization dynamicssemantic specificitycalibration signals
C$^{2}$R: Cross-sample Consistency Regularization Mitigates Feature Splitting and Absorption in Sparse Autoencoders
The paper introduces C$^2$R (Cross-sample Consistency Regularization) to address feature splitting and absorption in Sparse Autoencoders (SAEs) for large language model interpretation. Feature splitting fragments coherent concepts into non-atomic latents, while absorption creates arbitrary exceptions in general features, both stemming from inconsistent latent assignment. C$^2$R enforces cross-sample consistency by penalizing co-activation of directionally similar latents within a batch. Evaluations show C$^2$R mitigates these issues while preserving reconstruction fidelity, enhancing latent interpretability without performance degradation.
sparse autoencodersfeature splittingfeature absorptioncross-sample consistencylatent interpretability
MESA: Prioritizing Vulnerable Communication Channels for Securing Multi-Agent Systems
We introduce MESA, a label-free framework for prioritizing vulnerable communication channels in multi-agent systems (MAS) by ranking security-critical edges. MESA combines six graph-theoretic metrics with two dynamic probes (ablation and masking) to assess edge vulnerability without requiring attack traces. Evaluated across three MAS scenarios, eight network topologies, and five LLMs (Qwen, Llama, Gemma), MESA achieves a mean Spearman ρ=+0.60 (peaking at +0.73) correlation with empirical attack success rates. Monitoring the top 10% of MESA-ranked edges intercepts 3x more successful attacks than random allocation. The framework demonstrates effectiveness under varying attacker/defender models and LangGraph workflows, though limitations exist under adaptive attacks and high-redundancy graphs.
multi-agent systemsgraph-theoretic metricsdynamic probesspearman correlationlanggraph workflows
Words Speak Louder Than Code: Investigating Cognitive Heuristics in LLM-Based Code Vulnerability Detection
This paper presents the first systematic investigation of cognitive heuristics in LLM-based code vulnerability detection, introducing a controlled framework that isolates three heuristics: halo (author attribution), framing (task objectives/consequences), and anchoring (prior analysis). Evaluating eight LLMs across three programming languages, the study finds average susceptibility rates of 33.2% for framing, 23.5% for anchoring, and 18.4% for halo, with semantic reasoning vulnerabilities being more affected than pattern-matching ones. A proof-of-concept black-box attack demonstrates that cognitive susceptibility can suppress up to 97% of detected vulnerabilities, revealing it as a consistent and exploitable property of LLM-based detection systems.
cognitive heuristicsvulnerability detectionsemantic reasoningframing effectblack-box attack
Beyond 2D Matching: A Unified Single-Stage Framework for Geometry-Aware Cross-View Object Geo-Localization
We propose GAGeo, a single-stage Geometry-Aware Geo-localization framework for cross-view object geo-localization (CVOGL) that jointly predicts bounding boxes, segmentation masks, and camera poses. Built upon the permutation-equivariant 3D foundation model $π^3$, GAGeo integrates visual features, referring prompts, and learnable task tokens, adapting inherited 3D priors in a unified forward pass. We introduce a contrastive loss leveraging satellite views as universal anchors for implicit alignment, enabling zero-shot ground-to-drone localization without triplet data. Evaluated on a new large-scale dataset with 220,000 ground-satellite and drone-satellite pairs, GAGeo outperforms state-of-the-art methods, demonstrating strong generalization in unseen scenes and novel cross-view setups.
cross-view object geo-localizationgeometry-aware frameworkpermutation-equivariant 3d modelcontrastive losszero-shot localization
A Multi-task Mixture of Experts Framework for Malware Classification, Packing Detection, and Family Attribution
The paper proposes a multi-task Mixture of Experts (MoE) framework for malware analysis, addressing classification, packing detection, and family attribution. It evaluates EMBER features and raw byte arrays, comparing Homogeneous MoE, Heterogeneous MoE, and Multi-Gate MoE (MMoE) architectures. MMoE achieves a 0.9744 detection rate with 2.56% failure, demonstrating robustness against adversarial mutations. The framework leverages expert specialization and adaptive gating for scalable, resilient malware detection.
mixture of expertsmalware classificationpacking detectionmulti-task learningadversarial robustness
The Human Creativity Benchmark
The Human Creativity Benchmark (HCB) introduces a framework for evaluating creative AI by preserving both convergence and divergence signals in professional judgments. It collects pairwise preferences, scalar ratings on prompt adherence, usability, and visual appeal, alongside qualitative rationale from domain professionals across 15,000 judgments in five creative domains and three workflow phases. Results show convergence on verifiable dimensions like technical correctness and divergence on taste-driven aspects like aesthetic direction. No model uniformly excels across all phases, and collapsing these signals into a single metric discards critical insights on correctness versus steerability.
human creativity benchmarkconvergencedivergencepairwise preferencesworkflow phases
TraceLab: Characterizing Coding Agent Workloads for LLM Serving
The paper introduces TraceLab, a dataset of 4,300 coding-agent sessions (350K LLM steps, 430K tool calls) from daily use of Claude Code and Codex, addressing the lack of real workload data for serving-system optimization. Through trace collection and analysis, the authors identify key workload characteristics: long autonomous loops, short outputs in long contexts, heavy-tailed tool calls, and high but imperfect KV-cache hit rates. These findings suggest optimizations like append-length-aware prefill and semantic-aware tool-latency prediction.
coding agentsllm servingkv-cachetool callsworkload characterization
Linguistic Firewall: Geometry as Defense in Multi-Agent Systems Routing
We introduce ANTAP (Automatic Non-Textual Agent Picker), a routing architecture for Multi-Agent Systems that mitigates security vulnerabilities arising from reliance on unverified proxies for agent competence. ANTAP employs active capability testing to empirically assess agent performance, distilling results into fixed behavioral operators within a shared semantic space. Routing decisions are made via non-textual algebraic projection, establishing a 'linguistic firewall' that prevents metadata-based attacks. Experiments demonstrate ANTAP achieves near-zero Attack Success Rate (ASR) against description-based injection attacks, compared to 67.3% for baseline methods, and reduces ASR by 20% against adaptive embedding attacks.
multi-agent systemslinguistic firewallattack success ratesemantic spacealgebraic projection
To Tab or Not to Tab: Measuring Critical Engagement in AI Code Completion Tools Using Behavioral Signals and Attention Checks
The study introduces Clover, an AI code completion tool that logs student interactions and incorporates attention checks to measure critical engagement during programming tasks. A taxonomy of behavioral interaction metrics was developed, informed by prior literature. Analysis revealed that higher rates of tab acceptance correlated with lower attention check performance, while increased dwell time was associated with higher attention check performance. These findings suggest that interaction patterns and attention checks can serve as indicators of reflective engagement in AI-assisted programming.
code completionattention checksbehavioral interaction metricstab acceptancedwell time
Latent Actions from Factorized Transition Effects under Agent Ambiguity
The paper introduces Observed Transition Factorization (OTF) to address action ambiguity in Latent Action Models (LAMs) by decomposing transitions into reusable primitives. OTF-LAM abstracts these primitives into action-like latents using inverse-forward dynamics, while OTF-LAM-Dino predicts future states in DINOv2 space without a decoder. Experiments show zeroshot transferability of OTF primitives across carrier and morphology shifts, with downstream policy learning matching or surpassing baselines under transition ambiguity.
latent action modelstransition factorizationdino representationinverse-forward dynamicszeroshot transfer
TRACE: Temporal Relationship-Aware Conversational Entrainment Detection in Dyadic Speech
The paper introduces TRACE, a temporal relationship-aware framework for detecting emotional entrainment in dyadic speech, and DyadEE, a dataset containing both natural and synthetically disrupted conversations. TRACE models interactions as ordered sequences of acoustic embeddings from emotion fine-tuned Whisper representations, treating samples as interaction traces rather than pooled utterances. Experiments on DyadEE show that incorporating conversational context and relationship information improves detection, with TRACE achieving 97.01% accuracy.
emotional entrainmentdyadic speechwhisper representationsinteraction traceacoustic embeddings
Learning from Mistakes: Rollout-Retrieval Lifelong Policy Learning for Autonomous Driving
The paper introduces Rollout-Retrieval Lifelong Policy Learning (R²LPL), a framework for continual improvement of autonomous driving policies by learning from recoverable mistakes. R²LPL addresses the challenge of converting sparse failure evidence into compact supervised knowledge by filtering mistake-related states and retrieving feasible corrective targets. Evaluated on large-scale closed-loop nuPlan benchmarks, R²LPL significantly improves a learning-based planner's performance, achieving state-of-the-art results on the challenging Test14-hard split with minimal rollout and continual-learning cycles. This demonstrates R²LPL's efficacy in leveraging recoverable closed-loop mistakes for sustained policy enhancement.
autonomous drivinglifelong learningpolicy improvementclosed-loop scenariosnuplan benchmarks
Entity Binding Failures in Tool-Augmented Agents
The paper identifies entity binding failures as a critical reliability issue in tool-augmented language-model agents, where correct tool selection still leads to actions on incorrect real-world entities. The authors formalize the distinction between tool correctness and entity correctness, propose a taxonomy of wrong-entity failures, and evaluate entity-aware execution mechanisms including resolution preconditions, confidence gating, and provenance tracking. In diagnostic evaluations across 60 tasks with five model backends, action-oriented baselines produced 24.0-26.0% wrong-entity actions, while entity-aware methods eliminated these errors at the cost of reduced task completion under ambiguity.
entity binding failurestool-augmented agentsentity-resolution preconditionsprovenance trackingconfidence gating
Informational Frustration in Neural Manifolds: Shannon Bottlenecks and the Limits of Learnability
The paper introduces a unified theoretical framework connecting information theory, topology, and statistical mechanics to explain deep learning's generalization paradox. It proposes the Entropic Learnability Horizon (ELH), a fundamental law relating data manifold entropy, decision boundary complexity, and weight space entropy. The Shannon-Topological Bottleneck Theorem proves that exceeding this horizon triggers an entropic phase transition into 'Informational Frustration', explaining phenomena like grokking as entropic release. The theory yields Entropic Gradient Descent (EGD), an optimization method dynamically managing weight entropy. Results demonstrate entropy as the physical currency governing learnability.
entropic learnability horizonshannon-topological bottleneckinformational frustrationentropic gradient descentvon neumann entropy
On the Faithfulness of Post-Hoc Concept Bottleneck Models
The paper analyzes faithfulness issues in Post-Hoc Concept Bottleneck Models (post-hoc CBMs), which project latent features onto interpretable concept spaces. It identifies two failure modes: (1) covariate shifts in auxiliary data causing unfaithful concept representations, with a derived error bound, and (2) systematic label noise in vision-language model-generated concept labels. The authors propose novel metrics decoupling concept faithfulness from predictive accuracy, demonstrating their effectiveness across synthetic and real-world benchmarks where standard accuracy evaluations fail.
concept bottleneck modelscovariate shiftvision-language modelsinterpretabilitylabel noise
McMg: A Learned Phase-Space Multi-channel Multigrid Preconditioner for Helmholtz Equation
The paper introduces McMg, a learned phase-space multigrid preconditioner for heterogeneous Helmholtz equations that addresses challenges of indefiniteness and pollution errors. The method coarsens physical space while preserving wave information in channel dimensions, using learned packets of amplitude, phase, and direction, combined with adaptive stencils and medium-dependent smoothers. Experiments on high-frequency, high-contrast 3D problems show McMg reduces iterations and wall-clock time versus classical baselines and outperforms existing neural preconditioners, with generalization across scales via Layer-by-Layer Progressive Finetuning.
helmholtz equationmultigrid preconditionerphase-space coarseningneural pde operatorslearned green's operator
SIMAX: A Scalable and Interpretable Framework for Multi-Fidelity and Annotated Clinician-Patient Dialogue Simulation
SIMAX introduces a scalable framework for generating controlled clinician-patient dialogues with behavioral annotations, addressing the scarcity of real-world data for evaluating AI-driven communication coding systems. The method employs predefined clinical scenarios, personas, and voice conditions, with behaviors controlled via Global and WISER codebooks. Evaluation on 3,388 simulated dialogues across three specialties demonstrated reasonable speech naturalness (UTMOS: 3.03, WV-MOS: 2.61), high transcription fidelity (WER: 0.07, CER: 0.05), and positive text-audio correspondence (CLAP cosine similarity: 0.41). Human assessments yielded median MOS of 4.67 and clinical realism score of 3.00, validating SIMAX's utility in assessing communication coding systems.
simulated dialoguescommunication codingbehavioral annotationstranscription fidelityclinical realism
Situation Perception: A Necessary Primitive to Artificial Superintelligence
The authors argue that achieving artificial superintelligence (ASI) necessitates the development of 'situation perception,' a capacity to construct, revise, and act within internal simulations of possible worlds across latent time. They identify three core components required for this capability: abstract prediction, long-term compressed memory, and active learning guided by objectives. The analysis critiques current large language models for their lack of general intelligence despite advanced pattern recognition, proposing specific tests to evaluate progress toward machines capable of simulating futures, pursuing self-directed goals, and potentially judging their creators.
artificial superintelligencesituation perceptionabstract predictionlong-term compressed memoryactive learning
COHORT: Collaborative Orchestration for Hardening via Offensive Replay on Emulated Topologies
COHORT introduces the first end-to-end framework for automating enterprise network mitigation through a role-decomposed multi-agent LLM workflow. The system generates and refines mitigations as real device commands, evaluated via offensive replay on a GNS3 emulator with vendor firmware, supplemented by connectivity-regression and cumulative-effect checks. In experiments across three topologies and four attack scenarios, 46.7% of mitigations successfully disrupted attacks while preserving connectivity, outperforming a single-agent baseline by 4.4×.
offensive replaygns3 emulatormulti-agent llmconnectivity-regressionenterprise mitigation
Field Order Should Not Matter: Permutation-Invariant Embedding Model Fine-Tuning for Structured Metadata Retrieval
The paper introduces permutation-invariant fine-tuning (PI-FT), a method to make structured metadata retrieval robust to field order variations in serialization. By randomizing field order and applying dropout during fine-tuning, PI-FT reduces the performance drop from 7.4 to 0.2 nDCG@10 when field order changes, while maintaining in-distribution accuracy. The approach is evaluated on DevDataBench, a multilingual benchmark of 10,000 development indicators with LLM-generated queries, where a 118M-parameter fine-tuned model outperforms zero-shot baselines like text-embedding-3-large (0.707 vs. 0.556 nDCG@10), particularly in low-resource languages.
permutation-invariant fine-tuningstructured metadata retrievalfield order robustnessmultilingual benchmarkin-context learning
Collective cooperation without individual fidelity in LLM agents
The study evaluates the fidelity of LLM agents as proxies for human decision-making in social simulations by comparing their behavior in a networked Prisoner's Dilemma experiment against human data. Nine open-weight LLMs were tested using identical interaction protocols, payoff structures, and network topologies. While LLMs reproduced macro-level cooperation dynamics, including early decline and later stabilization, they underestimated individual-level heterogeneity and exhibited different conditional cooperation patterns. Introducing random agents improved micro-level agreement but did not align decision rules with human behavior. The findings highlight a macro-micro dissociation in LLM-based social agents, emphasizing the need for multi-faceted validation beyond aggregate outcomes.
llm agentsprisoner's dilemmacooperation dynamicsindividual heterogeneitydecision rules
The FIL Hypothesis: Inductive Biases Help with Kernel Engineering
The paper challenges the Bitter Lesson by introducing the Feedback Information Loop (FIL) hypothesis, which identifies feedback latency as a critical scaling dimension for AI systems. It argues that future applications in science and physical-world domains will involve FILs ranging from hours to weeks, rendering purely data-driven methods impractical. The authors propose an alternative approach incorporating human-inspired inductive biases to constrain the solution space. Initial validation on GPU programming tasks demonstrates superior performance over data-driven methods, with code released publicly.
feedback information loopinductive biasesbitter lessonscaling dimensiongpu programming
Translating Natural Language to Strategic Temporal Specifications via LLMs
The authors introduce a framework for translating natural language descriptions of strategic requirements into ATL/ATL* formulas using Large Language Models (LLMs), addressing the challenge of formalizing Multi-Agent System specifications. They create an expert-validated dataset for training and evaluation, as no existing dataset supports this task. Fine-tuned open-weight models (3-7B parameters) achieve 0.84 semantic accuracy, comparable to 0.86 for proprietary few-shot baselines, while maintaining on-premises requirements. Judge reliability inversely correlates with generator strength, with Llama-3.3-70B tracking human verdicts most closely. The tool integrates with a strategic logics model checker, enabling non-experts to specify properties in natural language.
multi-agent systemsatl/atl* formulassemantic accuracyllm judgestrategic logics
Transformer Architectures as Complete Bayes Processes: A Formal Proof in the Measure-Theoretic Kernel Framework
The paper provides a formal proof that transformer architectures implement exact Bayesian posterior inference when their internal update mechanisms satisfy a Bayes joint-distribution condition. Using a measure-theoretic kernel framework, the authors define a hierarchy of abstractions—from core Bayesian transformers to multilayer stacks—and prove that the update kernel equals the posterior almost everywhere at each level. The proof includes deriving the explicit Bayes formula through Radon-Nikodym differentiation and demonstrating that softmax attention induces a valid probability distribution over keys. The framework establishes conditions under which transformer blocks are provably Bayesian, linking abstract kernel theory to concrete attention mechanisms.
bayesian inferencemeasure-theoretic kernelradon-nikodym differentiationsoftmax attentionmarkov kernel
Beyond Point Estimates for Glaucoma Visual Field Forecasting with Diffusion Models
This work introduces conditioned denoising diffusion models for probabilistic forecasting of glaucoma visual fields (VFs), addressing the limitations of deterministic predictions in representing disease progression uncertainty. The method generates distributions of plausible future VFs from longitudinal observations with irregular follow-up intervals, enabling uncertainty-aware risk assessment. Evaluated on two independent VF cohorts, the approach produces well-calibrated distributions for clinically relevant VF measures and achieves state-of-the-art accuracy when reduced to point estimates, outperforming clinical baselines and prior learning-based methods. The results advocate for a shift toward distributional modeling in glaucoma monitoring and treatment planning.
denoising diffusion modelsvisual fieldsglaucomaprobabilistic forecastinguncertainty-aware
Can LLMs Rank? A Tale of Triads and Triage
The paper introduces a dual-metric framework for assessing LLM consistency in high-stakes ranking tasks, combining classical social choice theory with modern LLM evaluation. It proposes using the coefficient of consistency (ζ) for intra-run circular triad analysis and Kendall's τ for inter-run ranking distance, demonstrating their complementary value through homelessness allocation and emergency triage case studies. Experiments reveal significant performance variation across three leading LLMs, with guidelines for practical consistency assessment before deployment.
large language modelsranking consistencycircular triadskendall's tausocial choice theory
Beyond IID: How General Are Tabular Foundation Models, Really?
The paper introduces BeyondArena, a unified benchmark for evaluating tabular foundation models across diverse task types (IID, temporal, grouped), dataset scales, and feature types. It also presents Data Foundry, a Python framework for curating tabular datasets. Evaluations on 11 models and 142 datasets reveal that existing foundation models perform well on small to medium IID data but are outperformed by traditional tree-based and deep learning models on non-IID, large, or high-dimensional datasets. The benchmark aims to guide research toward more challenging scenarios in tabular data modeling.
tabular foundation modelsiid databenchmarkingdata curationnon-iid challenges
ENC-ODE: Event-level Neurodegenerative Modeling in Continuous Time with Neural ODEs
ENC-ODE introduces a neural ODE framework for continuous-time modeling of neurodegenerative disease progression, addressing sparse and irregular longitudinal biomarker data. The method employs diagnosis-conditioned dynamics and target-conditioned attention to predict future biomarker evolution without history compression. Evaluated on the ADNI dataset, ENC-ODE outperforms sequence models, providing a scalable solution for clinical support in Alzheimer's disease management.
neurodegenerative modelingneural odesbiomarker predictioncontinuous-time dynamicsattention mechanism
Model Predictive Current Control with Harmonic Correction for Single-Phase AC-DC EV Charging
A duty cycle predictive Model Predictive Current Control (MPCC) with real-time harmonic estimation is proposed to improve current quality in single-phase AC-DC EV charging. The method dynamically estimates low-order harmonic components of input current, corrects MPCC reference current, and enables continuous duty cycle control for targeted harmonic suppression. Compared to switching state predictive MPCC, the proposed approach reduces steady-state current THD_i from 11.47% to 6.10%, and further to 2.85% with harmonic reference correction, addressing limitations from dead time, control delay, and model parameter mismatch.
model predictive current controlharmonic estimationelectric vehicle chargingpower factor correctiontotal harmonic distortion
A Stochastic--Geometric Theory of Scaling Laws in Grokking
The paper presents a stochastic-geometric theory explaining grokking (delayed generalization) in neural networks, attributing it to a shell-core topological configuration in the solution space induced by Adam optimization with weight shrinkage. The analysis reveals that random initializations concentrate on an outer shell, memorization solutions on an inner shell, and generalization solutions in the core. Using stopping-time theory, the authors derive scaling laws for learning rate, batch size, and ℓ2 regularization, validated empirically and consistent with prior work.
grokkingadam optimizationstopping-time theoryscaling lawsℓ2 regularization
Whose Side Is Your Agent On? Multi-Party Principal Loyalty in LLM Agents
The paper introduces PrincipalBench, a 75-item multi-turn benchmark, and two mechanisms for ensuring multi-party loyalty in LLM agents, where agents must balance principal loyalty with counterparty interactions. PrincipalBench employs leak probes, dual judges, and an integrity-audit gate to evaluate 13 frontier models, revealing a sharp split between selective (≤20% harm) and over-refusing clusters (53.6-75.3% harm). A prompt-time loyalty scaffold reduces harm to 19.4% in Claude-Sonnet, while a per-token-KL distillation recipe transfers knowledge from Qwen3-32B to 8B Qwen3 and Llama-3.1. Both mechanisms operate along a leak/over-refusal trade-off, unable to achieve jointly favorable outcomes.
multi-party loyaltyleak probeskl distillationintegrity-audit gateprompt-time scaffold
Set-Inclusive Uncertainty Modeling for Robust Brain Tumor Segmentation
The authors propose a probabilistic representation framework for robust brain tumor segmentation under missing MRI modalities, addressing intrinsic uncertainty from information loss. The method models representations as Gaussian distributions, where the mean encodes task information and variance quantifies uncertainty. A regularization strategy aligns partial modality means with full-modality counterparts while scaling variance by their discrepancy. A set-inclusive strategy leverages hierarchical modality subsets with ordering constraints for consistent uncertainty relationships. Experiments on BraTS 2018 and 2020 demonstrate superior performance across diverse missing-modality scenarios compared to baselines.
probabilistic representationgaussian distributionsmissing modalitiesbrain tumor segmentationuncertainty modeling
Using Large Language Models as Low-Cost Statistical Estimators for Human-Response Data
The paper establishes that pretrained large language models (LLMs) serve as risk-equivalent estimators for conditional expectations under squared loss, achieving restricted functional risk equivalence with Bayes-optimal risk for conditional-mean-dependent inference. The authors formalize LLMs as misspecified functional estimators, decomposing error into representation bias and optimization error, and prove convergence to irreducible population variance plus squared representation bias under mild regularity conditions. They derive finite-sample bounds and a calibration protocol, showing LLMs can replace human experiments for near-optimal statistical inference when conditions are met.
risk equivalenceconditional expectationsrepresentation biaspinsker inequalityle cam deficiency
ReactiveBFM: Reactive Closed-Loop Motion Planning Towards Universal Humanoid Whole-Body Control
ReactiveBFM introduces a real-time closed-loop planning-control framework for humanoid robots, addressing limitations of Behavior Foundation Models (BFMs) in reactive whole-body coordination. The method employs a scheduled prefix sampling curriculum to mitigate exposure bias and an asynchronous replanning mechanism to reconcile latency mismatches between planning and tracking. Trajectory chunking ensures spatio-temporally fluid execution. Deployed on the Unitree G1 humanoid, ReactiveBFM demonstrates zero-shot moving target reaching and achieves a 93.1% success rate in sim-to-sim benchmarking under severe perturbations, outperforming open-loop baselines by 28.6%.
behavior foundation modelsexposure biasclosed-loop planningasynchronous replanningtrajectory chunking
Residual-Guided Expert Specialization for Incomplete Multimodal Learning
The paper proposes MARS, a mixture-of-experts framework for incomplete multimodal learning that leverages representational deviations caused by missing modalities. The method uses a privileged residual signal derived from complete-incomplete representation contrasts to guide expert specialization via a residual router, while a feature router imitates this behavior for deployment. Discrepancy-aware noise regularization mitigates train-test router gaps. Experiments on multimodal classification (CASIA-SURF, CREMA-D, UPMC Food-101) and segmentation (MCubeS) demonstrate consistent improvements over baselines while maintaining efficiency and backbone compatibility.
incomplete multimodal learningmixture-of-expertsresidual signaldiscrepancy-aware regularizationrepresentation deviation
FFAvatar: Feed-Forward 4D Head Avatar Reconstruction from Sparse Portrait Images
FFAvatar introduces a Transformer-based 3D Gaussian framework for reconstructing animatable 4D head avatars from sparse portrait images, supporting incremental refinement with additional inputs. The method employs an alternating attention mechanism to disentangle identity appearance from expression/viewpoint variations, coupled with a sparse-to-dense learning paradigm that first captures coarse features via FLAME-anchored primitives before UV-domain densification. A motion refinement module models residual motion for subject-specific dynamics. Experiments show FFAvatar achieves high-fidelity, identity-consistent rendering with superior flexibility and driving efficiency compared to existing approaches.
4d avatar reconstructiontransformer-based 3d gaussianalternating attention mechanismsparse-to-dense learningflame parametric model
DRIFT: Difficulty Routing Self-DIstillation with Rhythm-Gated Exploration and Success BuFfer Training
The paper introduces DRIFT, a self-evolution policy optimization framework for large language models that enables stable self-improvement without external supervision. DRIFT combines Difficulty Routing to dynamically allocate self-distillation and reinforcement learning signals based on problem-level learning states, and Rhythm Gating to focus token-level exploration on critical reasoning positions. It incorporates a success buffer and two-stage curriculum learning to preserve high-quality experience and guide policy evolution. Evaluated across five benchmarks and three model scales, DRIFT achieves 79.5% average score, outperforming GRPO by 9.5% and SDPO by 7.5%, with a 79.2% accuracy on ToolUse.
self-distillationreinforcement learningcurriculum learningpolicy optimizationreasoning tasks
Early Cue Precision Shapes Visual Shortcut Learning in Controlled Cue-Manipulation Benchmarks
The study demonstrates that early cue precision critically influences visual shortcut learning, showing that degraded-but-predictive inputs cannot substitute for proper cue decorrelation. Through controlled experiments on synthetic shape-texture tasks, sequential digit training, and CIFAR-10 benchmarks, the authors manipulate object-texture match probability and evaluate accuracy under conflict and suppression. Results reveal that low early cue precision improves pre-target conflict behavior (e.g., conflict accuracy drops from 0.589 to 0.005 in digit probes), but shortcut-rich fine-tuning can rapidly erase this benefit, necessitating sustained cue decorrelation during downstream adaptation.
visual shortcut learningcue precisioncue decorrelationconflict accuracytexture-overlay benchmark
Sequential Fairness Auditing with Limited Output Access
The paper introduces a sequential fairness auditing framework for AI systems under limited output access, addressing the practical constraints faced by independent auditors. It formulates fairness auditing as a tolerance-aware sequential hypothesis-testing problem, employing a generalized likelihood-ratio framework to accumulate evidence from a finite audit pool. The framework is instantiated for Statistical Parity and Equal Opportunity audits, and extended to score- and logit-based proxy audits when richer observables are available. Results demonstrate that both fairness metrics and model access levels significantly impact audit efficiency, with richer outputs reducing query requirements in certain settings but offering limited gains near thresholds.
sequential hypothesis-testingfairness auditingstatistical parityequal opportunitygeneralized likelihood-ratio
BayesEvolve: Explicit Belief States for Autonomous Scientific Discovery
The paper introduces BayesEvolve, a belief-guided discovery framework that maintains explicit, uncertainty-aware beliefs about hypothesis quality to improve autonomous scientific discovery. The method converts experimental evidence into predictive belief states, guiding future experimentation more effectively than memory- or archive-based approaches. Evaluated on shifted BBOB-style black-box optimization tasks, BayesEvolve demonstrates superior sample efficiency, predictive accuracy on held-out candidate pools, and productive late-stage exploration under fixed evaluation budgets.
bayesevolvebelief stateblack-box optimizationsample efficiencyuncertainty-aware
MCP Server Architecture Patterns for LLM-Integrated Applications
The paper contributes a taxonomy of five architectural patterns (Resource Gateway, Tool Orchestrator, Stateful Session Server, Proxy Aggregator, Domain-Specific Adapter) for Model Context Protocol (MCP) servers in LLM-integrated applications, derived from analyzing fifteen production and public servers. Methods include structured pattern documentation à la Gamma et al., quantitative evaluation of inter-rater reliability (κ=0.76), transport overhead measurement, and tool-selection accuracy studies. Results show pattern-boundary ambiguities, tool-selection accuracy drops below 90% at 10-15 tools for Claude Haiku 4.5 and 20-30 tools for Sonnet 4, with replication materials released.
model context protocolarchitectural patternstool orchestrationinter-rater reliabilitytransport overhead
Always-OnAgents:A Survey of Persistent Memory, State, and Governance in LLMAgents
The survey introduces a framework for analyzing always-on agents—LLM-based systems with persistent state across interactions—through six diagnostic axes (authority, scope, mutability, provenance, recoverability, actionability) and a state lifecycle. It analyzes 435 works, revealing a focus on state accumulation/retrieval over governance/recovery. The Always-On Evaluation Protocol (AOEP-v0) is proposed to assess state mutation and recovery obligations, linking the field to databases, distributed systems, and machine unlearning.
always-on agentspersistent-state systemsstate lifecycleaoep-v0machine unlearning
Research Entity Extraction and Topic Detection from UKRI Grant Proposals
The study compares GPT-4o, Mistral, and DSIT-Taxonomies for extracting and classifying research entities from UKRI grant proposals to detect emerging research areas. A three-stage pipeline used Mistral for entity extraction and OpenAlex Topics taxonomy mapping, evaluated on 42 proposal abstracts. Mistral and GPT-4o showed comparable performance with high semantic overlap, outperforming DSIT-Taxonomies; Mistral achieved 90.5% topic classification accuracy versus 71.4% for DSIT-Taxonomies, demonstrating efficiency for sensitive data analysis.
entity extractiontopic detectionllm comparisonopenalex taxonomygrant proposals
ManimAgent: Self-Evolving Multimodal Agents for Visual Education
ManimAgent introduces a self-evolving multimodal agent that transfers reflection experience across tasks via a dual-channel Episodic Memory Bank, eliminating the need for weight updates or human seeds. The agent generates Python code using the Manim library to render mathematical animations from scientific paper sections, with a vision-language model scoring rendered keyframes to populate positive (Reference Examples) and negative (Known Pitfalls) memory channels. Evaluations show that increasing memory size improves blind human Pass@1 rates and reduces reflection rounds compared to no-memory, retrieval-augmented generation, and shuffled-memory baselines.
episodic memory bankmanim librarymultimodal agentreference examplesknown pitfalls
Rehearsed Multi-Agent Live Product Demonstrations with Real-Time Voice Question Answering
The paper introduces Rhetor, a multi-agent system for automated live product demonstrations with real-time voice QA. The system combines UI exploration and source-code analysis into a cross-modal feature representation with focus tiers, employs a grounded scripter with semantic locators, and ensures synchronization between browser actions and narration via a rehearsal loop. Evaluated on four applications (including Excalidraw), the system achieves locator-firing rates (σ̄) of 0.31-1.00 across 147 actions, with 0.92 σ̄ for complex workloads. A benchmark protocol with ten metrics is proposed for broader validation.
multi-agent systemcross-modal representationsemantic locatorsrehearsal loopsynchronization invariant
PromptGNN-sim: Deep Fusion and Alignment of GNN and LLMs for Text-Attributed Graph Learning
PromptGNN-sim introduces a bi-directional fusion framework for text-attributed graph learning, integrating Graph Attention Networks (GAT) and Large Language Models (LLMs) through structure-semantic collaboration. The method employs semantically aware neighborhood selection via GAT, generates structure-aware LLM prompts (node summaries, labels, keywords), and jointly optimizes components using cross-modal contrastive learning and cross-attention. Evaluations on Cora, Pubmed, and WikiCS demonstrate superior performance over GNNs, LLMs, and fusion baselines in accuracy, generalization, and robustness under cross-task and sparse scenarios.
text-attributed graphsgraph attention networkcross-modal contrastive learningstructure-semantic fusionbi-directional alignment
Towards Continual Motion-Language Agents: LoRA Variants for Incremental Motion Understanding and Generation
The paper introduces continual learning variants of low-rank adaptation (LoRA) for bidirectional motion-language agents, enabling incremental acquisition of new motion concepts without catastrophic forgetting. The method employs mixture-of-experts architectures with an autoencoder-based router for task-specific expert selection at inference, eliminating need for task labels. Evaluated on a five-task HumanML3D benchmark, results show near-zero forgetting in both motion-to-text and text-to-motion tasks, with hard expert routing outperforming soft blending and revealing token-level vs. generation quality discrepancies.
continual learninglow-rank adaptationmotion-language agentsmixture-of-expertscatastrophic forgetting
Defending Against Harmful Supervision Hidden in Benign Samples
We propose Dual-Reference SFT (DR-SFT), a defense against Embedded Attack, where harmful QA pairs are embedded within benign training samples. DR-SFT adapts DPO-style contrastive objectives to supervised fine-tuning (SFT) through token-level regularization, mitigating harmful fine-tuning beyond coarse data filtering. Experiments show that representative guardrails often fail to detect Embedded Attacks at the example level, while DR-SFT effectively counters this threat by preventing harmful supervision from being learned during fine-tuning.
embedded attackdual-reference sfttoken-level regularizationharmful supervisioncontrastive objectives
KnowsTFM: Knowledge-Informed Fine-Tuning of Small Tabular Foundation Models
The paper introduces KnowsTFM, a method for knowledge-informed fine-tuning of small tabular foundation models (TabPFN/TabICL variants) in niche domains with scarce data. It combines structural attention priors from knowledge graphs with parameter-efficient low-rank updates during adaptation. Results show meaningful performance gains over vanilla models in specialist settings (where pretraining distributions differ), while general-domain tasks see marginal improvements. The study also identifies catastrophic forgetting risks during continual fine-tuning of frontier models.
tabular foundation modelsknowledge graphsparameter-efficient fine-tuningattention priorscatastrophic forgetting
EMPATH: A Multilingual Auditor-Judge Benchmark for Safety Evaluation of Emotional-Support Chatbots
EMPATH introduces a multilingual benchmark for evaluating safety in emotional-support chatbots, addressing limitations of fixed-prompt approaches. The method employs an auditor model to generate multi-turn conversations from 140 seed instructions and 34 personas, with a judge model scoring transcripts across 19 metrics in five dimensions (crisis handling, therapeutic quality, etc.). Results show score inflation on 10 metrics under standard rubrics, with model-specific divergences up to six points; cross-family judge agreement reaches 93% within ±1 score, while run-to-run reliability varies significantly across models.
emotional-support chatbotsmultilingual benchmarkauditor-judge frameworksafety evaluationconversational integrity
Inoculation Adapters: Improved Selective Generalization of Capabilities with Fewer Surprising Backdoors
The paper introduces inoculation adapters (IA), a method to suppress undesired traits in AI models while preserving desired capabilities. IAs are LoRAs trained in three stages: exposure to undesired traits, frozen integration during task adapter training, and final deployment without the IA. Evaluated across six model families, IAs outperform inoculation prompting in suppressing emergent misalignment and avoiding backdoors, though both methods struggle with consistent retention of desired traits. The technique reduces optimization pressure toward undesired behaviors without relying on prompt-based elicitation.
inoculation adaptersemergent misalignmentloraselective generalizationbackdoors
Curvature-Guided Sheaf Diffusion for Unsupervised Community Detection on Heterophilic Graphs
Curvature-Guided Sheaf Diffusion (CGSD) introduces a fully unsupervised community-detection algorithm for heterophilic graphs, leveraging discrete Forman--Ricci curvature as a topological signal. The method comprises three components: (i) a curvature-gated sheaf-diffusion encoder trained with label-free structural losses, (ii) a curvature-aware spectral clusterer (CSpec) that re-weights k-NN affinity, and (iii) a unified evaluation against nine unsupervised baselines. CGSD outperforms baselines on heterophilic benchmarks Wisconsin and Chameleon, achieving a 15% improvement in mean NMI over K-Means (0.091 to 0.107, p=0.008). The interpretable mechanism separates intra- and inter-community curvature distributions.
heterophilic graphsforman-ricci curvaturesheaf diffusionspectral clusteringunsupervised learning
Clarus: Coordinating Autonomous Research Agents toward Web-Scale Scientific Collaboration
Clarus introduces a collaboration infrastructure for coordinating autonomous research agents in web-scale scientific endeavors, shifting from code-centric execution to research-oriented collaboration processes. The system organizes scientific collaboration across four layers—Research Application, Digital Collaboration, Physical Substrate, and Physical World—using a minimal project-agent-resource object model. Core modules are implemented as pluggable mechanisms, enabling adaptation to task risk, collaboration structure, and resource constraints. A controlled paper-generation case study demonstrates Clarus's ability to structure research goals into traceable, reviewable, attributable, and accumulative collaboration networks. The framework supports open, auditable, and resource-aware multi-phase collaboration processes, providing a foundation for open research networks.
autonomous research agentscollaboration infrastructurepluggable mechanismsresource-awaremulti-phase collaboration
EvalSafetyGap: A Hybrid Survey and Conceptual Framework for LLM Evaluation-Safety Failures
The paper introduces EvalSafetyGap, a conceptual framework for analyzing discrepancies between evaluation metrics and latent safety properties in LLMs under optimization pressure. Combining a hybrid survey (systematic review + grey evidence synthesis) with a 10-model audit, it examines eight evidence streams including benchmark validity, jailbreak robustness, and mechanistic interpretability from 2018-2026. Results show weak correlation between capability and adversarial robustness (r=0.232, p=0.520), with safety gaps primarily governance-driven and sensitive to measurement protocols. The work provides standardized constructs (Instability Decomposition, Alignment Trilemma) and diagnostic tools for dynamic evaluation and alignment auditing.
eval-safety gapgoodhart's lawinstability decompositionalignment trilemmajailbreak robustness
Efficient RGB-T Object Detection via Sparse Cross-Modality Fusion
The paper proposes a sparse cross-modality fusion mechanism for efficient RGB-T object detection, addressing computational inefficiency in existing dual-backbone methods. The two-stage framework first uses lightweight modality-specific detectors to generate high-recall region proposals, then applies feature fusion only to sparse foreground regions for refinement. Experiments demonstrate competitive accuracy with significantly reduced parameters (exact counts unspecified) and computational cost, while maintaining scalability to high-resolution images.
rgb-t detectionsparse fusioncross-modalitytwo-stage frameworkcomputational efficiency
A Multi Center Breast FNAC Whole-Slide Cytology Dataset for AI-Assisted Patch-Wise Classification Using C1 to C5 Reporting Categories
The authors introduce a multi-center breast fine needle aspiration cytology (FNAC) dataset for AI-assisted patch-wise classification, comprising 470 whole-slide images from 321 patients across Indian tertiary medical centers. The dataset includes 7,398 PNG image patches extracted from 446 annotated WSIs, labeled using C1 to C5 reporting categories and stained with Papanicolaou or MayGrunwald Giemsa. Images were scanned at 40X magnification (0.25 microns per pixel) using a Hamamatsu NanoZoomer S360 and stored in NDPI format. The release provides NDPI WSIs, GeoJSON annotations, extracted patches, metadata, and inspection code, totaling approximately 950 GB and accessible via Zenodo.
fine needle aspiration cytologywhole-slide imagespatch-wise classificationpapanicolaou staininggeojson annotations
The Many-Body Problem of the Data Centre
The paper reframes modern AI's embodiment through data centers, arguing they serve as AI's biological-like bodies while simultaneously functioning as capital's laboring organs. It develops an organic analogy to analyze data centers as non-unique, universal embodiments that process human-desire-born data without intrinsic desires. The analysis reveals a many-body problem in this distributed computational embodiment and demonstrates how capital equates artificial and human intelligence through market pricing mechanisms, enabling cross-domain intelligence valuation.
data center embodimentmany-body problemorganic analogycomputational laborintelligence valuation
Forewarned is Forearmed: When Non-Sequential Embedding Turns Into an Anomaly Detector
The paper introduces a novel anomaly detection method leveraging non-sequential multimodal sentence-level embeddings, particularly in the SONAR model. It identifies embedding dimensions sensitive to perturbations, using consistency between successive encoding and decoding processes to build an accurate detector. The authors also explore modifying specific dimensions to correct anomalies. This approach enhances the reliability of multimodal representations by emphasizing the importance of embedding analysis.
non-sequential embeddingsmultimodal representationsanomaly detectionsonar modelembedding dimensions
Domain Adaptation with Adaptive Imagination for Visual Reinforcement Learning under Limited Target Data
We propose AIDA (Adaptive Imagination for Domain Adaptation), a domain adaptation framework for visual reinforcement learning that addresses sim-to-real transfer under scarce target data without additional environment interaction. AIDA employs adaptive imagination, generating reliable rollouts via a distribution-shift-aware discriminator that truncates low-confidence transitions, and introduces a self-consistency loss that cycles through state-image-state to penalize reconstruction discrepancies. Experiments on five MuJoCo tasks and two Gymnasium-Robotics tasks demonstrate that AIDA effectively truncates unreliable rollouts, learns semantically meaningful state representations, and outperforms baselines.
domain adaptationsim-to-real transferadaptive imaginationself-consistency lossvisual reinforcement learning
From Detecting Agency to Doing Work: Self-Caused Credit Builds a Durable Behavioral Self in a Minimal Spiking Agent
The study demonstrates that agency-gated slow credit (Own*Agency*Salience) enables durable behavioral self-shaping in spiking neural agents, contrasting with transient agency detection. Using Nengo LIF/PES networks, the authors show this mechanism produces post-unload behavioral retention (retained fraction 0.96) that collapses when slow decoders reset or agency gating is removed. Experiments across 24D control tasks and sequential learning (8 tasks) confirm the necessity of slow self-credit for durable behavior (final accuracy 0.88 vs 0.00 baselines) and interference resistance, formalized as an operational behavioral self without consciousness claims.
agency-gated creditspiking neural networksbehavioral residueslow parameter updateself-preservation
Few-Shot Domain Incremental Learning via Continual Vision-Language Consolidation
The paper introduces Continual Vision-Language Consolidation (CVLC), a novel algorithm addressing few-shot domain incremental learning (FSDIL) under extreme data scarcity. CVLC employs latent space reservation in the base domain and dual coalescent projection (DCP) for parameter-efficient fine-tuning. It calibrates vision prototypes, generates language prototypes via LLMs, and fuses them for adaptation to new domains. Structured with shared and domain-specific components, CVLC combines general knowledge and domain-specific details. Evaluations on benchmark problems show CVLC outperforms prior methods by up to 16%.
domain-incremental learningfew-shot learninglatent space reservationdual coalescent projectionvision-language fusion
Dynamo: Dynamic Skill-Tool Evolution for Vision-Language Agents
Dynamo introduces a training-free framework for enhancing vision-language models (VLMs) through dynamic skill-tool evolution, eliminating the need for retraining or manual prompt engineering. The method autonomously generates reusable reasoning skills and executable visual tools by analyzing correct and incorrect attempts on a small labeled subset, storing these capabilities in a persistent library. Evaluated across four visual reasoning benchmarks and five VLM backbones, Dynamo improves direct inference accuracy by an average of +5.6%, achieves optimal tool invocation when tools are pre-specified, and bridges 65–99% of the performance gap compared to task-specific RL methods at reduced computational cost.
vision-language modelsdynamic skill evolutionvisual reasoningtraining-free adaptationpersistent capability library
MirrorCode: AI can rebuild entire programs from behavior alone
(No summary returned.)
Beyond Drug Discovery: The Nanotechnology Molecular Optimization (NMO) Benchmark
The Nanotechnology Molecular Optimization (NMO) Benchmark is introduced to bridge machine learning and quantum materials science, addressing limitations in transferability from drug discovery-focused benchmarks. NMO employs quantum simulations instead of proxy oracles and implements strict protocols to prioritize scientific utility over leaderboard overfitting. It imposes hard structural constraints and rugged fitness landscapes, challenging existing generative models. A novel baseline method is developed, incorporating a structural constraint representation and domain-agnostic pretraining to mitigate pharmaceutical dataset bias. Results surpass state-of-the-art physical properties and uncover new structural motifs, demonstrating ML's potential for genuine scientific discovery in nanotechnology.
quantum simulationsstructural constraintsfitness landscapesdomain-agnostic pretrainingnanotechnology
Federated Learning with Energy-Based Structured Probabilistic Inference
The paper proposes a federated learning framework that improves client aggregation weights using Conditional Random Fields (CRFs). The method models client-specific reliability via unary potentials and client interactions via pairwise potentials, enabling optimized weight assignment during global model updates. Experiments demonstrate consistent performance gains over standard federated learning baselines in non-IID data settings.
federated learningconditional random fieldsnon-iid dataclient aggregationprobabilistic inference
Physically-Constrained Harmonic Separation for Robust Heart and Respiratory Rate Estimation from Wrist Photoplethysmography
The authors propose Physically-Constrained Harmonic Separation (PCHS), a novel framework for robust heart rate (HR) and respiratory rate (RR) estimation from wrist photoplethysmography (PPG) under motion artifacts. PCHS formulates HR/RR estimation as an analysis-by-synthesis problem, using accelerometer measurements to condition artifact separation rather than direct regression. A physics-guided harmonic generator decomposes the PPG signal into quasi-periodic physiological components and motion-related residuals, enabling HR recovery from fundamental frequency and RR prediction from respiratory-driven harmonic modulations. Experiments on the motion-intensive PPG-DaLiA dataset show PCHS outperforms state-of-the-art methods while providing interpretable signal decompositions that disentangle physiological activity from motion artifacts.
photoplethysmographyharmonic separationanalysis-by-synthesisphysiological componentsmotion artifacts
Estimating Grammatical Gender Directions in Contextual Embeddings under Controlled and Natural Contexts
The study presents the first method to disentangle grammatical gender from semantic bias in contextual embeddings for gendered languages like Spanish. Using controlled templates and natural Wikipedia contexts, the authors construct balanced datasets of inanimate nouns and propose a framework with centroid, SVM, and LDA gender direction estimators, plus contamination-aware weighting strategies. Evaluation via dual-objective metrics shows unweighted controlled contexts yield the purest grammatical gender direction, with the centroid estimator outperforming discriminative baselines in suppressing gender leakage while preserving semantic distinctions.
contextual embeddingsgrammatical gendersemantic biasgender direction estimatorscontamination-aware weighting
FacePlex: Full-Duplex Joint Speech-Facial Motion Generation for Conversational Avatars
FacePlex introduces a full-duplex framework for joint speech-facial motion generation in conversational avatars, addressing the gap between real-time speech synthesis and synchronized facial animation. The method employs Rolling Flow Matching for online motion frame generation and Rolling Cross-Attention to couple streaming audio and motion queues bidirectionally. Experiments demonstrate superior lip-sync quality and motion fidelity compared to audio-driven baselines under streaming constraints.
full-duplex generationflow matchingcross-attentionlip-syncstreaming synthesis
Relevance Is Not Permission: Warranted Attention for Value Contributions
The paper introduces Warrant, a path-localized interface addressing the gap between attention relevance and value contribution in neural models. By formalizing this as a permission problem, Warrant modifies the weighted value term α_ij * v_j to α_ij * g_ij * v_j via learned query-item permission g_ij, while preserving attention relevance. Evaluated across 32 comparisons in tasks like CTDG link prediction and TKG tail prediction, Warrant improved primary metrics in 27 cases, with notable gains (+0.1076 AUC in CTDG, +0.0683 MRR in TKG). Ablations reveal domain-specific benefits, e.g., historical-tail value path exposure in TKG and edge-conditioned permission in CTDG.
attention relevancevalue contributionpath-localizedquery-item permissionmetric-defining value paths
Query-Aware Spreading Activation for Multi-Hop Retrieval over Knowledge Graphs
The paper introduces a query-aware spreading activation method for multi-hop retrieval over knowledge graphs, addressing limitations of query-blind traversal in existing Graph RAG systems. The proposed approach uses a single per-step semantic gate (cosine similarity between candidate entity descriptions and the question) to enable query-aware traversal, expressed as a single Cypher query executed in Neo4j. On MuSiQue, it matches QAFD-RAG's exact match (32.80 vs 33.50) and outperforms HippoRAG by 5.3 EM and 3.4 F1, while reducing retrieval latency by 1.5-4.9×. Ablation confirms the gate's contribution to both performance gains (3.6-7.4 F1) and latency reduction.
knowledge graphsmulti-hop retrievalspreading activationquery-aware traversalgraph rag
Hyper-Network Neural Functional Maps for Unsupervised Robust 3D Shape Matching
The paper introduces Hyper-Network Neural Functional Maps (NFM), a novel unsupervised method for robust 3D shape matching that addresses limitations of existing functional map approaches in challenging scenarios like partiality and topological noise. The method employs a hyper-network to predict weights for an MLP with skip-connections, refining standard functional maps (FM) to better align spectral bases. Trained with an unsupervised spectral alignment loss, NFM integrates seamlessly into deep functional map pipelines, significantly improving matching accuracy in demanding conditions.
neural functional mapshyper-networkspectral alignment3d shape matchingunsupervised learning
Does Verbose Chain-of-Thought Really Help? In-Distribution Evidence that Content, Not Length, Matters
This study investigates whether verbose chain-of-thought (CoT) prompting improves LLM reasoning due to semantic content or token count. Two methods are employed: in-distribution analysis comparing shorter and longer natural generations across 25 models, and controlled interventions using dual-validator designs across four targets and eight benchmarks. Results show that extra tokens leave accuracy unchanged across independently-trained reasoners, and verbose traces improve accuracy modestly (1-4 points) depending on prose quality, not length. Maximum numerical redaction amplifies effects (median 3.24x), while non-reasoning filler recovers none. Findings converge on the importance of reasoning and validation content over token count.
chain-of-thoughtllmsemantic contenttoken countdual-validator
Gravitational Duals from Equations of State II: Large Hierarchies and False Vacua
The authors advance the reconstruction of holographic duals for strongly coupled quantum field theories in regimes featuring large hierarchies and false vacua, extending previous Physics-Informed Neural Networks (PINNs) methodologies. They address challenges such as near-degenerate states, energy scale hierarchies, and unprobed potential regions through methodological innovations. The framework accurately reconstructs scalar potentials deep into the false vacuum regime, achieving robust agreement with underlying thermodynamic features despite numerical stiffness. This work bridges holography and machine learning, demonstrating data-driven approaches' potential to elucidate strongly coupled systems.
holographic dualsfalse vacuaphysics-informed neural networksrenormalization group flowsscalar potentials
Open Problems in Constitutional Preference Reconstruction
The paper identifies three open problems in constitutional preference reconstruction methods for language model training: difficulty in measuring principle quality, ambiguity in principle composition, and variability between LLMs. Using Inverse Constitutional AI (ICAI+) on datasets like PRISM, AlpacaEval, and Chatbot Arena, the authors demonstrate that principle refinement improves inter-executor agreement (78% vs. 73%) and matches LLM judge accuracy (66% vs. 67%). Results suggest constitutions should be evaluated as constitution--executor systems, with implications for LLMs-as-a-judge paradigms.
constitutional preference reconstructioninverse constitutional aillm judgeprinciple compositionpairwise preference data
SA-VLA: State-aware tokenizer for improving Vision-Language-Action Models' performance
SA-VLA introduces a state-aware action tokenizer for vision-language-action (VLA) models, addressing the limitation of fixed continuous action prototypes in existing tokenizers by conditioning action decoding on robot state. The method employs two state-injection mechanisms: cross-attention between state and action features, and a lightweight state adapter predicting action-wise modulation factors for state-conditioned action modulation. Evaluated on 12 RoboTwin manipulation tasks, SA-VLA improves average success rates from 0.29 to 0.56 over baselines, and from 0.15 to 0.33 in zero-shot sim-to-real experiments, demonstrating reduced compression gap in discrete VLA policies.
vision-language-action modelsaction tokenizationstate-conditioned decodingrobot manipulationsim-to-real transfer
Automating the Design of Embodied AgentArchitectures
The paper introduces AgentCanvas, a typed-graph runtime for embodied agent architectures, and KDLoop, a coding-agent search procedure, to automate architectural design in perceptual embodied agents. The method evaluates three Agent Architecture Search (AAS) variants across four embodied executors, including vision-language navigation and language-conditioned manipulation tasks. Results show architecture-level search improves success rates, though optimization signals are masked by rollout noise and search can stall in local edit basins, revealing both potential and current limitations of automated search for embodied agents.
embodied agentsarchitecture searchvision-language navigationtyped-graph runtimecredit assignment
Structural Certification for Reliable Physical Design with Language Models
The paper introduces Physics-Anchored Certification (PHACT), a propose-certify framework that ensures reliable physical design generation by language models through deterministic certification. PHACT decouples proposal (by LM) from certification (by deterministic engine), deriving certified quantities from fixed inputs to prevent forgery. Evaluated across 80 adversarial trials with two models (unspecified), two decoding temperatures, and a faulted engine, the method achieved zero false certifications, demonstrating robustness in five scientific domains.
physics-anchored certificationdeterministic certificationpropose-certify looplanguage modelsphysical design
Propagation of~Interval Belief Structures and~Imprecise Copulas for~Neural Network Verification
The paper proposes a sound framework for quantitative verification of neural networks under imprecise probabilistic information, combining interval belief structures for marginal uncertainty with imprecise copulas for uncertain dependence. The method develops propagation techniques for imprecisely coupled interval belief structures through feed-forward networks, using mixed imprecise copula volumes to derive sound push-forward constructions via affine transformations and activation functions. Results demonstrate guaranteed lower and upper bounds on probabilistic safety properties, valid for all probability models compatible with the specified imprecise inputs.
interval belief structuresimprecise copulasneural network verificationprobabilistic safetyaffine transformations
Temporal Feature Extractors in EEG Foundation Models: A Controlled Comparison Including a Pretrained Time-Series Model
This work systematically compares temporal feature extractors for EEG foundation models, including a linear baseline, convolutional encoder, and frozen pretrained time-series foundation model (MOMENT). The study evaluates representation quality on motor imagery and emotion recognition tasks, revealing task-dependent performance: motor imagery benefits from simpler temporal representations, while emotion recognition requires richer temporal modeling. Notably, the general-purpose MOMENT model transfers effectively as a frozen feature extractor despite no EEG-specific adaptation, demonstrating cross-domain applicability of time-series representations.
eeg foundation modelstemporal feature extractiontime-series transfer learningmotor imageryemotion recognition
Hierarchical Reinforcement Learning in StarCraft Micromanagement with Influence Maps and Cluster-based Scripts
The paper proposes HRL-IM/CBS, a hierarchical reinforcement learning framework for StarCraft micromanagement that combines influence map hashing and cluster-based scripts. The method encodes global battlefield states via hexadecimal influence maps and enables adaptive unit coordination through cluster-based tactical modules, using a hierarchical multi-Q-table architecture with dense reward allocation. Experiments in six asymmetric scenarios show competitive performance against deep RL baselines, with improved sample efficiency and interpretability through transparent Q-table representations.
hierarchical reinforcement learninginfluence map hashingcluster-based scriptsmulti-q-tablestarcraft micromanagement
SAT-RTS: A systematic framework for tactical knowledge extraction and visualization-based analysis in real-time strategy games
The SAT-RTS framework enhances interpretable tactical knowledge extraction in real-time strategy games by integrating visualization with automated pattern extraction from high-dimensional sequence data. It employs a cluster-centric BK-tree algorithm with specialized distance metrics for state-stream abstraction and a rule-based multi-label extraction method to transform raw sequences into discrete tactical labels. Experiments show SAT-RTS improves interpretability and efficiency in tactical analysis of complex RTS environments.
real-time strategy gamestactical knowledge extractionbk-tree algorithmmulti-label extractionfitness landscape visualization
Online Data Selection for Instruction Tuning via Gaussian Processes
The paper introduces GAIA, a global adaptive instruction tuning framework using Gaussian Processes for online data selection in LLM training. GAIA models utility manifolds across semantic space via Gaussian Process regression and employs adaptive strategy fusion to prioritize high-utility samples dynamically. The method, framed under the fixed-share Hedge framework, guarantees robustness under non-stationary quality scores. Evaluations on three datasets show GAIA outperforms state-of-the-art baselines like \greats in instruction tuning efficiency.
gaussian processesinstruction tuningdata selectiondynamic regretsemantic space
ACPO: Agent-Chained Policy Optimization for Multi-Agent Reinforcement Learning
The paper introduces Agent-Chained Policy Optimization (ACPO), a novel method for Multi-Agent Reinforcement Learning (MARL) under the Centralized Training with Decentralized Execution (CTDE) paradigm. ACPO decomposes the joint policy gradient into per-agent terms using decentralized critics and score functions, enabling independent actor updates that collectively form a joint gradient step. Key to this approach is a serialized decision process where agents condition actions on beliefs about preceding actions, ensuring coordination. Evaluated on Multi-Robot Warehouse, SMACv2, and MA-MuJoCo, ACPO outperforms baselines, particularly as agent count increases.
multi-agent reinforcement learningpolicy gradientdecentralized executionnash equilibriascore functions
Neural Subspace Reallocation: Continual Learning as Retrieval-Based Subspace Memory Management
Neural Subspace Reallocation (NSR) reformulates continual learning as parameter subspace memory management, treating Low-Rank Adaptation (LoRA) modules as compressible, retrievable memory units. The method cycles through compressing LoRAs via SVD, storing them in a TaskKnowledgeBank, recalling similar past LoRAs via embedding similarity, and reallocating active subspaces with distillation. Theoretical analysis shows memoryless policies incur Ω(T(M-1)Δ_switch) regret versus history-aware policies. Experiments demonstrate 10x faster cyclic recovery on Split-CIFAR-100, 9x reduced backward transfer on 5-Datasets, and 0.29MB/task memory footprint, with similarity-based retrieval outperforming learned controllers.
neural subspace reallocationlow-rank adaptationtaskknowledgebankcontinual learningparameter memory
Little Brains, Big Feats: Exploring Compact Language Models
This study evaluates the performance of small language models (SLMs) in Retrieval-Augmented Generation (RAG) systems, addressing their underrepresentation in current research. Using diverse open-source and proprietary datasets, the authors benchmark SLMs across various subject areas and question types. Results indicate that RAG systems incorporating SLMs can operate efficiently on-device without GPU hardware, maintaining reasonable execution times. The experimental framework and supplementary materials are publicly accessible via GitHub.
small language modelsretrieval-augmented generationon-device executionbenchmarkinggpu hardware
MuseBench: Benchmarking Intent-Level Audiovisual Arts Understanding in MLLMs
The paper introduces MuseBench, a novel benchmark for evaluating multimodal large language models (MLLMs) on intent-level audiovisual arts understanding. The benchmark comprises 4,016 questions across cinematic arts, visual arts, stage performance, and game arts, derived from 10K+ video essays with professional commentary. Questions are generated via a four-phase pipeline involving shortcut filtering, adversarial distractors, and expert validation. Zero-shot evaluation of 28 MLLMs shows top accuracy of 48.29%, significantly below human expert performance (87.18%), revealing a critical gap in models' artistic reasoning capabilities.
multimodal large language modelsaudiovisual artsintent-level understandingzero-shot evaluationadversarial distractors
IBRSteG: Learning a Generalizable Steganography Framework for 3D Gaussian Splatting
IBRSteG proposes a generalizable steganography framework for 3D Gaussian Splatting (3DGS) that embeds secret 3D scenes into cover scenes without per-scene optimization. The method introduces GAS (Gaussian Attributes Steganographer), a network that learns a scene-independent embedding function by injecting secret 3D Gaussian attributes into cover scenes, leveraging 2D learning paradigms for generalization. Experiments show IBRSteG achieves high visual quality, superior capacity, and security across diverse 3DGS scenes.
3d gaussian splattingsteganographygeneralizable frameworkgaussian attributes steganographerscene-independent embedding
T3R: Deeper Test-Time Adaptation for Graph Neural Networks via Gradient Rotation
T3R introduces a novel test-time adaptation method for Graph Neural Networks (GNNs) that enables deeper parameter updates using unlabeled test data. The approach leverages multiple Rotograd matrices to enhance task affinity between target and auxiliary tasks, coupled with a rotation technique that reorients self-supervised signals to generate surrogate gradients. This allows adaptation across nearly the entire architecture, addressing limitations of shallow updates in conventional Test-Time Training. Empirical results demonstrate a 0.172 reduction in MAE on regression datasets and at least 9.37% relative improvement on cross-domain OGB classification benchmarks compared to non-adaptive models.
graph neural networkstest-time trainingrotograd matricesself-supervised learningsurrogate gradients
AlgoSkill: Learning to Design Algorithms by Scheduling Human-Like Skills
AlgoSkill introduces a skill-based framework for algorithm design, modeling it as sequential decision-making over a typed library of human-like algorithmic skills (e.g., abstraction, constraint analysis). The method combines a learned scheduler with Monte Carlo Tree Search (MCTS) guided by verification feedback from compilation, testing, and complexity analysis. Experiments on competitive programming and combinatorial optimization benchmarks demonstrate improvements over direct LLM generation, chain-of-thought prompting, and baseline MCTS, with ablations highlighting the importance of typed skills, verification-based repair, and search-based scheduling.
algorithm designmonte carlo tree searchverification feedbackskill schedulingcomplexity refinement
Be Faithful When Response: Returning Fluent and Grounded Answers for Vision-Language Models Reinforcement Learning
The paper proposes Faithful Warm-Start (FWS), a strategy to improve Vision-Language Models' (VLMs) reasoning by ensuring visual grounding before reinforcement learning. FWS curates the FaithfulQA dataset from six VQA benchmarks, selecting samples with explicit vision-language causal relationships, then purifies it using a VLM-based judge for causal consistency. This warm-start phase enhances the model's understanding of grounded patterns prior to RL optimization. Experiments demonstrate improved answer accuracy (quantitative results unspecified), stabilized RL training, and reduced ungrounded reasoning compared to direct RL application.
vision-language modelsreinforcement learningvisual groundingfaithfulqa datasetcausal consistency
Stabilizing Extrapolation in Looped Transformers via Learned Stochastic Stopping
The paper introduces learned stochastic stopping to stabilize extrapolation in Looped Transformers for variable-length algorithmic tasks. By analyzing the spurious correlation between sequence length and loop count, the authors propose training-time randomization of loop counts and RL-Halting as a learned schedule. Experiments on binary addition, Dyck-1, Unique Set, and Copy tasks show reduced out-of-distribution variance and improved accuracy-stability trade-offs, though suboptimal computations may persist. The work frames loop termination as a training-time design choice rather than purely inference-time optimization.
looped transformerslength generalizationstochastic stoppingrl-haltingout-of-distribution variance
Exploration and Online Transfer with Behavioral Foundation Models
The paper introduces online transfer for zero-shot reinforcement learning (RL), addressing the limitation of offline reward specification in Behavioral Foundation Models (BFMs). By framing the problem as a bandit-like exploration-exploitation task, the authors propose using BFMs to generate exploration policies, with rewards observed through environment interactions. A method inspired by Upper Confidence Bound is derived for linear reward approximation, focusing on eigenvalue minimization of an uncertainty matrix for exploration. The framework is validated on a simple environment, demonstrating its feasibility for online transfer.
zero-shot transferbehavioral foundation modelsonline reinforcement learningexploration-exploitationupper confidence bound
First-Order Temporal Logic Tensor Networks
The authors introduce First-Order Temporal Logic Tensor Networks (FOT-LTN), extending Logic Tensor Networks (LTN) to incorporate temporal reasoning. FOT-LTN combines First-Order Linear Temporal Logic syntax with LTN's fuzzy semantics, supporting temporal operators, quantifiers, and full differentiability. Evaluated on synthetic temporal knowledge graph completion tasks, FOT-LTN outperforms dedicated neural baselines, demonstrating its efficacy in handling dynamic object properties and relations.
first-order temporal logiclogic tensor networkstemporal knowledge graphsneuro-symbolic aidifferentiable reasoning
RiverONE: Generating Knowledge-Intensive VLM by Simulated Quantum Machines
RiverONE introduces a lightweight vision-language model (VLM) for quantum calibration plot understanding, leveraging simulated quantum computation during construction to generate structured parameters. The model combines a specialized visual encoder with an InternVL-based language backbone, materializing quantum-generated parameters as classical tensors post-training for GPU inference. At 1.9B parameters, RiverONE achieves ≥95% performance of NVIDIA Ising Calibration 1 (19B+ parameters) on target tasks, demonstrating simulated quantum computation's utility for building compact, knowledge-intensive VLMs.
vision-language modelquantum computationparameter compressionquantum calibrationinternvl
DuoMem: Towards Capable On-Device Memory Agents via Dual-Space Distillation
DuoMem introduces a dual-space distillation framework for deploying capable memory-augmented agents on resource-constrained devices, transferring procedural problem-solving abilities from large teacher to compact student models. The method combines context-space distillation (prepending teacher-generated procedural memories) and parameter-space distillation (fine-tuning LoRA adapters on successful trajectories). On ALFWorld, DuoMem boosts a 4B-parameter model from 4.3% to 77.9% success rate (vs. 87.1% for a 72B teacher), with <10M added parameters and 3× faster inference, enabling real-time edge deployment.
dual-space distillationmemory-augmented agentslora adaptersalfworldprocedural memories
SWE-Together: Evaluating Coding Agents in Interactive User Sessions
SWE-Together introduces a multi-turn benchmark for evaluating coding agents in interactive user sessions, addressing the limitations of static benchmarks. The benchmark reconstructs 109 repository-level tasks from 11,260 recorded sessions, ensuring recoverable repository states, clear user goals, and observable outcomes. A reactive LLM-based user simulator preserves original user intents and provides feedback as needed. Evaluation metrics include final repository correctness and the number of corrective feedback turns. Experiments with frontier coding agents reveal that stronger agents achieve higher success rates with fewer interventions, indicating an enhanced user experience.
multi-turn benchmarkcoding agentsrepository-level tasksllm-based user simulatorcorrective feedback turns
SpreadsheetBench 2: Evaluating Agents on End-to-End Business Spreadsheet Workflows
The authors introduce SpreadsheetBench 2, a workflow-level benchmark for evaluating spreadsheet agents on end-to-end business tasks, addressing limitations of prior benchmarks focused on isolated operations. The benchmark comprises 321 tasks derived from authentic business data (financial reports, corporate filings), averaging 11.8 worksheets and 593.5 cell modifications per instance, with expert validation. Evaluating eight frontier LLMs and commercial spreadsheet products reveals significant reliability gaps: best model achieves 34.89% overall accuracy, with debugging accuracy dropping to 12.00%, primarily due to inadequate spreadsheet inspection and target-cell selection errors.
spreadsheet agentsworkflow-level benchmarkmulti-sheet dependenciesllm evaluationbusiness automation
Exploiting Local Flatness for Efficient Out-of-Distribution Detection
The paper introduces Fold, a computationally efficient out-of-distribution (OOD) detector that exploits local loss-landscape flatness differences between in-distribution (ID) and OOD data. The method leverages feature Hessian curvature and partial feature normalization, with AutoFold enabling self-supervised calibration via pseudo-OOD samples generated through ID logit masking. Experiments on OOD benchmarks demonstrate Fold's superiority, achieving a 1.63% AUROC improvement and 2.30% FPR95 reduction while maintaining forward-pass efficiency. Theoretical analysis confirms the observed curvature discrepancy between ID and OOD inputs.
out-of-distribution detectionhessian curvatureloss-landscape flatnessself-supervised calibrationlogit masking
Data-Efficient Multimodal Alignment for Histopathology-based Molecular Prediction
We introduce a data-efficient multimodal alignment framework for predicting molecular pathways from H&E-stained histopathology images using frozen foundation models. By training a lightweight alignment module via contrastive learning on a multi-cancer cohort (N=1,720), we enable open-vocabulary molecular prompting of H&E slides with gene-set signatures, achieving a 25-fold improvement in retrieval over baselines. Morphologically grounded programs (e.g., cell-cycle, immune-related) show high predictability (R^2>0.5), while pathways lacking morphological footprints remain challenging. Clinical validation on the POSEIDON trial demonstrates accurate prediction of NSCLC subtypes and tumor microenvironment archetypes, with generalization across unseen cohorts and data-efficient domain adaptation.
multimodal alignmentcontrastive learningopen-vocabulary promptinghistopathologymolecular pathways
SAGA: Scene-Aware, Goal-Evolving Agents for Long-Horizon CivRealm Strategy Planning
SAGA introduces a multi-agent LLM framework for long-horizon strategy planning in complex games, addressing three systematic failures: scene blindness, context overflow, and shallow cross-game learning. The method combines a Map-Semantic Scene Graph for spatial reasoning, a Tool-Augmented Planner for domain-specific state management, and a Dual-Horizon Feedback Loop for strategic evolution. Evaluated on FreeCiv, SAGA achieves the highest mean civilization score with 27% fewer output tokens, outperforming baselines in infrastructure construction and cross-game performance, with each component independently contributing to its advantage.
multi-agent frameworkscene graphtool-augmented plannerdual-horizon feedbacksparse reward
HippoSpark: An On-Demand Experience System for LLM Reasoning
HippoSpark introduces a state-level experience system for LLM reasoning that retrieves on-demand guidance tailored to immediate reasoning bottlenecks, contrasting with task-level approaches that assume universal solution patterns. The method dynamically provides precise, state-specific experience during problem-solving, addressing local failures in complex reasoning. Evaluations across mathematical, scientific, and programming benchmarks demonstrate consistent improvements over standard prompting and task-level baselines, highlighting the importance of actionable guidance at critical reasoning states.
llm reasoningexperience systemstate-level retrievalreasoning bottleneckson-demand guidance
Latent-CURE for Breast Cancer Diagnosis
Latent-CURE introduces a novel breast cancer diagnosis framework leveraging asymmetric weighted chain-of-thought methodology for latent space reasoning. The approach constructs implicit reasoning trajectories, forcing sequential inference of BI-RADS morphological descriptors before final diagnosis. A dual-asymmetric optimization strategy dynamically adjusts margins and weights to prevent rare malignant features from being overshadowed by common benign patterns. Evaluations demonstrate that this knowledge-injected method provides transparent clinical evidence while maintaining robust diagnostic accuracy in imbalanced medical cohorts.
chain-of-thoughtlatent spacebi-radsasymmetric optimizationmalignant descriptors
EVAF: A Test-Retest Protocol for Selective Parametric Consolidation
The paper introduces EVAF (Echo-Valence Attractor Field), a mechanism for selective parametric consolidation in language agents, alongside a test-retest protocol to measure consolidation under interference. EVAF employs gated LoRA updates to preferentially consolidate high-valence, high-surprise experiences while maintaining factual memory through a routed retrieval path. Experiments on GPT-2 and TinyLlama demonstrate EVAF's superiority over baselines in behavioral persistence (post-interference), with reduced parameter drift and cross-persona contamination, supporting a distinction between memory access and internalization.
parametric consolidationloratest-retest protocolmemory routingvalence-attractor
A causal modeling perspective on decision theory
The paper introduces a formal framework for decision theory using nonparametric structural equation models (NPSEMs) to unify representations of agents, counterfactuals, and causal relationships. It proposes personal decision theory, where agents maximize subjective counterfactual utility, and establishes a performance metric based on hypothetical interventions. Under specific assumptions, the theory proves optimal for this metric, demonstrated through analyses of the smoking lesion problem and Newcomb's problem. The approach aims to clarify modeling language and evaluative criteria in decision theory.
nonparametric structural equation modelscounterfactual utilitydecision theorycausal inferencenewcomb's problem
Pondering the Way: Spatial-perceiving World Action Model for Embodied Navigation
The paper introduces SWAM (Spatial-perceiving World Action Model), a task-centric joint observation-action generation framework for embodied navigation that addresses limitations of verification-centric planners. SWAM performs single-pass inference to simultaneously generate intermediate RGB-D sequences and action trajectories from start and goal RGB observations, leveraging depth pseudo-labels during training but requiring only monocular RGB at inference. Key innovations include a visual-guided action refinement module and trajectory-scale regularization loss for motion-visual alignment. Experiments demonstrate SWAM's superiority over two-stage planners in success rate (quantitative metrics unspecified), trajectory accuracy, and inference efficiency, with robust zero-shot generalization.
embodied navigationworld modelrgb-d generationaction trajectoryzero-shot generalization
CW-B: Class Weighted Boosting Framework for Imbalance Resilient Multi Class Cardiac Phenotyping
The paper introduces CW-B, a class-weighted XGBoost framework for robust multi-class cardiac discharge phenotyping under real-world data imbalance and missingness. The method integrates fold-specific class-balanced instance weighting, missingness-indicator augmentation, and classwise error auditing to prioritize high-risk phenotypes while maintaining interpretability. Evaluation via five-fold stratified cross-validation shows CW-B outperforms tree-based, ensemble, and neural baselines in Accuracy (exact values unspecified), Macro-F1, Balanced Accuracy, and Prioritized F1 metrics.
class-weighted boostingcardiac phenotypingxgboostmissingness-indicator augmentationclasswise error auditing
Semi-Supervised Sound Event Detection with Conditional Mixup and Embedding-Level Contrastive Loss
The paper improves semi-supervised sound event detection (SED) by introducing conditional mixup and embedding-level contrastive loss within the ATST-SED framework. The method resolves the conflicting roles of mixup in pseudo-label learning (composition) and contrastive learning (perturbation) by unifying them, while leveraging unlabeled data more effectively through self-supervised contrastive objectives. The model achieves state-of-the-art performance on DESED validation with 0.645 PSDS1 and 0.822 PSDS2 scores.
sound event detectionsemi-supervised learningcontrastive lossconditional mixupaudio foundation models
LLM-based Multimodal Personality Recognition via Facial Action Unit-Text Semantic Fusion
The paper proposes an LLM-based multimodal framework for personality recognition in asynchronous video interviews (AVIs) by semantically fusing facial action units (AUs) with textual responses. AU sequences are converted to textual descriptions and fused with participant responses via an LLM, followed by a lightweight regression head for continuous personality scoring. On AVI-6 benchmark, the method achieves lower prediction errors and stronger human-score correlations than baselines, with AU-derived semantics providing complementary non-verbal cues. The decoupled architecture enhances training stability and interpretability.
personality recognitionfacial action unitsmultimodal fusionasynchronous video interviewsllm-based framework
Critical Interval MSE: Toward Reliable Offline Validation for Robot Manipulation Policies
The paper introduces Critical Interval MSE (CI-MSE), an offline validation metric for robot manipulation policies that improves correlation with real-world performance. CI-MSE focuses error computation on task-critical segments and incorporates action-alignment procedures to better reflect rollout behavior. Experiments show CI-MSE achieves a Spearman's rank correlation of -0.87 (vs. raw MSE's -0.61) with real-world performance, demonstrating robustness across hyperparameters and distribution shifts.
offline validationrobot manipulationcritical intervalsspearman correlationaction-alignment
Child-Centric Voice Anonymization in Single and Multi-Speaker Speech via Domain-Adapted SSL Models
The paper contributes a child-centric voice anonymization system by adapting self-supervised learning (SSL) to child speech domains. Using the MyST corpus for domain adaptation, the method combines target speaker extraction with anonymization for both single-speaker and two-speaker conditions. Results show improved intelligibility (↑3.2dB SNR) and perceptual quality (MOS↑0.8) while maintaining 98% privacy protection, demonstrating the necessity of child-specific adaptation in speech anonymization pipelines.
voice anonymizationself-supervised learningdomain adaptationchild speechspeaker extraction
SABER-Math: Automated Benchmark for Information Retrieval Evaluation in Mathematics
SABER-Math introduces the first fully automated benchmark for evaluating mathematical information retrieval (IR), addressing the lack of fine-grained mathematical relevance in existing benchmarks. The method constructs reranking tasks from 283K high-school-level math problems by (i) extracting solution summaries and topics via LLMs, (ii) identifying relevant documents using ontology-based and lexical similarities, and (iii) generating fine-grained relevance ratings through an LLM preference tournament. Evaluation of lexical retrievers, math-specific systems, and embedding models reveals that embedding models outperform classical and specialized baselines but struggle with symbol-heavy domains like Algebra and Calculus. Results demonstrate that general-purpose IR benchmarks fail to predict mathematical performance, underscoring the need for domain-specific evaluation.
information retrievalembedding modelsontology-based similaritylexical retrieversreranking tasks
Trust Your Instincts: Confidence-Driven Test-Time RL for Vision-Language-Action Models
We introduce T^2VLA, a test-time reinforcement learning framework for Vision-Language-Action Models (VLAs) that enables self-bootstrapping policy improvement without external reward signals. T^2VLA leverages trajectory-level similarity to high-confidence expert demonstrations as intrinsic reward and employs a Confidence-Driven Dual Expert Bootstrapping mechanism to balance exploration and training stability. Evaluations on LIBERO and RoboTwin benchmarks demonstrate that T^2VLA consistently outperforms supervised baselines, approaching oracle RL performance with ground-truth rewards, while adapting to diverse VLA paradigms including OpenVLA-OFT and the pi series.
vision-language-action modelstest-time reinforcement learningconfidence-driven bootstrappingintrinsic rewardself-bootstrapping
SafePyramid: A Hierarchical Benchmark for In-context Policy Guardrailing
SafePyramid introduces a hierarchical benchmark for in-context policy guardrailing, comprising 1,000 multi-turn conversations across 10 domains and 3,000 application-specific policies with 61,699 distinct natural-language rules. The benchmark evaluates three difficulty levels: individual-rule understanding (L0), reasoning over rule dependencies (L1), and adaptation to novel policy frameworks (L2). A rigorous multi-stage pipeline ensures benchmark quality. Evaluation of 10 frontier LLMs and 5 policy-configurable guardrails reveals significant challenges: GPT-5.5 achieves exact identification of violated rules in only 54.0%, 35.3%, and 12.9% of cases for L0, L1, and L2, respectively. These results underscore the need for improved in-context policy guardrails.
in-context policy guardrailingmulti-turn conversationsrule dependenciespolicy frameworksnatural-language rules
LWDrive: Layer-Wise World-Model-Guided Vision-Language Model Planning for Autonomous Driving
The paper introduces LWDrive, a vision-language model (VLM) framework for autonomous driving that refines coarse trajectories through layer-wise world-model guidance. The method uses a Foresight Cascade Planner (FCP) to expand and refine candidate trajectories by integrating multi-layer VLM features, historical states, Action-Query representations, and Bird's-Eye-View (BEV) features, while preserving high-level driving intentions. Experiments demonstrate LWDrive's effectiveness, achieving scores of 92.0 on NAVSIM and 89.6 on NAVSIM-v2 benchmarks.
vision-language modelautonomous drivingtrajectory refinementbird's-eye-viewforesight cascade planner
Clinical Reasoning Graphs: Structured Evaluation of LLM Diagnostic Reasoning Reveals Competence Without Consistency
The study introduces clinical reasoning graphs, a structured evaluation framework for LLM diagnostic reasoning, using a domain-grounded ontology with 5 node types and 7 edge types. Analyzing 750 diagnostic traces from five LLMs across 50 clinical cases, the authors find no evidence of stable reasoning schemas, with graph similarity nearly identical for correct (0.488) and incorrect (0.484) diagnoses. Structured reflection prompts increase feature analysis (+33%) but not cross-case consistency, revealing diagnostic competence without schema-scale reasoning consistency.
clinical reasoning graphsdiagnostic schemasstructured reflection promptinggraph similaritydomain-grounded ontology
AI Training Manager: Bounded Closed-Loop Control of Adaptive Training Recipes
The paper introduces AI Training Manager, a bounded LLM-based supervisory controller for adaptive machine learning training that addresses mid-run failures like overfitting and loss imbalance. The system operates via a schema-conditioned interface, reading structured telemetry, auditing constrained actions, and returning validated parameter updates (e.g., learning rate, regularization). Evaluations on TinyStories show a 60% lower validation loss than baseline, with asynchronous update capability. In robotic RL, it mitigates exploration issues. Results demonstrate LLMs can complement conventional optimizers with interpretable, multi-axis control.
adaptive trainingschema-conditioned interfaceloss imbalancebounded controltelemetry snapshots
ARKD: Adaptive Reinforcement Learning-Guided Bidirectional KL Divergence Distillation for Text Generation
The authors propose Adaptive Reinforcement Learning-Guided Bidirectional KL Divergence Distillation (ARKD), a novel framework for text generation that dynamically balances forward and reverse KL divergence (FKL/RKL) objectives. ARKD employs a reinforcement learning policy network to adaptively weight FKL and RKL based on teacher-student distributional characteristics, optimizing both principal and long-tail probability modeling. The method achieves dual distribution alignment through reward-guided optimization. Experimental results demonstrate consistent improvements, with ARKD surpassing greedy heuristics by 0.4-0.6 points on Rouge-L and BertScore metrics across diverse benchmarks.
knowledge distillationkl divergencereinforcement learningtext generationdistribution alignment
RoAd-RL: A Unified Library and Benchmark for Robust Adversarial Reinforcement Learning
The authors introduce RoAd-RL, an open-source benchmarking framework for robust adversarial reinforcement learning, addressing fragmentation in implementations and evaluation protocols. The library provides unified abstractions for policies, attacks, defenses, and robustness metrics, integrating with Stable-Baselines3 and Gymnasium. Evaluation of DQN, PPO, and SAC agents across 192 attack-defense configurations in LunarLander and Highway-v0 reveals environment-dependent robustness variations, with temporal smoothing emerging as a consistently effective defense while some common defenses prove counterproductive.
adversarial reinforcement learningrobustness metricsstable-baselines3temporal smoothinggymnasium
SUMO: Segment and Track Any Motion with Nonlinear State Space Models
SUMO introduces a zero-shot, training-free framework for Visual Object Tracking (VOT) and Moving Object Segmentation (MOS) by integrating nonlinear dynamics with vision-based segmentation. The method employs a nonlinear State Space Model (SSM) inspired by robotics, a Selective Unscented Filter (SUF) for state estimation with multi-source prediction fusion, and a memory selection mechanism. Experiments demonstrate state-of-the-art performance on VOT and MOS tasks.
visual object trackingmoving object segmentationnonlinear state space modelselective unscented filterzero-shot learning
Beyond Triplet Plausibility: Relation Set Completion in Knowledge Graphs
The paper introduces relation set completion (RSC), a novel knowledge graph completion task addressing entity-relation compatibility gaps beyond traditional triplet prediction. The authors propose Relation Set Embedding (RelSetE), which models latent patterns in observed entity relations to infer missing compatible relations. Evaluated on three derived benchmark datasets, RelSetE demonstrates effective capture of entity-relation compatibility patterns, outperforming baselines in missing relation inference.
knowledge graph completionrelation set embeddingentity-relation compatibilitylink predictiontriplet prediction
Exploring Motivations for Algorithm Mention in the Domain of Natural Language Processing: A Deep Learning Approach
The study introduces a sentence-level framework for analyzing motivations behind algorithm mentions in NLP academic papers, focusing on description, use, comparison, and improvement. Using manual annotation and machine learning, algorithm entities and related sentences were identified, with motivation classification performed via pretrained models and data augmentation. Results indicate that deep learning models with augmented data outperform traditional methods in motivation classification. Findings reveal that direct use is the most common motivation (over 50%), while improvement is the least frequent. Over time, use motivations have replaced description motivations, and motivation diversity has increased, though individual algorithms show declining motivation type counts.
natural language processingalgorithm entitiesmotivation classificationdata augmentationdeep learning models
MATCH: Modulating Attention via In-Context Retrieval for Long-Context Transformers
MATCH introduces a scalable framework for enhancing sparse-attention transformers by dynamically integrating in-context information via efficient retrieval. The method modulates attention mechanisms without rigid structural constraints, addressing the quadratic cost of traditional attention while preserving long-range recall capabilities. Empirical results demonstrate significant performance improvements on synthetic and natural-language tasks, validating MATCH as an effective approach for maintaining efficiency in long-context scenarios.
sparse-attentionin-context retrievallong-context transformersattention modulationefficient retrieval
Neural Procedural Memory: Empowering LLM Agents with Implicit Activation Steering
The paper introduces Neural Procedural Memory (NPM), a training-free framework for enhancing LLM agents through implicit activation steering rather than explicit textual instructions. NPM distills procedural skills from contrastive experiences into activation-space steering vectors, directly triggering task-relevant neural mechanisms. Evaluations on four agent benchmarks show NPM matches explicit-instruction baselines while combining both approaches yields complementary robustness. Representational analyses reveal steering vectors encode consistent task logic with organized activation-space structures, suggesting implicit steering as a viable agent memory mechanism.
neural procedural memoryactivation steeringllm agentsretrieval-augmented generationcontrastive experiences
Experience Graphs: The Data Foundation for Self-Improving Agents
The paper introduces Trellis, a data foundation that treats experience graphs as first-class database objects for long-horizon agentic tasks like code generation and hardware design. Experience graphs capture structured search trajectories (artifacts, tool outputs, rewards, lineage) typically discarded as ephemeral logs. Trellis reformulates agent operations as database patterns: frontier selection as queries, cross-session reuse as graph retrieval, and training-data extraction as materialized views. Evaluated on Meta's KernelEvolve, it achieves 10x faster target speedup at 52% lower token cost by enabling durable, queryable experience graphs that transform inference-time search into institutional assets.
experience graphsagentic tasksmaterialized viewsgraph retrievaltime-travel query
Dual-Flow Reinforcement Learning with State-Aware Exploration
Dual-Flow RL introduces a unified actor-critic framework for complex continuous-control tasks, addressing challenges in multimodal action spaces and uncertain return distributions. The method jointly models continuous return distributions and multimodal policies using conditional flow matching (CFM), coupled with an Entropy-Covariance Exploration Regulator (ECER) for state-aware exploration. Experiments on DeepMind Control Suite and Humanoid-Bench demonstrate state-of-the-art performance, surpassing prior diffusion-based and flow-based methods.
dual-flow rlconditional flow matchingmultimodal explorationactor-critic frameworkcontinuous-control
How Far Can You Get Without a GPU? A Systematic Benchmark of Lightweight Hallucination Detection Across Question Answering, Dialogue, and Summarisation
This work benchmarks lightweight, CPU-feasible hallucination detection methods using publicly available models, addressing resource constraints in AI deployment. Five methods—ROUGE-L, semantic similarity, BERTScore, a FEVER-trained DeBERTa-based NLI detector, and a similarity-NLI ensemble—are evaluated across QA, dialogue, and summarisation tasks on the HaluEval benchmark. Results show task-dependent performance: the ensemble excels in QA (F1 = 0.792, AUC-ROC = 0.873), NLI leads in dialogue (AUC-ROC = 0.713), and all methods degrade in summarisation (AUC-ROC = 0.469–0.574). Experiments were conducted on a standard laptop CPU, mapping the practical limits of GPU-free detection.
hallucination detectionnli detectorcpu-feasiblehaleval benchmarkauc-roc
Making Multimodal LLMs Reliable Chart Data Extractors: A Benchmark and Training Framework
The paper introduces a benchmark and training framework to enhance multimodal large language models (MLLMs) for chart data extraction, particularly from label-free charts. The authors propose a human-centered approach, modeling chart reading as a progressive learning process, and develop a 7B-parameter model that achieves state-of-the-art performance in numerical accuracy. Results demonstrate significant improvements over existing methods, with a user study confirming the model's effectiveness in mixed-initiative workflows.
multimodal llmschart data extractionprogressive learningnumerical accuracymixed-initiative systems
Accelerating Q-learning through Efficient Value-Sharing across Actions
The paper introduces the mean-expansion layer, a parameter-free addition to Q-network architectures that accelerates action-value learning in reinforcement learning. The layer shares values across actions within a state and transforms the problem into learning a lower-norm representation of action-values, rather than directly learning potentially large values. Applied to deep Q-networks and implicit quantile networks, the method improves aggregate performance across 57 Atari games, increases action gaps, and significantly reduces value overestimation.
mean-expansion layeraction-value learningq-network architecturesimplicit quantile networksvalue overestimation
The CRISTAL Method: Neurosymbolic analysis from AI-synthesized world models
The CRISTAL Method introduces a neurosymbolic framework for automating complex analysis workflows, addressing challenges in domains like fundamental investment analysis with high uncertainty and subjective data. It combines statistical model synthesis, continuous learning, and active learning to build a dynamic, interpretable probabilistic program supporting Bayesian inference. The method leverages LLMs for code synthesis and refines its world model during analysis. Evaluated on a synthetic equities benchmark, CRISTAL achieves Bayes-optimal accuracy with 5 examples and a 5-second budget, outperforming state-of-the-art LLMs by 60% accuracy margins.
neurosymbolicprobabilistic programbayesian inferenceactive learningllm synthesis
Multi-Level Distributional Entropy for Explainable Network Intrusion Detection
The paper introduces Multi-Level Distributional Entropy (MDE), a framework for deriving interpretable entropy features from flow-level statistics in network intrusion detection. MDE operates at three levels: within-flow Gaussian differential entropy, cross-directional Jensen-Shannon divergence, and TCP flag-pattern Shannon entropy, requiring no raw packet data. Evaluated on NSL-KDD, CICIDS-2017, CICIDS-2018, and UNSW-NB15 benchmarks, entropy-only features achieve weighted F1 scores of 0.708-0.989, matching conventional features. Analysis reveals hidden failure modes, such as a detection rate drop to 0.48 on CICIDS-2018 despite F1=0.74. SHAP analysis confirms reproducible entropy attributions (Spearman rho=0.80-0.95).
entropyintrusion detectionjensen-shannon divergenceshapley valuestcp flags
What Drives the Inlier-Memorization Effect? A Theory of Outlier Detection via Early Training Dynamics
The paper presents a theoretical analysis of the inlier-memorization (IM) effect in unsupervised outlier detection, where deep models memorize inlier patterns earlier than outliers. Using a simple autoencoder framework, the authors characterize the emergence, strength, and persistence of IM under mild assumptions, linking these properties to data distribution and parameter initialization. Derived guidelines for enhancing IM—including data preprocessing and initialization schemes—achieve state-of-the-art performance on ADBench, providing a theoretical foundation for IM-based methods.
outlier detectioninlier-memorizationautoencoderunsupervised learningearly training dynamics
HERO: Improving the Reliability and Sensitivity of Generative Model Evaluation Using Historical Data
HERO (History Enhanced RObust model evaluation) introduces a framework leveraging historical evaluation data to enhance generative model assessment reliability and sensitivity. By calibrating noisy silver labels against sparse gold annotations and anchoring estimators to high-precision covariates, HERO suppresses bias and reduces variance. Theoretical conditions for bias-variance reduction are established, with empirical validation through simulations and real-world benchmarking datasets. The method remains effective across evaluation tasks and partial historical labeler availability.
generative model evaluationsilver labelsgold annotationsbias-variance tradeoffcovariate anchoring
FalconTrack: Photorealistic Auto-Labeled Perception and Physics-Aware Vision-Based Aerial Tracking
FalconTrack introduces a unified perception-and-tracking framework for vision-based aerial tracking in GPS-denied environments, leveraging photorealistic simulation for automated label generation and physics-aware tracking. The system employs a Gaussian Splatting simulator to isolate target Gaussians from short object videos, compositing them with randomized backgrounds to produce RGB, mask, class, and 6-DoF pose labels, generating 10k labeled images in under 20 minutes. A multi-head perception module trained with staged learning and reprojection consistency is fused with class-conditioned dynamics priors in an EKF for tracking. FalconTrack achieves 96-100% class accuracy in zero-shot sim-to-real transfer, maintains consistent performance in unseen scenes, and runs at 25 Hz with 100% success in real hardware closed-loop visual tracking.
gaussian splattingsim-to-real transfer6-dof poseekf trackingzero-shot learning
Mandol: An Agglomerative Agent Memory System for Long-Term Conversations
Mandol introduces an agglomerative memory system for long-term conversational agents, addressing fragmentation and inefficiency in existing heterogeneous databases. The system features a hierarchical memory model with basic and abstract layers uniformly represented as structured semantic graphs, a fused semantic data structure (SemanticMap + SemanticGraph) enabling hybrid retrieval, and a quantitative query mechanism with adaptive routing and token-constrained context generation. Evaluated on LoCoMo and LongMemEval benchmarks, Mandol achieves superior accuracy, 5.4x faster retrieval, and 4.8x faster insertion under 10 QPS load while maintaining low latency on consumer hardware.
agglomerative memorysemantic graphhybrid retrievalquantitative querylong-term conversation
Towards Generalizable and Evidential Nuclear Magnetic Resonance-Based Molecular Structure Elucidation via Large Language Model Agent
NMRAgent introduces a novel approach to molecular structure elucidation by integrating large language models (LLMs) with specialized spectral analysis tools and chemical knowledge graphs. The agent mimics human deductive reasoning, processing NMR spectra and molecular formulas to plan elucidation, propose candidate structures, verify peak-atom consistency, and refine substructures through formula-aware fragment optimization. NMRAgent achieves a 46.5% improvement in top-1 accuracy and a 0.502 increase in Tanimoto similarity on a scaffold-split benchmark, demonstrating its efficacy with novel scaffolds. It successfully elucidated structures of unknown natural products and corrected literature misassignments, establishing a new paradigm for interpretable AI in analytical chemistry.
nmr spectroscopylarge language modelsevidential reasoningchemical knowledge graphsscaffold-split benchmark
CLQT: A Closed-Loop, Cost-Aware, Strategy-Consistent Benchmark for Diagnostic Evaluation of LLM Portfolio-Management Agents
CLQT introduces a closed-loop, cost-aware benchmark for diagnosing LLM portfolio-management agents, shifting evaluation from ranking returns to identifying process strengths and weaknesses. The framework employs a five-stage trading cycle (gather, synthesize, allocate, execute, reflect) within a temporally-gated environment, supported by six pillars including TimeGate, cost modeling, and strategy-consistency scoring. Agents operate as constrained committees or full-autonomy orchestrators, enabling process scaffolding experimentation. Metrics are derived from a recompute-verifiable hash chain, yielding a five-axis capability scorecard (APM-CS) validated via contamination-controlled backtests and live broker tracks. CLQT disentangles outcomes from capabilities, providing a durable map of agent competencies.
closed-loopcost-awarestrategy-consistencytemporally-gatedhash chain
TopoAgent: An Agentic Framework for Automated Topology Learning in Medical Imaging
TopoAgent introduces an LLM-based agentic framework for automated topology learning in medical imaging, addressing the limitation of fixed topological descriptors in conventional methods. The framework employs a Perception--Reasoning--Action--Reflection loop, supported by 21 domain-specific tools and dual memory, to analyze input images and determine optimal topological descriptors without task-specific training. It evaluates 15 topological descriptors across 26 datasets using six classifiers, enabling the generation of suitable topological feature vectors for downstream tasks. This approach leverages persistent homology to capture geometric structural properties often overlooked by pixel-level deep learning.
topological data analysispersistent homologyagentic frameworktopological descriptorsmedical imaging
PS-PPO: Prefix-Sampling PPO for Critic-Free RLHF
The paper introduces Prefix-Sampling Proximal Policy Optimization (PS-PPO), a critic-free RLHF method that improves computational efficiency by exploiting temporal redundancy in trajectories. PS-PPO samples trajectory prefixes via a prompt-conditioned cutoff distribution and applies importance-weighting to maintain unbiased gradient estimation while only backpropagating through prefixes. Experiments on mathematical reasoning and RLHF benchmarks demonstrate comparable accuracy to baselines while significantly reducing training compute (up to 3×) and GPU memory usage.
reinforcement learning from human feedbackproximal policy optimizationcritic-free methodstemporal redundancygradient estimation
Rethinking Generative Reconstruction Attacks against Graph Neural Network Models
The paper introduces two novel graph inversion attacks against Graph Neural Networks (GNNs): graph-label conditioned (GLC) and embedding-label conditioned (ELC) attacks, leveraging model predictions and intermediate representations respectively. Using a generator-discriminator approach, the attacks reconstruct high-quality graphs in black-box scenarios, evaluated on NCI1, PROTEINS, and AIDS datasets with FGD, EGD, MMD, and GKS metrics. Results show GNN vulnerability even with 50% reduced queries (Ours-- variant) and varying Laplacian noise-scales.
graph neural networksmodel inversion attackprivacy attacksgraph reconstructionblack-box attack
DEEPMED Search: An Open-Source Agentic Platform for Medical Deep Research with Introspective Verification
The paper introduces DEEPMED Search, an open-source agentic platform for transparent medical research that addresses limitations in commercial tools and standard RAG implementations. The system employs a source-adaptive router to dispatch sub-queries to PubMed, web search, or local knowledge bases, coupled with an introspective verification module using causal-consistent multi-agent debate for evidence validation. Results demonstrate its capability to handle rare disease queries, filter noise, and generate citation-backed reports efficiently, providing a robust infrastructure for medical reasoning.
agentic platformintrospective verificationsource-adaptive routercausal-consistent debateknowledge bases
DeepTrans Studio: Turning Expert Interventions into Shared Team Knowledge in Agentic Translation Workflows
DeepTrans Studio introduces a collaborative translation workspace that transforms expert interventions into shared team knowledge within agentic translation workflows. The system enables professionals to intercept specific workflow nodes, review evidence, revise AI outputs, and save approved decisions to a collective team memory. During a demo, participants role-played translators and reviewers, addressing preset terminology and legal-modal risks, with their decisions propagated to downstream segments and surfaced as reusable precedents in teammates' workspaces. This approach ensures human interventions become traceable, shared knowledge rather than isolated corrections.
agentic translationteam memorylegal-modal risksworkflow nodesreusable precedents
From Trait to Behavior: A Cognitive-Affective Personality System (CAPS) Perspective on Multi-Homing Intention in AIGC Platforms
This study addresses the theoretical gap in cross-platform usage intentions within Artificial Intelligence Generated Content (AIGC) platforms by proposing and validating a three-stage multiple mediation model. The model integrates optimum stimulation level (OSL) theory, complementarity theory, and perceived value theory, with social influence and use experience as control variables. Results indicate that OSL enhances perceived complementarity, which positively affects perceived epistemic value, subsequently predicting multi-homing intention. A chain mediation path from OSL to multi-homing intention via perceived complementarity and epistemic value was identified. Social influence positively impacts multi-homing intention, while use experience shows no significant effect.
artificial intelligence generated contentoptimum stimulation levelperceived complementarityperceived epistemic valuemulti-homing intention
Redefining Maritime Anomaly Detection via Equation-Grounded Synthetic Anomalies
The paper introduces an equation-grounded taxonomy and synthetic anomaly generation pipeline for maritime anomaly detection using AIS data. The method defines three anomaly types (unexpected activity, route deviation, close approach) and implements a score-synthesize-label pipeline with LLM-guided plausibility scoring. Evaluations across temporal-window variations and anomaly compositions demonstrate the framework's effectiveness when tested on diverse time-series and anomaly detection models. The approach addresses limitations of prior statistical rarity and expert-labeling methods while enabling systematic benchmarking.
automatic identification systemanomaly taxonomyllm-guided synthesistime-series evaluationmaritime safety
Diagnosing and Mitigating Context Rot in Long-horizon Search
This paper investigates and mitigates context rot, a degradation of Large Language Model (LLM) capabilities due to extensive context in long-horizon search tasks. The authors evaluate four open-source models across three benchmarks, revealing that increasing context length leads to premature uncertain answers or model disengagement. They explore mitigation strategies through context management (seven methods across three categories) and rot-aware rejection sampling, demonstrating their effectiveness individually and in combination. Pruning experiments establish a relationship between accumulated context and rot severity, providing guidance for strategy selection based on performance, cost, and rot impact.
context rotlong-horizon searchrejection samplingcontext managementpruning experiments
Optimizing Expert-Designed Crystal Graph Networks for Band-Gap Prediction with an Autonomous LLM Research Loop
An autonomous LLM research agent optimized expert-designed crystal graph networks for band-gap prediction, achieving state-of-the-art accuracy on the MatBench benchmark (>100k crystals) without external pretraining. The agent implemented known methods, including element-pair features on message-passing edges and crystal space-group embeddings, outperforming seventeen expert-designed models. This work demonstrates the potential of LLM-driven autonomous research in optimizing machine learning models for material property prediction while highlighting its methodological limitations.
crystal graph networksband-gap predictionmatbenchmessage-passing edgesspace-group embedding
SEVA: Self-Evolving Verification Agent with Process Reward for Fact Attribution
SEVA introduces a structured verification agent for LLM fact attribution, addressing hallucination through evidence alignments, reasoning chains, and error diagnoses. The method employs process reward in RL to decompose verification quality into five components, resolving advantage collapse and inducing an implicit curriculum. Results show improved alignment (0.917→0.997) and F1 (64.9→69.0), with SEVA-3B matching GPT-4o-mini (69.0 vs. 69.8 F1) on ClearFacts while providing richer output.
fact attributionprocess rewardadvantage collapseself-evolution loopstructured verification
ARMOR: Adaptive Retriever Optimization for Low-Resource Telecom Question Answering
ARMOR introduces adaptive retriever optimization for low-resource telecom QA, prioritizing query-encoder adaptation over generator fine-tuning under bounded-parameter assumptions. The method jointly optimizes latent-document RAG likelihood and InfoNCE contrastive objectives, learning separate temperatures for each and regularizing the adapted query encoder toward its frozen base. Evaluations on telecom-specific benchmarks demonstrate improved evidence retrieval and answer generation compared to generator-side adaptation.
retrieval-augmented generationquery-encoder tuninginfoncelatent-document likelihoodtelecom qa
GUICrafter: Weakly-Supervised GUI Agent Leveraging Massive Unannotated Screenshots
GUICrafter introduces a weakly-supervised GUI agent that reduces reliance on human annotations by leveraging massive unannotated screenshots. The method employs a two-stage curriculum learning framework: first learning visual grounding from unannotated GUI screenshots and webpages, then fine-tuning with minimal high-quality data via reinforcement learning. Experiments demonstrate competitive performance to UI-TARS using only 0.1% of its annotated data, and superior results to GUI-R1 under equivalent annotation budgets.
gui agentweakly-supervised learningvisual groundingcurriculum learningreinforcement learning
Toward Secure and Reliable PDDL Formalization of Large Language Models with Planner-in-the-Loop Feedback
The paper introduces NL-PDDL-Bench, a multi-domain benchmark for natural-language-to-PDDL formalization with planner-verified executability and difficulty scaling by object count. It proposes a planner-in-the-loop framework using validator and planner diagnostics for localized edits, combining Low-Rank Adaptation fine-tuning, planner-derived Direct Preference Optimization, and inference-time repair. Experiments demonstrate improved planner success rates, plan-level agreement, and robustness across domains, highlighting verifiable formalization for safety-critical LLM deployment.
pddl formalizationplanner-in-the-looplow-rank adaptationdirect preference optimizationplan-level consistency
Early Warning Signals for OpenVLA Failure under Visual Distribution Shift
This work demonstrates that OpenVLA's internal activations contain linearly decodable signals predictive of near-term task failure under visual distribution shifts. The authors analyze LIBERO manipulation rollouts with a fixed OpenVLA policy, logging activations and fitting lightweight monitors post-hoc. Under occlusion stress tests reducing success rates from 57% to 17%, a logistic probe at layer 16 achieves AUROC 0.972 and AUPRC 0.352 for failure prediction within 15 steps, outperforming baselines. Layer-wise analysis reveals uneven decodability, with layer 16 being most informative. The monitor generalizes to camera jitter but not benign color shifts, though causal mechanisms and deployable recovery remain unestablished.
vision language action modelslinear decodabilitydistribution shifttask failure predictioninternal activations
A Machine-Verified Proof of a Quantum-Optimization Conjecture
The authors present a machine-verified proof of the Farhi-Goldstone-Gutmann (FGG) conjecture in quantum optimization, which had remained open for over a decade. Using Claude Fable 5 and Lean 4, they formalized QAOA components and reduced the conjecture to a single mathematical statement, which the LLM then proved by identifying a hidden dynamical symmetry. The proof leverages quantum information theory and adjacent mathematical tools, with Lean providing end-to-end verification while requiring human input only for structural validation. This demonstrates a scalable methodology for resolving open conjectures in quantum information science.
quantum approximate optimization algorithmlean theorem provermachine-verified proofdynamical symmetryquantum information theory
Sample-Efficient Learning of Probabilistic Causes for Reachability in Markov Decision Processes with Probabilistic Guarantees
The paper introduces a sample-efficient learning method for identifying probability-raising (PR) causes in Markov decision processes (MDPs) with unknown transition probabilities, providing probabilistic guarantees. The approach uses a restart-based MDP modification to reduce PR-cause verification to two conditional reachability queries, avoiding reliance on original MDP reachability values. Theoretical analysis establishes sample-complexity bounds, while experiments on benchmarks demonstrate reliable causal identification via an anytime algorithm combining learning and two-sided value iteration.
markov decision processesprobability-raising causalitysample complexityconditional reachabilityvalue iteration
Unlocking the Visual Record of Materials Science: A Large-Scale Multimodal Dataset from Scientific Literature
We introduce MatSciFig, a large-scale multimodal dataset unlocking the visual record of materials science literature by decomposing compound figures into 391,606 panel-level image-text pairs from 180,571 figures across 14,810 open-access articles. Our MatMMExtract pipeline employs a fine-tuned YOLO12-m detector for panel localization (mAP_50: 0.9227) and Gemini 3.1 Flash Lite for structured annotation generation (82% quality, 4.8% hallucination rate). Each pair includes sub-captions, visualization categories, and scientific summaries grounded in a materials science taxonomy. A dual-encoder retrieval baseline demonstrates MatSciFig's utility, achieving 4.4x improvement in R@1 over zero-shot CLIP. All resources are openly released.
multimodal datasetcompound figurespanel localizationstructured annotationvision-language learning
Diversity is the Strength of the AI Crowd
The study demonstrates that ensemble forecasting accuracy for future world events improves by combining diverse off-the-shelf LLMs rather than relying on highly correlated predictions from similar models. Using binary questions from the Metaculus AI Benchmark, the authors analyze prediction correlations and find that models like Grok 4 enhance ensemble performance due to lower correlation with other frontier LLMs. Results indicate optimal ensembles prioritize both model quality and diversity, suggesting AI forecasting systems should explicitly optimize for complementary errors.
ensemble forecastingllmsmetaculuscorrelationsuperforecaster
Safety from Honesty in a Disinterested AI Predictor
The paper presents a formal safety argument for the Scientist AI (SAI) Predictor, a Bayesian posterior approximation model trained on epistemically contextualized natural-language statements. The Predictor avoids implicit agency by distinguishing factual claims from communication acts and using a posterior-seeking training objective that excludes downstream effects as reward signals. Under assumptions on training dynamics and sparsity of dangerous Predictors, the probability of residual harm exceeding a threshold is proven to be small, as coordinated deception is rare and costly. Safety and accuracy are jointly supported by constraints that prevent misalignment and agency emergence.
bayesian posteriorepistemic contextualizationimplicit agencytraining dynamicsresidual harm
Budgeted Act-or-Defer Multi-Agent LLM Deliberation with Local Reliability Bounds
The paper introduces a budgeted act-or-defer framework for multi-agent LLM deliberation, ensuring reliable decision-making by deferring to human review when confidence bounds fall below a user-specified threshold. The method maps debate prefixes to low-dimensional states, computes $k$-nearest-neighbor lower confidence bounds on correctness, and decomposes risk into calibration failure, residual action risk, and representation gap. Evaluated on six benchmarks against nine baselines, it achieves 84% automation and 96% acted-on accuracy while using only 9--12% of the pre-declared wrong-action budget. The approach prospectively converts user-declared budgets into auditable operating points under explicit assumptions.
multi-agent deliberationconfidence boundsrepresentation gapwrong-action budgetauditable operating point
Hybrid Retriever Evolution for Multimodal Document Reasoning Agents
The paper introduces a failure-driven evolution framework for learning adaptive retrieval orchestration in multimodal document question answering. A meta-agent diagnoses reasoning failures, probes the tool environment, and iteratively rewrites the task agent's instructions to dynamically coordinate lexical, semantic, and multimodal retrievers during reasoning steps. Evaluated on MMLongBench-Doc and DocBench, the evolved agent achieves up to +19.6 point gains over baselines, outperforming MACT, MDocAgent, and SimpleDoc through adaptive routing and cross-modal evidence composition rather than fixed retrieval pipelines.
multimodal retrievalfailure-driven evolutionadaptive routingmeta-agentdocument reasoning
Fuzzing Large Language Models to Elicit Hidden Behaviours
This paper presents the first systematic study of fuzzing techniques to elicit hidden behaviors in sleeper-agent LLMs (7B-13B parameters), comparing Gaussian noise injection into weights versus residual-stream activations against temperature-sampling baselines. Fuzzing outperformed temperature sampling on 4 of 6 models, with up to 6x improvement on OpenHermes-13B. Hyperparameter selection proved critical, as uniform sweeps yielded low elicitation rates (few percent) compared to best cells (2-10x higher). A Thompson-sampling-based proxy task (in-context secret elicitation) improved activation-fuzzing elicitation by 4x and weight-fuzzing by 1.3-1.8x over uniform baselines. The authors propose reporting results as a (uniform-baseline, proxy-selected, oracle) triple for clarity.
fuzzingsleeper-agentthompson samplingresidual-streamhyperparameter selection
Fast Wireless Foundation Models with Early-Exits
The paper introduces an early-exit framework for wireless foundation models (FMs) to reduce computational costs while improving out-of-distribution (OOD) task performance. The method attaches lightweight task-specific heads at intermediate layers of a frozen FM encoder, enabling variable-depth inference. Results show up to 93% FLOPs reduction and higher accuracy on unseen tasks compared to full encoder execution, with fixed-exit strategies outperforming dynamic early-exiting policies.
wireless foundation modelsearly-exitout-of-distributionvariable-depth inferenceflops reduction
Two-Stage Prompt Optimization for Few-Shot Relation Extraction: From Reasoning-Guided Search to Gradient-Guided Refinement
We propose a two-stage framework for automatic prompt optimization in episodic few-shot relation extraction, combining reasoning-based and gradient-based approaches. The first stage employs any reasoning-based optimizer for broad prompt improvements, while the second stage introduces GradPO, which uses loss and gradient signals to identify high-impact prompt spans and refine them with local edits. Experiments on FS-TACRED and FS-FewRel demonstrate that local refinement typically enhances prompts from the first stage, with GradPO being the most consistent refiner. Our framework achieves state-of-the-art performance on FS-TACRED using Qwen3-4B and remains competitive on FS-FewRel.
prompt optimizationfew-shot relation extractiongradient-based optimizationreasoning-based optimizerlocal refinement
SFBench: The SciFy Scientific Feasibility Benchmark
SFBench introduces a novel benchmark for evaluating systems that assess scientific claim feasibility, featuring 197 de novo claims in materials science annotated with expert-derived feasibility scores and explanations. The benchmark emphasizes complex reasoning over varying feasibility levels, avoids LLM training contamination by using original claims, and employs open-ended explanations rather than fixed-format responses. Baseline evaluations using recent GPT models demonstrate the benchmark's utility in assessing scientific reasoning capabilities.
scientific feasibilitymaterials sciencebenchmark datasetopen-ended explanationsgpt models
SCARCE: Scalable Cascade Analysis for Rare-event Characterisation via Embeddings
The paper introduces SCARCE, a method for scalable rare-event probability estimation that replaces traditional Subset Simulation's handcrafted performance function with learned latent embeddings and geometric rulers. By adaptively constructing nested intermediate events from data and formalizing the approach via a non-negative supermartingale, SCARCE provides valid high-probability upper bounds even under early stopping. Experiments demonstrate 400-500x lower mean absolute error than grid-searched Subset Simulation on MNIST misclassification, and 2.6% mean relative error for LLM jailbreak detection on Llama-Guard-3-8B hidden states with adversarial fractions η ≥ 10⁻³.
rare-event estimationsubset simulationlatent embeddingsnon-negative supermartingalellm jailbreaks
Do We Still Need Fine Tuning? Turkish Sentiment Analysis in the Era of Large Language Model
The study demonstrates that supervised fine-tuning remains superior to zero-shot prompting for Turkish sentiment analysis, particularly in three-class settings. Comparing classical ML, fine-tuned BERTurk, and prompted LLMs on Turkish e-commerce reviews, fine-tuned BERTurk achieves the highest accuracy, while LLMs struggle with neutral class classification, often collapsing it into polarized categories. Performance gaps narrow in binary positive-negative classification, but three-class evaluation reveals LLMs' limitations. Results emphasize the continued necessity of fine-tuning and the importance of including neutral classes for robust sentiment analysis evaluation.
sentiment analysisfine-tuninglarge language modelszero-shot learningberturk
Does Role Specialization Matter for Explanation Faithfulness in Mixture-of-Experts?
The study investigates whether role specialization in Mixture-of-Experts (MoE) architectures preserves explanation faithfulness, hypothesizing that inter-expert representation overlap degrades attribution-based faithfulness. The authors propose representation-level decorrelation regularization to minimize inter-expert similarity, enhancing role separation. Experiments on multimodal benchmarks demonstrate improved faithfulness metrics (comprehensiveness, sufficiency, AOPC) without compromising task performance, with benefits extending to standard sparse MoE baselines. Findings suggest representation-level separation complements structural role decomposition for faithful explanations.
mixture-of-expertsexplanation faithfulnessrepresentation decorrelationmultimodal benchmarksattribution-based metrics
Mechanistically Eliciting Latent Behaviors in Language Models
The paper introduces Causal Perturbative Elicitation (CPE), an unsupervised method for discovering interpretable low-rank adapters (LoRAs) that elicit latent behaviors in language models. CPE uses tensor decomposition to perturb transformer computations, efficiently learning diverse behavioral modes from minimal data. Results show competitive performance with supervised methods (85% vs 87% on Qwen3-8B Countdown task), success in unlocking sandbagged models (85% BigCodeBench recovery), and mitigation of alignment-faking in Llama3-70B. CPE also aids alignment initialization in GPT-OSS-20B, demonstrating utility for both safety evaluation and behavioral control.
causal perturbative elicitationlow-rank adapterstensor decompositionlatent behaviorsalignment-faking
Langshaw: Declarative Interaction Protocols Based on Sayso and Conflict
Langshaw introduces a declarative protocol language for multiagent systems, addressing over-constraining and semantic ambiguity in existing approaches. The method centers on three constructs: (1) 'sayso' for attribute priority assignment, (2) 'nono' and 'nogo' for action conflict resolution, combined with an information model for semantic clarity. Results include formal semantics, safety/liveness verification procedures, and a message-oriented protocol generation method enabling flexible asynchronous enactment.
declarative protocolmultiagent systemsattribute priorityconflict resolutionasynchronous enactment
One Scene, Two Depths: Probing Geometric Ambiguity in Monocular Foundation Models
The paper introduces MultiDepth-3k (MD-3k), a sparse two-layer ordinal benchmark for evaluating depth-layer preference and multi-layer spatial relationship accuracy (ML-SRA) in monocular depth estimation. It highlights the geometric ambiguity in transparent scenes, where a single camera ray may intersect multiple surfaces, challenging the conventional single-depth-per-pixel paradigm. Experiments on MD-3k reveal diverse depth-layer preferences across leading depth foundation models under RGB input, with Laplacian Visual Prompting (LVP) significantly altering layer predictions for frozen models. The best-performing RGB/LVP pair, DAv2-L, achieves 75.5% ML-SRA, suggesting that depth models may express complementary geometric hypotheses beyond standard RGB inference.
monocular depth estimationgeometric ambiguitymulti-depth benchmarklaplacian visual promptingspatial relationship accuracy
How AI settled the complexity of the oldest SGD algorithm
The paper establishes the worst-case complexity of the Kaczmarz algorithm, the earliest known stochastic gradient descent (SGD) method originally proposed in 1937 for solving linear systems. Modern AI models including ChatGPT and Gemini were employed to analyze this foundational optimization technique. The interdisciplinary approach connects classical numerical analysis with contemporary machine learning paradigms to characterize the algorithm's computational limits.
stochastic gradient descentkaczmarz algorithmcomputational complexitylinear systemsoptimization
SonoCLIP: Mask-Guided Region-Aware Vision-Language Pretraining for Fetal Ultrasound Analysis
SonoCLIP introduces a region-controllable vision-language foundation model for fetal ultrasound analysis, addressing limitations of global image-text alignment in existing CLIP-based approaches. The model integrates segmentation masks as mask-channel visual prompts within the vision encoder, enabling joint global-local contrastive representation learning, and employs a sigmoid-based pairwise contrastive loss for scalable region-text alignment. Pretrained on a curated 1.44M-image multimodal fetal ultrasound dataset spanning 24 standard planes, SonoCLIP demonstrates superior zero-shot transfer performance in cross-center evaluations under both global and mask-guided inference. The model establishes a clinically oriented foundation for fetal ultrasound analysis.
vision-language foundation modelmask-channel visual promptscontrastive representation learningzero-shot transferfetal ultrasound analysis
Bilevel Optimization for Neural Architecture Search
The paper provides a structured overview of Neural Architecture Search (NAS) through bilevel optimization, categorizing methods into sampling-based and bilevel theory-based approaches. It introduces an auxiliary mathematical programming framework that integrates second-order information from training loss, ensuring optimal parameter updates for both architecture and model weights. Comparative analysis demonstrates that bilevel theory-based methods outperform sampling-based approaches in accuracy and efficiency.
bilevel optimizationneural architecture searchhyperparameter tuningsecond-order informationmathematical programming
The Joint Effect of Quantization and Sampling Temperature on LLM Safety Alignment: A Factorial Analysis
This work systematically evaluates how quantization and sampling temperature jointly affect LLM safety alignment through a factorial study of 9 instruction-tuned models across 3 precisions (FP16, INT8, INT4) and 6 temperatures (0-1.0), generating 322k responses assessed by a safety ensemble. Results show standard quantization is generally safety-neutral (INT4 reduces attack success in 7/9 models), while temperature increases decision instability (DFR reaches 53.0% at T=1.0), with sub-additive interaction effects (Compound Degradation Index: -0.195 to +0.045). The findings suggest INT4/INT8 quantization is viable for aligned models, but safety evaluations at high temperatures should measure multi-sample stability.
quantizationsampling temperaturesafety alignmentattack success ratedecision instability
ScAle: Attention Head Scaling as a Minimal Adapter for Spatial Reasoning in Vision Language Models
ScAle introduces an ultra-lightweight adaptation method for vision language models (VLMs) that improves spatial reasoning by rescaling activations in transformer layers without modifying pretrained weights. The method learns a small set of scalar coefficients to modulate last-token attention and MLP activations in a fully frozen backbone. Evaluated on SpatialEval, COCOQA, and VGQA benchmarks, ScAle achieves up to 134.1% relative accuracy gains using only 1K trainable parameters, recovering a substantial fraction of standard PEFT performance while maintaining strong non-spatial VQA accuracy.
spatial reasoningvision language modelsscalar coefficientslast-token attentionparameter-efficient
ReMAP-PET: Beyond Visual Understanding -- Learning Region-Guided Metabolic Alignment Semantics from Brain PET
ReMAP-PET introduces a framework for learning region-guided metabolic semantics from brain PET scans, addressing limitations of existing 3D brain foundation models that treat PET as generic volumetric data. The method supervises a partially-tuned MedicalNet 3D ResNet-50 with brain regional standardized uptake value ratio (SUVR) profiles through joint regression and contrastive objectives. On 1015 paired PET--SUVR samples, ReMAP-PET achieves 0.070 SUVR MAE and 77.8% PET SUVR Recall@1, outperforming five frozen pretrained baselines. The framework enables PET-to-report generation via SUVR-constrained verbalization and retains clinically relevant information in embeddings without task-specific fine-tuning.
positron emission tomographymetabolic semanticsstandardized uptake value ratiocontrastive objectivespet-to-report generation
TF-MoE: Time-Frequency Mixture-of-Experts for Efficient Speech Separation
The paper proposes TF-MoE, a sparse Mixture-of-Experts framework for efficient speech separation that enhances model capacity without increasing inference cost. The method introduces dynamic expert specialization in time and frequency dimensions via alternating time-wise and frequency-wise MoE modules, built upon a mel-band-splitting Conformer backbone. Experiments show TF-MoE outperforms BSRNN by +3.8 dB SDR on Libri2Mix with comparable compute (4.1 GMACs/s), demonstrating effectiveness under low-compute constraints.
mixture-of-expertsspeech separationconformeredge computingdynamic routing
Coverage-Driven KV Cache Eviction for Efficient and Improved Inference of LLM
The paper introduces K-VEC, a coverage-aware KV-cache eviction strategy for efficient LLM inference that addresses performance degradation from reduced token coverage. The method employs cross-head and cross-layer coverage modules to retain critical tokens across attention heads and model layers, theoretically preserving mutual information between inputs and outputs. Evaluations on 16 LongBench subsets show K-VEC achieves up to 10.35-point improvement over existing methods under identical eviction rates and memory constraints.
kv-cache evictionlong-context reasoningmutual informationattention sparsityllm inference
VISTA-DZ: Visual Semantic Trajectory Adaptation for Personalized Dilemma Zone Prediction
VISTA-DZ introduces a visual semantic trajectory adaptation framework for personalized dilemma zone prediction at signalized intersections. The method converts historical trajectories into visual representations, processes them with a vision-language model to generate behavioral profiles, and uses semantic embeddings to condition a dual-output prediction network combining bidirectional GRU, driver-conditioned cross-attention, and Feature-wise Linear Modulation. Evaluated on SDZ and FDZ datasets, it achieves 93.26% in-domain accuracy and 90.22% mean accuracy across 20 held-out drivers, demonstrating effective simulation-to-real transfer.
dilemma zonevisual semantic trajectoryfeature-wise linear modulationbidirectional grucross-attention
Proteus: Automated Adversarial Robustness Testing for Audio Deepfake Detectors
Proteus introduces an automated framework for adversarial robustness testing of audio deepfake detectors, combining exhaustive breadth-first search and Q-learning to identify effective attack chains. The system evaluates sequences of audio transformations (codec transcoding, noise addition, reverberation, dynamic-range compression, VoIP simulation) that fool detectors while maintaining speech quality. Results from production deployment show specific augmentation chains reliably flip detection verdicts without compromising intelligibility or speaker identity, enabling detector hardening via targeted retraining.
adversarial robustnessaudio deepfake detectionq-learningaudio transformationsautomated testing
Learned Coordination Conventions in Cooperative MARL: Measuring the Translation Gap Between Theory-Informed Roles and Learned Routing
The study introduces a diagnostic framework for measuring the translation gap between theory-informed role expectations and learned coordination conventions in cooperative multi-agent reinforcement learning (MARL). Using role-routing matrices, formation sensitivity, and gradient/occlusion attribution, the authors analyze coordination structures in MiniGrid and SMACv2 (Terran) environments. Results show that label-conditioned attention outperforms flat MLP baselines in producing role-specific routing, exhibits stability across team sizes (3v3--9v9), and transfers zero-shot. A 5-seed re-evaluation reveals partial alignment with designer-specified priors, highlighting noise-induced strategic divergence. The framework provides empirical insights into coordination structure without proposing new equilibrium concepts.
multi-agent reinforcement learningrole-routing matrixformation sensitivitylabel-conditioned attentionzero-shot transfer
Em-ergence of the em-dash: a population-level rise in em-dash frequency in medRxiv preprints at the dawn of the large-language-model era
This study quantifies a population-level increase in em-dash usage in medRxiv preprints following ChatGPT's release, suggesting LLM-assisted writing leaves detectable stylistic traces. Analyzing 69,632 Discussion sections (≥500 chars) from 2020-2025 via logistic regression with author-clustered errors, em-dash prevalence rose from 4.23% pre-ChatGPT (before Nov 2022) to 11.58% post-ChatGPT (Δ=7.35pp, OR=2.96, 95% CIs [6.94-7.77] and [2.77-3.17]). The gradual acceleration (4% in 2023, 20.3% in 2025) persisted across sensitivity analyses and falsification tests, absent in pre-LLM placebo splits (+0.13pp) and boilerplate sections.
em-dashlarge language modelsstylometric analysismedrxivlogistic regression
RESOURCE2SKILL: Distilling Executable Agent Skills from Human-Created Multimodal Resources
RESOURCE2SKILL introduces a framework for distilling executable agent skills from multimodal human resources, including tutorial videos, repositories, articles, and reference artifacts. The method organizes skills hierarchically in a multimodal Skill Wiki, preserving complementary signals from diverse sources: temporal operations from videos, executable patterns from code, and conceptual grounding from articles. At inference, agents retrieve and compose skills, with online acquisition addressing coverage gaps. Evaluated across seven authoring domains, RESOURCE2SKILL improves average overall score by +11.9 percentage points over no-skill agents and outperforms baselines in 26 of 28 model-domain cells. Ablations highlight the importance of multimodal format, hierarchical organization, source diversity, selection strategy, and online acquisition.
multimodal resourcesexecutable skillsskill wikihierarchical organizationonline acquisition
OSWorld2.0: Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks
OSWorld 2.0 introduces a benchmark for evaluating computer-use agents on 108 long-horizon real-world workflows, addressing limitations of prior benchmarks by capturing complex phenomena like streaming interaction, dynamic environments, and cross-source reasoning. Tasks require median 1.6 human-hours and average 318 tool calls (vs. 30 in OSWorld 1.0), grounded in authentic artifacts and user profiles. Under a binary-completion metric at 500 steps, Claude Opus 4.8 achieves only 20.6% task completion (54.8% partial), while GPT-5.5 plateaus at 13%, revealing agents' struggles with hidden state recovery and mid-task information integration.
long-horizon workflowsstreaming interactioncross-source reasoningimplicit-state inferencevisual-spatial precision
SemJoin: Semantic Join Optimization
SemJoin introduces an LLM-agent-based decision pipeline for optimizing semantic joins in relational databases, dynamically selecting execution strategies based on table characteristics. The system employs an LLM advisor to route joins to either a Cluster Join strategy, which uses unsupervised embedding clustering and sample-based filtering, or a Classifier strategy for predicates reducible to discrete label sets. Evaluated on IMDb reviews, email contradictions, and Stack Overflow tags, SemJoin outperforms adaptive block join by 20-33 F1 points across datasets and achieves higher F1 scores than featurized-decomposition join at 1-2 orders of magnitude lower token cost.
semantic joinllm-agentembedding clusteringtoken costdynamic routing
MotionAtlas: Detailed Region Captioning for Motion-Centric Videos
MotionAtlas introduces a system for region-aware motion captioning in videos, addressing visual clutter and motion entanglement through precise spatiotemporal mask-based descriptions. The framework comprises MotionAtlas-Bench, a human-annotated benchmark with 2,073 multiple-choice questions for fine-grained motion understanding, a scalable data pipeline producing 159k high-quality motion captioning samples via self-bootstrap refinement, and a tailored training strategy enhancing Video-MLLMs like Molmo2 and Qwen3-VL. MotionAtlas-4B outperforms Qwen3-VL-4B by 5.2 percentage points on general motion benchmarks. The benchmark, dataset, and code are publicly available.
region-aware motion captioningspatiotemporal maskself-bootstrap refinementvideo-mllmsfine-grained motion understanding
SAKE: Software Architectural Knowledge Evaluation Benchmark for Large Language Models
The paper introduces SAKE, a benchmark for evaluating large language models' (LLMs) software architectural knowledge, addressing a gap in existing benchmarks that focus on syntactic or algorithmic tasks. SAKE comprises 2154 expert-curated multiple-choice questions across eight architectural categories and four context-length levels, tested on 11 LLMs in zero-shot and five-shot settings. Results show high overall accuracy but significant variation across categories, revealing gaps in areas critical to professional practice. The benchmark, evaluation scripts, and results are open-sourced.
software architecturelarge language modelsbenchmarkzero-shot learningfive-shot learning
The Verbose Context Problem in Medical Records
The paper introduces PopMedQA, a benchmark addressing the verbose context problem in medical records, where structured concepts have token-inefficient textual representations. The benchmark uses neopatient, a library for generating artificial patient records, to evaluate computational tasks on longitudinal records exceeding 400K tokens. Ablations on prompting strategies, prompt compression, and agentic decomposition reveal that domain-independent methods fail to mitigate the issue, highlighting the need for domain-specific input structuring in language models for population-scale reasoning.
verbose context problempopmedqaneopatientlongitudinal recordsprompt compression
UCOB: Learning to Utilize and Evolve Agentic Skills via Credit-Aware On-Policy Bidirectional Self-Distillation
The paper proposes UCOB, a framework for improving agentic reinforcement learning through credit-aware bidirectional self-distillation of skill memories. The method treats skill-conditioned and no-skill prompts as on-policy context views, using the higher-return view as a local teacher to guide skill utilization, correction, and memory updates. Evaluations on ALFWorld, WebShop, and Search-QA demonstrate performance gains of up to 23.5 and 18.0 points over state-of-the-art baselines, with ablations confirming the efficacy of core mechanisms.
skill memoriesself-distillationagentic reinforcement learningcredit-aware learningon-policy training
Cognitive World Models for Process-Level Social Influence Evaluation
We introduce Cognitive World Model (CogWM), an LLM-based user model for evaluating process-level social influence in multi-turn dialogues. CogWM jointly predicts BDI/E cognitive states (beliefs, desires, intentions, emotions) and user utterances, functioning as both a user simulator and evaluation platform. It employs a three-tier framework assessing turn-level fidelity, trajectory-level state dynamics, and task-level composite scoring. Trained on 150,454 user-turn samples via Summarize-and-Allocate (SaA) annotation, CogWM achieves 77.6% emotion accuracy (2.1× GPT-5.5) and distinguishes six commercial agents in 3600 trials, with Llama-4-Scout ranking highest (CTS +0.233).
cognitive world modelbdi/e statessummarize-and-allocatemulti-turn dialoguesocial influence
Faults in Our Formal Benchmarking: Dataset Defects and Evaluation Failures in Lean Theorem Proving
The paper identifies and categorizes defects in Lean theorem-proving benchmarks, demonstrating that formal verification alone does not guarantee semantic correctness. Through corpus-scale static analysis of five widely used Lean benchmarks, the authors uncover 4,833 issues including 398 mechanically certified defects like vacuous theorems and unsound axioms. They propose a fault taxonomy, automated checkers, and audit prompts to improve dataset quality, showing on corrected subsets that benchmark defects can significantly distort prover performance evaluations.
lean theorem provingformal verificationbenchmark defectsstatic analysissemantic correctness
Reported Confidence in LLMs Tracks Commitment More Than Correctness
The study demonstrates that verbal confidence reports in large language models (LLMs) primarily reflect commitment readiness rather than answer correctness, challenging their use as reliability proxies. Using a two-stage abstention paradigm across four non-reasoning models and multiple prompt framings, verbal confidence predicted commit/abstain decisions better than correctness, while token log-probabilities showed the opposite pattern. Mechanistic analyses in Gemma 3 and 4 revealed that post-answer activations encoded abstention decisions orthogonally to correctness, with steering along confidence-specific directions causally altering abstention behavior.
verbal confidencelog-probabilitiesabstention paradigmcommit-readinesscorrectness discrimination
To Reason or to Fabricate: Reasoning Without Shortcuts via Hint-Anchored Pairwise Aggregation
HIPPO introduces a reinforcement learning framework addressing shortcut exploitation in LLM reasoning caused by Pre-RL data overlap. The method integrates hint-injected aggregation and a pairwise reward model, leveraging hint injection to expose overlap-induced behaviors and generate discriminable preference signals. This enables a lightweight judge model to reliably distinguish genuine reasoning from shortcut-driven rationalization while ensuring stable optimization. Experiments demonstrate HIPPO's substantial improvements over baselines and effective generalization to out-of-distribution tasks, confirming its ability to extract authentic, transferable reasoning skills.
reinforcement learningpre-rl data overlaphint-injected aggregationpairwise reward modelshortcut exploitation
CRAFT: Counterfactual Credit Assignment from Free Sibling Rollouts for Self-Distilled Agentic Reinforcement Learning
CRAFT introduces a three-pillar credit-assignment scheme for self-distilled agentic reinforcement learning, addressing limitations in retrospective and sign-blind token-level distillation loss. Pillar 1 leverages sibling rollouts to estimate counterfactual advantage changes, Pillar 2 employs an asymmetric controller for distillation weight adjustment, and Pillar 3 polarises the KL penalty based on credit signs. The method ensures bit-exact reproducibility and proves estimator consistency and variance bounds. Evaluated across three environments, four model scales, and five methods, CRAFT demonstrates significant improvements, isolating counterfactual contributions effectively.
self-distilled reinforcement learningcounterfactual advantageasymmetric controllerkl penaltybit-exact reproducibility
A Posteriori Error Analysis for Decoupled Neural Approximations of Fully Coupled FBSDEs with Control Mismatch
The paper develops an a posteriori error analysis framework for neural approximations of fully coupled forward-backward stochastic differential equations (FBSDEs) with decoupled controls. It introduces an auxiliary control process in the forward coefficients, distinct from the backward component approximated by the neural network, and analyzes the resulting control mismatch. The method derives computable error bounds depending on terminal defect, pathwise residual, and control mismatch, validated through numerical experiments on linear-quadratic and Burgers-type FBSDEs.
a posteriori error analysisforward-backward sdesneural approximationscontrol mismatchdecoupled controls
Agent-Computer Observation Interfaces Enable Dynamic Computer Use
We introduce the Agent-Computer Observation Interface (AOI), a model-agnostic perception layer that decouples continuous observation from discrete actions in computer-use (CU) agents. AOI comprises three gated components: inter-step keyframe capture, volume-gated audio transcription, and CU-model-generated visual narration that persists as text. Evaluated on DynaCU-Bench (150 tasks), CU models from 7B to frontier scale achieve +17 to +48 percentage point improvements over screenshot baselines without retraining, with AOI agents solving all tasks involving spoken content. Analysis reveals that keyframe selection is less critical than narrating frames into persistent text, and optimal component configurations vary across models like Gemini 3 Flash due to image-token dilution effects.
agent-computer observation interfacedynacu-benchvisual narrationimage-token dilutionvolume-gated audio transcription
One-Step Gradient Delay is Not a Barrier for Large-Scale Asynchronous Pipeline Parallel LLM Pretraining
This work challenges the assumption that one-step gradient delay inherently causes instability in asynchronous pipeline parallelism for LLM pretraining, demonstrating that optimizer choice is the critical factor. Through empirical analysis, the authors show that while AdamW suffers severe degradation under PipeDream-2BW's one-step delay, newer optimizers like Muon remain robust. They propose an optimizer-agnostic Error Feedback-inspired correction and provide theoretical convergence guarantees for Muon with and without this modification. Experiments on models up to 10B parameters confirm that their approach bridges the performance gap with synchronous training, enabling practical large-scale asynchronous pipeline parallelism.
asynchronous pipeline parallelismgradient delayllm pretrainingmuon optimizererror feedback
Wireless Backdoor Attack and Defense for Semantic Communications over Multiple Access Channel
The paper introduces a selective over-the-air backdoor attack targeting semantic communication (SemCom) systems over multiple access channels, where an adversary injects low-power trigger waveforms to manipulate semantic inference for one transmitter while minimally affecting others. A trigger-aware defense mechanism is proposed to mitigate this vulnerability through robust training. Experimental results demonstrate both the attack's effectiveness in selectively compromising SemCom systems and the defense's success in preserving correct semantic labels under trigger-contaminated conditions.
semantic communicationbackdoor attackmultiple access channelwireless securityrobust training
A Hybrid Framework For Crypto-Ransomware Detection In Enterprise Shared Storage
The paper proposes a hybrid framework for detecting crypto-ransomware in enterprise shared storage environments, combining signature-based Indicators of Compromise (IoCs) with machine learning. The method introduces Region of Interest (RoI) analysis for network traffic feature extraction, enhancing existing security tools like EDRs and IDSs. Evaluated across multiple ransomware families, the ML module achieves 99.64% precision, 0% FNR, and minimal FPR, with 99.44% accuracy in early intrusion detection before significant damage occurs.
crypto-ransomwareindicators of compromiseregion of interestenterprise shared storageearly detection
Uncertainty-Aware Generation and Decision-Making Under Ambiguity
The paper introduces uncertainty-aware decision-making algorithms for LLMs, leveraging Bayesian decision theory and risk-averse strategies in tutoring and peer-review tasks. Methods include conformal prediction for strategy and score guarantees, with empirical evaluation showing Bayesian approaches outperform risk-averse rules when ambiguity is high. Results indicate improved generation utility but highlight trade-offs in optimizing for generic outputs under high ambiguity.
large language modelsbayesian decision theoryconformal predictionrisk-averse decision makinguncertainty-aware generation
The Fundamental Limits of Valid Transport Map Estimation
The authors formalize the estimation of valid transport maps within a minimax framework, establishing sample complexity lower bounds for transport-based generative methods like flow matching and diffusion models. By leveraging stability assumptions from optimal transport (OT) theory, they demonstrate that estimating any valid transport map is statistically equivalent to estimating the OT map. However, when these assumptions fail, alternative transport maps can be learned more accurately than the OT map. This analysis provides a rigorous foundation for understanding the statistical limits of modern transport-based generative modeling techniques.
optimal transportminimax frameworksample complexitytransport mapsgenerative modeling
SWE-INTERACT: Reimagining SWE Benchmarks as User-Driven Long-Horizon Coding Sessions
SWE-Interact introduces a novel benchmark for evaluating coding agents in multi-turn, user-driven software engineering workflows, contrasting with traditional single-turn SWE benchmarks. The method employs a user simulator that progressively reveals requirements and provides feedback, testing agents' ability to discover intent and adapt to evolving constraints. Results show performance drops from 50% (single-turn) to 25% (multi-turn), with top models like Opus 4.8 and GPT 5.5 demonstrating better requirement integration but still suffering from over-agentic coding and technical errors.
swe benchmarkscoding agentsmulti-turn interactionuser simulatoriterative refinement
Attractor States Emerge in Multi-Turn LLM Conversations
The study identifies model-specific attractor states in multi-turn LLM conversations, demonstrating their influence on stylistic and behavioral patterns. Using 7 LLMs across 20 controversial topics, the authors analyze self-play and mixed-play dyadic debates through representation space trajectories, discourse traits, and stance tracking. Results reveal asymmetric attractor effects, with Claude Haiku strongly influencing other models' latent space positions and GPT-4.1 nano showing high malleability, suggesting predictable dynamics in open-ended multi-agent interactions.
attractor statesmulti-agent interactionlatent spaceself-playdiscourse traits
Forensic Trajectory Signatures for Agent Memory Poisoning Detection
The study identifies a behavioral invariant in LLM agents under memory poisoning attacks, demonstrating that successful attacks require a specific sequence of memory_recall_fact before email_send_email, which non-exfiltrating sessions rarely exhibit. Using a rule-based approach exploiting this invariant achieves AUC = 0.9563, while a Random Forest classifier over 19 trajectory features improves detection to AUC = 0.9904. Cross-model validation on 9 models (7B-120B parameters) confirms AUC = 1.000 on 6/9 hold-out splits, with generalization to frontier models like GPT-4.1 and GPT-4o without retraining. The method enables real-time blocking with AUC = 0.934 and distinguishes memory-channel attacks from prompt-injection attacks (score = 0.541) using tool-call logs.
memory poisoningbehavioral invariantaucrandom foresttool-call logs
Convergence of Continual Learning in Homogeneous Deep Networks
The paper characterizes continual classification in weakly regularized homogeneous models as sequential projections onto task margin sets, generalizing prior analyses limited to stationary deep models or continual linear models. The authors demonstrate that global convergence typically fails, even for simple models linear in data but nonlinear in parameters. Using nonconvex projection theory, they identify regularity properties in homogeneous deep networks that ensure local linear convergence under random and cyclic task sequences, extending the analysis to continual regression for unified treatment of homogeneous models.
continual learninghomogeneous modelstask margin setsnonconvex projectionlocal linear convergence
Bridging the NISQ and Fault-Tolerant Regimes: Generative-ML-Assisted Quantum Selected CI for Molecular Simulations
(No summary returned.)
$μ$Flow: Leveraging Average Images for Improving Generalisation of Deepfake Faces Detectors
The paper introduces $μ$Flow, a one-class deepfake detector trained exclusively on real images to improve generalization across unseen generators (GANs vs diffusion models). The method exploits averaged images to amplify generative traces, modeling their feature distribution with normalizing flows and aligning individual images to this distribution via likelihood-based separation. Evaluated in a fully out-of-distribution setting, $μ$Flow outperforms state-of-the-art detectors without relying on synthetic artifacts or pseudo-deepfakes.
deepfake detectionone-class learningnormalizing flowfeature distributiongenerative traces
ITSPACE: Monotone Gaussian Optimal Transport Updates
The paper introduces ITSPACE, a proximal majorization-minimization method for optimizing the exact Bures-Wasserstein (BW) objective on symmetric positive definite matrices. The method employs closed-form updates in a square-root factorization, ensuring PSD structure preservation and supporting rank-restricted factors. Theoretical guarantees include a sufficient-decrease inequality in exact arithmetic and a certificate-gap bound for inexact polar computations. Empirical results show ITSPACE converges faster than BW-gradient descent, alternative covariance-geometry methods, and entropic OT baselines on real-world covariance-alignment tasks.
bures-wassersteinoptimal transportcovariance alignmentproximal optimizationspd cone
Staged Hybridisation for Visual Quantum Reinforcement Learning via Knowledge Distillation
The paper proposes staged knowledge distillation (KD) as a hybrid strategy for visual quantum reinforcement learning (QRL), addressing challenges in high-dimensional observations and unstable training. By first training a classical visual teacher, freezing its encoder, and distilling policy behavior into compact classical or variational quantum circuit (VQC)-based heads, the method enables quantum-compatible students to learn efficiently. Evaluated on CartPole Pixels and Acrobot Pixels, angle-encoded VQC heads achieve near-teacher performance, while amplitude-encoded heads trade compactness for fragility and simulation time. The approach reframes visual QRL as a compact-head learning problem.
quantum reinforcement learningknowledge distillationvariational quantum circuitsvisual controlcompact-head learning
Muon learns balanced solutions in matrix factorization without slow saddle-to-saddle dynamics
The paper introduces Muon, an optimizer for matrix factorization problems that exhibits distinct dynamical properties compared to gradient descent. Muon avoids slow saddle-to-saddle dynamics by learning all top modes of the target matrix simultaneously, with smaller modes converging first. It remains stable at learning rates exceeding the critical threshold set by local loss sharpness, enabling rapid convergence via exponential annealing. Muon conserves the matrix quantity √(PᵀP) - √(QᵀQ), differing from gradient flow's conserved quantity PᵀP - QᵀQ, yet both find balanced solutions. Theoretical alignment rates are derived and empirically validated, with a proposed two-step schedule achieving near-perfect alignment.
matrix factorizationoptimizer dynamicslearning rate annealingbalanced solutionsalignment rates
Doubly Robust Adaptive Conformal Inference for Causal Effects Under Temporal Dependence
The authors propose doubly robust adaptive conformal inference (DR-ACI), a method for constructing prediction intervals for doubly robust pseudo-outcomes in temporally dependent data. DR-ACI combines doubly robust estimation with adaptive conformal inference to achieve valid coverage guarantees under distribution shifts. The approach handles time-series dependencies without requiring strict stationarity assumptions. Theoretical results demonstrate marginal coverage guarantees, while empirical evaluations on synthetic and real-world datasets show improved interval width compared to non-adaptive baselines.
conformal inferencedoubly robust estimationtemporal dependenceprediction intervalsdistribution shift
Discovering Collaboration from Novelty: Random Network Distillation for Clustered Federated Learning
The paper proposes a lightweight clustering method for Clustered Federated Learning using Random Network Distillation (RND) to address non-IID data challenges. Clients train compact RND predictors locally, using prediction errors as novelty signals to estimate similarity and form clusters before federated training. This approach decouples clustering from model training, reducing computational and communication costs while enabling autonomous federation without predefined cluster counts or structures. The method is task-agnostic and suitable for large-scale distributed systems.
clustered federated learningrandom network distillationnon-iid datanovelty signalautonomous collaboration
GPU Parallelization Strategies for Forward and Backward Propagation in Shallow Neural Networks: A CUDA-Based Comparative Study
The study evaluates CUDA optimization strategies for forward and backward propagation in shallow neural networks, comparing three techniques: tiled shared memory with bank-conflict elimination, pre-transposed weight matrices for coalesced access, and a fused MatMul+ReLU kernel. Implemented on an NVIDIA Tesla T4 (CUDA 13.0), the fully optimized version achieves a 1.41x speedup over the baseline CUDA implementation on large datasets (25,600 samples), reducing execution time from 21.0s to 14.8s. Results demonstrate significant performance gains from memory-access optimizations in GPU-accelerated deep learning primitives.
cudagpu optimizationneural networksmemory coalescingkernel fusion
Factorizable Normalizing Flows for parameter-dependent density morphing
Factorizable Normalizing Flows (FNFs) are introduced to model parameter-dependent density deformations efficiently, addressing the intractability of learning separate flows for each parameter configuration. FNFs decompose the problem into a fixed high-fidelity flow for a reference configuration and a learnable transformation polynomial in parameters, factorized over them. This allows learning each parameter's effect in isolation and combining them via summation at inference, avoiding combinatorial sampling. On a controlled problem with two deformations, FNFs reproduce true deformations, match optimal likelihood, and capture residual correlations with optional interaction terms. The method scales linearly with parameters, maintains tractable likelihood, and enables unbinned likelihood fits in high energy physics.
normalizing flowsparameter-dependent densityfactorizable transformationsunbinned likelihoodhigh energy physics
Non-parametric recovery of causal diffusion mechanisms from steady-state observations
The paper presents a non-parametric method to recover the causal drift mechanism of continuous-time diffusion processes from cross-sectional steady-state observations. Assuming a known acyclic causal graph and time-homogeneous diffusion dynamics, the authors prove identifiability under a non-explosion condition and derive a consistent kernel estimator. Theoretical analysis includes consistency guarantees and a cross-validation scheme for hyperparameter tuning, with empirical validation through simulations. Connections to irreversible generative diffusion models and low-frequency sampling are discussed.
causal inferencediffusion processesnon-parametric estimationsteady-state analysiskernel methods
MuonSSM: Orthogonalizing State Space Models for Sequence Modeling
MuonSSM introduces a framework for stabilizing state space models (SSMs) by conditioning memory update geometry rather than recurrent transitions. The method combines momentum-based pathways with Newton Schulz transformations on low-rank inputs, maintaining parallel scan complexity while bounding updates. Theoretical analysis shows improved gradient propagation and spectral conditioning. Experiments across language, vision, and time-series tasks demonstrate accuracy and robustness gains in long-context settings when integrated into SSM variants.
state space modelssequence modelingspectral conditioningmomentum pathwaynewton schulz transformation
HSAP: A Hierachical Sequence-aware Parallelism for Hybrid-Context Generative Models
The paper proposes Hierarchical Sequence-Aware Parallelism (HSAP), a novel framework combining existing sequence parallelism paradigms while addressing their limitations in handling hybrid-context packed sequences. HSAP introduces a Sequence-Aware Parallelism algorithm that optimizes tensor transmission and partial attention computation across device groups using JIT compilation for NCCL-level communication. The hierarchical framework manages memory and communication overhead effectively. Experimental results demonstrate HSAP's superiority over state-of-the-art sequence parallelism approaches across multiple metrics.
sequence parallelismhybrid-contextjit compilationattention computationnccl
Curvature-Weighted Gradient Diversity: A Noise Measure for Geometry-Adaptive SGD Schedules
The paper introduces Curvature-Weighted Gradient Diversity (CWGD), a geometry-aware noise measure for SGD that weights per-sample gradient diversity by the inverse square root of the Hessian. The method employs a Hutchinson-based diagonal Hessian estimator and a CWGD-modulated cosine learning-rate schedule, proving a 2x reduction in asymptotic optimization error for strongly convex quadratic objectives with diagonal Hessians. Experiments show CWGD-Cosine achieves ~20% lower final error than standard cosine annealing across various condition numbers, batch sizes, and noise structures, with negligible overhead in quadratic settings. Limitations include Hessian staleness in non-convex optimization.
curvature-weighted gradient diversitygeometry-aware optimizationhutchinson estimatorcosine annealinghessian staleness
Exploring Differences Between Tabular Enterprise Data and Public Benchmarks
This work identifies key differences between enterprise tabular data and public benchmarks, highlighting the need for domain-specific evaluation. The authors analyze data statistics and measure performance of tabular models (TabPFN, TabICL, ConTextTab) on enterprise datasets. Results demonstrate poor generalization between public benchmarks and enterprise data, with models excelling on one domain often underperforming on the other. The findings underscore the necessity for additional benchmarks that capture enterprise-grade characteristics to advance tabular machine learning in business applications.
tabular dataenterprise databenchmarkinggeneralizationtabpfn
Internal-State Probes Read the Situation, Not the Action: Three Negative Results for Pre-Action Misalignment Monitoring
The study evaluates three methods for pre-action misalignment monitoring using internal-state probes in agentic systems, finding negative results across all approaches. Methods tested include fine-tune/base direction separation in Qwen2.5-Coder-32B-Instruct, last-token probes in Llama-3.1-8B-Instruct, and emotion-concept vectors in Gemma-3-27B-IT. Results show that while probes achieve high AUC scores (up to 1.000 for Qwen), they fail to generalize as robust pre-action monitors, with specificity and transferability limitations across domains and scenarios. The work provides a methodology for testing internal-readout claims against generalization and specificity controls.
internal-state probespre-action monitoringmisalignment detectiongeneralization checksspecificity controls
When Does Online Imitation Learning Help in LLM Post-Training? The Role of (Non-)Realizability Beyond Horizon
The work challenges the prevailing view that error accumulation primarily explains online imitation learning's (IL) advantages in LLM post-training, demonstrating instead that realizability—whether the student policy class can represent the expert policy—is key. Through empirical and theoretical analysis, the authors show offline IL matches expert performance under realizability, while non-realizable settings introduce an information-theoretic bottleneck even for horizon $H=1$. They propose a structural characterization of misspecification relative to rewards, proving online IL achieves high performance despite distributional mismatch.
online imitation learningrealizabilitymisspecificationpolicy distillationdistributional mismatch
SGD Provably Prioritizes a Shortcut Spurious Feature in the XOR Model
The work provides the first theoretical characterization of spurious feature learning in two-layer ReLU networks trained via online minibatch SGD on logistic loss, using high-dimensional Boolean hypercube data with XOR signal and linear spurious correlation. Analysis reveals SGD learns the spurious feature exponentially fast, with dynamics coupling spurious and signal features such that stronger spurious components inhibit signal learning. Phase transitions show initial rapid spurious feature growth driven by sign alignment, followed by suppressed signal learning due to large majority group margin. Theoretical results demonstrate spurious feature dominance even at XOR sample complexity thresholds when correlation is maximal.
spurious correlationxor modelrelu networkssgd dynamicsfeature learning
CAN We Trust Your Results? A Cross-Dataset Study of Automotive IDS Evaluation
This study introduces a standardized benchmarking framework to evaluate CAN bus Intrusion Detection Systems (IDS) across seven diverse datasets, addressing inconsistencies in prior evaluations. The framework enables cross-dataset comparison of five distinct IDS methodologies, revealing significant performance variations dependent on dataset characteristics. Results demonstrate that current IDS evaluations lack generalizability, emphasizing the need for robust cross-dataset validation to assess true detection capabilities in varying automotive network environments.
can busintrusion detection systemscross-dataset evaluationbenchmarking frameworkautomotive security
Arko-T: A Foundation Model for Text-to-Structured 3D Generation
Arko-T introduces a 4B-parameter foundation model for text-to-structured 3D generation, mapping natural-language intent directly into executable, parametric CAD programs. Unlike existing text-to-3D systems that produce renderable shapes, Arko-T ensures CAD artifacts remain editable by aligning pipeline stages—data curation, code normalization, and execution-grounded supervision—to a formal notion of design state. Evaluated against seven frontier LLMs across 12 metrics, Arko-T achieves the best score on 8 metrics and the second-best on 3, at approximately one-tenth the per-benchmark cost. Results demonstrate that targeted design-level training at moderate scale can rival general-purpose models in structured CAD generation.
text-to-3dparametric caddesign stateexecution-grounded supervisioncode normalization
Proofs of Ownership for Machine Learning Models
The paper introduces a formal framework for Proof of Ownership (PoW) in machine learning models, addressing the challenge of verifying model ownership in cases of theft. The authors model PoW as a three-party game involving a model owner, a thief, and a judge, where the owner generates a perturbed model and a proof, the thief modifies it to evade detection, and the judge determines ownership. Under standard cryptographic assumptions, the authors establish a dichotomy for classifiers in the black-box setting: ownership can be proven if and only if the concept class is not self-correctable, extending results from Blum et al. (STOC'90).
proof of ownershipmachine learning modelsblack-box settingself-correctablecryptographic assumptions
Experience Augmented Policy Optimization for LLM Reasoning
The paper proposes Experience-Augmented Policy Optimization (EAPO), a method to enhance large language model reasoning by reusing experience adaptively in reinforcement learning with verifiable rewards (RLVR). EAPO employs a prior RL-optimized policy as an action-level experience prior, selectively injecting experience at critical decision points during rollout, and uses an adapted importance sampling scheme for stable learning. Evaluations on Qwen-2.5-math 7B and Qwen-3-8B across five benchmarks show EAPO outperforms state-of-the-art RLVR methods in reasoning performance.
reinforcement learninglarge language modelspolicy optimizationimportance samplingreasoning benchmarks
Diffusion Fine-tuning with Rewarded Moment Matching Distillation
We introduce Rewarded Moment Matching Distillation (RMMD), a framework combining diffusion model distillation with reward maximization. RMMD adapts the sampling loop for on-policy training and repurposes the distillation loss as KL regularization, preserving high-fidelity generation. Evaluations on ImageNet show RMMD achieves superior FID-Reward Pareto fronts compared to DI++ and DRaFT. Applied to GenCast, RMMD achieves a 7.5x speedup while outperforming the teacher model on 93% of weather variables and improving calibration, demonstrating scalability to high-dimensional scientific domains.
diffusion modelsdistillationreward maximizationkl regularizationon-policy training
MOPD: Multi-Teacher On-Policy Distillation for Capability Integration in LLM Post-Training
We introduce Multi-teacher On-Policy Distillation (MOPD), a post-training paradigm for integrating multiple capabilities into large language models (LLMs). MOPD first trains domain-specific reinforcement learning (RL) teachers, then distills them into a student model using its own rollouts, eliminating exposure bias and providing dense optimization signals. On Qwen3-30B-A3B, MOPD outperforms Mix-RL, Cascade RL, Off-Policy Finetune, and Param-Merge baselines, inheriting nearly all capabilities from each teacher. MOPD enables parallel development of domain teachers, removing cross-domain coupling in multi-domain post-training. It has been deployed in MiMo-V2-Flash, demonstrating practical value for capability integration in frontier-scale LLMs.
multi-teacher distillationon-policy learningcapability integrationreinforcement learningpost-training
Predict, Reuse, and Repair: Accelerating Dynamic Sparse Attention for Long-Context LLM Decoding
PRR introduces a speculate-reuse-repair runtime to accelerate dynamic sparse attention (DSA) in long-context LLM decoding by predicting relevant KV blocks, speculating attention computations, and incrementally repairing missed blocks. The method employs an EMA-based predictor, a profiling-guided speculation budget, and a FlashAttention-based repair kernel with online-softmax statistics. Evaluations on long-context benchmarks show PRR reduces per-token decoding latency by up to 40% without compromising downstream task accuracy.
dynamic sparse attentionkv blocksonline-softmaxflashattentionspeculative execution
Scalar Representations of Neural Network Training Dynamics
The authors propose scalar embeddings of neural network training dynamics by treating optimization trajectories as temporal networks. They apply dimensionality reduction techniques to analyze training dynamics of a multilayer perceptron on MNIST, preserving key dynamical features including sensitivity to initial conditions and Lyapunov exponents. The method enables definition of a characteristic decorrelation time for training trajectories and reveals statistical organization of asymptotic states through spacing observables. Results show rescaled asymptotic spacings follow a skew lognormal distribution, demonstrating scalar embeddings effectively capture high-dimensional optimization dynamics.
scalar embeddingtemporal networkslyapunov exponenttraining dynamicsdimensionality reduction
RenderFormer++: Scalable and Physically Grounded Feed-Forward Neural Rendering
RenderFormer++ introduces a scalable, physically grounded feed-forward neural rendering framework for global illumination in mesh scenes. The method combines Physics-Informed Transport Guidance (PITG) to embed rendering-equation inductive biases into attention mechanisms and Hierarchical Object-Centric Tokenization (HOCT) to aggregate triangle-level features into object-level tokens, reducing computational costs. Experiments show improved physical accuracy, efficiency, and scalability over prior methods like RenderFormer, enabling stable rendering across complex large-scale scenes.
neural renderingglobal illuminationattention mechanismtransport consistencyobject-centric tokenization
FlowAWR: Online Adaptive Flow Reinforcement via Advantage-Weighted Rectification
FlowAWR introduces a novel paradigm for continuous generative policy optimization by recasting it as supervised regression toward an optimal velocity field, eliminating the need for stochastic SDE samplers and Classifier-Free Guidance (CFG). The method derives a magnitude-aware, advantage-weighted rectification form from the optimal policy of a KL-constrained reward maximization. Evaluated on SD3.5-Medium, FlowAWR achieves superior alignment performance (24.12 PickScore) with 2× to 5× faster convergence than DiffusionNFT and FlowGRPO, while maintaining stable out-of-domain performance under multi-reward constraints.
generative flow modelsadvantage-weighted rectificationkl-constrained reward maximizationvelocity field optimizationonline reinforcement learning
On the Vulnerability of Parameter-Level Defenses to Model Merging
The paper exposes vulnerabilities in parameter-level defenses against unauthorized model merging, showing that protected task vectors are small-magnitude perturbations dominated by pretrained weights. The authors propose Anchor-Guided Attack (AGA), which exploits this dominance by aligning protected models with a static pretrained anchor to recover transformation matrices analytically. Experiments demonstrate AGA's effectiveness against individual and composite defenses, while Anchor-Repulsive Fine-tuning (ARF) is introduced as a countermeasure that reduces anchor dominance and mitigates AGA.
model mergingparameter-level defensesanchor-guided attacktask vectorspretrained weights
Learning the structure of open quantum systems
(No summary returned.)
OLIVE: View-Augmented Latent Prediction with Waveform Reconstruction for Speech SSL
The paper introduces OLIVE, a self-supervised learning framework for speech representation that jointly optimizes analysis (masked latent prediction) and synthesis (waveform reconstruction) objectives. View augmentation and invariant latent prediction enhance robustness, while reconstruction preserves signal-level information in early encoder features. OLIVE demonstrates improved performance on generation and speaker tasks, maintains competitiveness in recognition and semantic tasks, and achieves superior waveform reconstruction compared to baseline methods.
self-supervised learningwaveform reconstructionmasked latent predictionview augmentationspeech representation
REAR: Test-time Preference Realignment through Reward Decomposition
We propose REAR, a test-time preference realignment framework for large language models that decomposes reward functions into question-related and preference-related components. By formulating REAlignment Reward (REAR) as a linear combination of token-level policy log-probabilities, our method enables computationally efficient integration with test-time scaling algorithms like best-of-N sampling and tree search. Experiments demonstrate that REAR outperforms test-time baselines in preference alignment tasks across diverse user requirements while maintaining generalization capabilities in mathematical and visual domains.
test-time scalingreward decompositionpreference alignmenttoken-level policyrealignment reward
FlexTab: A Flexible Encoder-Decoder Architecture for In-Context Learning Across Diverse Tabular Tasks
FlexTab introduces a flexible encoder-decoder architecture for in-context learning on tabular data, featuring a task-agnostic encoder and task-specific decoders. The design produces target-agnostic row embeddings applicable to six tasks: classification, regression, anomaly detection, clustering, entity matching, and entity classification in relational databases. Trained on unlabeled tables, FlexTab achieves state-of-the-art performance on four tasks and remains competitive in relational entity classification, demonstrating its efficacy as a general-purpose backbone for diverse tabular prediction problems.
encoder-decoderin-context learningtabular datatarget-agnosticrelational databases
Local-Minima-Preserving Continuous Relaxation of Ising Problems
The authors present a polynomial relaxation for the generalized Ising problem that preserves one-flip local minima, proving a landscape equivalence theorem guaranteeing a bijective correspondence between relaxation minima and original problem minima. This enables gradient-based optimization (e.g., ADAM) for combinatorial problems like MAX-CUT and Number Partitioning. Empirical results demonstrate scalability and strong performance on spin-glass models and benchmark problems.
ising problemlocal minimapolynomial relaxationlandscape equivalencegradient-based optimization
Extrapolating from Regularised Solutions for Solving Ill-Conditioned Linear Systems in Machine Learning
The paper introduces autonugget, a Python package for stable numerical solution of ill-conditioned linear systems in machine learning prototyping. The method circumvents manual nugget selection in Tikhonov-regularised inversion by combining multiple linear solves via Richardson extrapolation, improving accuracy while maintaining compatibility with JAX automatic differentiation. Results demonstrate enhanced stability and computational efficiency compared to single-nugget approximations, enabling end-to-end differentiable training pipelines.
tikhonov regularizationrichardson extrapolationill-conditioned systemsautomatic differentiationjax
Hybrid Active-Online Learning Framework for Label-Efficient Concept Drift Adaptation in Optical Network Failure Detection
The authors propose a hybrid active-online learning framework for label-efficient concept drift adaptation in optical network failure detection. The method employs margin-based selective labeling to minimize annotation costs while maintaining performance. Results demonstrate near-ceiling accuracy and AUC scores with only 3.4% of streaming samples queried, introducing negligible latency overhead compared to static inference.
active learningconcept driftoptical networksselective labelingfailure detection
BrainJanus: A Unified Model for Understanding and Generation across Brain, Vision, and Language
BrainJanus introduces the first unified model integrating brain, vision, and language processing within a single framework. The model employs a Unified Brain Tokenizer to quantize neural dynamics into discrete tokens aligned with visual and linguistic representations in a shared Omni space, coupled with an All-in-One autoregressive architecture for any-to-any generation tasks, including image-to-brain, text-to-brain, brain-to-image, and brain-to-text decoding. Extensive experiments demonstrate superior performance across benchmarks, zero-shot generalization, and preservation of interpretable biological topography. The code is publicly available on GitHub.
unified brain tokenizeromni spaceautoregressive architectureany-to-any generationbiological topography
Toward an Energy-Optimized Operation of Data Centers Located in Wind Farms Using Reinforcement Learning
The paper proposes a Reinforcement Learning (RL) framework for optimizing energy usage in wind-turbine-integrated HPC data centers, addressing workload shifting under wind curtailment constraints. Using Proximal Policy Optimization (PPO) and a modified Soft Actor-Critic (SAC) with on-policy updates, the study evaluates Imitation Learning and Reward Shaping to mitigate credit assignment issues. Results show improved performance with these techniques, though a gap remains compared to offline optimization with full foresight. The benchmark framework supports future extensions to multi-site and continuous-time scenarios.
reinforcement learningwind-turbine-integratedproximal policy optimizationsoft actor-criticcredit-assignment problem
TRACE: A Concept Bottleneck Model for Longitudinal 3D Glioblastoma Response Assessment
TRACE introduces a concept bottleneck model for interpretable glioblastoma response classification on longitudinal 3D MRI, aligning with RANO 2.0 criteria. The model processes paired baseline and follow-up multimodal MRI scans using a shared 3D vision encoder, predicts clinically meaningful tumor measurements, and computes downstream RANO-derived concepts via deterministic rules. It achieves a 4-class macro F1 of 0.4769 and a binary progression-versus-non-progression macro F1 of 0.7085 on the LUMIERE dataset, outperforming a concept bottleneck baseline and remaining competitive with non-interpretable deep learning approaches. Ablation studies highlight the importance of the expert RANO graph and intervention-consistency training, while intervention experiments show that correcting concepts improves downstream predictions.
concept bottleneck modellongitudinal mrirano criteriaglioblastoma classification3d vision encoder
Highly Data Parallelizable Estimation of the Sliced-Wasserstein Distance Using Cumulative Distribution Functions
The authors propose a novel class of estimators for the Sliced Wasserstein (SW) distance based on cumulative distribution functions (CDFs) instead of traditional quantile functions. These estimators avoid sorting projected samples and enable massive dataset parallelism by leveraging CDFs of projected measures. The method is particularly advantageous for Gaussian mixtures and federated learning, as CDFs can be computed locally and aggregated without raw data exchange. The estimators include variants with hyperparameters controlling variance and smoothness, offering flexibility for different applications.
sliced wasserstein distancecumulative distribution functionsdata parallelismfederated learningoptimal transport
DreamForge-World 0.1 Preview: A Low-Compute Real-Time Controllable World Model
DreamForge-World 0.1 Preview introduces a low-compute foundational world model for real-time interactive simulation, prioritizing consumer-GPU runtime and broad interactive capabilities. The system adapts the LongLive 1 autoregressive video stack, derived from Wan2.1-T2V-1.3B, and integrates a residual action pathway inspired by the Matrix-Game family. It supports multimodal initialization, live keyboard/mouse control, mid-stream reprompting, dual-view operation, and minute-scale interactive rollouts at 480p resolution, achieving 14-15 FPS on a single RTX 4090 with low memory usage. Leveraging open video backbones and targeted adaptation runs, the model demonstrates a cost-efficient approach to real-time controllable world-model previews, though it is not yet memory-complete or frontier-quality.
autoregressive video stackresidual action pathwaymultimodal initializationdual-view operationlow-compute adaptation
When Is a Draft Accepted? A Theory of Acceptance in Speculative Decoding
The paper develops a theoretical framework for acceptance criteria in speculative decoding, focusing on practical regimes beyond distribution-preserving settings. It characterizes rejection regions as lower level sets of the target distribution and derives exact KL divergence certificates and margin-based bounds for various acceptance criteria, including strict greedy decoding, relaxed additive/multiplicative rules, top-(m) criteria, and entropy-thresholded acceptance. The framework is extended to greedy tree decoding, providing certificates for when the target token remains within the drafter's top-(m) candidates. Evaluations on Qwen3 models demonstrate that relaxed and tree-based criteria significantly expand certified acceptance regions, particularly in low-margin decoding steps.
speculative decodinggreedy decodingkl divergenceacceptance criteriatree decoding
Your Data Manifold is Secretly a Reward Model: Shell-LCC for Text-to-Video Generation
We propose Shell-Local Coordinate Coding (Shell-LCC), a method that leverages the intrinsic manifold structure of high-quality Supervised Fine-Tuning (SFT) data to provide dense, differentiable reward signals for text-to-video (T2V) generation. Unlike traditional reward models that incur computational overhead and require costly annotations, Shell-LCC explicitly models the manifold 'surface' as an isotropic shell, avoiding mean regression and preserving high-frequency details. Experiments show that Shell-LCC enhances realism, mitigates low-level distortions, reduces over-smoothing artifacts, and alleviates motion blur in generated videos.
shell-lcctext-to-videomanifold structuresupervised fine-tuningreward signals
A Distributionally Robust Framework for Learned Reconstructions in Inverse Problems
The paper introduces a distributionally robust optimization (DRO) framework for learned reconstructions in inverse problems, addressing poor generalization under distributional shifts. By restricting ambiguity sets to structured perturbations aligned with the data-acquisition process, the method models uncertainty in the forward operator and noise model more faithfully. Theoretical results include strong duality and finite-dimensional dual representations, while numerical experiments on deblurring and sinogram-to-CT reconstruction demonstrate improved robustness and stability over standard DRO and MSE baselines. The framework induces Tikhonov regularization and yields effectively low-rank operators in linear settings.
distributionally robust optimizationinverse problemsstructured perturbationstikhonov regularizationworst-case risk bound
B3O: Scalable Boltzmann Batch Bayesian Optimization
B3O introduces a scalable framework for large-batch Bayesian Optimization (BO) by reframing batch generation as a sampling problem from the Boltzmann distribution defined by the acquisition function. This approach avoids computational bottlenecks and maintains batch diversity, addressing limitations of existing methods. Theoretical analysis shows negligible additional regret for queries sampled from this distribution. Empirical evaluation demonstrates B3O's superiority on synthetic benchmarks and robustness in complex tasks, including multi-objective electrode design and mixed-variable race car configuration.
bayesian optimizationboltzmann distributionbatch generationacquisition functionmulti-objective design
Characterizing Optimizer-Dependent Training Dynamics Through Hessian Eigenvector Displacement and Localization
The paper characterizes optimizer-dependent training dynamics by analyzing Hessian eigenvector evolution in neural networks. Using multilayer perceptrons on classification tasks, the authors measure eigenvector dynamics via (i) temporal displacement metrics and (ii) localization through inverse participation ratio, comparing against a random Hessian null model. Results show SGD stabilizes leading curvature directions, while Adam exhibits stronger eigenvector reorganization and parameter subset localization in dominant curvature directions. These findings demonstrate Hessian eigenvector dynamics differentiate optimizer behaviors and training trajectories.
hessian eigenvectorstraining dynamicsoptimizer comparisonlocalizationinverse participation ratio
Robust Strategic Classification under Decision-Dependent Cost Uncertainty
The paper introduces a two-stage robust optimization framework for strategic classification that accounts for decision-dependent cost uncertainty, addressing a key limitation in existing literature which assumes fixed manipulation costs. The proposed method captures how manipulation costs evolve based on past algorithmic decisions, reducing uncertainty and mitigating gaming behavior over time. Results demonstrate that incorporating policy-dependent costs not only enhances robustness but also more effectively curtails strategic manipulation of algorithmic systems.
strategic classificationrobust optimizationdecision-dependent uncertaintymanipulation costsalgorithmic gaming
Predictive Objectives Discard Exogenous Control-Relevant Features: A Controlled Mechanistic Study
The study demonstrates that joint-embedding predictive objectives (JEPA-style) discard exogenous yet control-relevant features due to their focus on temporal predictability rather than control-relevance. Through a controlled 2x2 experimental design varying feature controllability and relevance, six objectives were evaluated: reconstruction, JEPA variants, inverse dynamics, and reward-grounded JEPA. Results show reward-free predictive objectives fail to retain exogenous control-relevant features (near chance accuracy), while reward-grounded JEPA recovers them with as little as 2% reward-labeled transitions, robust across environments (16-1024 latent dimensions). Latent geometry analysis reveals JEPA achieves minimal class separation compared to supervised references.
joint-embedding predictive objectivesexogenous featurescontrol-relevancetemporal predictabilitybisimulation theory
Data-Driven Energy-Based Learning via Gibbs Measures on Hierarchical Structures
The paper introduces a probabilistic framework for learning systems using Gibbs measures on hierarchical structures, replacing empirical risk minimization with an energy-based model derived from empirical loss functions. It formulates consistency conditions for finite-volume distributions and derives nonlinear integral fixed-point equations to characterize equilibrium learning states. The analysis reveals phase-transition phenomena in hierarchical systems, where multiple Gibbs measures emerge beyond critical thresholds, corresponding to distinct prediction regimes. Numerical experiments with non-separable kernels demonstrate coexisting solution branches, illustrating data-induced probabilistic landscapes.
gibbs measureshierarchical structuresenergy-based learningphase-transitionfixed-point equations
From Failure Taxonomy to Intervention: A Diagnostic Methodology for Industry-Scale AVLM in Video and Live-Streaming Platform Moderation
The paper introduces a diagnostic methodology for industry-scale Audio-Visual-Language Models (AVLM) development, addressing the gap between generic pretrained models and platform-specific moderation requirements. The method maps model failures to a taxonomy of observable signatures and links each failure class to targeted intervention spaces, enabling traceable improvements. The authors instantiate this approach in a large-scale video and live-streaming platform, resulting in a system supporting over 100 regions and handling noisy, ambiguous global content.
audio-visual-language modelsfailure taxonomymodel interventioncontent moderationmultimodal foundation models
Notes on generative modeling: flow matching, diffusion, optimal transport and Schr{ö}dinger bridge
This work provides a unified mathematical framework connecting key generative modeling techniques. The author establishes theoretical links between optimal transport, Schrödinger bridge, flow matching, and diffusion-based approaches, demonstrating their shared underlying principles. Through mathematical analysis, the notes reveal how these methods relate to each other in terms of their formulations and optimization objectives. The exposition offers a consolidated perspective on modern generative modeling, highlighting the connections between these approaches that are often treated separately in the literature.
generative modelingoptimal transportschrödinger bridgeflow matchingdiffusion models
Bridging the Gap Between Image Restoration and Navigational Safety in Hazy Conditions: A New Visibility Estimation Metric for Maritime Surveillance
The study introduces a visibility-oriented evaluation framework for maritime surveillance, addressing the gap between image dehazing quality and navigational safety in hazy conditions. A Maritime Simulated Visibility Dataset (MSVD) is constructed using Unity3D, providing paired hazy and clear images with precise visibility annotations. The proposed metric leverages object detection accuracy to map visibility distance to detection performance, converting image restoration improvements into measurable visibility gains. Six dehazing methods are evaluated using both conventional metrics and the proposed framework. Results demonstrate MSVD's reliability as a benchmark and the metric's effectiveness in interpretable visible-distance estimation, supporting navigational safety assessment.
visibility estimationimage dehazingmaritime surveillanceobject detectionsimulated dataset
Building Multi-Task Agentic LLMs via Two-Phase Distillation
The paper proposes a two-phase distillation method for building multi-task agentic LLMs that matches single-task RL expert performance. It identifies off-policy distillation's mode-covering limitation in multi-task settings and on-policy distillation's need for strong initialization, combining them sequentially for optimal performance. Evaluations on conversational agents and text-based games show the two-phase approach outperforms standalone off-policy or on-policy methods.
multi-task learningreinforcement learningknowledge distillationmode-coveringon-policy
Heads, Not Backbones: Output Heads Dominate Architectures on Fat-Tailed Returns
The study demonstrates that output heads, not backbone architectures, dominate performance in forecasting fat-tailed financial returns at short horizons. Comparing four backbones (TimesNet, DLinear, N-BEATS, iTransformer) with three output heads (point, single-Gaussian, Gaussian mixture), results show head choice drives CRPS improvements (3.7pp gradient), while backbone swaps yield ≤5.1% changes. Mixture heads excel in high-volatility regimes (13.9% CRPS gain in 1970s stagflation). Horizon analysis reveals head dominance at short horizons (h<6), with backbones prevailing at longer horizons. Distributional metrics (CRPS, pinball) separate heads, unlike squared error.
fat-tailed returnsoutput headsbackbone architecturesgaussian mixturecrps
Consensus Clustering of Free-Viewing Gaze Data: New Insights into Human-Information Interaction
The paper introduces EnsembleGaze, an unsupervised ensemble learning system for consensus clustering of free-viewing gaze data to analyze human-information interaction patterns. The method employs feature engineering based on statistical descriptors of fixation distributions, followed by consensus voting of clustering methods to compute a co-association matrix. Two high-dimensional clustering strategies—consensus subspace clustering and spectral biclustering—are proposed for joint user and image characterization. Results show robust image stimuli groupings (ambient vs. focal viewing modes) and context-dependent user groupings, with biclustering uniquely recovering this structure. Evaluation on public datasets reveals dataset-specific patterns.
consensus clusteringgaze dataensemble learningspectral biclusteringhuman-information interaction
A multi-architecture study of specificity refinement and false-positive mechanism analysis in prostate MRI
This study characterizes false positives in prostate MRI detection and evaluates a lightweight post-hoc refinement head for case-level specificity. Using PI-CAI (5-fold cross-validation) and Prostate158 datasets, a context-aware evidence head and an 89,216-parameter refinement head were trained on a frozen detection backbone, with additional training on nnU-Net, U-Net, Mamba, and MIGF-Mamba architectures. False positives exhibited contrast ratios closer to true cancers than benign tissue, replicating across five architectures and modality-perturbation scenarios. Refinement improved case-level specificity from 0.469 to 0.549 (+17.2%) on PI-CAI fold-0 while maintaining sensitivity (0.943), though fold-conditional behavior was observed. Results suggest false positives share raw imaging features with cancers, not histologically confirmed mimicry.
false positivespost-hoc refinementcontrast ratioscase-level specificitymodality-perturbation
Atompack: A Storage and Distribution Layer for Read-Heavy Atomistic ML Training Datasets
Atompack introduces a storage and distribution layer optimized for read-heavy atomistic ML training datasets, focusing on immutable snapshots with efficient append operations and memory-mapped reads. The system prioritizes complete molecular record serving over field chunks or object reconstruction, aligning with training pipelines' shuffled access patterns. Benchmarks against HDF5, LMDB, and ASE show Atompack achieves 96x faster shuffled reads than ASE LMDB and 79% smaller artifact sizes for 64-atom workloads.
atomistic machine learningstorage formatmemory-mapped readstraining pipelineshuffled access
NeuReasoner: Theory-grounded Mapping of Reasoning Elicitation Boundaries
The paper introduces NeuReasoner, a theory-grounded elicitation instrument combining Neuro Lenses (functional specificity) and Cognitive Lenses (Erotetic Theory of Reasoning) to probe reasoning boundaries in large language models. Through internal modularization, it evaluates performance on CogBench (cognitive psychology tasks) and standard benchmarks. Results show NeuReasoner matches/exceeds thinking-mode baselines on arithmetic, code generation, Bayesian reasoning, and reward learning at scale, but fails on risk-taking and decision-making under uncertainty. Scale interacts variably with elicitation, widening advantages on some tasks while erasing others.
elicitation boundariescognitive lensesinternal modularizationthinking-mode baselinesfunctional specificity
Improved Predictive Performance and Interpretability for Mesomorphic Neural Networks Using Local Fidelity Regularization
The paper introduces Local Fidelity Regularization (LFR) to address degenerate weight collapse in Interpretable Mesomorphic Neural Networks (IMNs), where explanatory variance concentrates in a single output weight. LFR aligns linear output weights with local data variations, ensuring faithful interpretations without sacrificing predictive performance. Empirical results on the OpenML benchmark suite show LFR improves AUROC over unregularized IMNs while maintaining competitive accuracy with black-box models.
interpretable mesomorphic neural networkslocal fidelity regularizationdegenerate weight collapseopenml benchmarkauroc
Diagnosing and Mitigating Retrieval Bottlenecks in LLM-Based Cold-Start Recommendation
The study evaluates LLM-based rerankers in cold-start recommendation systems, revealing performance gaps despite semantic understanding expectations. Using a five-domain benchmark separating reranking quality from retrieval coverage, it shows calibrated LLM rerankers (Qwen3-8B to Qwen3-32B) underperform collaborative/content baselines in natural traffic and struggle with retrieval-realistic regimes (gold item present only 4.6-22.9% of the time). The proposed LHF (learned hybrid fusion) improves retrieval coverage (17-61% recovery on content-rich domains) but highlights persistent mismatches in LLM reranking pipelines. The benchmark protocol and artifacts are publicly released.
llm rerankingcold-start recommendationretrieval coveragelearned hybrid fusionmulti-retriever pool
Bandwidth Selection in Kernel Density Estimation for Model Calibration
The paper introduces Risk Alignment (RA), a novel optimization framework for selecting optimal kernel bandwidths in Kernel Density Estimation (KDE) for model calibration. RA aligns KDE-reconstructed risk with empirical risk to minimize calibration estimation bias, providing a principled criterion applicable to various metrics like canonical calibration error. Theoretical analysis shows RA's effectiveness across data distributions. Experiments on multiple architectures and datasets demonstrate RA's consistent superiority over standard bandwidth selection methods, yielding more reliable calibration assessments.
kernel density estimationmodel calibrationbandwidth selectionrisk alignmentcanonical calibration error
MemDelta: Controlled Baselines and Hidden Confounds in Agent Memory Evaluation
MemDelta introduces a controlled evaluation protocol for agent memory systems, isolating component effects by varying one element at a time on LongMemEval-S (500 questions, 50+ sessions, three model families). Key findings include: (1) performance rankings reverse across models (Gemini gains +14pp from full context, Sonnet +31pp from RAG); (2) embedding model swaps shift accuracy by +6.2pp (p = 0.004); (3) self-memory underperforms basic retrieval (42% vs. 47%); (4) narrow cost-benefit tradeoffs (Mem0 matches cloud RAG on 2/6 question types at 50x cost). The study recommends fixed embeddings, model-family stratification, and cost reporting in memory evaluations.
agent memorycontrolled evaluationembedding modelretrieval-augmented generationcost-benefit analysis
Golden Hour Divide: Trauma Care Accessibility and Resource Vulnerability in Sri Lanka
This study evaluates trauma care accessibility in Sri Lanka by quantifying gaps between clinical demand and specialized resource availability across 25 districts. Using national epidemiological data and terrain-aware H3 hexagonal modeling, the authors analyzed accessibility for seven critical conditions based on spatial gaps, clinical need-gaps, lethality, coverage, and resource availability. Unsupervised K-Means clustering categorized districts into four policy-actionable archetypes, revealing severe service deficits in Northern and Eastern provinces, where spatial gaps exceed 70%. The findings suggest that improving accessibility by 25% in high-priority clusters would reduce the national need-gap by 9.65%, providing a roadmap for strategic specialist redistribution.
h3 hexagonal modelingclinical need-gapsk-means clusteringspatial gapsterrain-aware
Same Concept, Different Directions: Cross-Modal Feature Heterogeneity in Sparse Autoencoders
The paper identifies cross-modal feature heterogeneity in vision-language models, where semantically corresponding features diverge directionally across image and text modalities. To address this, the authors propose training modality-specific sparse autoencoders that preserve each modality's feature geometry, followed by post hoc alignment of corresponding features. This approach improves reconstruction fidelity and enhances performance in cross-modal retrieval and concept steering tasks, demonstrating that latent activation alignment alone is insufficient to resolve feature mismatch.
cross-modal feature heterogeneitysparse autoencodersvision-language modelsfeature geometryconcept steering
Decision-Value Attribution in Predict-then-Optimize Systems
The paper introduces Decision Value Attribution (DVA), a Shapley-based framework for explaining the operational value of predict-then-optimize systems by attributing value to information sources or design parameters. Three variants are proposed: InfoDVA (feature attribution), DesignDVA (operational configuration attribution), and Decision-Value Interactions (DVI) for joint attribution. The method distinguishes post-DVA (realized outcomes) from pre-DVA (model predictions) to diagnose alignment between model beliefs and performance. Case studies in electricity storage arbitrage and emergency medical services demonstrate DVA's ability to reveal mismatches between predictive explanations and operational value, guiding targeted interventions.
shapley valuepredict-then-optimizevalue attributionoperational decision-makingdecision relevance
Implementation of Hyperelastic Physics-Augmented Neural Networks in the Explicit Finite Element Codes Simcenter Radioss and OpenRadioss with Applications to Impact Events
This work integrates physics-augmented neural networks (PANNs) into the explicit finite element solvers Simcenter Radioss and OpenRadioss, enabling machine-learning-based constitutive modeling for engineering simulations. A framework is developed to transfer pretrained PANNs, trained in PyTorch or TensorFlow, into Fortran user material routines, ensuring compatibility with existing finite element technology without specialized solvers. Computational efficiency is optimized by replacing SoftPlus with SQuarePlus activation functions, reducing evaluation costs while maintaining accuracy. A GitHub repository automates routine generation, requiring only network architecture and trained parameters. Impact simulations demonstrate that PANNs accurately reproduce nonlinear hyperelastic material behavior under large strains, validating their practical application in explicit finite element simulations.
physics-augmented neural networksexplicit finite elementhyperelastic materialsfortran user materialsquareplus activation
Comparing Chatbot Performance Enhanced with Persistent Homology
The study investigates performance enhancement in chatbots using persistent homology (PH) vectorizations derived from raw datasets, particularly for scenarios with limited or confidential training data. The authors compare multiple chatbot models with and without PH augmentation across various metrics. Results indicate that PH enhancement occasionally yields significant improvements at minimal computational cost, though benefits are not universally observed. The approach addresses challenges in domain-specific or privacy-sensitive applications where large datasets are unavailable.
persistent homologychatbot performancedataset augmentationprivacy-sensitive trainingvectorization
Theory of Continual Learning Against Data Poisoning Attacks
We develop a theoretical framework for analyzing data poisoning attacks and defenses in regularization-based continual learning (CL), addressing a critical gap in CL security. By modeling adversary-defender interactions as an online zero-sum game, we establish fundamental performance limits: no defense succeeds against linear-proportion task poisoning with unbounded noise. We then analyze two defensible scenarios: infrequent attacks and bounded noise per attack. For infrequent attacks, we propose a task-to-task verification mechanism to detect poisoning and reduce cumulative bias. For bounded noise, we derive a robust defense that minimizes sensitivity to poisoned features, provably accelerating convergence. Experiments on realistic tasks validate our theoretical findings.
continual learningdata poisoningregularization-basedonline zero-sum gametask-to-task verification
The Forgetting-Retention Dilemma: Certified Unlearning Theory in Continual Learning
This work establishes the first theoretical foundation bridging continual learning (CL) and machine unlearning by formulating CL's unlearning objective as minimizing post-unlearning excess risk. The authors decompose this risk into CL excess risk and unlearning loss, characterizing the trade-off between knowledge preservation and targeted forgetting. Under mild assumptions, they derive an upper bound for CL excess risk in non-convex models and adapt gradient-based and Hessian-based certified unlearning approaches to CL. Experiments validate that while Hessian-based methods minimize unlearning loss more effectively, gradient-based approaches offer near-zero storage overhead, motivating a hybrid strategy balancing performance and efficiency.
continual learningmachine unlearningexcess risknon-convex modelscertified unlearning
MemLeak: Diagnosing Information Leaks in Multimodal Agent Memory
The paper introduces MemLeak, a benchmark for diagnosing information leaks in multimodal agent memory systems when facts are deleted. The authors propose an Information Provenance Graph (IPG) taxonomy to classify memory representations by deletion affordance, revealing multiple leakage channels. Experiments show that while direct probing yields <1% recovery, retained correlated text enables 18.3% recovery and images enable 12.0% recovery (47% image leaks not text-recoverable), with content-aware semantic deletion reducing image residuals to 2.0%. Results are validated across multiple VLMs, a production system, and real photographs, with dual-annotator human validation (kappa=0.88).
multimodal memoryinformation leakagevisual language modelsdeletion affordanceinformation provenance graph
GLIP: Graph and LLM Joint Pretraining for Graph-Level Tasks
The paper introduces GLIP, a joint pretraining framework combining graph neural networks (GNNs) and large language models (LLMs) for graph-level tasks. The method employs graph augmentation to construct contrastive pairs, a multi-token selection strategy for informative patches, and a diffusion-based projector to capture global-local contextual signals. A joint objective aligns semantic (LLM) and structural (contrastive) supervision. Experiments demonstrate GLIP's superiority over state-of-the-art methods in graph-level classification and reasoning tasks with limited labeled data.
graph neural networkslarge language modelscontrastive learningdiffusion projectorgraph-level tasks
How Far Do On-Prem Open LLMs Get on Text-to-SQL? A Cross-Family Size x Technique Frontier on BIRD
This study benchmarks on-premises open-weight LLMs for Text-to-SQL on the BIRD dataset (n=1534, Execution Accuracy), evaluating Qwen2.5-Coder, CodeLlama-Instruct, and Llama-3.x families across sizes (7B-70B) under a unified protocol. The authors ablate model-agnostic techniques (schema linking, self-correction, self-consistency) and analyze their impact. Key findings: (1) model generation matters more than size, with Qwen2.5-Coder outperforming CodeLlama-Instruct at matched sizes; (2) self-correction consistently improves accuracy; (3) schema linking provides no significant benefit despite high recall; (4) self-consistency offers minimal gains at high computational cost. Results are validated via McNemar tests, with full reproducibility and cost analysis provided.
text-to-sqlexecution accuracyschema linkingself-correctionself-consistency
Optimizing Nursing Care Taxi Dispatch Leveraging Integer Linear Programming Solvers and Machine Learning
The paper introduces Nursing Care Taxi Dispatch, a constrained Vehicle Routing Problem variant with wheelchair, compatibility, and temporal constraints, where neural methods typically fail due to complexity. A Transformer-based supervised learning approach is proposed, trained on high-quality solutions from an integer linear programming solver, with post-processing for constraint satisfaction. Evaluations on real-world data show 8% lower operating times for <30-user instances while minimizing violations, outperforming existing methods in time-vs-quality tradeoffs.
vehicle routing probleminteger linear programmingtransformer architectureconstraint satisfactionsupervised learning
Simplifying Flow Matching Transformations with Low-Rank Mixture Models
The authors propose using mixtures of probabilistic principal component analyzers (MPPCA) as latent densities in normalizing flows to simplify flow transformations and improve generative performance. By aligning the latent distribution more closely with the data distribution in terms of KL divergence, the method enables faster convergence and reduces topological mismatch. MPPCA models are efficiently fit using expectation-maximization, making them practical for high-dimensional tasks. Empirical validation on tabular and image datasets demonstrates consistent improvements in training efficiency and generation quality compared to standard normal latent densities.
normalizing flowsmppcakl divergenceexpectation-maximizationgenerative models
ScaleAware-JEPA: Latent Representation for Discovery in Multiscale Physical Fields
ScaleAware-JEPA introduces a self-supervised framework for learning latent representations of multiscale physical fields by aligning predictive tasks with inherent scale hierarchies. The method employs Constrained Diffusion Decomposition (CDD) to separate fields into scale components, using diffusion-derived coordinates to define context-target masking geometry rather than fixed patches. Evaluated on MHD turbulence, interstellar molecular gas, and urban nighttime-light data, the approach generates dense structural atlases without labels, revealing coherent morphology through scale-aware latent spaces.
multiscale representationself-supervised learningconstrained diffusion decompositionlatent coordinatesphysical fields
The Hidden Cost of Resampling: How Imbalance Correction Degrades Probability Calibration in Tree Ensembles
The study quantifies how class-imbalance correction methods affect probability calibration in tree ensembles, demonstrating that SMOTE introduces minor calibration degradation (ECE +0.009) while random undersampling causes severe miscalibration (ECE up to 0.395 at imbalance ratio 70). Through systematic experiments on five datasets (imbalance ratio 1.9-70) with random forests and gradient boosting, the authors show that post-hoc recalibration (Platt or isotonic) effectively mitigates these issues (66% ECE reduction) with minimal impact on discrimination (AUC -0.002). They establish that prior-shift correction fails for SMOTE due to distorted class-conditional densities, necessitating data-driven recalibration.
class-imbalanceprobability calibrationsmoterandom undersamplingpost-hoc recalibration
A Diagnostic Framework and Multi-Evaluator Audit of Evaluator-Driven Preference Dynamics in Self-Adapting LLM Agents
The authors introduce the Evaluator Preference Collapse (EPC) framework to diagnose instability in LLM evaluators, comprising the Multimodal Preference Collapse Index (MPCI), evaluator-indexed coupling matrix, and Jensen-Shannon divergence (JSD). They apply EPC across eight experimental conditions (N=122 repetitions), revealing evaluator coupling coefficients ranging from 0.00 to 1.18 (CV≈0.9), with four conditions showing strong coupling and four collapsing to near-zero. A notable finding is the May-to-June GPT-4o drift, where evaluator instability inverted study conclusions. Self-evaluation consistently collapsed (97% zero, JSD=0.003), though floor effects may confound results. Output-format analysis showed aggregate ρ=0.89 but per-instance ρ=0.219 (p=0.093).
evaluator preference collapsemultimodal preference collapse indexcoupling matrixjensen-shannon divergencellm evaluator instability
IG-Lens: Exact Additive Probability Attribution Across Transformer Layers via Telescoping Integrated Gradients
IG-Lens introduces an exact additive probability attribution method for decoder-only transformers, addressing limitations in existing layer-wise readout tools. By applying Integrated Gradients in a telescoping manner along hidden states from baseline to final layer, it attributes probability changes per layer while preserving softmax nonlinearity. The method ensures exact summation to total probability change, eliminates Riemann discretization error, and operates efficiently in a single-pass batched implementation. Verification shows completeness to floating-point precision, with code available on GitHub.
integrated gradientsprobability attributiontransformer layerssoftmax nonlinearitytelescoping sum
CAREBench: A Child-Safety Risk Benchmark for Language Models
CAREBench introduces a child-safety risk benchmark for language models, focusing on upstream risks before explicit harm occurs. The benchmark comprises 500 prompts across 12 categories (e.g., grooming, emotional dependency) annotated by parents and clinicians, excluding explicit abuse material. Evaluation of seven frontier models reveals failure rates from 2% to 58%, with varying patterns across risk categories. The benchmark aids LLM developers in identifying and mitigating child-safety policy gaps.
child-safety evaluationlanguage modelsrisk categoriesupstream risksfailure rates
Learning as Observable Matrix Dynamics: Diffusive Relaxations versus Phase Transitions
The paper introduces Observable Matrix Dynamics (OMD), a diagnostic framework for analyzing neural network training dynamics through time-evolving distance matrices of internal representations. OMD employs random matrix theory and particle dynamics to detect spectral reorganizations, decomposing matrices via Bogomolny-Bohigas-Schmit theory into ambient noise and latent geometric structures. Experiments reveal two regimes: diffusive dynamics lack stable spectral structure, while sharp reorganizations produce identifiable fingerprints corresponding to smooth, clustered, or soliton-like geometries. The method provides geometric regime identification beyond scalar intrinsic dimension metrics.
observable matrix dynamicsrandom matrix theoryspectral reorganizationbogomolny-bohigas-schmit theorylatent geometry
I-BBS: Coordinate-Free Inference of Latent Sub-Manifolds Using Random Distance Matrix Theory
I-BBS introduces a coordinate-free method for inferring latent sub-manifolds from high-dimensional ambient distance matrices, applicable even when the ambient vector space is partially observable or undefined. The approach models ambient embeddings using generative noise, distinguishing between model-based and model-free classes, and identifies latent geometry through integer-stable signatures: the multiplicity of the top non-Perron multiplet and a parameter-free law governing multiplet positions under noise. Tests on synthetic spheres $S^1$, $S^2$, and $S^3$ demonstrate superior noise stability compared to continuous spectral slope, enabling accurate recovery of both manifold and noise model from a single distance matrix.
latent sub-manifoldsdistance matrixgenerative noiseinteger-stable signaturesnon-perron multiplet
Adjusted Wasserstein distances for bridging empirical and true distributions with applications to MDS
The paper introduces Max-D-SW, an adjusted Wasserstein distance that aggregates contributions over orthonormal bases instead of single unit directions, enhancing Multidimensional Scaling (MDS) for pattern recognition. This modification improves numerical performance, particularly with heavy-tailed distributions, while maintaining statistical tractability with sample-complexity bounds comparable to max-sliced Wasserstein. Results demonstrate that superior sample complexity does not always correlate with better MDS performance, highlighting a nuanced trade-off in metric selection.
wasserstein distancemultidimensional scalingpattern recognitionsample complexityheavy-tailed distributions
Benchmarking Geospatial Foundation Models for Agriculture Applications
The study benchmarks geographic transferability of geospatial foundation models (Prithvi, SpectralGPT, SatMAE) for agricultural applications, revealing significant performance degradation under regional distribution shifts. Using a controlled evaluation across four U.S. states (Iowa, North Carolina, California, Minnesota) with regionally separated train/validation/test splits, the authors measure cross-region generalization in multi-temporal crop segmentation and change detection. All models exhibit sharp performance drops, disproportionately predicting common crops while missing rare ones, with additional confounding effects from standardized input formatting. Results highlight critical limitations in current geospatial foundation models and advocate for region-aware evaluation standards.
geospatial foundation modelsregional transferabilitymulti-temporal segmentationcrop classificationdistribution shift
t-STEP: An interpretable model for Total Electron Content predictions and irregularities estimations
The study introduces t-STEP, an interpretable machine learning model for high-resolution (30-second) Total Electron Content (TEC) prediction and irregularity estimation in the ionosphere. The model leverages GPS observations from solar cycle 24, employing SHAP for feature interpretability and dynamic time warping for robustness evaluation. Results show 91% accuracy (MAE: 4.38 TECU) during high solar activity, outperforming IRI-2020 by 35% in accuracy and 57% in error reduction, while capturing storm-induced irregularities better than an LSTM baseline.
total electron contentionospheric irregularitiesinterpretable machine learningdynamic time warpinggeomagnetic storms
Lie Group Diffusion Models for Hardware-Aware Quantum Circuit Synthesis
We introduce Lie group diffusion models for hardware-aware quantum circuit synthesis, addressing the hybrid continuous-discrete structure of unitary compilation. The method combines a discrete circuit skeleton selector with a diffusion model operating on the SU(2) manifold to generate quantum gates. Evaluated on three-qubit Hamiltonian simulation targets (Transverse Field Ising Model, Heisenberg-XXZ Model), the approach outperforms baselines in synthesizing customizable circuits with varying rotation angles while balancing fidelity and complexity. Results demonstrate effective hardware constraint incorporation and natural geometric integration for quantum circuit synthesis.
quantum circuit synthesislie group diffusionsu(2) manifoldhamiltonian simulationhardware constraints
Kriging and neural network models for pressure losses across perforated plates
Novel data-driven models using kriging and neural networks (NN) are proposed to predict pressure losses across perforated plates in turbulent flows, outperforming empirical formulae across most configurations. The models are trained on limited experimental datasets and validated against measurements, demonstrating strong predictive accuracy. Their applicability is further tested in numerical simulations using Reynolds-averaged Navier-Stokes (RANS) equations, where the models are implemented as source terms in momentum equations. RANS predictions align excellently with model outputs, confirming their suitability for computational fluid dynamics applications.
krigingneural networkspressure lossesperforated platesrans equations
Bidirectional Autoregressive Latent Diffusion for Forward and Inverse Magnetohydrodynamics
The paper introduces a bidirectional autoregressive latent diffusion model for predicting multi-field magnetohydrodynamics (MHD) evolution. The method leverages bidirectional temporal flow as a self-supervised consistency metric, enabling uncertainty estimation without ground truth by comparing forward-backward predictions. Results demonstrate applications in non-invasive plasma diagnostics and robustness improvement via adaptive feedback from sparse measurements.
bidirectional autoregressivelatent diffusionmagnetohydrodynamicsself-supervised consistencyuncertainty estimation
Boundary Degree as a Node-level Feature for Epidemic Scenario Identification in Agent-based Cascade Simulations
The paper introduces boundary degree, a node-level feature defined as the count of an infected node's uninfected contacts in a contact network, for epidemic scenario identification in agent-based cascade simulations. Through systematic ablation studies on realistic social contact networks of Tennessee and Virginia, the authors demonstrate that boundary degree alone improves scenario identification accuracy by 19%. The study provides theoretical grounding for the empirical importance of edge features and shows that boundary degree and edge features have complementary effects. The results indicate that certain epidemic scenarios are indistinguishable without boundary or edge information, suggesting that contact tracing applications should track contacts with non-infected individuals.
boundary degreeepidemic scenario identificationagent-based simulationscontact tracingnode-level feature
STEMGym: Benchmarking Sequential Decision-Making under Dose Budgets in Autonomous Electron Microscopy
The paper introduces STEMGym, a Gymnasium benchmark for autonomous electron microscopy, challenging the assumption that adaptive navigation is key to sample-efficient acquisition. The benchmark comprises 15 physics-simulated STEM environments across five materials, three difficulty levels, and four tasks, evaluated via Dose-Efficiency Curve area (DEC-AUC). Results show perception pipelines dominate dose efficiency: a CNN analyst with naïve raster scanning improves DEC-AUC by 5.5x over baseline (0.287 vs. 0.052), while advanced navigation methods yield no significant gains. Vision-language models underperform task-specific CNNs by ~13x in defect analysis.
stemgymdose-efficiency curveautonomous microscopyperception pipelinecnn analyst
Geometric Algebra Meets Cartesian Tensors: Higher-Order Equivariance for Interatomic Potentials
(No summary returned.)
Speculative Pre-Positioning: Decoding Stateful Sessions to the Next Decision Point Off the Critical Path
The paper introduces speculative pre-positioning, a method for stateful session management in LLM inference that reduces latency by pre-decoding sessions to their next decision point during idle periods. The approach uses the target model's own forward pass (without a draft model) to move cross-request prefill and entry-decode off the critical path. Results show a capable model achieves 87% precision in triggering the confidence gate, reducing first-token latency to 1.0 ms compared to 39 ms with prefix caching, while maintaining bounded false accept rates.
stateful inferencespeculative decodinglatency reductionconfidence gateprefix cache
Persona-Trained Monte Carlo: Estimating Market-Outcome Distributions via Swarms of Persona-Conditioned Neural Policy Bots in a Limit Order Book
The authors propose Persona-Trained Monte Carlo (PTMC), a method for estimating market-outcome distributions by simulating interactions among persona-conditioned neural-policy trading bots in a limit order book. PTMC generates Monte Carlo samples through repeated simulations where bots, sharing a trained policy network but conditioned on heterogeneous persona parameters, interact in a continuous double auction. The method incorporates randomness through persona draws, action sampling, and optional exogenous shocks. The authors formalize the PTMC estimator, outline its convergence properties, and propose a four-level validation methodology. While not implemented, the framework contributes a formal estimator, cross-disciplinary design justification, and validation roadmap.
monte carloneural-policylimit order bookpersona-conditioneddouble auction
Optimizer Memory Makes Shuffle Order a First-Order Source of Fine-Tuning Noise
The paper demonstrates that shuffle order introduces first-order fine-tuning noise in fixed-clock optimizers like AdamW, contrary to memoryless optimizers where such effects are second-order. By analyzing moment buffers and preconditioner states that advance with step index rather than learning-rate-scaled time, the authors derive a fit-free method to quantify this noise. Results show order-variance slopes of 1.83 for AdamW, 2.00 for fixed-β momentum, and 4.00 for SGD, with clock-matching restoring the regular exponent. The analysis provides error bars, attribution weights, and seed-budget criteria for fine-tuning comparisons.
fine-tuning noisefixed-clock optimizersmomentum bufferorder-variancegradient bracket
Improved Multi-Dimensional Forecasting for Swap Regret
The paper presents improved algorithms for multi-dimensional forecasting in swap regret minimization, targeting scenarios with multiple downstream agents of unknown objectives. For 2D outcome spaces, it introduces a polynomial-time algorithm achieving $\tilde{O}(\sqrt{kT})$ swap regret per agent, improving upon prior $\tilde{O}(kT^{5/8})$ bounds and exponential runtime. The method extends to higher dimensions with $\tilde{O}(\sqrt{T})$ regret, though runtime scales with dimension. For arbitrary dimension $d$, an $\tilde{O}(d\sqrt{kT})$ regret bound is shown, surpassing previous $\tilde{O}(T^{2/3})$ results that required behavioral assumptions.
swap regretmulti-dimensional forecastingpolynomial-time algorithmdownstream agentsregret minimization
The Mirage of Optimizing Training Policies: Monotonic Inference Policies as the Real Objective for LLM Reinforcement Learning
The paper identifies training-inference mismatch as a key instability source in LLM reinforcement learning, where divergent probability distributions between training and inference engines create persistent off-policy effects. It proposes Monotonic Inference Policy Improvement (MIPI) as a new optimization objective that directly targets inference-side policy quality, implemented via a two-step Monotonic Inference Policy Update (MIPU) framework with sampler-referenced candidate generation and inference-gap-based acceptance. Experiments on two model scales demonstrate MIPU improves reasoning performance by 12-18% and training stability under high-mismatch conditions.
reinforcement learningtraining-inference mismatchoff-policynesspolicy optimizationlarge language models
Do Models Read What They Write? Causal Registers in Scratchpad Reasoning
The study demonstrates that models can causally use intermediate scratchpad states for computation, not merely as legible reasoning traces. Using a controlled state-tracking task with known transition rules, researchers edited internal representations of written states while keeping scratchpad text fixed, then measured downstream prediction accuracy. Qwen2.5-Coder-7B predicted correct next-phase bits 80-91% of the time when using edited states, significantly outperforming pretrained and final-answer-only controls. Results generalized across model families, suggesting scratchpad oversight should aim to train computationally integrated intermediate states rather than just transparent reasoning.
scratchpad reasoningcausal registersprocess supervisionintermediate variablesstate-tracking
Not All Objectives Are Born Equal: Priority-Constrained Descent for Hierarchical Multi-Objective Optimization
The paper introduces Priority-Constrained Descent (PCD), a gradient-based optimization framework for hierarchical multi-objective problems where primary and secondary objectives have unequal importance. PCD preserves primary objective descent direction while minimally distorting gradients to ensure secondary objective progress, controlled by a parameter τ ∈ [0,1]. The method provides scaling invariance and closed-form solutions for 2-3 objectives. Experiments in network compression, sparsity, and low-rank tasks demonstrate Pareto dominance over baselines, with τ offering interpretable trade-offs between objectives.
hierarchical optimizationgradient descentmulti-objective learningnetwork compressionpareto efficiency
Anti-Collapse Dynamics and the Emergence of Multi-Time-Scale Learning in Recurrent Neural Networks
The paper demonstrates that the temporal decay class (exponential vs. power-law) in recurrent neural networks emerges from coupled state-parameter dynamics, not fixed architecture. Through a coarse-grained stochastic process analysis, the authors prove the existence of an anti-collapsed regime with power-law forgetting when heavy-tailed parameter fluctuations balance training's bias toward short time scales. The spectral exponent β governs both time-scale spread and forgetting rate. Practical realization requires architectural/optimizer capacity to maintain broad time-scale spectra under heavy-tailed forcing, which enables long-range learning.
recurrent neural networkslong-range learningpower-law forgettingspectral exponentheavy-tailed fluctuations
Harvesting AI Computation at the Edge via Generic Approximation
The authors propose a framework to harvest underutilized AI computation resources at the edge by converting general-purpose tasks into neural network models via neural architecture search (NAS). A runtime scheduler offloads these approximate tasks to idle AI chips, alleviating the burden on general-purpose processors without compromising primary AI workloads. Experiments on a representative AIoT processor demonstrate substantial performance improvements across various edge processing tasks.
neural architecture searchedge computingruntime scheduleraiot processorapproximation techniques
A Mathematical Optimization Approach for Expert-Informed Bayesian Best Subset Selection
The paper introduces Expert-Implied Bayesian Best Subsets (EBBS), a method integrating domain-expert probability estimates of feature relevance into the mixed-integer optimization (MIO) framework for best subset selection. EBBS aggregates expert views using the Poisson binomial distribution, pairwise win rate, or normalized mean rank, incorporating them as log-odds penalty terms in the objective function. This approach reduces to classical Best Subsets when expert views are absent. The paper provides analytic derivations of the maximum a posteriori (MAP) formulation and characterizes its theoretical properties, with empirical results on synthetic and real datasets forthcoming.
mixed-integer optimizationbest subset selectionmaximum a posterioripoisson binomial distributionlog-odds penalty
Reinforcement Learning in Super Mario Bros: Curriculum, Pedagogy, and Optimal Level Design in World 1-1
The study provides empirical validation for Super Mario Bros World 1-1's pedagogical level design by comparing reinforcement learning algorithms in discrete environment implementations. Four algorithms (Q-Learning, SARSA, Monte Carlo, DQN) were evaluated across three progressively complex level variants, with Monte Carlo achieving highest win rate (94.9% ±1.5%) by optimizing intermediate rewards. Curriculum experiments permuting six level segments showed canonical ordering yields fastest convergence, highest learning efficiency, and zero catastrophic failures, demonstrating its unique pedagogical structure.
reinforcement learningcurriculum learningmonte carlo methodsgame designpedagogical structure
The Calibrated Deepfake Trust Score (CDTS): Competence-Coupled Trust Degradation Across Deepfake Detectors
The paper introduces the Calibrated Deepfake Trust Score (CDTS), demonstrating a competence-calibration coupling where calibration degrades as detector discriminative competence decreases (Pearson r = -0.81 across 32 configurations). The study validates this across three architectures (convolutional networks and CLIP ViT) and four datasets, showing label-free competence estimation can flag calibration risks. CDTS improves routing performance (lower AURC) and addresses calibration inequity across demographic subgroups. The authors propose competence-aware trust scoring as a unifying framework.
deepfake detectioncalibrationtrust scorecompetence estimationvision transformer
Chamber geometry and specification numbers of Boolean threshold functions
The paper establishes a geometric interpretation of Boolean threshold functions' specification numbers, linking them to chamber facets in a hyperplane arrangement. Using methods from combinatorial geometry and the resonance arrangement, it proves the average specification number is Θ(n), resolving a question by Gutekunst et al. The analysis extends to polynomial threshold functions and connects to threshold zonotopes and one-inclusion graphs. Operations preserving simpliciality and minimum specification number are characterized, including a resolution of a posed question about variable extensions.
boolean threshold functionsspecification numberhyperplane arrangementthreshold zonotopeone-inclusion graph
Structured Proper Loss Geometries for Multiclass Classification: Theory and Controlled Empirical Evaluation
The paper analyzes three multiclass loss functions—CAPM (class-aware quadratic Bregman score), HPG (log-cosh ridge generator), and APMS (HPG with annealed margin penalty)—through theoretical and empirical lenses. It derives bounds for conditional regret, curvature, and gradient behavior, while proving exact penalty-range properties for APMS. Controlled experiments on Digits, Wisconsin breast cancer, and synthetic datasets under varied noise and imbalance conditions show comparable performance to cross-entropy on clean data, with limited gains in specific noisy-label scenarios. Theoretical results are rigorously established, but empirical evidence does not support general superiority claims.
proper scoring rulesmulticlass classificationbregman divergencelabel noise robustnessconditional regret bounds
Self-Supervised Calibration of Scientific Instruments Using Physical Consistency Constraints
The authors propose a physics-informed self-supervised framework for joint learning of detector calibration parameters and task-specific predictions from raw measurements, eliminating reliance on expert-labeled data. The method leverages physical consistency constraints to generate iterative pseudo-labels, reformulating calibration as a self-supervised optimization problem. Demonstrated on ionic charge-state determination in the VAMOS++ magnetic spectrometer, the approach achieves accurate reconstruction while inferring calibration coefficients that enable automated detector monitoring for gain drifts and aging effects.
self-supervised learninginstrument calibrationphysical consistencypseudo-labellingdetector monitoring
Prototype Latent World Model Replay for Class-Incremental Learning
The paper introduces Prototype Latent World Model Replay (LWM), a memory-free class-incremental learning framework that avoids catastrophic forgetting without storing raw exemplars. The method uses a frozen ImageNet-pretrained encoder to project images into a latent space, where old classes are represented as prototype-centered distributions with class-specific variances. During incremental learning, synthetic old-class samples are generated from these distributions and combined with new-class features to train a lightweight adapter and classifier, augmented by supervised contrastive loss for better separation. On Split CIFAR-100, LWM+Con improves LastAcc by 27.09%, 27.99%, and 26.14% absolute over fine-tuning for Inc5, Inc10, and Inc20 respectively, while maintaining AvgAcc above 45%. Ablations confirm the importance of stable latent-state replay and contrastive refinement.
class-incremental learninglatent replayprototype distributionscontrastive losscatastrophic forgetting
Interpretable Inverse Design of Metal-Organic Frameworks with Large Language Model Agents
The paper introduces LLM4MOF, a closed-loop framework using language-model agents for interpretable inverse design of metal-organic frameworks (MOFs). Agents autonomously propose and test hypotheses about metal nodes, linkers, and pore geometry, refining designs over ten iterations. The system evaluates candidates through simulation, focusing on top-performing structures for adsorption, separation, and electronic-structure tasks within 400 evaluations. LLM4MOF outperforms random search and genetic algorithms, achieving cost-effective ($1 per campaign) and simulation-grounded design without per-objective model training.
metal-organic frameworksinverse designlanguage-model agentssimulation-groundedautonomous iterations
How Much Due Diligence Before You Bid? Learning in Intractable Takeover Auctions
The paper contributes a computational model for studying due diligence in takeover auctions, demonstrating that self-play methods can effectively learn bidding strategies. Using a game-theoretic framework where bidders acquire costly private signals, the authors show that optimal diligence is finite, decreases with cost, and is further reduced under competition. They compare lightweight self-play (trained on a laptop) against specialized solvers, finding general methods competitive in intractable regimes while exact methods dominate smaller instances. Results provide empirical evidence for practical AI in complex auctions and quantify the economic value of information acquisition.
self-play learningtakeover auctionsdue diligenceprivate signalsgame-theoretic modeling
Closing the Activation-Cone Blind Spot: Response-Time Probing and Unified Defense
The paper introduces response-time probing, a novel defense against prefilling attacks in large language models, addressing a structural blind spot in activation-cone-based defenses. The method employs a linear probe on hidden states at initial generated tokens, combined with a halt mechanism, achieving AUROC 0.97-1.00 across seven instruction-tuned models (7-31B). This approach reduces prefilling attack success to 0/40 with 0% benign false positives, outperforming Llama Guard 3. When composed with AlphaSteer's null-space steering, it achieves defense success rates of 0.983 on Mistral and 0.994 on Llama. Diverse negative training sets further reduce probe false positives from 80-100% to near zero.
response-time probingprefilling attacksactivation-conenull-space steeringlinear probe
Randomized neural operator for parametric PDEs with fast training and conformal uncertainty quantification
We introduce PCA--RaNN, a randomized latent neural operator for parametric PDEs that combines PCA-based dimensionality reduction with fixed random features and a closed-form least-squares readout. This approach reformulates latent operator learning as fixed-feature linear regression, reducing training time by 1-3 orders of magnitude while maintaining competitive accuracy. The method incorporates energy-matched scaling, BFGS refinement, and ensemble averaging for variance reduction. Evaluated on Burgers, Darcy, Navier--Stokes, and backward heat equation benchmarks, PCA--RaNN demonstrates favorable speed--accuracy trade-offs against baselines. It supports split-conformal prediction intervals and enables rapid online adaptation via recursive least squares without retraining hidden features.
parametric pdesrandomized neural operatorpca-based dimensionality reductionsplit-conformal predictionrecursive least squares
Fractional Stochastic Neural Networks
The paper introduces fractional stochastic neural networks with residual dynamics governed by fractional Brownian motion. A discrete stochastic maximum principle yields adjoint recursion for training, while projected samplewise stochastic gradient descent achieves mean-square convergence for deterministic parameters. Experiments demonstrate superior performance in long-memory time series generation (vs. Brownian/deterministic baselines) and robustness in image classification under structured perturbations, alongside closed-form convergence tests and noisy regression with uncertainty quantification.
fractional brownian motionstochastic maximum principleadjoint recursionlong memory recoverystructured perturbations
Fourier Neural Operators with Least-Squares Readout Refit for Learning Random Obstacle-to-Solution Maps
The paper introduces a least-squares readout refit (FNO-LS) for Fourier neural operators to improve learning of random obstacle-to-solution maps from elliptic variational inequalities. The method freezes the trained FNO backbone and recomputes the final affine readout via linear least-squares over all training samples and grid points, optimizing the readout while preserving nonlinear features. Evaluated against DeepONet variants and vanilla FNO on obstacle ensembles, FNO-LS achieves superior performance in field accuracy, contact-set recovery, and obstacle-violation metrics, particularly for high-amplitude obstacles with complex contact geometry. The refit provides a low-cost post-training enhancement when the FNO backbone is informative but not fully converged.
fourier neural operatoroperator learningleast-squares refitelliptic variational inequalitiescontact-set recovery
Temporal Posed and Spontaneous Gesture Recognition from Electromyography in the Rock-Paper-Scissors Game
This work investigates temporal electromyography (EMG) characteristics for gesture recognition in Rock-Paper-Scissors (RPS), focusing on posed versus spontaneous gestures and inter-player dynamics. Twenty-four participants played RPS dyads while forearm EMG was recorded. EMG onsets were detected 800ms before visible gesture onset, peaking at 342ms prior. Posed gesture recognition achieved 63.4% accuracy, while spontaneous gestures yielded 53.6%. Opponent EMG analysis revealed gesture detection at 65% accuracy, peaking 2082ms post visual onset, indicating reaction-based gesture inference. Results demonstrate EMG's predictive advantage for rapid intent recognition, with implications for human-computer interaction and assistive technologies.
electromyographygesture recognitionrock-paper-scissorstemporal analysisintent recognition
Can Machines Really See Objects in Images? A Study Based on Syntactic Distance and Visual Self-Referential Instances
The study investigates the structural limitations of vision models in recognizing global semantics by introducing syntactic distance, a metric quantifying class separability based on symmetry of operations. A visual self-referential task is constructed using maximum-variance binary noise, where positive and negative samples differ only in global semantics but have zero syntactic distance, eliminating local statistical cues. Experiments on ResNets and Vision Transformers demonstrate a phase-transition phenomenon, with accuracy collapsing to random guessing beyond a critical image scale, unaffected by larger training sets or model size. Globally attentive ViTs exhibit earlier collapse, revealing a capability boundary in current architectures for global-concept tasks.
syntactic distancevisual self-referentialphase-transitionglobal semanticsvision transformers
Self-Organized Conformal Prediction: Reducing Regional Coverage Gaps with Unsupervised Group Discovery
Self-Organized Conformal Prediction (SOCP) introduces a calibration scheme that reduces regional coverage gaps by discovering input-space groups via Self-Organizing Maps (SOMs) and retrieving local calibration buffers from best-matching unit cells or fixed grid neighborhoods. The method maintains exact validity for BMU-cell retrieval and approximate validity for neighborhood buffers, with a split-routed extension ensuring fixed retrieved-set validity. Evaluated on eight regression and classification benchmarks, SOCP reduces weighted regional coverage gaps on 7/8 datasets (mean paired change −7.1%) with a 6.2% mean prediction-set size increase, demonstrating efficacy without supervised partitions or predictor retraining.
conformal predictionself-organizing mapregional coveragenonconformity scorequantile regression
Exploring the Cryptographic Limits of Transformer Networks
This work establishes constructive upper bounds on transformer networks' cryptographic capabilities by mapping cryptographic functions to transformer architectures. The authors generate threshold circuits for Keccak functions, Merkle--Damgard constructions, and Merkle Trees, then propose two architectural mappings: no-attention and tokens-as-gates. Verified scaling laws for circuit width and depth are derived, providing structural guarantees for transformer computational capacity and enabling principled capability evaluations of AI systems.
threshold circuitskeccak functionsmerkle--damgardtransformer architecturescomputational capacity
Interventional Flow Matching: Prospective Dose-Response Forecasting with Velocity-Field Jacobian Regularization
The paper introduces Interventional Flow Matching (IFM), a continuous-time generative framework for prospective dose-response forecasting in glucose management. IFM conditions a flow-matching velocity field on patient history and planned treatments, using Jacobian regularization to enforce physiologically plausible responses without mechanistic ODEs. The method penalizes velocity-field Jacobians with respect to smoothed treatment drivers, ensuring signed, bounded sensitivities (e.g., insulin lowers glucose). Evaluated on a simulated UVA/Padova type 1 diabetes cohort, IFM achieves optimal balance between observational RMSE and interventional metrics while maintaining physiological correctness and directional consistency.
flow matchingdose-response forecastingjacobian regularizationcontinuous-time generative modelsphysiological trajectory prediction
Dynamic Parsing and Updating Natural Language Specification using VLMs for Robust Vision-Language Tracking
Proposes a language dependency parsing mechanism for vision-language tracking that dynamically updates textual descriptions to mitigate semantic-visual mismatches. The method leverages Qwen-VL's cross-modal understanding to perform component-aware updates of target objects, semantic concepts, and background context. Integrated into a baseline framework, it achieves superior performance on TNL2K, LaSOT, TNLLT, and OTB-LANG benchmarks. Source code and pre-trained models will be publicly released.
vision-language trackinglanguage dependency parsingsemantic-visual mismatchqwen-vlcomponent-aware updates
Adaptive Financial Transformer with Regime-Gated Attention for Stock Return Prediction
The Adaptive Financial Transformer (AFT) introduces regime-gated attention for stock return prediction in non-stationary markets. The model employs a Market Regime Encoder, Adaptive Gate Network, and Adaptive Financial Context module to dynamically adjust self-attention based on 95 financial features grouped into 11 semantic categories. It addresses sequence alignment and backtesting issues while optimizing a composite objective of prediction error, directional accuracy, and Sharpe ratio. Evaluations show competitive performance with 15.2% reduced complexity and improved parameter efficiency compared to Transformer baselines.
adaptive financial transformerregime-gated attentionmarket regime encodernon-stationary marketscomposite objective
Reliability, Faithfulness, and the Limits of Post-hoc Explanations of Opaque Scientific Models
The article critiques the adequacy of post-hoc explanation methods for interpreting scientific machine learning models, arguing that reliability and faithfulness alone cannot validate structural claims about the underlying phenomenon. While reliability ensures model predictions align with observed outcomes and faithfulness ensures explanations match the model's behavior, neither verifies that the model's mechanisms mirror the phenomenon's actual structure. The authors contend that such explanations can only generate candidate hypotheses requiring external corroboration, not definitive structural insights.
post-hoc explanationsmodel reliabilityexplanation faithfulnessscientific machine learningstructural claims
Two kinds of robustness are not the same: disentangling fault tolerance and low-SNR robustness in multi-domain event detection on real data
The study disentangles fault tolerance and low-SNR robustness in multi-domain event detection, demonstrating that sensor-dropout training dominates robustness gains over architectural redundancy. Using a unified benchmark from three real-world datasets (Hi-net seismic, Utah FORGE DAS, MAFAULDA vibration), the authors evaluate CEPHALON (a fault-tolerant detector) against standard models (1D CNN, TCN, compact Transformer) under sensor loss and additive noise. While all models achieve near-perfect AUC (~0.99) on clean data, CEPHALON excels in low-SNR conditions (AUC 0.939 vs. 0.532-0.572 at -2.5 dB), with ablation showing sensor-dropout training as the primary factor. The pipeline is released for reproducibility.
event detectionsensor-dropoutlow-snr robustnessfault tolerancemulti-domain benchmark
AMR: Adaptive Modality Routing for Multimodal Polyglot Speaker Identification
The paper introduces Adaptive Modality Routing (AMR), a novel modality fusion module for multimodal polyglot speaker identification addressing missing modalities and language mismatch. AMR dynamically assesses input quality using modality adapters for audio (W2V-BERT 2.0) and face embeddings (IResNet-18), followed by a trainable router estimating dynamic modality weights for logit aggregation. Training employs a modality-aware strategy with four sample pair types and KL divergence supervision. Evaluated on POLY-SIM 2026, AMR achieves 99.93% (English multimodal), 100.00% (Urdu multimodal), 97.50% (English audio-only), and 98.83% (Urdu audio-only) accuracy, averaging 99.07% and outperforming FOP by 32.73%.
adaptive modality routingmodality fusionpolyglot speaker identificationmodality adapterskl divergence
Sample Complexity of Scientific Discovery: PAC Learnability of Compositional Function Trees
The paper establishes PAC-learnability guarantees for compositional function trees in symbolic regression, showing that generalization error depends polynomially on depth d and Lipschitz constants rather than exponentially on symbolic complexity. Using Rademacher complexity analysis, the authors prove risk bounds scaling as O(L^d/√n) for trees built from K base operators with arity b, when K,b=O(1). Theoretical results are validated empirically via differentiable operator trees trained on synthetic physics-like targets, demonstrating correlation between generalization gap and the predicted (L^d)/√n complexity term.
pac-learningrademacher complexitysymbolic regressioncompositional functionslipschitz constants
Gradient boosting with vector-valued leafs
The paper extends gradient boosting to vector-valued objective functions, addressing limitations in existing frameworks that either update vector elements sequentially or use diagonal Hessian approximations. The proposed method generalizes gradient boosting for vector outputs, enabling efficient optimization of multivariate objectives like multinomial logistic regression. A key contribution is a novel algorithm compatible with histogram-based decision trees, maintaining computational efficiency while handling full vector updates. The approach theoretically supports arbitrary vector-valued loss functions and demonstrates practical applicability to multi-class classification scenarios.
gradient boostingvector-valued objectivemultinomial logistic regressionhistogram-based treeshessian approximation
Deciphering Region-Level Signatures from Latency Measurements in LEO Satellite Internet
The paper proposes a hierarchical framework for analyzing region-level latency signatures in LEO satellite Internet using Starlink RTT measurements from the LENS dataset. The method transforms raw RTT sequences into multi-scale statistical features for cross-region comparison, identifying infrastructure availability and dish-to-PoP distance as key deployment factors. Results show 83% accuracy in short-term region classification using XGBoost, with minimum RTT as the most discriminative feature, though performance degrades over longer periods due to limited temporal generalization.
leo satellite internetround-trip timehierarchical frameworkxgboosttemporal generalization
📰 Industry Media (1)
Agriculture is ready for AI, but its data isn’t
Agricultural AI systems demonstrate potential for yield improvement (26%), water reduction (41%), and chemical optimization (33%), but require robust data foundations to avoid misleading outputs. The study identifies key challenges: disparate IoT sensor data, heterogeneous field conditions (soil variation, GPS coordinates), and dynamic external inputs (weather, market data). Effective implementation necessitates unified data models with governance frameworks, exemplified by Reltio's context intelligence layer integrating entities, relationships, and business rules. Without such infrastructure, precision agriculture risks 'garbage in, garbage out' scenarios with operational and compliance consequences.
precision agricultureiot sensor fusioncontext intelligence layerdata governanceyield prediction
Generated automatically at 2026-06-30 21:32 UTC. Summaries and keywords are produced by an LLM and may contain inaccuracies — always consult the original article.
