Daily Digest — 2026-05-23
348 items · 3 research labs, 334 arxiv papers, 11 industry media
🏛️ Research Labs (3)
OpenAI named a Leader in enterprise coding agents by Gartner
OpenAI has been recognized as a Leader in the Gartner Magic Quadrant for Enterprise AI Coding Agents, reflecting its advancements with Codex. Codex, used by over 4 million weekly users, integrates GPT-5.5 and supports enterprise-scale deployments through enhanced tool use, faster performance, and workflow integration. Gartner highlights Codex's strengths in agentic software development, enterprise governance, sandboxing, and flexible deployment options, including IDE extensions, CLI, SDKs, and cloud orchestration. Enterprises like Cisco leverage Codex to accelerate development cycles, reducing delivery time from quarters to weeks. Recent updates include Codex Security, GPT-5.5-Cyber, HIPAA compliance, and expanded deployment support via Codex Labs and GSI partners.
codexgartneragenticsandboxingenterprise
Specialization Beats Scale: A Strategic Variable Most AI Procurement Decisions Overlook
Specialized small models outperform larger general-purpose models in domain-specific tasks, demonstrating that distributional alignment to the deployment task is more decisive than parameter count. A 3-billion-parameter OCR model, fine-tuned via supervised fine-tuning and Direct Preference Optimization, achieved a composite score of 0.911 on the DharmaOCR-Benchmark, surpassing commercial APIs like Claude Opus 4.6 (0.833) while operating at 52x lower cost. The model also exhibited superior production stability, with a 0.20% text degeneration rate. Progressive specialization, moving from general-purpose to domain-specific training, yielded consistent quality gains and reduced degeneration rates across multiple parameter scales.
specializationdistributional alignmentfine-tuningtext degenerationocr
Catch up on the Dialogues stage at Google I/O 2026.
The Dialogues stage at Google I/O 2026 featured discussions on transformative AI technologies and their societal impact. Google leaders and external experts explored proactive AI agents, quantum computing integration, AI-driven scientific problem-solving, advancements in embodied physical AI, and AI-enhanced cinematic storytelling. Key participants included Google CEO Sundar Pichai, DeepMind CEO Demis Hassabis, and representatives from Boston Dynamics and creative industries. The sessions highlighted interdisciplinary collaborations and technological breakthroughs shaping future applications in productivity, robotics, and creative domains. All discussions are available on YouTube.
proactive ai agentsquantum computingembodied physical aicinematic storytellingscientific problem-solving
📜 arXiv Papers (334)
Vector Policy Optimization: Training for Diversity Improves Test-Time Search
The paper introduces Vector Policy Optimization (VPO), a reinforcement learning algorithm that trains language models to produce diverse solutions anticipating varied downstream reward functions. VPO modifies the GRPO advantage estimator to optimize for vector-valued rewards (e.g., per-test-case correctness or multiple user personas), enabling specialization across reward trade-offs. Experiments on four tasks show VPO matches or exceeds scalar RL baselines in test-time search metrics (pass@k, best@k), with performance gaps widening under larger search budgets, and enables evolutionary search on problems unsolvable by GRPO models.
vector policy optimizationreinforcement learningreward diversitytest-time searchevolutionary algorithms
The Matching Principle: A Geometric Theory of Loss Functions for Nuisance-Robust Representation Learning
The paper introduces the Matching Principle, a geometric theory unifying robustness objectives in representation learning by estimating label-preserving deployment nuisance covariance and regularizing the encoder Jacobian accordingly. It demonstrates that methods like CORAL, adversarial training, and IRM are estimators of this covariance, not independent techniques. Theoretical results include closed-form optimality in the linear-Gaussian model, necessity of range coverage for quadratic Jacobian penalties, and consistency lemmas under identifiability assumptions. Empirical validation across classical ML to Qwen2.5-7B shows the principle's effectiveness, with matched style-PMH improving selective honesty and preserving Style TDI at 7B scale.
matching principlenuisance covarianceencoder jacobiantrajectory deviation indexstyle-pmh
Finite-Particle Convergence Rates for Conservative and Non-Conservative Drifting Models
The paper introduces a conservative drifting method for one-step generative modeling, replacing displacement-based drifting velocity with a KDE-gradient velocity to address non-conservatism. The method leverages kernel density estimators to compute smoothed data and model scores, forming a gradient field. Theoretical analysis provides finite-particle convergence bounds in continuous-time, including empirical Stein drift, smoothed Fisher discrepancy, and squared center velocity. Results include deterministic and probabilistic local-occupancy conditions for controlling self-interaction terms, with explicit quadrature constants and bandwidth-dependent rates. A non-conservative variant using Laplace kernels is also analyzed, revealing residual terms in velocity decomposition.
conservative driftingkde-gradient velocityfinite-particle convergencestein driftlaplace kernel
MOSS: Self-Evolution through Source-Level Rewriting in Autonomous Agent Systems
The paper introduces MOSS, a self-evolving autonomous agent system that performs source-level rewriting to address structural failures unreachable through text-mutable artifacts. MOSS operates via a deterministic pipeline: it curates failure evidence, delegates code modification to external coding agents, verifies candidates through batch replay in ephemeral workers, and deploys updates via container swaps with rollback safeguards. This approach enables Turing-complete adaptation beyond text-based methods. Evaluated on OpenClaw, MOSS improved a four-task mean grader score from 0.25 to 0.61 in one cycle without human intervention.
self-evolving agentssource-level rewritingdeterministic pipelineephemeral trial workerscontainer swap
Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention
Gated DeltaNet-2 introduces channel-wise erase and write gates to decouple memory editing in linear attention, addressing the scalar gate limitation in Delta-rule models like Kimi Delta Attention (KDA) and Gated DeltaNet. The method employs asymmetric erase factors and a gate-aware backward pass, maintaining efficient parallel training while enabling precise memory updates. Evaluated on a 1.3B parameter model trained on 100B FineWeb-Edu tokens, it outperforms Mamba-2, Gated DeltaNet, KDA, and Mamba-3 variants in language modeling, commonsense reasoning, and retrieval, particularly excelling in long-context RULER benchmarks.
linear attentiondelta-rulechannel-wise decayfast-weight updaterecurrent state
LCGuard: Latent Communication Guard for Safe KV Sharing in Multi-Agent Systems
The paper introduces LCGuard, a framework for secure key-value (KV) cache sharing in multi-agent LLM systems that prevents sensitive information leakage while preserving task performance. The method treats shared KV caches as latent working memory and applies learned representation-level transformations to block adversarial reconstruction of sensitive inputs, formalized through an adversarial training objective. Evaluations across multiple model families and benchmarks show LCGuard reduces reconstruction-based leakage by 37-52% and attack success rates by 29-45% while maintaining 92-97% of baseline task accuracy.
latent communicationkv-cachemulti-agent systemsadversarial traininginformation leakage
DeltaBox: Scaling Stateful AI Agents with Millisecond-Level Sandbox Checkpoint/Rollback
DeltaBox introduces DeltaState, an OS-level abstraction enabling millisecond-level checkpoint/rollback (C/R) for AI agents by tracking state changes rather than full duplication. It combines DeltaFS (layered filesystem C/R via copy-on-write) and DeltaCR (incremental process state dumps with template-based forking), reducing latency to 14ms (checkpoint) and 5ms (rollback). Evaluations on SWE-bench and RL benchmarks demonstrate significant improvements in agent exploration efficiency under fixed time constraints.
deltastatecheckpoint/rollbackdeltafsdeltacrsandbox
SDPM: Survival Diffusion Probabilistic Model for Continuous-Time Survival Analysis
The Survival Diffusion Probabilistic Model (SDPM) introduces a generative approach for continuous-time survival analysis without parametric assumptions or time discretization. SDPM models the conditional distribution of survival outcomes using a denoising diffusion process, transforming generated samples into survival function estimates via the Kaplan-Meier estimator. Evaluated on ten real datasets and synthetic Cox-Weibull data, SDPM matches or outperforms five baselines in C-index, time-dependent AUC, and Brier score, while accurately recovering continuous survival distributions. Ablation studies confirm the benefits of target-space transformations for calibration and predictive performance.
survival analysisdiffusion probabilistic modelcontinuous-time modelingkaplan-meier estimatorgenerative modeling
MambaGaze: Bidirectional Mamba with Explicit Missing Data Modeling for Cognitive Load Assessment from Eye-Gaze Tracking Data
MambaGaze introduces a framework for real-time cognitive load assessment from eye-tracking data, addressing challenges of data missingness and long-range dependencies. It employs XMD encoding for uncertainty modeling and bidirectional Mamba-2 for efficient temporal dependency capture. Evaluations on CLARE and CL-Drive datasets show 76.8% and 73.1% accuracy, outperforming CNN, Transformer, ResNet, and VGG baselines by 4-12 percentage points. Edge deployment achieves 43-68 FPS with <7.5W power consumption, enabling wearable applications.
mambagazexmd encodingbidirectional mamba-2cognitive load assessmenteye-tracking
CogAdapt: Transferring Clinical ECG Foundation Models to Wearable Cognitive Load Assessment via Lead Adaptation
CogAdapt transfers clinical ECG foundation models to wearable cognitive load assessment via lead adaptation, addressing sensor mismatch and task differences. The framework introduces LeadBridge for transforming 3-lead wearable signals into 12-lead representations and ProFine for progressive fine-tuning to prevent catastrophic forgetting. Evaluations on CLARE and CL-Drive datasets show macro-F1 scores of 0.626 and 0.768, outperforming baselines trained from scratch, demonstrating effective subject-independent cognitive load assessment.
ecg foundation modelslead adaptationcognitive load assessmentprogressive fine-tuningwearable sensors
Deep Reinforcement Learning for Flexible Job Shop Scheduling with Random Job Arrivals
We propose a deep reinforcement learning (DRL) approach for the Flexible Job Shop Scheduling Problem (FJSP) with random job arrivals, addressing combinatorial complexity and unpredictability. Using Proximal Policy Optimization (PPO) and lightweight Multi-Layer Perceptrons (MLPs), we train a DRL agent to minimize total job completion time, with state representation directly accessible from the environment and action selection limited to established dispatching rules. Our method outperforms individual dispatching rules across datasets with varying heterogeneity and job arrival rates, and benchmarks favorably against an arrival-triggered mixed-integer linear programming solution, particularly in heterogeneous scenarios.
deep reinforcement learningflexible job shop schedulingproximal policy optimizationmulti-layer perceptronsmixed-integer linear programming
Reducing Political Manipulation with Consistency Training
The authors identify and quantify covert political bias in large language models (LLMs), defined as asymmetric handling of counterpart topics from opposing political sides, categorized into 7 technique types. They propose Political Consistency Training (PCT), a reinforcement learning method with two paradigms—Sentiment Consistency Training and Helpfulness Consistency Training—to reduce bias while preserving model helpfulness. Evaluations show PCT substantially mitigates covert bias (measured via sentiment and helpfulness consistency metrics) and generalizes to held-out benchmarks.
covert political biassentiment consistencyhelpfulness consistencypolitical consistency traininglarge language models
Understanding Data Temporality Impact on Large Language Models Pre-training
The study investigates how data ordering during pre-training affects temporal knowledge acquisition in large language models (LLMs). Researchers introduce a benchmark of 7,000 temporally grounded questions and pre-train 6B-parameter models on sequentially ordered Common Crawl snapshots versus shuffled data. Results show sequential training maintains general language understanding while improving factual freshness and temporal precision, whereas shuffled training favors older facts due to increased repetition. The work provides datasets, code, and checkpoints to support future continual learning research.
temporal groundingpre-training dynamicsfactual freshnesscontinual learningknowledge acquisition
Advancing Mathematics Research with AI-Driven Formal Proof Search
The study demonstrates AI-driven formal proof search's capability to solve open mathematical problems by combining large language models (LLMs) with Lean-based verification. An advanced agent autonomously resolved 9 of 353 open Erdős problems, proved 44/492 OEIS conjectures, and is deployed across multiple mathematical domains. A simpler agent replicated these results but with higher computational costs. The findings highlight the efficacy of LLM-generated formal proofs and identify optimal agent designs for mathematical research.
large language modelsformal proof searchlean verificationerdős problemsoeis conjectures
Towards a General Intelligence and Interface for Wearable Health Data
The authors propose a foundation model for wearable health data, pretrained on 1 trillion minutes of unlabeled sensor signals from 5 million participants, to address challenges in converting low-level sensor data into meaningful health representations. The method combines scaling model capacity with pretraining data volume, evaluated on 35 diverse health prediction tasks, demonstrating improved performance, few-shot learning, and generative capabilities. Results show that integrating these predictors into a Personal Health Agent yields contextually aware responses, validated by 1,860 clinician ratings.
foundation modelwearable sensorsfew-shot learninghealth predictionpersonal health agent
Cyber-Physical Anomaly Detection in IoT-Enabled Smart Grids Using Machine Learning and Metaheuristic Feature Optimization
The study proposes a cyber-physical anomaly detection method for IoT-enabled smart grids by combining machine learning with genetic-algorithm-based feature selection. Using the MSU/ORNL Power System Attack Dataset, it evaluates logistic regression, RBF-SVM, XGBoost, Random Forest, and Extra Trees, finding tree-based ensembles most effective. The GA + Extra Trees model reduces features from 112 to 27.4 while improving macro-F1 from 0.9118 to 0.9212 and ROC-AUC from 0.9791 to 0.9837, demonstrating that a compact subset of phasor-based features suffices for accurate detection.
smart gridsanomaly detectiongenetic algorithmphasor measurementensemble learning
Superhuman Safe and Agile Racing through Multi-Agent Reinforcement Learning
The paper demonstrates that multi-agent reinforcement learning (MARL) enables superhuman performance in high-speed quadrotor racing while improving safety. Using league-based self-play with variable agent counts, the method trains agents to handle complex aerodynamic interactions (e.g., downwash) and strategic maneuvers like overtaking. The resulting agents outperform a human champion at speeds >22 m/s, reduce collisions by 50% versus single-agent baselines, and show zero-shot generalization to human interaction. Results suggest multi-agent training is key for robust real-world robotics.
multi-agent reinforcement learningquadrotor racingaerodynamic interactionsself-playzero-shot generalization
Proxy-Based Approximation of Shapley and Banzhaf Interactions
ProxySHAP introduces a novel method for approximating Shapley and Banzhaf interactions in machine learning, combining tree-based proxy models with residual correction for consistency. The approach generalizes TreeSHAP to polynomial-time computation of exact interaction indices for tree ensembles, avoiding exponential dependencies in prior methods. Theoretical analysis confirms Maximum Sample Reuse (MSR) corrects proxy bias without exponential variance scaling. Benchmarking shows ProxySHAP outperforms ProxySPEX and KernelSHAP-IQ in approximation quality and downstream explainability tasks across various feature scales.
proxyshapshapley interactionsbanzhaf indicestreeshapresidual correction
The Distillation Game: Adaptive Attacks & Efficient Defenses
The paper introduces a minimax game framework to study the trade-off between model utility and distillation vulnerability, involving a utility-constrained teacher and adaptive student. It proposes two tractable solutions: an adaptive evaluation rule where the student reweights high-value examples, and a teacher-side defense template suppressing distillation-prone outputs. The Product-of-Experts (PoE) defense emerges as a computationally efficient forward-pass-only method. Empirical results on GSM8K and MATH show adaptive students outperform passive evaluations, narrowing robustness gaps between expensive defenses and PoE while maintaining reasoning quality. Findings underscore the persistent challenge of preventing strong distillation.
distillation attacksminimax gameadaptive evaluationproduct-of-expertsrobustness gap
HarnessAPI: A Skill-First Framework for Unified Streaming APIs and MCP Tools
HarnessAPI introduces a unified Python framework for deploying LLM tools as both HTTP endpoints and MCP tool registrations, eliminating code duplication. The framework treats a typed skill folder as a single source of truth, automatically deriving a streaming HTTP endpoint with Server-Sent Events, an interactive OpenAPI/Swagger UI, and a zero-configuration MCP tool from one handler.py and Pydantic schemas. Dual-mode content negotiation enables seamless support for SSE-streaming and JSON-returning clients. Dynamic code-generation ensures Pydantic type annotations propagate correctly to FastMCP's inspection layer. Evaluations across six skills show a 74% reduction in boilerplate compared to manual dual-stack implementations.
harnessapimcp toolpydanticserver-sent eventsfastapi
Beyond Acoustic Emotion Recognition: Multimodal Pathos Analysis in Political Speech Using LLM-Based and Acoustic Emotion Models
The study evaluates acoustic emotion recognition models as proxies for Pathos analysis in political speech, comparing three modalities: emotion2vec_plus_large for acoustic features, Gemini 2.5 Flash for multimodal LLM analysis, and TRUST-Pathos scores from an LLM ensemble. Results show Gemini Valence correlates strongly with TRUST-Pathos (rho = +0.664, p < 0.001), while emotion2vec Valence does not (rho = +0.097, p = 0.499), indicating LLM-based multimodal analysis outperforms acoustic models for semantic emotion capture. Acoustic features remain useful for Arousal estimation. The study also critiques standard SER benchmarks like EMO-DB for cultural bias and acted speech.
pathos analysisacoustic emotion recognitionmultimodal llmspeech emotion recognitionvalence-arousal
Post-Training is About States, Not Tokens: A State Distribution View of SFT, RL, and On-Policy Distillation
The paper proposes a state-distribution perspective for analyzing post-training methods in autoregressive language models, contrasting fixed dataset states in supervised fine-tuning (SFT) with policy-induced states in reinforcement learning (RL) and on-policy distillation (OPD). Through controlled experiments on Qwen3-0.6B-Base using GSM8K, TruthfulQA, and MMLU benchmarks, the study demonstrates three key findings: (1) aggressive SFT degrades retention despite GSM8K improvements, (2) OPD outperforms its degraded teacher across all metrics, and (3) lightweight RL enhances GSM8K without retention loss. Results indicate state distribution significantly impacts post-training outcomes alongside objective functions.
autoregressive policystate distributionon-policy distillationsupervised fine-tuningretention evaluation
The Value of Covariance Matching in Gaussian DDPMs and the Lanczos Sampler
The paper demonstrates that matching full posterior covariance in Gaussian Denoising Diffusion Probabilistic Models (DDPMs) reduces the path KL divergence from $O(1/T)$ to $O(1/T^2)$, overcoming prior limitations. It introduces the Lanczos Gaussian Sampler (LGS), a training-free, matrix-free method leveraging Jacobian-vector products to sample from optimal reverse covariance without dense storage. Theoretical analysis shows exponential error decay with Lanczos steps, while empirical results on image benchmarks confirm improved sample quality over diagonal-covariance baselines like OCM-DDPM using just three steps.
gaussian ddpmcovariance matchinglanczos samplerpath kl divergencejacobian-vector products
Can AI Make Conflicts Worse? An Alignment Failure in LLM Deployment Across Conflict Contexts
The study introduces the first evaluation framework for assessing large language model (LLM) alignment in conflict contexts, where misaligned outputs risk exacerbating societal divisions. Researchers tested nine configurations from OpenAI, Anthropic, DeepSeek, and xAI on 90 multi-turn scenarios designed to surface failure modes like false equivalence, genocide denial, and ethnic slur recognition. Failure rates ranged from 6% to 47% across models, with 80-100% failure when users demanded 'balance' in adjudicated atrocity cases, highlighting model choice as a critical safety consideration.
alignment evaluationlarge language modelsconflict contextsmulti-turn scenariosfailure rates
Live Music Diffusion Models: Efficient Fine-Tuning and Post-Training of Interactive Diffusion Music Generators
The paper introduces Live Music Diffusion Models (LMDMs), a method for efficient interactive music generation by modifying the diffusion process with block-wise KV caching, outperforming discrete-AR models in inference complexity. The proposed ARC-Forcing paradigm enables stable post-training alignment without explicit reinforcement learning. Applications include text-conditioned generation, sketch-based synthesis, and real-time artist-AI collaboration, demonstrating feasibility on consumer hardware.
diffusion modelskv cachingblock-wise outpaintingarc-forcinglive music generation
Parametric Modular Answer Set Programs Made Declarative
The paper introduces parametric modular logic programs, a novel formalism for modular first-order answer set programming (ASP) that supports parameterized subprograms and intensionality statements. The approach captures the semantics of clingo-programs with collective control, enabling structured program composition and instantiation. Theoretical foundations are established, demonstrating the formalism's expressiveness while maintaining connections to traditional non-modular ASP.
answer set programmingmodular logic programscollective controlintensionality statementsclingo-programs
AnyMo: Geometry-Aware Setup-Agnostic Modeling of Human Motion in the Wild
AnyMo introduces a geometry-aware framework for setup-agnostic human motion modeling using wearable IMUs. The method employs physics-grounded IMU simulation for synthetic signal generation, pre-trains a graph encoder from paired synthetic views, tokenizes multi-position IMU data into motion tokens, and aligns these with an LLM for motion-language understanding. Evaluations on zero-shot activity recognition (14 datasets), cross-modal retrieval, and motion captioning show improvements of 11.7%/11.6%/22.6% in Accuracy/F1/R@2 for HAR, 15.9%/28.6% in MRR for retrieval, and 18.8% in BERT-F1 for captioning.
imu simulationgraph encodermotion tokenszero-shot recognitioncross-modal retrieval
AMEL: Accumulated Message Effects on LLM Judgments
The study identifies accumulated message effects on LLM judgments (AMEL), demonstrating that prior conversation history biases subsequent evaluations across 11 models from 4 providers (75,898 API calls). Identical test items presented after positive or negative histories shift toward the prevailing polarity (d = -0.17, p < 10^-46), with stronger effects for high-entropy items (d = -0.34) and a 1.62x negativity asymmetry. Scaling reduces but does not eliminate bias (e.g., GPT-5.2: d = -0.17). Mechanisms involve continuous token probability shifts, semantic components, and position-independent effects. Fresh contexts or balanced histories mitigate bias in evaluation pipelines.
accumulated message effectllm biasnegativity asymmetrytoken probability shiftevaluation pipelines
Abstraction for Offline Goal-Conditioned Reinforcement Learning
The paper introduces absolute abstraction via hierarchical policies in offline Goal-Conditioned Reinforcement Learning (GCRL), leveraging relativised options and distinct hierarchical representations to reuse experience across similar state-space contexts. Two algorithms are proposed for learning relativised options and abstracting from absolute reference frames. Experiments demonstrate that these inductive biases significantly enhance offline GCRL performance by reducing redundancy in Markov Decision Processes (MDPs) caused by symmetries and shared structure across state-goal pairs.
goal-conditioned reinforcement learningmarkov decision processeshierarchical policiesrelativised optionsabsolute abstraction
Beyond the Org Chart: AI and the Transformation of Invisible Work
The study investigates AI's impact on professional roles and workplace culture through interviews with 24 product-focused individuals at a tech firm. Findings reveal AI transforms both formal responsibilities and informal practices like mentorship, with mixed effects: improved peer collaboration but risks to career growth mechanisms. The authors propose interventions for AI companies to enhance visibility of informal work and preserve cultural elements supporting diversity and collaboration.
ai adoptionrole transformationinformal workmentorshiporganizational culture
Scout-Assisted Planning for Heterogeneous Robot Teams under Partially Known Environments
The paper introduces Scout-Assisted Planning (SAP), a heterogeneous planning framework where Unmanned Aerial Vehicles scout for Unmanned Ground Vehicles to mitigate backtracking in partially known environments. It proposes Information Gain-based Action Pruning (IGAP) to prioritize scouting actions by their expected impact on navigation, accelerated by a Graph Neural Network that predicts information gain from graph structure and belief state. Experimental results demonstrate SAP with IGAP reduces ground robot travel cost by 31.9--37.7% compared to the Canadian Traveler Problem baseline and outperforms proximity-based scouting by 8--14%.
scout-assisted planninginformation gain-based action pruninggraph neural networkheterogeneous robot teamscanadian traveler problem
Forecasting Scientific Progress with Artificial Intelligence
The study introduces CUSP (Cutoff-conditioned Unseen Scientific Progress), a temporally grounded evaluation framework for assessing AI's ability to forecast scientific progress across 4,760 multidisciplinary events. The benchmark evaluates feasibility assessment, mechanistic reasoning, generative solution design, and temporal prediction under controlled knowledge constraints. Results reveal systematic limitations: models identify plausible research directions but fail to reliably predict realization or timing of advances, with performance varying by domain (AI progress more predictable than biology/chemistry/physics). Additional pre-cutoff knowledge improves performance but gaps persist, particularly for high-citation advances, alongside unreliable uncertainty estimation and overconfidence.
scientific forecastingtemporal predictionknowledge constraintsuncertainty estimationmultidisciplinary benchmark
Swift Sampling: Selecting Temporal Surprises via Taylor Series
The paper introduces Swift Sampling, a training-free frame selection algorithm for identifying high-information moments in long-form videos by detecting temporal surprises. The method models videos as differentiable trajectories in visual latent space, computes feature velocity and acceleration, and uses Taylor expansion to project expected frame paths; frames deviating from these projections are selected. Evaluated on three long-video QA benchmarks and 10 downstream tasks, Swift Sampling outperforms uniform sampling and prior query-agnostic methods, achieving up to +12.5 accuracy points with only 0.02x computational overhead (30x cheaper than baselines).
temporal surprisespredictive codingtaylor expansionlatent spaceframe selection
Is Capability a Liability? More Capable Language Models Make Worse Forecasts When It Matters Most
The study identifies inverse scaling in LLMs for forecasting tasks involving superlinear growth and tail risk, such as finance and epidemiology. Using ForecastBench-Sim (FBSim) and real-world datasets (COVID-19, measles, housing markets, hyperinflation), it shows more capable models produce worse distributional forecasts, particularly in the upper tail. A per-quantile decomposition reveals this effect, exacerbated by model scale and post-training. Conventional single-threshold metrics fail to capture tail risks, necessitating continuous accuracy measures. Findings hold across synthetic SIR epidemics and real-world scenarios, with domain knowledge failing to improve calibration.
inverse scalingsuperlinear growthtail riskper-quantile decompositionforecastbench-sim
WorkstreamBench: Evaluating LLM Agents on End-to-End Spreadsheet Tasks in Finance
The paper introduces WorkstreamBench, a novel benchmark for evaluating LLM agents on end-to-end spreadsheet tasks in finance, addressing the gap in existing benchmarks that focus only on question-answering or single-formula edits. The evaluation taxonomy comprises three dimensions—Accuracy, Formula, and Format—with fine-grained criteria reflecting professional standards. Results show the Claude family leads in producing professional-looking outputs, but performance degrades sharply with task complexity, indicating current agents fall short of real-world financial workflow demands.
llm agentsspreadsheet tasksfinancial modelingevaluation taxonomyprofessional standards
Claw AI Lab: An Autonomous Multi-Agent Research Team
Claw AI Lab introduces an autonomous multi-agent research platform that transforms prompt-to-paper pipelines into interactive AI laboratories. The system enables instantiation of customizable research teams with collaborative workflows, real-time monitoring, and artifact inspection via a unified dashboard. Key innovations include the Claw-Code Harness for integrating local codebases/datasets into runnable experiments and feeding execution artifacts back into the research loop. In evaluations against AutoResearchClaw across five AI case studies, expert judges preferred Claw AI Lab for novelty, completeness, and paper quality. The work advances autonomous research toward usable, interactive scientific infrastructure.
autonomous researchmulti-agent systemcode harnessexperiment integrityinteractive dashboard
Moral Semantics Survive Machine Translation: Cross-Lingual Evidence from Moral Foundations Corpora
The study demonstrates that machine-translated moral language retains sufficient semantic fidelity for cross-lingual classification tasks, addressing the scarcity of non-English moral foundations corpora. Using 50k Polish social media posts, the authors validate translation quality through LaBSE embeddings, Centered Kernel Alignment, LLM-as-judge evaluation, and classifier parity tests. Results show preserved moral cues (mean cosine similarity 0.86) with minimal AUC degradation (0.01-0.02 gaps), improving further with fine-tuning, establishing translation as viable for under-resourced languages.
moral foundationscross-lingual embeddingcentered kernel alignmentllm-as-judgeclassifier parity
AtelierEval: Agentic Evaluation of Humans & LLMs as Text-to-Image Prompters
The paper introduces AtelierEval, the first unified benchmark for evaluating prompting proficiency in text-to-image (T2I) systems, covering both human and multimodal large language model (MLLM) prompters across 360 expert-crafted tasks. It proposes AtelierJudge, a skill-based, memory-augmented agentic evaluator that achieves 0.79 Spearman correlation with human experts, combining subjective and objective scoring. Results from benchmarking 8 MLLMs and 48 humans across 4 T2I backends reveal the superiority of mimicry over planning strategies, advocating for image-augmented prompting approaches.
text-to-imagemultimodal llmsagentic evaluationprompt engineeringbenchmarking
Spreadsheet-RL: Advancing Large Language Model Agents on Realistic Spreadsheet Tasks via Reinforcement Learning
Spreadsheet-RL introduces a reinforcement learning framework for training specialized spreadsheet agents in Microsoft Excel, addressing limitations of general-purpose LLMs on complex workflows. The method includes an automated pipeline for collecting paired start-goal spreadsheets from online forums, a Domain-Spreadsheet benchmark, and a Spreadsheet Gym environment with Excel functionality exposed via Python. Experiments show significant improvements: Qwen3-4B-Thinking-2507's Pass@1 increased from 12.0% to 23.4% on SpreadsheetBench and from 8.4% to 17.2% on Domain-Spreadsheet, demonstrating strong generalization potential for real-world spreadsheet automation.
spreadsheet-rlreinforcement learningdomain-spreadsheetexcel automationllm agents
More Context, Larger Models, or Moral Knowledge? A Systematic Study of Schwartz Value Detection in Political Texts
The study systematically evaluates factors affecting Schwartz value detection in political texts, focusing on context length, model size, and moral knowledge integration. Using the ValuesML/Touch{é} ValueEval framework, it compares sentence-, window-, and document-level inputs, retrieval-augmented generation (RAG) with a moral knowledge base, and models ranging from DeBERTa-v3-base/large to zero-shot LLMs (12B-123B parameters). Results indicate context benefits supervised DeBERTa (3.8-4.8 macro-F1 gain) but not zero-shot LLMs, while RAG consistently improves performance via early fusion. Model scaling yields inconsistent gains, and per-value analysis reveals context/RAG aids socially situated or confusable values most.
schwartz value detectionretrieval-augmented generationcontext windowzero-shot learningmacro-f1
Contractual Skills: A GovernSpec Design Framework for Enterprise AI Agents
The paper introduces contractual skills, a GovernSpec-inspired framework for enterprise AI agents that formalizes task contracts via SKILL.md files. The framework integrates goals, boundaries, permissions, and verification steps while maintaining skill discoverability. Evaluation involves two experiments: a text-generation study (960 outputs across 15 tasks and 8 models) and a tool-calling challenge (192 simulated records). Results show contractual skills improve checkability and maintainability over baselines but yield mixed gains in generation quality, emphasizing their role as a governance layer rather than a safety mechanism.
contractual skillsgovernspecskill.mdtool-callingenterprise ai
Healthcare LLM Benchmarks Are Only as Good as Their Explicit Assumptions
The paper identifies a critical gap between healthcare LLM benchmark evaluations and real-world deployment performance, attributing it to implicit assumptions about user interactions. It classifies these assumptions into task-based (testable via conversation data) and outcome-based (requiring behavioral studies). Through retrospective analysis of a healthcare RCT, the authors demonstrate that the gap divides equally into task and outcome components. They propose BenchmarkCards to document assumptions and staged evaluation to systematically test them, offering a framework to bridge the evaluation-deployment discrepancy.
healthcare llmbenchmarkcardsstaged evaluationtask assumptionsoutcome assumptions
Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents
The paper introduces Agentic CLEAR, an automated framework for multi-level evaluation of LLM agents, addressing limitations of static error taxonomies and basic observability tools. The method operates above the observability layer, providing textual insights at system, trace, and node granularity through a dynamic, domain-adaptable approach with an intuitive UI. Experiments across four benchmarks and seven agentic settings (involving tens of thousands of LLM calls) demonstrate high-quality feedback generation, strong alignment with human-annotated errors, and predictive capability for task success rates.
llm agentserror taxonomyobservability layermulti-level evaluationagentic systems
Innovations in Cardless Artificial Intelligence Banking: A Comprehensive Framework for Cyber Secure and Fraud Mitigation using Machine Learning Algorithms
The paper proposes a cybersecurity framework for cardless AI banking systems, integrating AI-powered data cryptography and machine learning for fraud mitigation. The method employs auto-generated virtual cards with encrypted data, secure communication channels, and AI-based transaction authorization to minimize information exposure. Results suggest the framework enhances security while maintaining transaction convenience, though specific performance metrics are not provided.
cardless bankingdata cryptographyfraud mitigationvirtual cardsai authorization
Think Thrice Before You Speak: Dual knowledge-enhanced Theory-of-Mind Reasoning for Persuasive Agents
The paper introduces ToM-PD, a Theory-of-Mind-based Persuasive Dialogue task grounded in the Belief-Desire-Intention framework, addressing LLMs' fragmented mental state representations. It proposes TTBYS, a dual knowledge-enhanced stepwise reasoning framework leveraging explicit/implicit prior experiences for desire/belief/persuasive strategy inference, and releases ToM-BPD, a large-scale annotated dataset with fine-grained mental states. Experiments show Qwen3-8B with TTBYS outperforms GPT-5 by 1.20%, 22.80%, and 16.97% on desire, belief, and strategy prediction respectively, while improving interpretability.
theory-of-mindpersuasive dialoguebelief-desire-intentionstepwise reasoningmental state representation
MoSA: Motion-constrained Stress Adaptation for Mitigating Real-to-Sim Gap in Continuum Dynamics via Learning Residual Anisotropy
The paper introduces MoSA, a motion-constrained stress adaptation framework for mitigating the real-to-sim gap in continuum dynamics by learning residual anisotropy and heterogeneity. MoSA employs an isotropic model as a physics prior and learns residual stress operators via microplane-constrained redistribution in a physics-informed cascaded network, while enforcing motion constraints through deformation field derivatives. Experiments demonstrate superior accuracy, generalization, and robustness, with validation in robot manipulation showing improved sim-to-real transfer.
real-to-sim gapcontinuum dynamicsresidual anisotropyphysics-informed learningmotion constraints
SceneAligner: 3D-Grounded Floorplan Localization in the Wild
SceneAligner introduces a 3D-grounded approach for floorplan localization in unconstrained environments, addressing limitations of prior methods that require controlled settings and vectorized floorplans. The method reconstructs a gravity-aligned 3D scene from image collections, projects it into a 2D density map as a floorplan proxy, and aligns it with input floorplans via 2D similarity transforms. A 2D foundation model is fine-tuned to bridge appearance gaps between density maps and architectural floorplans while preserving structural consistency. Experiments show significant improvements over baselines, including in sparse settings with single-image input.
floorplan localization3d reconstructioncross-modal correspondencedensity mapsimilarity transform
Beyond Temperature: Hyperfitting as a Late-Stage Geometric Expansion
The study demonstrates that hyperfitting in LLMs is distinct from temperature scaling or static vocabulary reweighting, instead involving dynamic context-dependent rank reordering. Through layer-wise analysis, the authors identify a 'Terminal Expansion' in the final transformer block, where an 80.8-dimensional feature space expansion promotes deep-tail tokens. They propose Late-Stage LoRA, a fine-tuning strategy updating only the final 5 layers, which achieves robust generation with minimal parameter updates.
hyperfittingtemperature scalingterminal expansionlate-stage lorarank reordering
VGenST-Bench: A Benchmark for Spatio-Temporal Reasoning via Active Video Synthesis
The paper introduces VGenST-Bench, a novel benchmark for evaluating spatio-temporal reasoning in Multimodal Large Language Models (MLLMs) through actively synthesized videos. The method employs a multi-agent pipeline with human quality control to generate diverse scenarios based on a 3x2x2 taxonomy (Spatial Scale, Perspective, Scene Dynamics) and decouples visual perception from reasoning via a hierarchical task suite. This approach enables fine-grained diagnosis of MLLMs' capabilities beyond static or passively curated datasets.
spatio-temporal reasoningmultimodal llmsvideo synthesisbenchmark taxonomyquality control
Measuring Security Without Fooling Ourselves: Why Benchmarking Agents Is Hard
The article identifies three critical weaknesses in benchmarks for evaluating AI agents in security roles: benchmark vulnerabilities, temporal staleness, and runtime uncertainty. Drawing on empirical evidence, it critiques current evaluation practices and proposes directions for more robust frameworks. The analysis highlights gaps in trustworthiness and suggests methodological improvements for future security assessments.
benchmark vulnerabilitiestemporal stalenessruntime uncertaintysecurity evaluationstrustworthy frameworks
Case-Aware Medical Image Classification with Multimodal Knowledge Graphs and Reliability-Guided Refinement
The paper proposes a case-aware medical image classification framework using multimodal knowledge graphs and reliability-guided refinement. The method constructs knowledge graphs from retrieved similar cases, employs an image-centric Graph Attention Network for knowledge propagation, and uses bidirectional cross-modal attention for feature injection. A confidence-calibrated decision refinement scheme mitigates noisy retrieval by jointly considering prediction confidence and sample similarity. Experiments on multiple medical imaging datasets demonstrate consistent performance improvements over baselines, with ablation studies validating component effectiveness.
multimodal knowledge graphsgraph attention networkcross-modal attentionconfidence calibrationmedical image classification
Dynamic Hypergraph Representation Learning for Multivariate Time Series without Prior Knowledge
The study introduces a novel method for constructing dynamic hypergraph representations from multivariate time series without prior structural knowledge. The approach employs community detection on time series data, transforms detected communities into hypergraphs via clique-based techniques, and processes them using a Dynamic Hypergraph Attention Convolution Network (DHACN) for prediction tasks. This method advances hypergraph representation learning by automatically capturing high-order relationships in complex systems, demonstrated through applications on diverse time series datasets.
hypergraph representationmultivariate time seriescommunity detectionattention mechanismdynamic convolution network
TerminalWorld: Benchmarking Agents on Real-World Terminal Tasks
TerminalWorld introduces a scalable data engine that reverse-engineers high-fidelity evaluation tasks from 80,870 terminal recordings, yielding a benchmark of 1,530 validated tasks across 18 real-world categories. The benchmark includes 1,280 unique commands and spans workflows from short operations to those exceeding 50 steps. A Verified subset of 200 manually reviewed tasks was curated. Benchmarking on TerminalWorld-Verified across eight frontier models and six agents reveals a maximum pass rate of only 62.5%, indicating significant challenges in authentic terminal workflows. TerminalWorld exhibits weak correlation (Pearson r=0.20) with expert-curated benchmarks like Terminal-Bench, highlighting its distinct real-world focus.
terminal recordingsbenchmarkingpearson correlationfrontier modelsverified subset
A Subjective Logic-based method for runtime confidence updates in safety arguments
The paper introduces a dynamic quantitative assurance method for updating confidence in safety arguments using Subjective Logic (SL). It integrates design-time evidence and runtime Safety Performance Indicators (SPIs) to propagate confidence across the development lifecycle. Runtime SPI evidence triggers targeted updates, prioritizing safety responsiveness over exact Bayesian updates. The method is demonstrated via a simulation-based construction zone assist function, focusing on ML-based cone detection, showing confidence evolution with observed SPI evidence.
subjective logicsafety performance indicatorsruntime confidencedynamic assuranceml-based detection
Stabilising Explainability Fragility in Cybersecurity AI: The Impact and Mitigation of Multicollinearity in Public Benchmark Datasets
The paper identifies multicollinearity-induced instability in AI explainability for intrusion detection systems (IDS), proving via theorem that multicollinearity inflates attribution variance, making explanations non-identifiable. It validates this on UNSW-NB15 using linear, tree-based, kernel, and neural models, proposing the Explanability Fragility Score and two mitigation methods: CAA-Filtering for attribution grouping and SHARP for training-time regularization. Results demonstrate stable predictive performance and improved explainability stability via Kendall's τ, offering guidelines for trustworthy XAI in security-critical contexts.
multicollinearityexplainability fragilityattribution varianceshapids
Meta-Learning for Rapid Adaptation in Reference Tracking of Uncertain Nonlinear Systems
The paper proposes a meta-learning framework for reference tracking in uncertain nonlinear systems, combining offline source system data with limited target system data for rapid adaptation. The method adapts implicit model-agnostic meta-learning (iMAML) to control, featuring a bi-level optimization with offline meta-training and online meta-adaptation phases. Two learning variants—neural state-space modeling and deep Q-networks—are integrated, differing in explicit system identification requirements. Experiments show improved control performance over baselines in simulations and hardware tests.
meta-learningnonlinear controlreference trackingbi-level optimizationsystem identification
Search-E1: Self-Distillation Drives Self-Evolution in Search-Augmented Reasoning
Search-E1 introduces a self-evolution method for search-augmented reasoning agents, eliminating the need for external supervision or auxiliary modules. The approach combines vanilla GRPO with offline self-distillation (OFSD), where the policy aligns its inference-time distribution to a privileged context via a token-level forward KL objective. This provides dense per-step supervision without complex machinery. Evaluated on seven QA benchmarks, Search-E1 achieves 0.440 average EM with Qwen2.5-3B, outperforming all open-source baselines at comparable scales.
self-distillationsearch-augmented reasoninggrpooffline self-distillationtoken-level kl
Towards Direct Evaluation of Harness Optimizers via Priority Ranking
The paper introduces priority ranking as a direct evaluation method for harness optimizers, addressing limitations of indirect performance-based assessments. By requiring optimizers to rank harness components by improvement potential, the method quantifies step-level optimization ability without costly rollouts. The approach leverages Shor, a dataset of 182 human-verified optimization scenarios across domains. Results show ranking performance correlates with multi-step optimization effectiveness, validating priority ranking as a reliable predictor. Code and data are publicly available.
harness optimizationpriority rankingdirect evaluationshor datasetoptimizer assessment
LACO: Adaptive Latent Communication for Collaborative Driving
LACO introduces a training-free latent communication paradigm for collaborative driving, addressing latency and information loss in language-based coordination. The method employs Iterative Latent Deliberation (ILD) for latent reasoning, Cross-Horizon Saliency Attribution (CHSA) for efficient information selection, and Structured Semantic Knowledge Distillation (SSKD) to stabilize ego-centric decision-making. Experiments in CARLA demonstrate that LACO reduces communication and inference latency while maintaining strong collaborative driving performance under partial observability.
latent communicationcollaborative drivingiterative latent deliberationcross-horizon saliency attributionstructured semantic knowledge distillation
Compiling Agentic Workflows into LLM Weights: Near-Frontier Quality at Two Orders of Magnitude Less Cost
The paper demonstrates that compiling agentic workflows into LLM weights via fine-tuning yields near-frontier performance at 100x lower cost compared to orchestration frameworks. The method addresses three perceived barriers through empirical evaluation on three domains: travel booking (14 nodes), Zoom support (14 nodes), and insurance claims (55 nodes, 6 decision hubs). Results show that subterranean agents match frontier model quality while avoiding context window consumption, proprietary procedure exposure, and per-conversation frontier model costs.
agent orchestrationsubterranean agentllm fine-tuningworkflow compilationprocedural tasks
BeLink: Biomedical Entity Linking Meets Generative Re-Ranking
The paper introduces BeLink, a biomedical entity linking system that employs instruction-tuned generative models for efficient candidate re-ranking. The method uses set-wise instruction-tuning to enhance both accuracy and computational efficiency in the re-ranking stage of BEL pipelines. Evaluations show 3%-24% accuracy improvements across multiple BEL benchmarks while reducing inference time compared to state-of-the-art approaches. The system is designed as a modular, end-to-end solution for practical deployment.
biomedical entity linkinginstruction-tuninggenerative re-rankingset-wise learningllms
The Neural Compiler: Program-to-Network Translation for Hybrid Scientific Machine Learning
The Neural Compiler introduces a program-to-network translation system that converts first-order Scheme-like expressions into frozen, differentiable PyTorch modules, enabling exact encoding of known physics in hybrid scientific machine learning models. The compiler supports 51 primitive operations, including vector and matrix algebra, and generates modules that match hand-coded implementations numerically without accuracy loss. Evaluated across six domains, compiled models with 1-4 trainable parameters recover physical constants to <1% error, outperforming PINN baselines with 7-93% error. The system's primary advantage is systematic composability, producing correct, differentiable modules from symbolic specifications without manual rewriting.
neural compilerhybrid modelsdifferentiable modulespinn baselinessystematic composability
Understanding Multimodal Failure in Action-Chunking Behavioral Cloning
The paper analyzes failure modes in action-chunking behavioral cloning when observations admit multiple valid actions. It compares latent-variable policies (where posterior-prior regularization trades off sampling reliability against mode discrimination) and action-space generative policies (where Lipschitz smoothness constraints limit multimodal coverage). Theoretical analysis shows latent-variable policies require careful regularization balancing, while generative policies need sharp transitions or off-support bridges to cover separated modes. Experiments on synthetic tasks and robotic simulations validate these mechanisms.
behavioral cloningaction-chunkinglatent-variable policieslipschitz smoothnessmultimodal failure
Implicit Regularization of Mini-Batch Training in Graph Neural Networks
The paper demonstrates that Random Node Sampling (RNS), the simplest mini-batch training scheme for Graph Neural Networks (GNNs), implicitly regularizes gradient variance and often outperforms full-graph training. Through backward error analysis of graph mini-batch SGD, the authors show RNS minimizes a sampled loss plus a gradient-variance-dependent regularizer, yielding better implicit objectives despite discarding local structure. Empirical results show RNS matches or exceeds full-graph performance on 8/10 datasets while reducing wall-clock time and memory usage, reframing graph sampling as implicit regularization.
graph neural networksmini-batch trainingimplicit regularizationrandom node samplinggradient variance
BioFormer: Rethinking Cross-Subject Generalization via Spectral Structural Alignment in Biomedical Time-Series
BioFormer addresses cross-subject generalization in biomedical time-series by explicitly modeling subject-specific variability through spectral drift. The proposed Frequency-Band Alignment Module (FBAM) generates band-wise modulation factors to align spectral structure by adjusting amplitude and phase, while Sample Conditional Layer Normalization stabilizes representations using intrinsic signal statistics. Evaluations on six datasets show BioFormer outperforms 12 baselines with 6% absolute F1-score improvement.
cross-subject generalizationspectral driftfrequency-band alignmentbiomedical time-serieslayer normalization
From Correlation to Cause: A Five-Stage Methodology for Feature Analysis in Transformer Language Models
The paper introduces a five-stage methodology for causal feature analysis in transformers, comprising probe design, feature extraction, causal validation, robustness testing, and deployment integration. Applied to GPT-2 small on the Indirect Object Identification task, activation patching identifies a key attention head (layer-9 head 9, +1.02 recovery), while sparse autoencoders extract name-selective features (30-50 activation units). Causal analysis reveals partial causality (15 features maintain 98% accuracy), with NLA evaluations showing 31% variance explained and selectivity-causality anticorrelation (r=-0.56). Robustness tests show circuit transfer but feature degradation, and cost analysis demonstrates 99.1% savings ($8.96 vs $1000 per 1000 queries).
transformercausal analysisactivation patchingsparse autoencoderrobustness testing
KAPPS: A knowledge-based CPPS Architecture for the Circular Factory
The authors propose KAPPS, a knowledge-based Cyber Physical Production System (CPPS) architecture for circular manufacturing, addressing the limitations of conventional IT systems in handling heterogeneous product states and dynamic reconfiguration. KAPPS integrates an ontology-grounded knowledge graph as a unified data backbone, with semantic interfaces for cross-system integration, reasoning, and communication. It features modules for constraint enforcement and event-driven planning, enabling adaptive execution under uncertainty. The architecture is validated through two use cases: anomaly detection via knowledge graph services and runtime constraint enforcement in a modular conveyor system, meeting 14 derived requirements.
cyber physical production systemknowledge graphcircular manufacturingontology-groundedconstraint enforcement
Steins;Gate Drive: Semantic Safety Arbitration over Structured Futures for Latency-Decoupled LLM Planning
The paper introduces SteinsGateDrive, a latency-decoupled planner-runtime architecture for LLM-based vehicle control that separates semantic planning from real-time execution. The method employs a worldline generator producing three forecast types (alpha: nominal, beta: interaction counterfactuals, gamma: hazard-stress scenarios), which are validated by a runtime safety contract system. Experiments with GPT-5.4 mini show reduced effective lag from +3.07s to -0.01s at 4-second horizons while maintaining collision-free operation, with safety enforced via predicate checks rather than forecast drift metrics.
latency-decoupledworldline generatorstrategicforecastsafety contractscounterfactual futures
Making the Discrete Continuous: Synthetic RAW Augmentations for Fine-Grained Evaluation of Person Detection Performance in Low Light
The study demonstrates synthetic RAW image augmentation's efficacy in evaluating pedestrian detection models under low-light conditions, addressing data scarcity in real-world datasets. By generating synthetic low-light samples that mimic camera sensor noise, the authors provide continuous sampling of the input space, enhancing benchmark coverage. Results show comparable performance metrics between real and synthetic data, suggesting the model's difficulty in distinguishing them. This approach offers a robust method for fine-grained evaluation of AI vision models in safety-critical applications like autonomous driving.
synthetic raw augmentationlow-light pedestrian detectioncamera sensor noiseautonomous driving safetyperformance metrics
Don't Forget the Critic: Value-Based Data Rehearsal for Multi-Cyclic Continual Reinforcement Learning
The paper introduces Qreg+NWLU, a value-based data rehearsal method for multi-cyclic continual reinforcement learning (CRL), addressing limitations of actor-centric approaches. The method combines continuous data rehearsal with dynamic Q-value updates and immediate 'No-Wait' regularization, eliminating the delay in applying regularization. Evaluations in multi-cyclic environments demonstrate improved learning efficiency, reduced catastrophic forgetting, and enhanced knowledge transfer compared to Qreg and conventional CRL methods, particularly in value function approximation settings.
continual reinforcement learningdata rehearsalq-value regularizationmulti-cyclic environmentscatastrophic forgetting
S2ED: From Story to Executable Descriptions for Consistency-Aware Story Illustration
The paper introduces Story-to-Executable Descriptions (S2ED), a training-free framework for generating consistent multi-frame story illustrations by converting narratives into explicit, editable descriptions. S2ED employs three agents to segment narratives, ground character attributes, and enrich spatial-affective cues, enabling interpretable state propagation and local edits without generator retraining. Evaluations on Flintstones and Shakoo Maku datasets demonstrate improved sequence-level consistency and character fidelity over baseline methods, validated by automatic metrics and human judgments, with deployment in an end-to-end storybook system.
story illustrationnarrative decompositionprompt-layer frameworkcharacter fidelitystate propagation
Pre-VLA: Preemptive Runtime Verification for Reliable Vision-Language-Action and World-Model Rollouts
The paper introduces Pre-VLA, a runtime verification architecture for vision-language-action (VLA) models and world-model rollouts, addressing uncertainty in action generation. It employs a multimodal backbone with modality-aware pooling and a dual-branch head to predict safety confidence and advantage scores, trained via a multi-task objective combining Focal classification, advantage regression, and soft-threshold calibration. Pre-VLA features a dual-mode preemptive resampling scheduler for filtering low-quality actions. On the LIBERO benchmark, it improves closed-loop success rates from 30.79% to 37.62%, reduces execution steps, achieves 183.9 ms verification time per action chunk, and mitigates rollout errors.
pre-vlaruntime verificationvision-language-actionworld-model rolloutsmodality-aware pooling
A Constant-Time Implementation Methodology for Activation Functions on Microcontrollers
The paper presents a constant-time implementation methodology for neural activation functions on microcontrollers to prevent timing side-channel attacks. The approach combines branchless selection, Padé approximations, dummy operations, and cycle alignment to achieve timing-invariant execution across ReLU, sigmoid, tanh, GELU, and Swish on ARM Cortex-M4. Experimental validation shows fixed-cycle counts (88 cycles for 3 functions, 108 for 5) while maintaining numerical accuracy, demonstrating practical side-channel resistance for embedded inference.
constant-timeside-channelpade approximationcycle alignmentmicrocontroller
Characterizing the Fault Response of the Intel Neural Compute Stick 2 Under Single-Pulse Electromagnetic Fault Injection
The study systematically characterizes fault responses of the Intel Neural Compute Stick 2 (NCS2) under single-pulse electromagnetic fault injection (EMFI), addressing a gap in safety-critical edge deployments. Using three ImageNet-trained CNNs (ResNet-18, ResNet-50, VGG-11) on OpenVINO, 1,536 spot tests and ~16,000 parameter trials identified four outcome classes: no-effect, silent data corruption (SDC), persistent degradation (18-31% hotspot incidence), and device hangs. Critically, persistent corruption occurs even during idle states post-load, rendering load-time checks insufficient. Mitigation strategies are proposed at the application level without firmware modifications.
electromagnetic fault injectionneural compute stick 2silent data corruptionpersistent degradationopenvino
FastTab: A Fast Table Recognizer with a Tiny Recursive Module and 1D Transformers
FastTab introduces a grid-centric table structure recognition (TSR) model combining a Tiny Recursive Module (TRM) for global reasoning and axial 1D Transformer encoders for long-range row/column dependencies. It predicts row/column counts, headers, and separators to construct a grid, then infers rowspan/colspan using ROI-aligned cell features, avoiding autoregressive HTML decoding. Evaluated on PubTabNet, FinTabNet, PubTables-1M, and SciTSR, FastTab achieves competitive structure recovery with low-latency inference. It demonstrates robustness under pixel-level anonymization and extends to curved separators for camera-captured documents. Source code is available at https://github.com/hamdilaziz/FastTab.
table structure recognitiontiny recursive module1d transformersroi-aligned featuresautoregressive decoding
Diffusion-guided Generalizable Enhancer for Urban Scene Reconstruction
The paper introduces GenRe, a diffusion-guided generalizable enhancer for urban scene reconstruction that improves 3D Gaussian representations under challenging viewpoint shifts. The method distills generative priors from diffusion models across diverse scenes, enabling efficient (few-minute) enhancement of pretrained 3D representations without per-scene optimization. Experiments demonstrate superior quality and efficiency over existing methods, with reliable generalization to unseen viewpoints (e.g., lane changes) and benefits for autonomous driving sensor simulation tasks.
diffusion models3d gaussian representationurban scene reconstructionviewpoint synthesisautonomous driving
Towards Clinically Interpretable Ophthalmic VQA via Spatially-Grounded Lesion Evidence
The authors introduce FundusGround, a benchmark for clinically interpretable ophthalmic visual question answering (VQA) that incorporates spatially-grounded lesion evidence. Their three-stage pipeline collects 10,719 fundus images with 15,595 image-level annotated lesions, spatially localized using the Early Treatment Diabetic Retinopathy Study (ETDRS) grid for anatomical consistency. The benchmark includes 72,706 questions across four formats and evaluates models using dual metrics for answer accuracy and lesion-level reasoning, demonstrating that explicit spatial grounding improves both performance and interpretability.
visual question answeringophthalmic diagnosisspatial groundingetdrs gridlesion-level reasoning
DeferMem: Query-Time Evidence Distillation via Reinforcement Learning for Long-Term Memory QA
DeferMem introduces a long-term memory framework for LLM agents that separates high-recall candidate retrieval from query-conditioned evidence distillation. The method employs a segment-link structure for raw history organization and DistillPO, a reinforcement learning algorithm that decomposes evidence distillation into message selection and rewriting, optimized via decomposed-and-gated rewards. Evaluated on LoCoMo and LongMemEval-S, DeferMem achieves superior QA accuracy and runtime efficiency without commercial-API token costs for memory operations.
long-term memoryevidence distillationreinforcement learningquery-conditionedsegment-link structure
Epicure: Navigating the Emergent Geometry of Food Ingredient Embeddings
Epicure introduces a family of three skip-gram ingredient embeddings trained on a multilingual recipe corpus of 4.14M recipes across seven languages. The method involves normalizing ingredient strings to 1,790 canonical entries using an LLM-augmented pipeline, constructing a 203,508-edge ingredient-ingredient NPMI graph and an 80,019-edge typed FlavorDB graph, and training three Metapath2Vec variants (Cooc, Chem, Core) with distinct random-walk schemas. These models explore the spectrum between chemistry and recipe context through controlled mixing of co-occurrence and compound metapaths.
skip-grammetapath2vecnmpi graphflavordbrandom-walk
Cross-Subject EEG Emotion Recognition Based on Temporal Asynchronous Alignment Contrastive Learning
Proposes Temporal Asynchronous Alignment-based Contrastive Learning (TA2CL) for cross-subject EEG emotion recognition, addressing temporal misalignment in responses via fine-grained local matching inspired by ColBERT's late interaction. The framework replaces global hard alignment with adaptive segment correlation, mitigating inter-subject variability. Achieves 64.5% (9-class) and 79.5% (binary) accuracy on FACED, 86.4% on SEED, and 70.1% on SEED-V, demonstrating robust generalization.
eeg emotion recognitioncontrastive learningtemporal alignmentcross-subject variabilitylate interaction
VeriScale: Adversarial Test-Suite Scaling for Verifiable Code Generation
VeriScale introduces a framework for constructing verifiable code generation benchmarks through adversarial test-suite scaling, addressing limitations in existing benchmarks' test-case quality and quantity. The method employs two stages: test-suite expansion to create diverse, challenging cases and reduction to distill compact yet discriminative suites. Instantiated on Verina, VeriScale produces VerinaPlus (83× test-suite expansion) and VerinaLite (14×), revealing significant performance drops (SpecGen and CodeGen) across eight LLMs while maintaining discriminative power at lower evaluation costs.
verifiable code generationadversarial test-suitetest-suite expansionllm evaluationformal verifiability
TimeGuard: Channel-wise Pool Training for Backdoor Defense in Time Series Forecasting
The paper introduces TimeGuard, a training-time defense against backdoor attacks in Time Series Forecasting (TSF) that addresses data entanglement and task-formulation shift. The method employs channel-wise pool training initialized via time-aware criteria to mitigate signal dilution, coupled with distance-regularized loss selection to progressively expand reliable training samples. Evaluations across multiple datasets, architectures, and attacks show TimeGuard improves robustness (1.96× MAE_P reduction) while maintaining clean performance (<5% MAE_C degradation), outperforming 13 baseline defenses.
backdoor defensetime series forecastingchannel-wise poolingdistance-regularized losssignal dilution
Scaling Observation-aware Planning in Uncertain Domains
The paper presents scalable techniques for solving decidable fragments of the Optimal Observability Problem (OOP) in uncertain domains, specifically the Sensor Selection Problem (SSP) and Positional Observability Problem (POP). It improves upon prior parameter synthesis methods and introduces a novel approach using POMDP decomposition to identify observation functions. The method achieves performance gains of 3 and 5 orders of magnitude in instance size and runtime, respectively, compared to the original approach.
optimal observability problempartially observable markov decision processsensor selection problempositional observability problemparameter synthesis
Incentive-Aligned Vehicle-to-Vehicle Energy Trading via Nash-Integrated Multi-Agent Reinforcement Learning
The paper proposes Nash-MADDPG, integrating Nash Bargaining Solution with Multi-Agent Deep Deterministic Policy Gradient for incentive-aligned vehicle-to-vehicle energy trading. The method combines bilateral pricing via Nash bargaining with Nash-guided price proximity rewards to align agent strategies. Evaluated over 30-day continuous operation, it achieves 61.6% higher social welfare, 62.9% greater trading volume than Double Auction, and 40.1% improved fairness (Jain's index), while maintaining stable pricing across 6-100 agents with continuous vehicle turnover.
vehicle-to-vehicle energy tradingnash bargaining solutionmulti-agent deep deterministic policy gradientsocial welfare optimizationdecentralized energy exchange
VEELA: A Clinically-Constrained Benchmark for Liver Vessel Segmentation in Computed Tomography Angiography
The paper introduces VEELA, a clinically constrained benchmark for liver vessel segmentation in CTA, addressing limitations of existing datasets by providing 40 rigorously curated scans from the CHAOS grand-challenge cohort. Vessels were manually annotated slice-by-slice under multi-expert consensus, adhering to visibility-driven policies without anatomical interpolation, capturing anatomical variability and imaging uncertainty. The benchmark includes standardized evaluation metrics (clDice, IoU, NSD, area, length) to assess diverse aspects of vascular integrity, demonstrating the need for multi-perspective evaluation. VEELA is publicly available to support reproducible research in vascular segmentation.
liver vessel segmentationcomputed tomography angiographyclinical benchmarkmulti-expert consensustopology-aware metrics
TransitLM: A Large-Scale Dataset and Benchmark for Map-Free Transit Route Generation
TransitLM introduces a large-scale dataset of 13 million transit route planning records from four Chinese cities, comprising 120,845 stations and 13,666 lines, enabling map-free route generation. The dataset serves as a continual pre-training corpus and benchmark for three evaluation tasks, supporting complementary metrics. Experiments demonstrate that large language models trained on TransitLM achieve high accuracy in generating structurally valid routes and implicitly ground arbitrary GPS coordinates to appropriate stations without explicit mapping. This enables end-to-end, map-free transit route planning directly from origin-destination information.
transit route planningmap-free generationgps groundingcontinual pre-trainingbenchmark evaluation
Bernini: Latent Semantic Planning for Video Diffusion
Bernini introduces a unified framework for video generation and editing by combining multimodal large language models (MLLMs) for semantic planning and diffusion models for pixel rendering. The MLLM-based planner predicts target semantic representations in ViT embedding space, while a DiT-based renderer synthesizes pixels conditioned on this plan, augmented by text features and source VAE features for editing. Key innovations include Segment-Aware 3D Rotary Positional Embedding (SA-3D RoPE) and chain-of-thought reasoning in the planner. Bernini achieves state-of-the-art performance on video generation and editing benchmarks, leveraging pretrained MLLM understanding for strong generalization.
multimodal large language modelsdiffusion modelsvit embedding spacesegment-aware 3d ropechain-of-thought reasoning
Sibyl-AutoResearch: Autonomous Research Needs Self-Evolving Trial-and-Error Harnesses, Not Paper Generators
The paper introduces Sibyl-AutoResearch, a self-evolving framework for autonomous research that addresses limitations in current systems by implementing Scientific Trial-and-Error Harnesses. These harnesses enable agents to conduct bounded trials, preserve outcomes (both positive and negative), and integrate lessons into subsequent research actions through two auditable conversion units: trial-to-behavior and trial-to-harness-behavior. The framework is implemented in SIBYL, a file-backed system that tracks state, roles, and artifact traces. A retrospective audit reveals eight high-confidence conversion events with median latency of one iteration, and a registry documents how five failure classes were mitigated. The system is available on GitHub.
autonomous researchtrial-and-error harnessesconversion unitsauditable tracesself-evolving systems
4D-GSW: Kinematic-Aware Spatio-Temporal Consistent Watermarking for 4D Gaussian Splatting
The paper introduces 4D-GSW, a kinematic-aware watermarking framework for 4D Gaussian Splatting (4DGS) that embeds copyright information while preserving spatio-temporal consistency. The method employs a Spatio-Temporal Curvature (STC) metric to identify Dynamic Instants, adaptively gating watermark gradient injection to avoid non-physical artifacts. A joint HMM-MRF energy minimization model synchronizes watermark phases across temporal trajectories and spatial neighborhoods, while anisotropic gradient routing decouples watermark embedding from photometric fidelity. Experiments show robust watermark hiding, resistance to attacks, and maintained rendering quality.
4d gaussian splattingkinematic-aware watermarkingspatio-temporal curvaturehmm-mrf modelanisotropic gradient routing
Meta-Soft: Leveraging Composable Meta-Tokens for Context-Preserving KV Cache Compression
Meta-Soft introduces a dynamic KV cache compression framework for large language models, addressing memory inefficiency and context loss in long-sequence processing. The method employs a meta-library with a learnable orthogonal basis matrix and a Gumbel-Softmax selector network to synthesize context-specific Soft Tokens, which probe key information while preserving semantic context via attention-flow integration. This approach dynamically adapts to input prompts and redistributes information from evicted tokens to retained ones. Experimental results demonstrate Meta-Soft's superiority over existing state-of-the-art KV cache eviction methods.
kv cachesoft tokensgumbel-softmaxattention-flowmeta-library
SepsisAI Orchestrator: A Containerized and Scalable Platform for Deploying AI Models and Real-Time Monitoring in Early Sepsis Detection
The SepsisAI-Orchestrator introduces a containerized platform for deploying AI models in clinical settings, addressing deployment barriers through modular design. It integrates HL7 FHIR-inspired preprocessing, NoSQL storage, a LightGBM classifier (F1 0.87-0.94), and a Streamlit dashboard, orchestrated with Docker/Kubernetes. Load testing with 50-1000 concurrent users reveals optimal replica scaling: 12 replicas on a 12-thread CPU reduces p95 latency by 57.3% (3.3s to 1.41s) and eliminates failures, while over-provisioning degrades performance due to contention.
sepsislightgbmkuberneteslatencyscaling
Benchmarking Autonomous Agents against Temporal, Spatial, and Semantic Evasions
This paper introduces a multi-dimensional evasion framework targeting LLM-based autonomous agents, addressing temporal, spatial, and semantic evasion techniques. The authors construct A3S-Bench, a benchmark comprising 2,254 real-world agent execution trajectories, to evaluate vulnerabilities across 10 mainstream LLM backbones and 20 threat scenarios. Results show that the evasion framework increases the average risk trigger rate from 28.3% to 52.6%, revealing systemic vulnerabilities in current agent systems. The findings underscore the need for tailored defense mechanisms against these advanced evasion strategies.
autonomous agentsevasion frameworkllm backbonesrisk trigger ratea3s-bench
ACCoRD: Actor-Critic Conflict Resolution with Deep learning for O-RAN xApps
The paper proposes ACCoRD, an Actor-Critic Conflict Resolution method for O-RAN xApps using a PPO-Clip-trained ANN to mitigate control decision conflicts in Near-Real Time RAN Intelligent Controllers. The CR Agent employs batch training with network feedback to optimize ANN weights, evaluated via a novel methodology on simulation data. Results demonstrate significant reduction in negative network events compared to rule-based approaches, particularly in medium/high traffic scenarios.
o-ranactor-criticconflict resolutionppo-clipxapps
Evaluation of Pipelines for Data Integration into Knowledge Graphs
The paper introduces KGI-Bench, a novel benchmark for evaluating knowledge graph (KG) integration pipelines, addressing the lack of standardized quality assessment methods. It proposes three metrics—coverage, correctness, and consistency—to analyze pipeline outputs, using movie-domain datasets with seed KGs, multi-format input data, and reference KGs. The benchmark's utility is demonstrated by evaluating 12 pipelines, revealing performance variations across input formats and design choices.
knowledge graphintegration pipelinesbenchmarkcoveragecorrectness
Cross-domain benchmarks reveal when coordinated AI agents improve scientific inference from partial evidence
The study introduces a cross-domain benchmark to evaluate when coordinated AI agents enhance scientific inference from partial evidence across four tasks: molecular sonification, paradigm-shift detection, disease emergence, and exoplanet vetting. Using frozen evaluation panels, predefined scoring protocols, and explicit baselines, it identifies three operational regimes: cross-channel composites improve performance when disciplines capture partial phenomena (AUROC 0.944-0.955), coordination aids interpretation when one signal dominates, and representational gains occur in sonification. The benchmark, supported by ScienceClaw x Infinite, validates coordination only when performance, provenance, or representation claims are empirically supported.
cross-domain benchmarkcoordinated ai agentsscientific inferenceaurocprovenance layer
One LR Doesn't Fit All: Heavy-Tail Guided Layerwise Learning Rates for LLMs
The paper introduces Layerwise Learning Rate (LLR), an adaptive scheme assigning distinct learning rates to Transformer layers based on Heavy-Tailed Self-Regularization (HT-SR) theory. LLR quantifies layerwise heavy-tailedness via empirical spectral density of weight correlation matrices, assigning larger LRs to less heavy-tailed layers and vice versa. Experiments across LLaMA, GPT-nano (60M-1B params), AdamW/Muon optimizers show 1.5x training speedup and zero-shot accuracy gains (47.09%→49.02%), with minimal tuning overhead by transferring LR settings from uniform baselines.
layerwise learning rateheavy-tailed self-regularizationtransformerempirical spectral densityzero-shot accuracy
SciCore-Mol: Augmenting Large Language Models with Pluggable Molecular Cognition Modules
SciCore-Mol introduces a modular framework to enhance LLMs' molecular reasoning capabilities through three pluggable cognitive modules: topology-aware perception, latent diffusion-based generation, and reaction-aware reasoning. These modules interface with the LLM backbone via learned representations, overcoming text-based information loss in molecular tasks. The 8B-parameter open-source system demonstrates competitive performance across molecular understanding, generation, reaction prediction, and general chemistry, outperforming proprietary models in some dimensions while providing a blueprint for scientific LLM augmentation.
molecular cognitionpluggable moduleslatent diffusiontopology-aware perceptionreaction prediction
EmoTrack: Robust Depression Tracking from Counseling Transcripts across Session Regimes
The paper introduces EmoTrack, a framework for robust depression severity (PHQ-8) prediction from counseling transcripts across single-session and multi-session regimes. The method combines LLM-extracted clinical signals with frozen turn-level semantic embeddings, trains symptom-specific predictors, and optionally incorporates cross-session memory for longitudinal context. Evaluated on the new LongCounsel dataset (multi-session with PHQ-8 labels) and DAIC-WOZ, EmoTrack achieves a 13.5% relative MAE reduction over the best DAIC-WOZ baseline in single-session mode while maintaining competitive multi-session performance.
phq-8 predictioncounseling transcriptslongitudinal contextsemantic embeddingscross-session memory
MuKV: Multi-Grained KV Cache Compression for Long Streaming Video Question-Answering
MuKV introduces multi-grained KV cache compression for long streaming VideoQA, addressing redundancy in existing frame-level caching methods. The approach employs patch-, frame-, and segment-level visual representations with dual signal token compression guided by self-attention and frequency. A semi-hierarchical retrieval mechanism enhances online QA efficiency. Experiments demonstrate significant accuracy improvements on long-streaming VideoQA benchmarks, with consistent gains in memory efficiency and QA performance. The compression mechanism alone outperforms baselines across all metrics.
kv-cachestreaming videoqamulti-grained compressionself-attentionsemi-hierarchical retrieval
Impact of Atmospheric Turbulence and Pointing Error on Earth Observation
The paper introduces an enhanced image simulator incorporating vertical-path atmospheric turbulence and satellite pointing jitter to generate physically realistic distortions for Earth Observation (EO) imagery. Using YOLOv8 and RetinaNet, vessel detection performance was evaluated under varying turbulence and jitter conditions. Results indicate YOLOv8 recall drops from 91% under ideal conditions to 60% with weak turbulence and below 40% with strong turbulence or jitter, while RetinaNet maintains approximately 75% recall across degraded conditions. The findings underscore the necessity of integrating realistic physical degradations into EO training datasets to enhance AI model robustness in operational environments, particularly for maritime surveillance.
atmospheric turbulencepointing jitterearth observationvessel detectionimage simulator
Detecting Atypical Clients in Federated Learning via Representation-Level Divergence
The paper introduces a geometric method for detecting atypical clients in federated learning by measuring representation-level divergence. The approach quantifies functional deviation through activation-induced input space partitioning on a shared probe set, yielding a permutation-invariant metric that distinguishes benign heterogeneity from harmful divergence. Results demonstrate effectiveness in identifying clients causing atypical functional changes, enabling risk-aware aggregation without parameter or gradient comparisons.
federated learningrepresentation divergenceactivation partitioningrisk-aware aggregationnon-iid detection
Tailoring Teaching to Aptitude: Direction-Adaptive Self-Distillation for LLM Reasoning
The paper introduces Direction-Adaptive Self-Distillation (DASD), a method that improves on-policy self-distillation (OPSD) for LLM reasoning by adapting teacher supervision direction based on token-level entropy. High-entropy tokens are pushed away from privileged teacher outputs to preserve exploration, while low-entropy tokens are pulled toward teacher outputs to stabilize execution. Evaluated on six mathematical reasoning benchmarks, DASD achieves superior macro Avg@16 performance over RLVR and self-distillation baselines, maintaining exploration without compromising step-level accuracy.
on-policy self-distillationtoken-level entropydirection-adaptive supervisionmathematical reasoningexploration preservation
What are the Right Symmetries for Formal Theorem Proving?
The paper introduces rewriting categories, a category-theoretic framework to formalize symmetries in formal theorem proving, specifically proof equivariance and success invariance. It demonstrates that state-based next-tactic provers inherently satisfy proof equivariance, while LLM-based provers fail to respect these symmetries, leading to performance variability. The authors propose test-time methods to aggregate equivalent rewritings, theoretically recovering success invariance and empirically improving robustness. Results indicate symmetry as a critical missing inductive bias in LLM-based theorem proving, with test-time computation offering a practical solution.
rewriting categoriesproof equivariancesuccess invariancellm-based proverstest-time computation
Explainable AI for Data-Driven Design of High-Dimensional Predictive Studies
The study introduces an Explainable AI Recommender system to enhance predictive modeling in clinical datasets by providing data-driven recommendations for feature selection, non-linear terms, and interactions. Using a Cox Proportional Hazards model on 245,614 patients, the method improved the C-index from 0.805 to 0.815 and calibration metrics. The framework recommended excluding 23 features, adding non-linear terms for two features, and incorporating 221 interactions, all validated by literature. The approach also demonstrated efficacy on two additional public datasets, showcasing its potential for transparent, high-dimensional predictive modeling.
explainable aicox proportional hazardsfeature selectionpredictive modelingclinical decision-making
Unlocking Proactivity in Task-Oriented Dialogue
The paper introduces a method to enhance proactivity in task-oriented dialogue systems by modeling user concerns as latent variables. It proposes the Cognitive User Simulator, which generates diverse interactions while tracking persuasion state dynamics, and Simulator-Induced Asymmetric-View Policy Optimization, combining concern-aware behavior distillation with state-transition policy refinement. Results demonstrate that conditioning on latent concerns enables proactive dialogue strategies unreachable through standard RL approaches.
proactive dialoguelatent concernsuser simulatorpolicy optimizationstate dynamics
Evaluating Large Language Models as Live Strategic Agents: Provider Performance, Hybrid Decomposition, and Operational Gaps in Timed Risk Play
The study evaluates large language models (LLMs) as live strategic agents in a timed multi-phase Risk environment, revealing performance differences in end-to-end systems versus isolated planning. Using a 32-game cross-provider championship with frozen rules, Gemini-3.1-Pro-Preview outperformed GPT-5.1, Claude-Opus-4-7, and Kimi-K2.6 (20/32 wins, p ≈ 1.5×10^-5). Hybrid decomposition showed near-equal planning performance (p ≈ 0.821) when execution was standardized on Gemini Flash, indicating provider spread stems from system behavior. Traces revealed Gemini's superior objective tracking and execution conversion, emphasizing workflow evaluation over static benchmarks.
large language modelsstrategic agentshybrid decompositionexecution conversionobjective tracking
Can Transformers Learn to Verify During Backtracking Search?
The paper identifies two failure modes in decoder-only transformers trained on cumulative traces for backtracking search: scattered retrieval (state features distributed across positions) and history entanglement (conditioning on trajectory rather than state). It proposes localization (trace-level rewriting) for scattered retrieval and Selective State Attention (SSA, a fixed attention mask) for history entanglement. Evaluated on 3-SAT, graph coloring, Blocks World, and backtracking parsing, SSA achieves identical decisions for same-state pairs where baseline models diverge. The work provides diagnostic tools and structural fixes for transformer-based search systems.
backtracking searchselective state attentionhistory entanglementscattered retrievalreactive verification
SGR-Bench: Benchmarking Search Agents on State-Gated Retrieval
The paper introduces SGR-Bench, a benchmark for state-gated retrieval (SGR) tasks requiring site-specific state configuration to access answer-bearing evidence. It contains 100 expert-curated tasks across six source families and 12 public data ecosystems, comparing constraint-guided and goal-oriented formulations. Evaluation of eight CLI-based LLM agents and three commercial search products shows a top performance of 66.18% item-level F1, with retrieval-scope drift (37.2%) and criterion mismatch (27.6%) as primary failure modes. The dataset is publicly available.
state-gated retrievalcli-based agentsretrieval-scope driftcriterion mismatchf1 score
Towards a compositional semantics for quantitative confidence assessment in assurance arguments
The paper introduces a compositional semantics for quantitative confidence assessment in assurance arguments, addressing the lack of operational semantics in notations like Goal Structuring Notation (GSN). Leveraging Subjective Logic (SL), the method models argument elements as SL opinions and maps relations to SL operators, enabling confidence propagation through a network. The approach provides explicit warrants, context-aware handling, provenance preservation, and GSN compatibility, demonstrated via an exemplary confidence assessment.
assurance argumentssubjective logicconfidence propagationgoal structuring notationcompositional semantics
CLORE: Content-Level Optimization for Reasoning Efficiency
CLORE introduces content-level optimization for efficient reasoning in large language models, addressing limitations of length-focused methods. The framework edits correct on-policy rollouts via an augmentation model that deletes repetitive, illegible, or task-irrelevant content while preserving answers, optimized with an auxiliary reference-free DPO objective. Experiments on DeepSeek-R1-Distill-Qwen-7B and Qwen2.5-Math-7B across five benchmarks demonstrate improved accuracy--efficiency trade-offs and compatibility with existing methods like GRPO and DAPO. Content analyses confirm reductions in repetitive reasoning and superfluous post-answer exploration.
content-level optimizationreasoning efficiencyon-policy rolloutsreference-free dpoaccuracy--efficiency trade-off
Temporal Coding as a Substrate for Sensorimotor Object Inference: A Spiking Reinterpretation of Thousand Brains Architecture
This work proposes temporal coding as a biologically plausible alternative to dense vector representations in the Thousand Brains Theory's sensorimotor object recognition framework. The method replaces static feature vectors with rank-order spike packets, where activation timing encodes spatial relationships during sensor traversal, coupled with STDP-based directional learning and adaptive evidence accumulation (parameter λ). Synthetic experiments demonstrate perfect discrimination on spatially variant objects (vs. chance performance with dense vectors), 30-50% robustness advantages under noise, and λ convergence reflecting object geometry. Implementation requires ~450 LoC in NumPy.
temporal codingthousand brains theoryrank-order spikingstdpsensorimotor inference
Skill Weaving: Efficient LLM Improvement via Modular Skillpacks
SkillWeave introduces a modular framework for efficient LLM specialization under fixed memory constraints by partitioning model capabilities into domain-specific skillpacks. The method employs SkillZip to compress these modules into inference-ready formats, maintaining multi-domain performance while reducing latency. Evaluations show a 9B SkillWeave model outperforms baselines and matches a 32B monolithic LLM on multi-task benchmarks, achieving up to 4x speedup.
skillpacksmodular improvementdelta modulesinference-readymulti-domain performance
OSS: Open Suturing Skills Vision-Based Assessment Challenge 2024-2025
The OSS Challenge benchmarks vision-based skill assessment in open surgery through a MICCAI-hosted competition spanning 2024-2025. The dataset includes GoPro-recorded suturing videos with instrument trajectories, evaluated on three tasks: skill classification (4 classes), OSATS score prediction (8 categories), and hand/tool tracking. Top-performing solutions employed general-purpose spatiotemporal video models, though hybrid and tracking-based approaches showed competitive results. While OSATS prediction improved with more data, keypoint tracking faced challenges from occlusions and out-of-frame instances. The study establishes baselines for open surgical skill assessment while identifying limitations in motion-based analysis.
surgical skill assessmentopen surgeryspatiotemporal modelsosats predictioninstrument tracking
Action with Visual Primitives
The paper introduces AVP (Action with Visual Primitives), a Vision-Language-Action (VLA) model that decouples cognitive-perceptual reasoning from motor control by using visual-primitive tokens. The architecture employs a VLM to infer target states and emit visual-primitive tokens, which condition a flow-matching action expert trained with end-effector kinematics supervision. Real-robot experiments demonstrate AVP achieves a 27.61% higher success rate than pi_0.5 on pick-and-place tasks, with improved data efficiency, spatial-compositional generalization, and object-level transfer compared to baseline methods.
vision-language-actionvisual primitivesflow-matchingend-effector kinematicsspatial-compositional generalization
LLM-Metrics: Measuring Research Impact Through Large Language Model Memory
The paper introduces LLM-Metrics, a novel research-impact assessment method leveraging large language models' parametric memory. It hypothesizes that high-impact papers leave stronger imprints in LLM training data due to greater academic exposure. The authors designed four probe types (title/author/method/venue recognition) and evaluated 549 CS papers across 17 LLMs (0.5B-72B parameters). Results showed significant correlations (rho=0.1495, p=0.0004) with citations, strongest for recent papers (rho=0.1880) and author recognition. Smaller models like Llama-3.2-3B outperformed larger ones, suggesting selective memory effects. The method provides citation-independent, real-time impact measurement.
parametric memoryspearman correlationllm probescitation-independentselective-memory hypothesis
SWE-Mutation: Can LLMs Generate Reliable Test Suites in Software Engineering?
The paper introduces SWE-Mutation, a benchmark for evaluating LLM-generated test suites in software engineering, addressing the scarcity of high-quality test suites for program repair and reinforcement learning. The method employs systematically mutated solutions to assess test suite discriminative power, featuring 2,636 variants from 800 instances across nine languages. Experiments on seven LLMs, including DeepSeek-V3.1, reveal low verification (10.20%) and detection (36.15%) rates, with an agentic mutation strategy further reducing detection rates from 71.04% to 39.81%, exposing LLM limitations in generating reliable test suites.
test suitesprogram mutationllm evaluationsoftware engineeringmultilingual benchmarks
Measuring Cross-Modal Synergy: A Benchmark for VLM Explainability
The paper introduces Synergistic Faithfulness ($\mathcal{F}_{syn}$), a novel metric for evaluating Vision-Language Model (VLM) explainability that isolates cross-modal interactions by computing the joint Harsanyi dividend between modalities. The method leverages the Shapley Interaction Index to achieve a 24× speedup while maintaining high accuracy (ρ=0.92) as a surrogate measure. Evaluation of 8 XAI methods across 3 VLMs and 3 datasets reveals that current VLM explainers overemphasize visual salience and underperform attention-based methods in capturing true modality synergy, exposing a fundamental contradiction (Kendall's τ=−0.06) in unimodal evaluation paradigms.
vision-language modelsshapley interaction indexcross-modal synergyharsanyi dividendexplainable ai
Adapting the Interface, Not the Model: Runtime Harness Adaptation for Deterministic LLM Agents
The paper introduces Life-Harness, a runtime interface adaptation method for deterministic LLM agents that improves performance without modifying model weights. By converting recurring interaction failures into reusable interventions across environment contracts, procedural skills, action realization, and trajectory regulation, the approach evolves from training trajectories. Evaluated on seven deterministic environments from τ-bench, τ²-bench, and AgentBench, Life-Harness improves 116 out of 126 model-environment settings across 18 model backbones, achieving an average relative improvement of 88.5%. The harness transfers to 17 other models, demonstrating its capture of environment-side structure rather than model-specific behavior.
life-harnessruntime harnessdeterministic llm agentsenvironment contractstrajectory regulation
ST-SimDiff: Balancing Spatiotemporal Similarity and Difference for Efficient Video Understanding with MLLMs
ST-SimDiff introduces a training-free framework for efficient video understanding in Multimodal Large Language Models (MLLMs) by balancing spatiotemporal similarity and difference. The method constructs a spatio-temporal graph from visual tokens, then applies a parallel dual-selection strategy: similarity-based selection via community detection for static information compression, and temporal difference-based selection for key dynamic shifts. Experiments demonstrate superior performance over state-of-the-art approaches with reduced computational costs.
multimodal large language modelsspatio-temporal graphcommunity detectionvisual tokenscomputational efficiency
One-Way Policy Optimization for Self-Evolving LLMs
The paper introduces One-Way Policy Optimization (OWPO), a novel method for stabilizing reinforcement learning with verifiable rewards in large language models. OWPO decouples optimization direction from update magnitude, using verifier-determined direction while adjusting magnitude via reference policy. It employs asymmetric reweighting: Accelerated Alignment for inferior deviations and Gain Locking for superior deviations, creating a Ratchet Effect through iterative reference updates. Experiments show OWPO outperforms baselines (DAPO, OPD, MOPD) by enabling continuous self-evolution without fixed priors or external references.
reinforcement learningverifiable rewardsasymmetric reweightingratchet effectself-evolution
IdleSpec: Exploiting Idle Time via Speculative Planning for LLM Agents
IdleSpec introduces a scalable inference approach for LLM agents that exploits idle time during environment interactions by generating and aggregating plan candidates. The method employs complementary drafting strategies (progressive and recovery) sampled from a learned distribution updated via posterior feedback. Experiments show IdleSpec improves agent performance by 5.1% on GAIA/FRAMES (55.6% accuracy with Gemini-2.5-Flash) and achieves up to 9.1% gains on MLE-Bench's Any Medal rate, demonstrating efficacy in long-horizon tasks with computational delays.
speculative planningidle-time computationmulti-step reasoningposterior feedbackdrafting strategies
Ratchet: A Minimal Hygiene Recipe for Self-Evolving LLM Agents
The paper introduces Ratchet, a self-evolving LLM agent framework that addresses skill lifecycle management through four hygiene mechanisms: outcome-driven retirement, bounded active-cap, meta-skill authoring guidance, and pattern canonicalisation. Evaluated on MBPP+ hard-100 and SWE-bench, Ratchet improves pass@1 from 0.258 to 0.584 (peak 0.658) and demonstrates transferability to agentic solvers (+0.22 peak lift). Ablations reveal retirement and meta-skill authoring as critical components, while canonicalisation is subsumed by meta-skills. The system prevents performance drift below baseline levels.
self-evolvingskill lifecyclehygiene mechanismsoutcome-driven retirementmeta-skill authoring
Short-Term-to-Long-Term Memory Transfer for Knowledge Graphs under Partial Observability
The paper introduces a neuro-symbolic value-based approach for short-term-to-long-term memory transfer in partially observable reinforcement learning with temporal knowledge graphs. The method employs per-item Q-learning with shared parameters and temporal-difference updates to decide whether to retain or discard observed triples before long-term storage. Evaluated on RoomKG with long-term memory capacity 128, the approach outperforms symbolic and neural baselines (including LSTM/Transformer variants), with analysis showing selective retention of navigation- and query-relevant facts while discarding lower-value candidates.
partial observabilityknowledge graphq-learningneuro-symbolicmemory transfer
Efficient Agentic Reasoning Through Self-Regulated Simulative Planning
The paper proposes SR$^2$AM, a self-regulated agentic reasoning framework that decomposes decision-making into three systems: simulative reasoning (System II) for future-state prediction, self-regulation (System III) for adaptive planning control, and reactive execution (System I). The method implements these as distinct stages in an LLM's chain-of-thought, with versions v0.1 (8B) and v1.0 (30B) trained via supervised and reinforcement learning. Results show v1.0-30B matches performance of 685B-1T parameter systems while using 25.8-95.3% fewer reasoning tokens, with RL increasing planning horizon by 22.8% without significant frequency increase.
agentic reasoningsimulative planningself-regulationchain-of-thoughtreinforcement learning
Atom-level Protein Representation Learning Improves Protein Structure Prediction
The paper introduces TriProRep, a structure-aware protein representation learning method that jointly models amino-acid identity, backbone geometry, and local full-atom geometry via VQ-VAE tokenizers. Pretraining involves recovering original tokens from corrupted views to distinguish plausible augmentations. The authors also propose RepSP, a benchmark evaluating representations in structure-predictive tasks like homodimer co-folding and interaction property prediction. TriProRep outperforms sequence-only and prior structure-aware models on RepSP while maintaining competitive performance on conventional benchmarks.
protein representation learningvq-vaestructure predictionhomodimer co-foldingbackbone geometry
Adversarial Trust Poisoning in Vehicular Collaborative Perception
The paper introduces TrustFlip, a novel adversarial attack targeting consistency-based defenses in vehicular collaborative perception (CP) systems. By deploying physical adversarial objects that induce genuine but inconsistent observations among benign vehicles, the attack weaponizes the defense mechanism to falsely degrade trust scores of targeted vehicles. Evaluations across multiple CP architectures show state-of-the-art defenses are vulnerable, with targeted vehicles excluded in 87.7% of scenarios and Average Precision dropping by 13%. The authors propose TrustReflect, a lightweight mitigation that excludes disputed regions from trust evaluation, reducing attack success by 35-100%.
adversarial attackcollaborative perceptiontrust poisoningconsistency-based defenseautonomous vehicles
Perception or Prejudice: Can MLLMs Go Beyond First Impressions of Personality?
The paper introduces Grounded Personality Reasoning (GPR), a novel task requiring Multimodal Large Language Models (MLLMs) to justify Big Five personality ratings with observable evidence. It presents MM-OCEAN, a dataset of 1,104 videos with 5,320 multiple-choice questions featuring timestamped behavioral cues and evidence-grounded analyses. Benchmarking 27 MLLMs reveals a Prejudice Gap: 51% of correct ratings lack grounding in retrieved cues, with Holistic-Grounding Rates ranging only 0-33.5%, highlighting deficiencies in grounded social cognition.
grounded personality reasoningbig fivemultimodal large language modelsprejudice gapholistic-grounding rate
ArborKV: Structure-Aware KV Cache Management for Scaling Tree-based LLM Reasoning
ArborKV introduces a structure-aware KV cache management framework for scaling tree-based LLM reasoning, addressing the memory bottleneck in Tree-of-Thoughts (ToT) inference. The method combines a lightweight value estimator with a tree-aware allocation policy, enabling token-extractive eviction and lazy rehydration to support backtracking. Experiments demonstrate up to ~4x peak KV-memory reduction while maintaining near-full-retention accuracy, facilitating larger search configurations under fixed hardware constraints.
kv cachetree-of-thoughtsmemory optimizationllm reasoningeviction policy
ExComm: Exploration-Stage Communication for Error-Resilient Agentic Test-Time Scaling
ExComm introduces a communication protocol for error-resilient agentic test-time scaling by detecting and resolving cross-agent factual conflicts during exploration. The method employs periodic belief state audits, tool-based verification loops, and soft belief updates to correct errors while maintaining trajectory diversity via a dedicated diversification module. Evaluations on AIME 2024/2025 and GAIA benchmarks with Gemini-2.5-Flash-Lite and Qwen3.5-4B show 5.7% and 5.0% average gains over baselines, demonstrating improved error recovery, scaling behavior, and diversity.
agentic test-time scalingerror propagationbelief state auditingsoft belief updatestrajectory diversification
MPDocBench-Parse: Benchmarking Practical Multi-page Document Parsing
The authors introduce MPDocBench-Parse, a benchmark for evaluating multi-page document parsing in realistic scenarios, addressing gaps in existing single-page or task-specific benchmarks. The dataset comprises 433 manually annotated documents (3,246 pages) across 15 English/Chinese document types, with evaluation protocols for content fidelity (text/table/formula recognition, truncation merging) and logical structure (reading order, heading hierarchy). Experiments reveal current models excel at basic text extraction but struggle with semantic continuity (12.4% error rate), visual content parsing (18.7% F1 drop), and hierarchical recovery (22.3% accuracy decline).
document parsingmulti-page benchmarksemantic continuityhierarchical structurevisual content preservation
TextTeacher: What Can Language Teach About Images?
TextTeacher introduces an auxiliary training objective that leverages frozen text embeddings from a language model to enhance vision models without inference overhead. The method projects image captions into semantic anchors via a lightweight projection layer, guiding ViT representations during training while preserving pure-vision inference. On ImageNet, TextTeacher improves ViT accuracy by up to +2.7 percentage points, outperforms vision distillation (+1.0 p.p. average transfer gain), and reduces training time by 33% at comparable accuracy. Analysis reveals it acts as a feature-space preconditioner, shaping deeper layers early in training. The approach adds minimal overhead and avoids multimodal training.
platonic representationsemantic anchorsfeature-space preconditionervision-language transfervit backbones
Not Yet: Humans Outperform LLMs in a Colonel Blotto Tournament
The study compares human and LLM performance in strategic Colonel Blotto games through three round-robin tournaments. Humans (n=200) and LLMs submitted strategies, with humans employing better-calibrated intermediate-level heuristics and outperforming LLMs' stereotyped approaches. Results show strategic sophistication benefits only at optimal reasoning depth, with STEM-background humans performing slightly better. Surprisingly, humans did not adjust strategies against LLMs, treating them similarly to human opponents. The work highlights LLMs' current limitations in complex strategic reasoning.
colonel blotto gamestrategic sophisticationround-robin tournamentreasoning depthnash equilibria
Knowledge Graph Re-engineering Along the Ontological Continuum (extended version)
The paper introduces the ontological continuum, a theoretical framework for characterizing knowledge graph (KG) engineering practices along two orthogonal dimensions: semantics vs pragmatics and properties vs affordances. This framework, derived from empirical observation of real-world KG practices, provides a vocabulary to describe, compare, and transform KGs across diverse modeling approaches, from lightweight vocabularies to richly axiomatized ontologies. The authors ground their vision through a case study on provenance knowledge and identify five open research challenges, positioning the ontological continuum as a shared research agenda for KG re-engineering, particularly in neuro-symbolic AI and GenAI contexts.
knowledge graphontological continuumneuro-symbolic aiformal concept analysisprovenance knowledge
A Camera-Cooperative ISAC Framework for Multimodal Non-Cooperative UAVs Sensing
The paper proposes a Camera-Cooperative ISAC (CC-ISAC) framework for multimodal sensing of non-cooperative UAVs, addressing resource contention in Integrated Sensing and Communication systems. The method combines coarse-grained camera monitoring with fine-grained ISAC sensing, featuring a Vision-to-Echo Data Alignment (V2EDA) model for cross-modal feature alignment and a Multimodal Fusion-Based Estimation (MMFE) model for state estimation. Evaluations on DeepSense 6G show 71% reduced beam steering overhead and 1.69-11.15% lower tracking overhead while maintaining angular estimation accuracy.
integrated sensing and communicationmultimodal fusionnon-cooperative uavsbeam steeringcross-attention mechanisms
LVDrive: Latent Visual Representation Enhanced Vision-Language-Action Autonomous Driving Model
LVDrive introduces a Latent Visual representation enhanced Vision-Language-Action (VLA) framework for autonomous driving, addressing limitations of sparse action supervision and pixel-level reconstruction in existing VLAs. The method integrates future scene prediction into the VLA paradigm, learning high-level latent representations under auxiliary supervision from a pretrained vision backbone. It jointly models future scene and motion prediction in a unified embedding space, processed in a single forward pass, and employs a two-stage trajectory decoding strategy. Experiments on Bench2Drive show LVDrive outperforms action-supervised and image-reconstruction-based methods in closed-loop driving performance.
vlalatent visual representationfuture scene predictiontrajectory decodingbench2drive
JMed48k: A Multi-Profession Japanese Medical Licensing Benchmark for Vision-Language Model Evaluation
The authors introduce JMed48k, a Japanese medical licensing benchmark comprising 48,862 exam questions and 20,142 images from 11 national licensing exams (2005-2025), annotated under an 8-type visual taxonomy. They derive JMed48k-Eval, a 12,484-question subset (9,905 text-only, 2,579 with images), and evaluate 21 models via text-only, with-image, and paired image-removal audits. Results show proprietary and open-source models benefit substantially from images (+5.7 to +39.8 points across professions), while medical-specific models exhibit limited visual evidence use, with correct answers persisting post-image removal.
vision-language modelsmedical licensing benchmarkimage-removal auditprofession-stratified evaluationjapanese healthcare
Enhancing Visual Token Representations for Video Large Language Models via Training-Free Spatial-Temporal Pooling and Gridding
The authors propose ST-GridPool, a training-free method to enhance visual token representations for Video Large Language Models (LLMs). The approach combines Pyramid Temporal Gridding (PTG) for multi-grained spatiotemporal interactions and Norm-based Spatial Pooling (NSP) to preserve semantically rich regions based on token norms. Evaluations across multiple benchmarks show consistent performance improvements without retraining. The method offers a plug-and-play solution for efficient visual token compression in Video LLMs.
visual token enhancementspatiotemporal interactionspyramid temporal griddingnorm-based spatial poolingvideo llms
From Reasoning Chains to Verifiable Subproblems: Curriculum Reinforcement Learning Enables Credit Assignment for LLM Reasoning
The paper introduces SCRL (Subproblem Curriculum Reinforcement Learning), a curriculum RL framework that improves credit assignment in LLM reasoning by deriving verifiable subproblems from reference chains. The method employs subproblem-level normalization to independently reward progress at each reasoning step, enabling finer-grained credit assignment without external rubrics. Evaluations across seven mathematical reasoning benchmarks show SCRL outperforms baselines, with +4.1 accuracy improvement on Qwen3-4B-Base and +3.7 pass@1 gains on AIME24/25/IMO-Bench, demonstrating enhanced exploration on hard problems.
reinforcement learningcredit assignmentcurriculum learningmathematical reasoningllm
Echo4DIR: 4D Implicit Heart Reconstruction from 2D Echocardiography Videos
Echo4DIR introduces a 4D implicit reconstruction framework for cardiac geometry from 2D echocardiography, addressing geometric ambiguity and temporal discontinuity. The method combines statistical shape models (SSMs) with a cardiac conditional SDF, an Epipolar Mask Encoder for multi-view feature fusion, and self-supervised SDF-tailored rendering for domain adaptation. A Radial SDF Alignment ensures 4D continuity by locking shape evolution to velocity fields. Results show state-of-the-art performance, achieving 98.35% Dice and 96.75% IoU on clinical datasets.
echocardiographyimplicit reconstructionstatistical shape modelsdifferentiable renderingepipolar attention
Safeguarding Text-to-Image Generative Models Against Unauthorized Knowledge Distillation
WaveGuard introduces a defense mechanism against unauthorized knowledge distillation in closed-weight text-to-image generative models. The framework employs frequency-aware perturbations to protect synthetic outputs while maintaining visual fidelity, controlled by a user-specified budget. Evaluations demonstrate its efficacy in balancing protection strength, perceptual quality, and computational efficiency, particularly in WikiArt-related distillation scenarios.
waveguardknowledge distillationfrequency-aware perturbationsynthetic imagesprotection efficiency
Prototype-Guided Classification Sub-Task Decoupling Framework: Enhancing Generalization and Interpretability for Multivariate Time Series
The paper introduces PDFTime, a prototype-guided framework for multivariate time series classification that decouples feature extraction and decision logic into multi-stage sub-tasks. By approximating class-conditional feature distributions via learned prototypes in latent space, the method enables progressive discrimination through granular similarity-based reasoning. Evaluations on UEA and UCR benchmarks show state-of-the-art performance, achieving top-1 accuracy on 80/128 UCR datasets while improving interpretability over traditional feature-to-label mapping approaches.
prototype-guidedtime series classificationmulti-stage decisionsimilarity-based reasoninguea/ucr benchmarks
LABO: LLM-Accelerated Bayesian Optimization through Broad Exploration and Selective Experimentation
The paper introduces LABO, a framework combining LLM predictions with Bayesian optimization (BO) to enhance sample efficiency. LABO uses a gating criterion to dynamically balance LLM-based exploration and experimental validation, leveraging low-cost LLM evaluations for broad search space exploration while reserving costly experiments for high-uncertainty regions. Theoretical analysis provides a cumulative regret bound quantifying efficiency gains. Empirical results across diverse tasks show LABO outperforms existing methods under identical budgets, demonstrating its utility in scientific discovery workflows.
bayesian optimizationllm predictionssample efficiencyregret boundscientific discovery
Active Evidence-Seeking and Diagnostic Reasoning in Large Language Models for Clinical Decision Support
The study introduces an OSCE-inspired standardized patient simulator and benchmark for evaluating active diagnostic inquiry in large language models (LLMs), addressing the gap between static medical examinations and iterative clinical evidence gathering. Using a protocol with 468 cases across 15 models, the authors find that multi-turn evidence seeking reduces diagnostic accuracy by 12.75% and supporting-evidence quality by 24.36% compared to full-context evaluation, with errors linked to premature closure and inefficient questioning. These results highlight the limitations of static benchmarks and advocate for interactive assessments in clinical decision support systems.
large language modelsclinical decision supportdiagnostic reasoningevidence-seekinginteractive evaluation
Secure and Parallel Determinant Computation for Large-Scale Matrices in Edge Environments
The paper proposes Secure Parallel Determinant Computation (SPDC), a framework for privacy-preserving matrix determinant computation in edge environments. SPDC combines Composite Element Distortion (CED) - integrating Element-wise Obfuscation (EWO) and Panth Rotation Theorem (PRT) - with parallel LU decomposition to distribute encrypted matrix blocks across untrusted edge servers. The method achieves O(n³) complexity reduction through one-way communication and offers two verification algorithms (Q₂ probabilistic, Q₃ deterministic) for result integrity. Analysis shows the approach maintains determinant properties while providing strong privacy guarantees and low computational overhead for real-time applications in IoT and control systems.
matrix determinant computationedge computingcomposite element distortionparallel lu decompositionprivacy-preserving computation
GA-VLN: Geometry-Aware BEV Representation for Efficient Vision-Language Navigation
The paper introduces GA-VLN, a geometry-aware Bird's Eye View (BEV) representation for Vision-Language Navigation (VLN) that addresses computational inefficiency and spatial reasoning limitations in existing approaches. The method constructs compact 3D-grounded features by projecting RGB-D inputs into agent-centric BEV maps and integrating both explicit depth-based projections and implicit geometric priors from a pretrained 3D foundation model. Experiments demonstrate state-of-the-art performance without DAgger augmentation or mixed VQA training, achieving 46.2% success rate on R2R benchmark with 60% fewer tokens than baseline methods.
vision-language navigationbird's eye view3d reconstructionmultimodal llmgeometry-aware
AgroVG: A Large-Scale Multi-Source Benchmark for Agricultural Visual Grounding
The paper introduces AgroVG, a large-scale multi-source benchmark for agricultural visual grounding that formulates the task as generalized set prediction, requiring models to localize all matching instances or abstain when none exist. The benchmark comprises 10,071 annotation-grounded image-query pairs from ten datasets across six target families, supporting both bounding-box (T1) and instance-mask (T2) grounding. Zero-shot evaluation of 26 model configurations reveals significant performance gaps, with the best multi-target Set-$F_1$ score reaching only 0.35 and mask success rate at IoU@0.75 below 0.17.
visual groundingagricultural aiset predictioninstance segmentationmulti-target localization
FRED: A Multi-Modal Autonomous Driving Dataset for Flooded Road Environments
The Flooded Road Environments Dataset (FRED) introduces the first multi-modal autonomous driving dataset focused on water hazard scenarios. It combines 2.3 MP FLIR camera images, 64-beam Ouster OS1-64 LiDAR point clouds, and iXblue ATLANS-C IMU data with RTK GNSS correction, captured across five locations during and post-flooding. The dataset supports KITTI-style and RTMaps formats, includes semantic labels for water detection, and provides dry-condition baselines for localization and SLAM tasks. This enables evaluation of both single-sensor and sensor-fusion methods in flooded environments.
multi-modal datasetwater hazard detection64-beam lidarrtk gnss correctionsemantic labeling
From TF-IDF to Transformers: A Comparative and Ensemble Approach to Sentiment Classification
The paper compares traditional and transformer-based models for binary sentiment classification on the IMDb dataset, proposing a soft voting ensemble to boost performance. Methods include Naive Bayes, Logistic Regression, SVM, LightGBM, LSTM, RoBERTa, and DistilBERT, with NLP preprocessing and evaluation via accuracy, precision, recall, F1-score, and ROC-AUC. RoBERTa achieved the highest accuracy (93.02%), while the ensemble further improved classification, demonstrating the efficacy of model combination for sentiment analysis.
sentiment analysistransformer modelsmodel ensemblingnlp preprocessingimdb dataset
Blind Spots in the Guard: How Domain-Camouflaged Injection Attacks Evade Detection in Multi-Agent LLM Systems
The study identifies a critical vulnerability in LLM agent protection systems: domain-camouflaged injection attacks evade detection by mimicking target document vocabulary and authority structures. Formalized as the Camouflage Detection Gap (CDG), experiments show detection rates drop from 93.8% to 9.7% on Llama 3.1 8B and from 100% to 55.6% on Gemini 2.0 Flash across 45 tasks (p < 0.001). Llama Guard 3 fails completely (0% detection), while multi-agent debate amplifies attacks 9.9x on weaker models. Detector augmentation yields limited improvements (10.2-78.7%), indicating architectural vulnerabilities.
domain-camouflaged injectioncamouflage detection gapllm agentsinjection detectorsmulti-agent debate
Virtual 3D H&E Staining from Phase-contrast Back-illumination Interference Tomography
The authors present HistoBIT3D, the first voxel-wise paired dataset combining Back-illumination Interference Tomography (BIT) and fluorescence-labeled nuclei for quantitative evaluation of virtual 3D H&E staining. Their framework leverages bidirectional multiscale content consistency and cross-domain style reuse to translate BIT volumes with shift-variant contrast into realistic H&E volumes. The method achieves state-of-the-art realism metrics and improves 3D nuclei segmentation accuracy by 15% under zero-shot Cellpose evaluation, while enhancing boundary preservation in volumetric histopathology.
back-illumination interference tomographyvirtual staining3d histopathologyshift-variant contrastnuclei segmentation
The Log is the Agent: Event-Sourced Reactive Graphs for Auditable, Forkable Agentic Systems
The paper introduces ActiveGraph, an agent runtime framework that prioritizes an append-only event log as the source of truth, with a working graph derived deterministically from this log. Behaviors—including LLM-backed routines—react to graph changes and emit new events, enabling coordination through the shared graph. This design ensures three key properties: deterministic replay, cheap forking at any event, and end-to-end lineage tracking. The architecture is demonstrated through a diligence example, and its potential for self-improving agents is discussed, extending prior work like BabyAGI and graph-memory research.
event-sourceddeterministic replaygraph-memoryself-improving agentsappend-only log
From Patches to Trajectories: Privileged Process Supervision for Software-Engineering Agents
The paper introduces Patches-to-Trajectories (P2T), a method for curating high-quality training trajectories for software-engineering agents by leveraging privileged reference patches. P2T formulates trajectory construction as bi-objective optimization over per-step effectiveness and length, using a reverse phase to distill patches into latent process graphs and a forward phase to score teacher continuations against these graphs. On SWE-bench Verified, P2T improves Pass@1 by up to 10.8 points while reducing inference cost by ~15%, using only 1.8k curated instances.
supervised fine-tuningtrajectory optimizationprivileged informationsoftware-engineering agentsprocess supervision
Ex-GraphRAG: Interpretable Evidence Routing for Graph-Augmented LLMs
Ex-GraphRAG introduces Multivariate Graph Neural Additive Networks (M-GNAN) to replace opaque GNN encoders in GraphRAG, enabling exact attribution of node-level contributions in knowledge graph-augmented LLMs. The method decomposes encoder outputs without post-hoc approximation, maintaining performance parity (matching black-box models on STaRK-Prime) while revealing a semantic-structural mismatch: dominant nodes are structurally disconnected, connected via low-attribution intermediaries whose removal degrades multi-hop QA by 28%. This auditability exposes previously invisible evidence routing patterns critical for retrieval pruning and failure diagnosis.
graphragm-gnanmultivariateattributionstark-prime
ECPO: Evidence-Coupled Policy Optimization for Evidence-Certified Candidate Ranking
The paper introduces Evidence-Coupled Policy Optimization (ECPO), a method for evidence-certified candidate ranking that jointly optimizes ranking utility and evidence validity. ECPO learns an interpretable trajectory reward from skeleton alignment and argument consistency, then optimizes a constrained policy with three coupled rewards: listwise ranking utility, certificate validity, and evidence-cycle reward computed by a deterministic verifier. Evaluated on MAVEN-ERE and RAMS, ECPO outperforms zero-shot, SFT, and GRPO policies, demonstrating improved CertNDCG and decision-evidence coupling in closed-roster, predicted-roster, and hybrid-roster settings.
evidence-certified rankingpolicy optimizationtrajectory rewardcertndcgdeterministic verifier
Learning Spatiotemporal Sensitivity in Video LLMs via Counterfactual Reinforcement Learning
The paper introduces Counterfactual Relational Policy Optimization (CRPO), a dual-branch reinforcement learning framework that improves spatiotemporal sensitivity in Video LLMs by training on both original and counterfactual videos (generated via horizontal flips and temporal reversals) with a Counterfactual Relation Reward (CRR). CRR enforces answer consistency for static questions and variation for dynamic questions, mitigating shortcut learning. Evaluated on DyBench, a new 3,014-video benchmark with pair-accuracy metrics, CRPO improves Qwen3-VL-8B's DyBench P-Acc by +7.7 and TimeBlind I-Acc by +8.2 over the base model while maintaining general video performance.
video llmscounterfactual learningspatiotemporal sensitivityreinforcement learningshortcut learning
Echo: Learning from Experience Data via User-Driven Refinement
Echo introduces a framework for converting noisy interaction logs into learnable knowledge by leveraging user-driven refinements of AI agent outputs. The method operationalizes continuous learning from experience data, distilling trial-and-error sequences into high-quality training signals through systematic harvesting of user corrections. Large-scale validation in a production code completion environment demonstrates a 10% absolute improvement in acceptance rate (25.7% to 35.7%), breaking static performance ceilings.
experience datauser-driven refinementcontinuous learninginteraction logsmodel optimization
Interpreting and Enhancing Emotional Circuits in Large Vision-Language Models via Cross-Modal Information Flow
The paper introduces a steering-vector-based causal attribution framework to interpret and enhance emotional reasoning in Large Vision-Language Models (LVLMs). It constructs a specialized dataset to analyze the three-stage 'Adapt-Aggregate-Execute' mechanism, revealing functional decoupling: visual emotional cues aggregate in middle layers via sentiment-specific attention heads, while deep layers translate them into narrative generation. The proposed method regulates emotional information routing to strengthen attention flow and amplify semantic activation. Experiments on MER-UniBench show significant performance improvements via inference-time intervention, mitigating emotional hallucinations and validating circuit fidelity.
large vision-language modelscausal attribution frameworkemotional circuitssentiment-specific attention headsinference-time intervention
Video as Natural Augmentation: Towards Unified AI-Generated Image and Video Detection
The paper introduces VINA (Video as Natural Augmentation), a unified framework for detecting AI-generated images and videos by addressing cross-modal generalization gaps. Current detectors fail on video frames due to synthesis-agnostic processing shifts (color conversion, compression) and model-specific fingerprints. VINA jointly trains on image and video data, using frames as natural augmentations and a contrastive objective to align cross-modal representations. Evaluated on 14 benchmarks, VINA achieves SOTA performance, demonstrating robustness and transferability without complex augmentation or dataset-specific tuning.
aigc detectioncross-modal generalizationcontrastive learningvideo augmentationsynthesis-agnostic shifts
Format-Constraint Coupling in Knowledge Graph Construction from Statistical Tables
The paper identifies format-constraint coupling as a phenomenon where schema constraints and serialization formats interact super-additively during knowledge graph construction from statistical tables, reducing fidelity. Through a 2x2 factorial design across 6 datasets, the study demonstrates this effect can increase errors by up to +1.180, with catastrophic mismatches causing fact coverage to drop below baselines in 4/6 cases. Analysis via probing, token ablation, and controlled experiments across format-schema pairings reveals surface-form anchoring as the underlying mechanism. The work introduces CSVFidelity-Bench, containing 15 datasets and 1,892 gold facts, to enable fidelity-aware evaluation.
knowledge graph constructionformat-constraint couplingstatistical tablessurface-form anchoringfidelity-aware evaluation
LLM Retrieval for Stable and Predictable Ad Recommendations
The paper introduces an LLM-based semantic candidate generation framework for ad recommendation systems, addressing stability and predictability gaps in traditional approaches. The method extracts hierarchical semantic attributes from ad creatives using fine-tuned LLMs, constructs graph-based expansions to ensure semantic variant coverage, and integrates these representations into retrieval. Offline and online A/B tests on a large-scale industrial system demonstrated significant improvements in both predictability metrics (e.g., robustness to input perturbations) and traditional performance metrics (e.g., NDCG). The framework generalizes to other large-scale recommendation systems facing scaling challenges.
llm retrievalsemantic candidate generationgraph-based expansionprediction stabilitynormalized discounted cumulative gain
ChronoMedicalWorld: A Medical World Model for Learning Patient Trajectories from Longitudinal Care Data
The ChronoMedicalWorld Model (CMWM) introduces an action-conditioned latent world-model framework for simulating long-term patient trajectories from longitudinal EHR data. The architecture combines joint-embedding state encoding with multimodal action encoding (structured interventions + free-text), trained via a six-term objective including next-observation prediction, latent dynamics, SIGReg regularization, and physiology-aware priors. In a CKD case study (2,232 patients), CMWM reduced eGFR trajectory forecasting errors by 7.28% MAE (7.384 vs 7.964) and 7.35% RMSE (10.256 vs 11.069) compared to GPT-5.5, with gains attributed to dialogue modeling. The framework generalizes to any chronic condition with periodic state-intervention sequences.
latent world-modellongitudinal ehraction-conditionedphysiology-aware priorsclosed-loop rollout
AI-Enabled Serious Games: Integrating Intelligence and Adaptivity in Training Systems
The chapter analyzes how contemporary AI techniques can enhance instructional intelligence and adaptivity in serious games, addressing limitations like static scenarios and poor learner modeling. It synthesizes historical developments from computer-assisted instruction to modern AI-enabled architectures, focusing on large language models (LLMs), reinforcement learning (RL), and agent-based approaches for dynamic scenario variation and adaptive pacing. Key challenges identified include explainability, validation, computational costs, and insufficient empirical evidence on long-term learning outcomes in AI-integrated serious games.
instructional intelligencedynamic difficulty adjustmentlarge language modelsagent-based architectureslearning analytics
MLLMs Know When Before Speaking: Revealing and Recovering Temporal Grounding via Attention Cues
The study reveals a perception-generation gap in multimodal large language models (MLLMs) for video temporal grounding (VTG), where models correctly attend to target intervals during prefill but lose this signal during autoregressive decoding. The authors identify Temporal Grounding Heads (TG-Heads) that localize events in the prefill stage and propose an inference-time read-then-regenerate framework. This method converts TG-Head attention into a debiased relevance signal, extracts high-attention intervals, and restricts visual context to these regions during regeneration. Without parameter updates, the framework improves MiMo-VL-7B, Qwen3-VL-8B, and TimeLens-8B by up to +3.5 mIoU on three VTG benchmarks.
multimodal large language modelsvideo temporal groundingtemporal grounding headsautoregressive decodinginference-time framework
Thermodynamic Irreversibility of Training Algorithms
The work establishes a thermodynamic framework for analyzing irreversibility in AI training algorithms, demonstrating equivalence between four characterization methods: numerical backward error (φ_DE), time-renormalized correction (φ_TR), microscopic time reversal asymmetry (φ_TA), and stochastic-thermodynamic entropy production (φ_ST). Using step size η expansions, the analysis reveals a symmetry-breaking emergent force that preserves orthogonal symmetries while breaking non-isometric reparametrization symmetries. Results indicate a universal preference for learning trajectories minimizing entropy production rate, providing fundamental insights into far-from-equilibrium dynamics of modern AI systems.
thermodynamic irreversibilitytraining algorithmsentropy productionsymmetry breakingfar-from-equilibrium
CausalGuard: Conformal Inference under Graph Uncertainty
CausalGuard introduces a structure-weighted conformal framework for causal effect estimation under graph uncertainty, combining graph-conditional doubly robust pseudo-outcomes with Bayesian Information Criterion-weighted DAG candidates. The method leverages LLM-derived edge priors, conditional-independence pruning, and a composite nonconformity score to ensure finite-sample marginal coverage. Empirical results show mean coverage above 90% on five benchmarks, with reduced interval width compared to graph-agnostic conformal baselines. The approach remains robust to misspecified priors when candidate sets are data-supported.
conformal inferencecausal graphdoubly robust estimationbayesian information criterionconditional average treatment effect
SDGBiasBench: Benchmarking and Mitigating Vision--Language Models' Biases in Sustainable Development Goals
The authors introduce SDGBiasBench, a large-scale benchmark suite for evaluating vision-language models (VLMs) on Sustainable Development Goals (SDGs), comprising 500k multiple-choice questions and 50k regression tasks to assess decision-level and estimation-level biases. They identify intrinsic SDG biases in VLMs, where predictions rely on priors rather than multimodal cues, and propose CADE (Contrastive Adaptive Debias Ensemble), a training-free method that uses modality-specific answer priors for debiasing. CADE improves multiple-choice accuracy by up to 25% and reduces regression MAE by 12 points across multiple VLMs.
sustainable development goalsvision-language modelsmultimodal reasoningbias mitigationcontrastive learning
MAVEN: A Multi-stage Agentic Annotation Pipeline for Video Reasoning Tasks
MAVEN introduces a multi-stage agentic pipeline for generating structured video annotations to train Vision Language Models (VLMs), addressing the scalability limitations of manual labeling. The method constructs Multi-Scale Spatio-Temporal Event Descriptions (MSTED) from three caption levels, which feed into downstream Q&A generation, and features agent-driven domain adaptation and hierarchical error refinement. Evaluated on 5,300 traffic videos, fine-tuning Cosmos-Reason2-8B with MAVEN-generated data outperforms Gemini 2.5 Pro and 3.1 Flash, achieving +38.8 MCQ accuracy points on a private CCTV set and +10.7 on AccidentBench, with further gains via RL post-training.
vision language modelsmulti-scale spatio-temporalagentic pipelinechain-of-thoughtdomain adaptation
Engineering Hybrid Physics-Informed Neural Networks for Next-Generation Electricity Systems: A State-of-the-Art Review
The review demonstrates that physics-informed machine learning (PIML) architectures significantly enhance electricity system modeling by embedding physical laws into neural networks. Hybrid approaches like PINNs, DeepONets, and PIGNNs improve accuracy under sparse/noisy data, reduce simulation time versus finite element methods, and enable real-time digital twin calibration. Case studies confirm superior performance in parameter sensitivity, dynamic behavior, and robustness compared to data-driven baselines. Challenges include training instability for stiff multi-scale problems and high computational costs. PIML enables a shift from black-box methods to transparent, physics-grounded solutions for Industry 4.0 applications.
physics-informed neural networksdigital twinsmaxwell's equationssurrogate modelinguncertainty quantification
Planning in the LLM Era: Building for Reliability and Efficiency
The paper analyzes the evolution of LLM-based planning methods, advocating for a shift from single-shot generation to verifiable symbolic solver synthesis. It critiques early approaches (prone to incompleteness and resource inefficiency) and surveys three emerging planner-generation paradigms that decouple LLM dependence during inference. The work identifies key limitations in current techniques and proposes research directions for achieving reliable, maintainable planners with minimal runtime LLM overhead.
llm-based planningsymbolic solver generationverifiable plannersinference efficiencyplanning reliability
Two-Stage Multimodal Framework for Emotion Mimicry Intensity Prediction
The authors present a two-stage multimodal framework for predicting continuous emotion mimicry intensity (EMI) from in-the-wild videos, achieving third place in the Hume-ABAW10 EMI Challenge. Their method independently trains modality-specific encoders (text, audio, vision, optional motion) before fusing representations via a lightweight regressor with modality dropout and controlled adaptation. The best-performing text-audio-vision-motion fusion achieved 0.4722 average Pearson correlation on validation (4:1 split), with test set performance reaching 0.57. While motion provided marginal gains, the work establishes a reproducible EMI baseline.
multimodal fusionemotion predictionmodality dropoutin-the-wild videopearson correlation
EvoScene-VLA: Evolving Scene Beliefs Inside the Action Decoder for Chunked Robot Control
EvoScene-VLA introduces a persistent action-updated scene state for chunked robot control, combining current visual observations with prior scene information from previous actions. The method employs a recurrent scene prefix in a vision-language model (VLM) to maintain geometry-aware scene states across control chunks, corrected by fresh visual evidence at each step. Training leverages a Scene Predictor and Geometric Anchor for scene-token targets and alignment, discarded during deployment. Evaluated on 31 RoboTwin tasks, EvoScene-VLA improves success rates from 87.2% to 89.1% (fixed) and 86.1% to 88.5% (randomized), outperforming baselines on the Galaxea R1-Lite robot.
evoscene-vlavision-language-actionscene priorchunked controlgeometric anchor
Learning Emergent Modular Representations in Multi-modality Medical Vision Foundation Models
The paper proposes Director-Experts (DEX), a modular network addressing gradient conflicts in multi-modality medical vision foundation models. DEX combines modality-specialized expert pools with a director module that integrates cross-modal semantics via group exponential moving average. Evaluated on Medical Vision Universe (4M images, 10 modalities) and 26 downstream tasks, DEX demonstrates improved optimization and transferability compared to monolithic approaches.
multi-modalityfoundation modelsgradient conflictmodular networksmedical vision
The Illusion of Reasoning: Exposing Evasive Data Contamination in LLMs via Zero-CoT Truncation
The paper introduces the Zero-CoT Probe (ZCP), a black-box method for detecting evasive data contamination in LLMs that bypasses existing detection via paraphrasing. ZCP truncates Chain-of-Thought reasoning to expose memorization, comparing performance on original benchmarks against isomorphically perturbed datasets. It proposes Contamination Confidence, a metric quantifying contamination likelihood and severity. Experiments on contaminated and fine-tuned models show ZCP effectively identifies both direct and evasive contamination, addressing a critical evaluation challenge in LLM benchmarking.
data contaminationchain-of-thoughtblack-box detectionmemorizationbenchmark evaluation
CrossVLA: Cross-Paradigm Post-Training and Inference Optimization for Vision-Language-Action Models
CrossVLA introduces cross-paradigm optimization for Vision-Language-Action (VLA) models, addressing gaps in post-training and inference. The method proposes (i) a flow-matching log-probability estimator enabling Direct Preference Optimization (DPO) on continuous-action models without ODE integration, (ii) a DoRA-based parameter-efficient layer outperforming LoRA by +10.4 pp on LIBERO benchmarks, and (iii) inference-time analysis revealing 78.6% latency from denoising loops and 21% acceleration limits with KV-caching. Additional pretraining achieves 99.5% k-NN recall@1 for task retrieval. Code and models are open-sourced.
vision-language-actiondirect preference optimizationflow-matchingparameter-efficient tuningkv-caching
OPPO: Bayesian Value Recursion for Token-Level Credit Assignment in LLM Reasoning
The paper introduces Oracle-Prompted Policy Optimization (OPPO), a Bayesian value recursion method for token-level credit assignment in LLM reasoning. OPPO leverages oracle signals to update belief about success probabilities along trajectories, providing token-level advantages without requiring value networks or additional rollouts. It outperforms GRPO, DAPO, and SDPO by up to +6.0 points on AMC'23 and +5.2 points on AIME'24, with gains increasing with response length. The method offers two estimators: self-oracle (reusing the student model) and teacher-oracle (using a stronger frozen model).
bayesiantoken-levelcredit assignmentllm reasoningoppo
ACC: Compiling Agent Trajectories for Long-Context Training
The paper introduces Agent Context Compilation (ACC), a method to convert agent trajectories into long-context QA pairs for supervised fine-tuning of LLMs. ACC integrates scattered evidence from tool responses and environment observations across multiple turns, enabling direct supervision of long-context reasoning without additional annotation. Evaluated on MRCR and GraphWalks, ACC-trained Qwen3-30B-A3B achieves 68.3 (+18.1) and 77.5 (+7.6) respectively, matching larger models while preserving general capabilities, with analysis revealing task-adaptive attention restructuring.
agent trajectorieslong-context reasoningsupervised fine-tuningcoreference resolutionattention restructuring
Comparing LLM and Fine-Tuned Model Performance on NVDRS Circumstance Extraction with Varying Prompt Complexity
The study proposes a hybrid approach combining large language models (LLMs) and fine-tuned RoBERTa for extracting inferentially complex circumstances from NVDRS death investigation narratives. A novel 'Complexity Score' algorithm predicts when detailed prompts outperform name-only prompts, enabling dynamic prompt strategy selection. Evaluation on 25 NVDRS circumstances shows LLMs (GPT-5.2, Gemini 2.5 Pro, Llama-3 70B) significantly outperform on low-prevalence cases, while fine-tuned models excel for common circumstances, suggesting an optimized hybrid architecture.
complexity scorenvdrsinferential extractionprompt strategyhybrid architecture
An Open Multi-Center Whole-Body FDG PET/CT Foundation Model for Tumor Segmentation
The authors present an open-source, multi-center whole-body FDG PET/CT foundation model for tumor segmentation, trained on 4,997 harmonized scans from four public datasets. The method employs hierarchical UNet backbones with early channel-wise concatenation for cross-modal interaction and a masked autoencoding objective using zero-mean imputation to avoid intensity discontinuities. Results show strong label efficiency: with 10% labeled data, performance matches full-dataset training, and 5-shot linear probing demonstrates superior Dice scores for joint PET/CT pretraining versus modality-specific approaches.
foundation modelpet/ct fusionmasked autoencodinglabel efficiencytumor segmentation
FLUID: From Ephemeral IDs to Multimodal Semantic Codes for Industrial-Scale Livestreaming Recommendation
FLUID introduces an ID-free ranking framework for industrial-scale livestreaming recommendation, addressing persistent cold-start issues from ephemeral room lifetimes. The method replaces traditional ID embeddings with discrete hierarchical codes (LUCID) generated by a cross-domain multimodal encoder trained on both short videos and livestreams, using late-fusion and staged warmup for stability. Deployed on a platform with >1B users, FLUID achieves +0.55% Quality Watch Duration, +2.05% Cold-Start Room Views, and +0.05% Active Hours.
collaborative filteringmultimodal encodercold-startdiscrete codeslivestreaming recommendation
Does Slightly Mean Somewhat? Measuring Vague Intensity Words in LLM Numeric Actions
The study investigates whether language models preserve ordinal meaning in intensity words when producing numeric actions, using Claude Haiku in a controlled resource-allocation task. Researchers tested 10 English degree modifiers (e.g., slightly to drastically) across 6,620 runs, varying temperature (T=0.0, T=0.7) and system state. Results show: (1) 10 words compress into 5 distinct median outputs, (2) system state explains more variance than lexical choice (epsilon-squared 0.782 vs. 0.079), and (3) near capacity, models exhibit three behavioral modes (hedging, abstaining, ceiling-pushing).
intensity wordsordinal meaningresource-allocationclaude haikuepsilon-squared
Toward AI VIS Co-Scientists: A General and End-to-End Agent Harness for Solving Complex Data Visualization Tasks
The authors present an end-to-end agentic harness for autonomously designing custom visual analysis applications (VIS apps) from raw data and high-level task descriptions. The system employs coordinated agents with specialized skills for exploratory analysis, planning, environment configuration, implementation, interface validation, and task evaluation, producing intermediate artifacts for iterative refinement. Evaluated on IEEE SciVis Contests featuring real-world complexity, the system generates functional single-page VIS Apps with verified linked-view behavior tailored to domain-specific tasks.
agentic harnessvisual analysislinked-view behaviorexploratory analysistask-driven validation
Implicit Safety Alignment from Crowd Preferences
The paper introduces Safe Crowd Preference-based RL, a hierarchical framework for extracting safety-aligned skills from crowd preferences to regularize agent behavior in downstream tasks. The method addresses limitations of direct reward combination by leveraging implicit safety criteria embedded in diverse user preferences. Experimental results demonstrate reduced safety costs and competitive task performance compared to oracle methods with ground-truth safety signals across RL environments and LLM-style tasks.
reinforcement learninghuman feedbacksafety alignmentcrowd preferenceshierarchical framework
Trace2Skill: Verifier-Guided Skill Evolution for Long-Context EDA Agents
Trace2Skill introduces a test-time scaling framework for hardware LLM agents tackling Complex Verilog Design Problems (CVDP), avoiding RTL-specialized fine-tuning. The method evolves natural-language skills by mining rollout traces, converting them into diagnostics and oracle lessons, and using an oracle-mutator-selector loop to refine task-specific skills. Dense verifier feedback provides sanitized functional observations without exposing hidden harnesses. Results show improved pass rates on hard CVDP tasks, including breakthrough solutions for previously unsolved problems, without model weight updates or specialized training data.
verilogedallm agentsverifier feedbackskill evolution
Residual Skill Optimization for Text-to-SQL Ensembles
DivSkill-SQL introduces residual skill optimization for Text-to-SQL ensembles, systematically building complementary agentic skills by optimizing each new skill on examples where the current ensemble fails, thereby maximizing marginal contribution to Pass@K. The method requires no fine-tuning and demonstrates cross-dialect and cross-task transferability. Evaluations on Spider2-Lite show accuracy improvements of +11.1 points (Snowflake) and +8.3 (BigQuery) over baselines, with consistent gains across base models (Opus-4.6, GPT-5.4) and reduced hallucination rates (3x fewer errors in schema references and function calls).
text-to-sqlensemble learningresidual optimizationpass@kschema hallucination
Patch Hierarchical Attention Transformer for Efficient Particle Jet Tagging
The paper introduces PHAT-JeT, a transformer architecture for real-time particle jet tagging that balances computational efficiency and accuracy. The model combines a geometric message-passing module for local detector-plane structure with hierarchical patch-based attention, enabling exact attention within small particle groups while maintaining global context through lightweight patch-token communication. Evaluated under strict latency constraints, PHAT-JeT achieves state-of-the-art accuracy and background rejection on four benchmarks (hls4ml, JetClass, Top Tagging, and Quark--Gluon), outperforming existing resource-constrained models.
jet tagginghierarchical attentionmessage-passingtransformerparticle physics
What Counts as AI Sycophancy? A Taxonomy and Expert Survey of a Fragmented Construct
This work establishes a taxonomy for AI sycophancy in LLMs through two contributions: a systematic review of 70 papers identifying two key dimensions (target: user beliefs vs. personal traits; expression: explicit vs. implicit behaviors), and an expert survey (N=106) revealing 94.3% consensus on sycophancy's significance but disagreement on specific behaviors. The taxonomy reveals research gaps in studying implicit and person-directed sycophancy, while providing a framework for standardized evaluation and mitigation strategies.
ai sycophancylarge language modelsbehavioral taxonomyexpert surveyevaluation framework
Understanding Perspectives of Patients, Caregivers and Clinicians towards Emerging Collaborative-decision Making Technologies
The study investigates perceptions of collaborative decision-making technologies in pediatric healthcare through qualitative analysis of patients, caregivers, and clinicians. Researchers examined interactive dashboards, VR simulators, and AI voice assistants, identifying divergent opinions across user groups. Results indicate technology acceptance correlates with trust levels, suggesting developers must prioritize trust-building design strategies for effective implementation.
collaborative decision-makinginteractive dashboardsvr simulatorsai voice assistantstechnology acceptance
A Causal Argumentation Method for Explainability of Machine Learning Models
The paper introduces a novel explainability method combining causal discovery with argumentation frameworks to elucidate model decisions. It employs causal relationships identified via discovery methods, translates them into a Bipolar Argumentation Framework (BAF), and uses semi-stable semantics to derive feature extensions that justify predictions. Evaluated on two benchmark datasets, the method outperforms standard post-hoc explainability approaches in clarifying decision rationales.
explainable aicausal discoverybipolar argumentation frameworksemi-stable semanticspost-hoc explainability
PEARL: Unbiased Percentile Estimation via Contrastive Learning for Industrial-Scale Livestream Recommendation
PEARL introduces a contrastive learning framework for unbiased percentile estimation in recommender systems, addressing behavioral intensity imbalance caused by heterogeneous user engagement patterns. The method employs nonparametric pairwise comparisons to approximate relative preferences directly, avoiding absolute engagement magnitudes, and incorporates percentile smoothing via bootstrapping for sparse feedback. Theoretical analysis shows unbiased percentile estimation, while empirical results demonstrate effectiveness: deployed on a billion-user livestream platform, PEARL achieved +2.10% watch duration, +0.80% consumption, +1.49% interaction rate, and -6.91% report rate.
contrastive learningpercentile estimationbehavioral biasrecommender systemsnonparametric
Who Uses AI? Platforms, Workforce, and AI Exposure
The study identifies measurement bias in AI exposure scores derived from platform logs, demonstrating that these metrics conflate platform user demographics with workforce characteristics. Using fixed outcomes, samples, controls, and estimators while varying platform inputs, the authors show coefficient variations up to 1.9x for post-ChatGPT employment effects and sign disagreements across consumer vs. enterprise channels. Reweighting to Bureau of Labor Statistics data reduces estimates by 42-93%. The analysis formalizes non-classical measurement error, deriving probability limits and partial-identification bounds for employment elasticities, revealing understated substitution effects relative to augmentation.
ai exposuremeasurement erroremployment elasticitypartial-identificationworkforce demographics
SMDD-Bench: Can LLMs Solve Real-World Small Molecule Drug Design Tasks?
The authors introduce SMDD-Bench, a multi-turn, long-horizon benchmark for evaluating LLM agents on small molecule drug design (SMDD) tasks. The benchmark comprises 502 solvable instances across 5 task types (2D Pharmacophore Identification, Interaction Point Discovery, Scaffold Hopping, Lead Optimization, Fragment Assembly), spanning 102 protein targets and diverse chemical spaces. Testing 7 frontier LLMs reveals GPT5.4 achieves only 40.2% task completion, highlighting the challenge of integrating chemical reasoning, 3D intuition, and tool use. The benchmark aims to standardize evaluation for autonomous computational drug design.
small molecule drug designmulti-turn benchmarkpharmacophore identificationscaffold hoppinglead optimization
AttuneBench: A Conversation-Based Benchmark for LLM Emotional Intelligence
AttuneBench introduces a conversation-based benchmark for evaluating LLM emotional intelligence (EI) through 200 multi-turn human-model dialogues with turn-by-turn emotional state annotations. Unlike synthetic or single-turn EI assessments, it measures real-time inference of user affect and preferred responses across 11 models. Key findings show EI decomposes into separable capabilities (emotion recognition, behavioral classification, preference prediction, response quality), with preference alignment and response quality being more discriminative than emotion-label accuracy, highlighting context-dependent response prediction as critical for EI.
emotional intelligencemulti-turn conversationpreference alignmentresponse qualityhuman-model interaction
Support-aware offline policy selection for advertising marketplaces
The paper introduces a support-aware offline decision framework for selecting reserve-price policies in advertising auctions, addressing limitations of existing replay and off-policy evaluation methods. The framework produces conservative decision objects (certified policies, dominated alternatives, unresolved candidates) rather than point estimates, with theoretical guarantees on preserving the best policy while eliminating only those with certified regret. Experiments on iPinYou real-time-bidding logs demonstrate effectiveness: a 19-policy catalog was reduced to a 2-policy validation shortlist while achieving 40.71-47.66% lift across seasons and certifying non-harm across 44 segments.
offline policy selectionreserve-price auctionssupport-aware evaluationstatistical certificationreal-time bidding
Probabilistic Attribution For Large Language Models
The authors propose a probabilistic token attribution measure for Large Language Models (LLMs) by situating them within stochastic process theory. Their model-agnostic method inverts next-token log-probabilities via Bayes rule to compute conditional probabilities of responses given prompts, with and without specific tokens marginalized. The attribution score is defined as the log ratio of these probabilities, complemented by entropy analysis of token distributions. Evaluations across 8 models and 7 prompts reveal insights into model behavior, including token sensitivity, response stability, and training convergence, enhancing interpretability.
probabilistic attributionstochastic processestoken marginalizationconditional probabilityentropy analysis
TBP-mHC: full expressivity for manifold-constrained hyper connections through transportation polytopes
The paper introduces Transportation Birkhoff Polytope (TBP) and its recursive variant (RTBP) to enable full expressivity in manifold-constrained hyper-connections (mHC) for residual networks. By parameterizing exactly doubly stochastic mixing matrices with $(n-1)^2$ degrees of freedom, TBP avoids iterative normalization (e.g., Sinkhorn) and combinatorial complexity (e.g., mHC-lite) while remaining within the Birkhoff polytope. Experiments on language model pre-training show competitive performance with improved stability and scalability compared to prior mHC approaches.
hyper-connectionsbirkhoff polytopedoubly stochasticresidual networkstransportation polytope
Learning Altruistic Collaboration in Heterogeneous Multi-Team Systems
The paper introduces a framework for heterogeneous multi-team collaboration using dynamic robot allocation, treating robots as transferable resources. It employs Hamilton's rule from ecology as an altruistic decision-making mechanism, addressing a combinatorial NP-hard allocation problem. A graph neural network policy is developed for centralized training and decentralized execution, approximating altruistic allocations and predicting robot transfers. Validated in firefighting simulations, the learned policy achieves near-optimal performance and scalability.
multi-team collaborationhamilton's rulegraph neural networknp-harddecentralized execution
Latent-space Attacks for Refusal Evasion in Language Models
The paper proposes Controlled Latent-space Evasion (CLE), a novel attack method that suppresses refusal behavior in safety-aligned language models by projecting latent representations past the decision boundary of linear refusal probes. The authors recast refusal suppression as evasion against linear classifiers, showing prior work's ablation methods correspond to minimum-confidence attacks. CLE optimizes projection distance to push representations into compliant regions, achieving state-of-the-art attack success rates across 15 models including instruction-tuned, multimodal, and reasoning variants.
latent-space attackrefusal evasionlinear probedecision boundaryjailbreak
The Impact of AI Usage and Informativeness on Skill Development in Logical Reasoning
This study investigates how AI usage and informativeness affect skill development in logical reasoning tasks. Through controlled experiments with on-demand AI assistance, the authors demonstrate that heavy AI usage correlates with weaker skill acquisition compared to light users or non-users. Results show low-information AI impairs learning without improving performance, while high-information AI enhances short-term outcomes without uniformly affecting post-AI performance, revealing context-dependent complementarity or substitutability between AI and human reasoning.
skill developmentlogical reasoningai assistanceinformativenesshuman-ai interaction
PocketAgents: A Manifest-Driven Library of Autonomous Defense Agents
PocketAgents introduces a manifest-driven library for autonomous defense agents, enabling LLM-based defensive enforcement through typed boundaries and runtime contexts. Each agent comprises a manifest, prompt, and runtime context, with shared runtime providing bounded telemetry access and action validation. Implemented on a cyber arena (Perry), the system evaluated two agents (Command and Control, Exfiltration) in 18 trials of a DarkSide-inspired attack. Results showed 13 successful network-block actions, 4 schema validation failures, and 1 valid no-action decision, demonstrating measurable and attributable LLM-driven defense.
pocketagentsmanifest-drivenautonomous defensetyped boundariescyber arena
Investigating Concept Alignment Using Implausible Category Members
The study investigates concept alignment in AI systems by probing category boundaries through implausible category members (e.g., 'Is an olive a vehicle?'), contrasting with traditional plausible queries. Using Rosch and Mervis's psychological framework, the authors compare AI and human assignments of objects to superordinate categories, including mismatched ones. Results reveal significant divergences: models misclassify words as vehicles, vegetables as fruits, and non-weapons as weapons, highlighting concept misalignment with implications for AI safety.
concept alignmentcategory boundariessuperordinate categoriesai safetyconcept misalignment
Tokenisation via Convex Relaxations
The paper introduces ConvexTok, a novel tokenisation algorithm formulated as a linear program and solved via convex optimisation, addressing the suboptimality of greedy methods like BPE and Unigram. ConvexTok optimises vocabulary construction globally, improving intrinsic metrics (e.g., bits-per-byte) and occasionally downstream task performance. It provides a certificatable lower bound on optimality, empirically achieving within 1% of optimal at standard vocabulary sizes.
tokenisationconvex optimisationlinear programbits-per-bytevocabulary construction
Integrable Elasticity via Neural Demand Potentials
The paper introduces Integrable Context-Dependent Demand Network (ICDN), a neural model for retail demand prediction that directly learns log-demand as a smooth function of log-prices conditioned on context. This approach enables exact derivation of elasticities from the learned demand surface, addressing limitations in traditional log-log models. Evaluated on the Dominick's beer dataset, ICDN demonstrates superior out-of-sample generalization and produces more stable, economically plausible elasticity estimates, particularly for weakly identified cross-price effects.
demand modelingprice elasticityneural networksretail analyticscontext-conditioned
Remember to be Curious: Episodic Context and Persistent Worlds for 3D Exploration
The paper introduces a curiosity-driven reinforcement learning method for 3D exploration that addresses local loops and forgotten states by incorporating spatial persistence and episodic context. The approach uses an online 3D reconstruction as a persistent world model and a sequence-modeled agent policy over RGB observations to maintain episodic memory. Evaluated on HM3D, the method outperforms RL-based active mapping baselines, demonstrates zero-shot generalization to Gibson and AI-generated environments, and shows superior performance in downstream tasks like apple picking and image-goal navigation.
curiosity-driven rl3d reconstructionepisodic memoryzero-shot generalizationintrinsic rewards
FAME: Failure-Aware Mixture-of-Experts for Message-Level Log Anomaly Detection
FAME introduces a failure-aware mixture-of-experts framework for message-level log anomaly detection, addressing granularity and scalability challenges in production systems. The method leverages an LLM offline to partition templates into failure domains, annotates limited labeled lines per template, and trains lightweight router and domain experts for on-premise deployment. Results show F1 = 98.16 on BGL with 76x annotation reduction and 86.3% anomaly detection on unseen EventIDs, achieving F1 = 99.95 on Thunderbird with perfect recall.
mixture-of-expertslog anomaly detectionfailure domainslightweight routerannotation efficiency
Uniform Diffusion Models Revisited: Leave-One-Out Denoiser and Absorbing State Reformulation
The paper revisits Uniform Diffusion Models (UDM) by identifying a mismatch between the standard plug-in ELBO and cross-entropy denoising objective, showing that the denoising posterior optimizes a leave-one-out predictor rather than clean-data prediction. It derives exact conversions between the denoiser, leave-one-out posterior, and score, enabling disentanglement of parameterization and training objectives. The authors propose an absorbing-state reformulation of UDM that simplifies sampling operations and improves inference via informed predictor-corrector sampling and temperature sampling. Experiments on language modeling demonstrate that leave-one-out parameterizations enhance UDM generation, while the absorbing construction matches or exceeds Masked Diffusion Models (MDM) performance.
uniform diffusion modelsleave-one-out denoiserabsorbing statemasked diffusion modelspredictor-corrector sampler
Lumberjack: Better Differentially Private Random Forests through Heavy Hitter Detection in Trees
(No summary returned.)
Plug-in Losses for Evidential Deep Learning: A Simplified Framework for Uncertainty Estimation that Includes the Softmax Classifier
The paper introduces a simplified framework for uncertainty estimation in Evidential Deep Learning (EDL) by approximating Dirichlet-based objectives with plug-in losses evaluated at the Dirichlet mean. The method demonstrates that approximation error decays with increasing evidence for common loss functions (e.g., cross-entropy, mean-squared error), and notably includes the softmax classifier as a special case. Empirical validation on the Google Speech Commands dataset shows comparable predictive accuracy and selective prediction performance to classical EDL, while offering simpler implementation via standard deep learning pipelines.
evidential deep learninguncertainty estimationdirichlet distributionplug-in lossselective prediction
SeqLoRA: Bilevel Orthogonal Adaptation for Continual Multi-Concept Generation
SeqLoRA introduces a bilevel orthogonal adaptation framework for continual multi-concept generation in text-to-image diffusion models, addressing representation interference through constrained continual learning. The method jointly optimizes LoRA factors via bilevel optimization, theoretically guaranteeing convergence and minimizing catastrophic forgetting by modeling residual layer activations as a matrix sub-Gaussian process. Experiments demonstrate improved identity preservation across 101 concepts, reduced attribute interference, and eliminated post-hoc fusion costs compared to frozen-basis methods.
continual learningbilevel optimizationlora adaptationdiffusion modelsrepresentation interference
Ternary Decision Trees with Locally-Adaptive Uncertainty Zones
The paper introduces ternary decision trees, which extend standard CART by adding locally-adaptive uncertainty zones around split thresholds. Each zone's half-width delta is computed using five novel methods: quality-plateau, class-overlap, gain-ratio, node-bootstrap, and margin, leveraging existing split statistics without external noise specification. Evaluated on 72 OpenML-CC18 datasets, all methods with probabilistic routing outperform CART (p < 0.001), with the margin method achieving the highest efficiency (0.104 accuracy gain per unit flagging rate) and winning on 42 datasets. Node-bootstrap shows particular promise on medical data, improving mammography screening accuracy by 0.71% while flagging 10.8% uncertain cases.
ternary decision treesuncertainty zonescart split findingprobabilistic routingboundary-uncertain flagging
Optimization over the intersection of manifolds
The paper establishes the equivalence between clean intersection and intrinsic transversality for optimization over intersecting manifolds, enabling tractable tangent space projection. It proposes a geometric method using single-manifold retraction with orthogonal updates: one direction approaches the second manifold asymptotically while the other minimizes the objective. Under intrinsic transversality, the method achieves provable convergence rates for both feasibility and optimality, with all accumulation points being first-order stationary. Experiments demonstrate effectiveness on sparse/low-rank problems including spherical data fitting, hyperbolic embedding approximation, and compressed mode computation.
manifold intersectionintrinsic transversalitygeometric optimizationretraction methodfirst-order stationarity
Multiple Neural Operators Achieve Near-Optimal Rates for Multi-Task Learning
The paper establishes approximation and generalization bounds for Multiple Neural Operators (MNO) in multi-task operator learning. It demonstrates that shared representations across tasks maintain parametric efficiency, matching single-task scaling laws. Theoretical analysis proves near-optimal upper bounds for Lipschitz operator classes and minimax lower bounds, showing no added complexity from multi-task learning. Comparisons with multi-task DeepONet reveal both architectures achieve similar asymptotic rates under worst-case complexity constraints.
neural operatorsmulti-task learninglipschitz mapsminimax ratesparametric complexity
Reading Task Failure Off the Activations: A Sparse-Feature Audit of GPT-2 Small on Indirect Object Identification
The study conducts a sparse-feature audit of GPT-2 Small's performance on the Indirect Object Identification (IOI) task, identifying features correlating with task failure. Using a 300-prompt corpus and layer-8 residual-stream sparse-autoencoder (SAE) features, the analysis reveals 146 significant features (Holm-corrected) and 105 with large effect sizes (|Cohen's d| > 0.8). Feature 17,491 ('cryptographic keys') shows the strongest correlation (d=+2.93), with failure rates of 93.3% for 'the keys' prompts versus 7.5% for others. Causal ablation and representation baseline tests confirm the feature's correlative, not causative, role. The audit pipeline, model-agnostic and laptop-executable, is the primary contribution.
sparse-autoencoderindirect object identificationresidual-streamcausal ablationcohen's d
Clipping Bottleneck: Stabilizing RLVR via Stochastic Recovery of Near-Boundary Signals
The paper identifies rigid clipping in RLVR (Reinforcement Learning with Verifiable Rewards) as a bottleneck causing training instability, where near-boundary signals are discarded. It proposes Near-boundary Stochastic Rescue (NSR), a method that stochastically retains these signals via boundary perturbations, outperforming deterministic approaches like gradient decay. Experiments across 7B to 30B models and diverse architectures (dense/MoE) show NSR improves stability and performance over baselines (DAPO, GSPO).
rlvrclippingstochastic rescuegradient decaynear-boundary signals
Posterior Collapse as Automatic Spectral Pruning
The paper demonstrates that posterior collapse in β-VAEs functions as automatic spectral pruning, with latent modes collapsing when their reconstruction contribution falls below a β-determined cutoff. Through Landau stability analysis, the authors show that equilibrium solutions exhibit a cascade of collapses as latent modes decouple from least to most useful. They introduce a latent-rescaling-invariant order parameter to rank active modes and identify collapse thresholds. In linear Gaussian cases, collapse, utility, and normalized PCA spectra align, following mean-field laws. Empirical validation on the WorldClim dataset supports these theoretical predictions.
posterior collapseβ-vaespectral pruninglandau stabilitymean-field
ChronoVAE-HOPE: Beyond Attention -- A Next-Generation VAE Foundation Model for Specialized Time Series Classification
ChronoVAE-HOPE introduces a next-generation VAE foundation model for specialized time series classification, addressing quadratic attention costs and structural disentanglement. The method employs a VAE framework with HOPE Blocks, replacing attention with dual-memory systems (Titans modules for short-term retention and Continuum Memory System for long-term context) and factorizing latent space into trend/seasonal components via dedicated encoder/decoder pathways. Pre-trained on Monash archive with MTSM and VAE objectives, it achieves strong performance on UCR benchmarks, particularly in causal settings, offering interpretable embeddings for downstream classification.
variational autoencodertime series foundation modelsdisentangled representationsdual-memory systemmasked time series modeling
Conceptualizing Embeddings: Sparse Disentanglement for Vision-Language Models
CEDAR (Conceptual Embedding Disentanglement via Adaptive Rotation) is a post-hoc method that reveals compositional structure in vision-language model embeddings without dimensionality expansion. It learns an invertible transformation with a top-$k$ sparsity bottleneck to produce axis-aligned disentangled coordinates, preserving original geometry while enabling interpretation via textual concepts (CLIP) or natural language descriptions (BLIP). Experiments show CEDAR achieves competitive reconstruction-sparsity trade-offs, yielding more interpretable and human-aligned explanations than sparse autoencoders, suggesting entanglement can be resolved via basis change rather than overcomplete expansion.
sparse disentanglementvision-language modelspost-hoc interpretabilityinvertible transformationaxis-aligned coordinates
Holographic functions and neural networks
The paper introduces three equivalent characterizations of bounded complexity for fuzzy Boolean functions $f:\cube^n\to [0,1]$. First, holographic property: $f(x)$ is recoverable from a bounded random sample of $x$'s coordinates. Second, structural property: $f$ approximates a bounded-degree polynomial in bounded linear forms. Third, computational property: $f$ approximates a neural network with bounded non-input neurons, Lipschitz activations, and weights. The equivalence is proven via hypergraph regularity techniques, linking sampling, algebraic, and neural representations of function complexity.
fuzzy boolean functionholographic propertybounded-degree polynomialneural networkhypergraph regularity
SegCompass: Exploring Interpretable Alignment with Sparse Autoencoders for Enhanced Reasoning Segmentation
SegCompass introduces an interpretable alignment method for reasoning segmentation by leveraging a Sparse Autoencoder (SAE) to map chain-of-thought (CoT) traces and visual tokens into a shared high-dimensional sparse concept space. The model employs a query codebook to select salient concepts, spatially grounded via a slot mapper into multi-slot heatmaps that guide a mask decoder. Joint training combines reinforcement learning for reasoning paths with segmentation supervision. Experiments on five benchmarks show SegCompass matches or exceeds state-of-the-art performance, with strong correlation between sparse concept quality and mask accuracy, confirming its inspectable alignment benefits.
sparse autoencoderreasoning segmentationchain-of-thoughtinterpretable alignmentslot mapper
The Secretary Problem with a Stochastic Precursor
This paper introduces a stochastic precursor signal in the secretary problem, demonstrating that its timing alone—without conveying additional information—can improve optimal stopping policies. The study analyzes both random-order and adversarial-order models, showing that a single uniformly timed precursor achieves success probability ≥1/2, surpassing the classic 1/e benchmark. As precursor timing becomes later, success probability approaches 1. In adversarial settings, sufficiently concentrated precursors enable constant success guarantees. These findings highlight asynchronous temporal information as a novel and potent form of advice for online decision-making.
secretary problemstochastic precursoroptimal stoppingrandom-order modeladversarial-order model
From Baseline to Follow-Up: Counterfactual Spine DXA Image Synthesis in UK Biobank Using a Causal Hierarchical Variational Autoencoder
The authors propose a causal hierarchical variational autoencoder (CHVAE) for generating counterfactual dual-energy X-ray absorptiometry (DXA) spine images that maintain anatomical plausibility under age interventions. The model, trained on 3,743 baseline AP spine scans from UK Biobank, incorporates metadata conditioning (participant attributes and lumbar morphometry) and enforces causal consistency via abduction-action-prediction (AAP) on follow-up imaging data. Evaluation demonstrates strong agreement (absolute-level) between synthesized and observed vertebral morphometry when intervening on age, validating the approach for intervention-aligned medical image synthesis.
causal hierarchical variational autoencoderdual-energy x-ray absorptiometrycounterfactual image synthesisabduction-action-predictionvertebral morphometry
Why SGD is not Brownian Motion: A New Perspective on Stochastic Dynamics
The work challenges the standard Langevin process approximation of Stochastic Gradient Descent (SGD) by proposing an alternative formulation as deterministic dynamics in a minibatch-induced fluctuating loss landscape. Starting from discrete updates, the authors derive a master equation and discrete Fokker-Planck equation, revealing deviations from Langevin dynamics at order η². Analysis near critical points shows SGD behavior decomposes along the mean Hessian eigenbasis, with nearly-flat directions exhibiting unbounded variance growth (effective diffusion proportional to η). Empirical validation on vision and NLP models confirms distinct confined/diffusive mode separation.
stochastic gradient descentlangevin dynamicsfokker-planck equationhessian eigenbasisdiffusion coefficient
The Double Dilemma in Multi-Task Radiology Report Generation: A Gradient Dynamics Analysis and Solution
The paper proposes Conflict-Averse Magnitude-Enhanced Gradient Descent (CAME-Grad), a backbone-agnostic optimizer for multi-task radiology report generation (RRG). Analyzing gradient dynamics via stochastic differential equations reveals a 'Double Dilemma' of drift term deviation and diffusion term decay in linear scalarization. CAME-Grad combines conflict-averse direction rectification, magnitude-enhanced energy injection, and adaptive gradient fusion to balance discriminative supervision and generation smoothness. Experiments on MIMIC-CXR and IU X-Ray show average improvements of 2.3% and 1.9% respectively across eight RRG methods.
radiology report generationmulti-task learninggradient dynamicsstochastic differential equationoptimizer
A note on convergence of Wasserstein policy optimization
The work establishes linear convergence for Wasserstein Policy Optimization (WPO) in entropy-regularized Markov Decision Processes with continuous state-action spaces. By analyzing WPO as a gradient flow and leveraging log-Sobolev inequalities, the authors demonstrate monotonic energy dissipation and derive a local log-Sobolev inequality under sufficient regularity conditions. These properties enable a proof of linear convergence to the global optimum for the value function.
wasserstein policy optimizationgradient flowlog-sobolev inequalityentropy-regularized mdplinear convergence
UNAD+: An Explainable Hybrid Framework for Unknown Network Attack Detection
UNAD+ introduces an explainable hybrid framework for detecting unknown network attacks, combining unsupervised anomaly detection with supervised refinement. The method employs a benign-only unsupervised ensemble with Weighted Majority Voting, pseudo-labeling for supervised refinement, and post hoc explainability for local/global interpretations. Evaluated on CICIDS2017 and NSL-KDD, it achieves >98% F1-score while reducing false positives and improving transparency compared to its predecessor UNAD.
zero-day attacksweighted majority votingpseudo-labelingintrusion detectionexplainable ai
Two is better than one: A Collapse-free Multi-Reward RLIF Training Framework
The paper proposes a multi-reward RLIF framework to address reward hacking and entropy collapse in unsupervised reinforcement learning for LLMs. The method decomposes training signals into complementary answer-level (cluster voting) and completion-level (token-wise self-certainty) rewards, normalized via GDPO, and introduces KL-Cov regularization to preserve exploration. Evaluated on mathematical reasoning and code-generation benchmarks, it achieves stability comparable to supervised RLVR methods while relying solely on internal feedback. Results demonstrate that dual-reward decomposition with targeted regularization enables robust long-horizon reasoning without external supervision.
rlifreward hackingkl-cov regularizationgdpo normalizationentropy collapse
Evolutionary Multi-Task Optimization for LLM-Guided Program Discovery
The paper introduces Evolutionary Multi-Task Optimization (EMO-STA), a two-stage framework for LLM-guided program discovery that first evolves a shared archive of executable programs across task families before adapting them to individual tasks. The method explores adaptation strategies including warm-starting, best-average adaptation, and task-specific adaptation. Evaluated across eight task families (continuous optimization, geometric construction, etc.), EMO-STA outperforms single-task evolution in most settings, with STA Best-Local excelling in-distribution and STA Best-Shared showing robust transfer to unseen tasks. Results indicate balanced compute allocation between shared and adaptation phases is optimal, and shared evolution mitigates overfitting in low-evidence scenarios like ARC tasks and time-series feature engineering.
evolutionary multi-task optimizationllm-guided program discoveryshared-then-adapttask-family optimizationcompute allocation
Benchmarking Machine Learning Architectures for Antimicrobial Stewardship in Pediatric ICUs
This study benchmarks machine learning architectures for antimicrobial stewardship (AMS) in pediatric ICUs, evaluating four clinically relevant proxy targets for antibiotic reduction. Using public and private datasets, it compares tabular, sequence-based, and graph-based temporal models across multiple resolutions. Results indicate that performance depends more on target prevalence and dataset characteristics than model complexity, with sequence models improving precision-recall trade-offs at 24-hour resolution but suffering poorer calibration. Multi-task learning offers marginal improvements, suggesting limited shared structure across targets. The findings emphasize target design, temporal representation, and calibration in clinical ML applications.
antimicrobial stewardshiptemporal modelingprecision-recallmulti-task learningclinical decision support
Factored Diffusion Policies:Compositionally Generalized Robot Control with a Single Score Network
The paper introduces factored diffusion policies, a method for compositionally generalizing robot control using a single shared diffusion network trained with per-factor null-token dropout. The approach decomposes the score additively across factors at inference, approximating the true joint score with bounded uniform error under approximate conditional independence, reducing the training-task budget from combinatorial to linear in factor cardinalities. Theoretical analysis provides a trajectory-tube certificate linking score-level bounds to closed-loop performance. Experiments on drone racing demonstrate strong generalization: 90% success on held-out gates (matching an oracle) versus 3% for a baseline, and zero-shot transfer to unseen venues with +11.7pp success and 2.4X crash reduction.
diffusion policiescompositional generalizationnull-token dropouttrajectory-tube certificaterobot control
Do Deep Ensembles Actually Capture Uncertainty in Graph Neural Networks?
This study challenges the assumed effectiveness of deep ensembles for uncertainty quantification in graph neural networks (GNNs). Through benchmarking on seven diverse datasets, the authors demonstrate that ensembles offer minimal improvement over single models, primarily stabilizing optimization noise rather than enhancing uncertainty estimates. An aleatoric-epistemic decomposition reveals epistemic collapse, where independently trained GNNs converge to similar predictions, undermining ensemble diversity. The analysis attributes this collapse to functional convexity rather than weight-space convexity. Findings indicate that deep ensemble success in domains like computer vision does not generalize to graph-structured data.
deep ensemblesgraph neural networksuncertainty quantificationepistemic collapsefunctional convexity
A Tutorial on Diffusion Theory: From Differential Equations to Diffusion Models
The tutorial formalizes diffusion models through differential equations, establishing connections between forward/reverse processes and their ODE/SDE representations. It demonstrates that the conditional Gaussian forward process admits both ODE and SDE formulations, with marginalization yielding dynamics that transform data distributions to Gaussian priors. The analysis derives reverse-time dynamics governed by marginal scores, showing equivalence between noise-prediction objectives and score matching. Sampling methods like DPM-Solver and guidance techniques are discussed, with DDPM and DDIM framed as discrete reverse-SDE and reverse-ODE sampling respectively.
diffusion modelsstochastic differential equationsscore matchingreverse-time dynamicssampling methods
GraphFlow: A Graph-Based Workflow Management for Efficient LLM-Agent Serving
The authors propose GraphFlow, a graph-based workflow management system for efficient LLM-agent serving, addressing limitations of template-based approaches. Key innovations include wGraph, a unified graph representation of atomic operations enabling dynamic workflow instantiation, and two optimizations: adaptive workflow generation from task semantics and workflow state management leveraging graph structure for KV-cache efficiency. Evaluations across five benchmarks demonstrate 4.95% average performance gain and 4× memory footprint reduction versus state-of-the-art methods.
llm-agentworkflow managementwgraphkv-cachedynamic instantiation
SynAE: A Framework for Measuring the Quality of Synthetic Data for Tool-Calling Agent Evaluations
The paper introduces SynAE, a framework for evaluating synthetic datasets used to test tool-calling agents, addressing the limitations of real datasets (sensitivity, sparsity). SynAE measures validity, fidelity, and diversity across four metric categories: task instructions/responses, tool calls, final outputs, and downstream evaluation. Experiments on agent benchmarks demonstrate SynAE's ability to detect fine-grained variations in synthetic data quality, showing that multi-axis evaluation is necessary. The framework is available via demo and open-source code.
synthetic data evaluationtool-calling agentsvalidity metricsfidelity assessmentmulti-turn interactions
Regret-Based $(ε,δ)$-optimal Stopping Criteria for Bayesian Optimization
The authors propose theoretically grounded stopping criteria for Bayesian optimization (BO) using Gaussian process upper confidence bound (GP-UCB), ensuring ε-optimal solutions with probability 1-δ. By deriving tighter instantaneous regret bounds for GP-UCB, they develop termination conditions that avoid unnecessary evaluations while providing optimality guarantees. Numerical experiments validate the criteria's effectiveness and efficiency compared to fixed-budget approaches.
bayesian optimizationgaussian processregret boundsstopping criteriagp-ucb
Neural Flow Operators can Approximate any Operator: Abstract Frameworks and Universal Approcimations
The authors introduce an abstract neural flow framework encompassing both finite-dimensional function approximation and infinite-dimensional operator approximation, featuring two continuous-depth models: neural flows with composition and separation structures. They prove well-posedness and universal approximation properties, including the first such result for flow-based models between infinite-dimensional spaces, and extend these results to convolutional neural flow models. Through time discretizations, the composition structure recovers ResNet-type architectures, while the separation structure yields plain architectures, unifying residual and plain designs for neural networks and operators.
neural flowsuniversal approximationcontinuous-depth modelsneural operatorsresidual architectures
ImplicitTerrainV2: Wavelet-Guided Spatially Adaptive Neural Terrain Representation
ImplicitTerrainV2 introduces a compact neural terrain representation combining wavelet-guided spatial adaptivity, derivative-aware supervision, and model compression. The method employs a wavelet complexity field (WCF) for frequency control, adaptive sampling for training efficiency, and gradient matching for derivative fidelity. Post-training quantization achieves 1.23 bpp storage with 0.28 dB PSNR drop. Evaluated on 50 Swiss terrain tiles, it achieves 66.25 dB PSNR (5.70 dB improvement over prior work), uses 3.2× fewer parameters, and trains in 55s/tile on one GPU, while supporting resolution-independent queries and derivatives.
implicit neural representationwavelet complexity fieldadaptive samplinggradient matchingmixed-precision quantization
A Martingale Kernel Independence Test
The authors introduce mHSIC and mdHSIC, two martingale-based independence tests that eliminate the need for permutation calibration while maintaining statistical power. mHSIC constructs a self-normalized lower-triangular sum of centered Gram matrices, achieving quadratic runtime and standard normal null distribution under independence. mdHSIC extends this to joint independence testing via a half-sample split, ensuring finite-sample consistency with linear growth in variable count. Both methods demonstrate empirical type-I error control and power matching permutation-based HSIC/dHSIC baselines, while offering 25-60× speedups across synthetic experiments with 1-500 dimensional inputs and 2-10 jointly tested variables.
martingale independence testhsicgram matricespermutation calibrationjoint independence
F-TIS: Harnessing Diverse Models in Collaborative GRPO
The paper introduces Filtered Truncated Importance Sampling (F-TIS), a GRPO-style reinforcement learning framework enabling heterogeneous model collaboration in decentralized systems. F-TIS addresses the challenge of off-policy samples in GRPO by employing importance sampling to maintain convergence despite model diversity. Evaluations demonstrate identical convergence to on-policy training and up to 12% better generalization on out-of-distribution tasks, while remaining communication-efficient.
reinforcement learningimportance samplingoff-policy learningdecentralized traininggeneralization
Relational Linear Properties in Language Models: An Empirical Investigation
This work empirically investigates relational linearity in language models, where object unembeddings are predicted from subject embeddings via linear maps for fixed relations. The authors introduce a KL-divergence-based probing method that improves upon prior Jacobian approximation techniques, evaluating layer-wise patterns and paraphrasing effects. Results across four datasets reveal model-dependent relational linearity, layer-specific trends aligning with known linguistic representation properties, and sensitivity to relation phrasing variations.
relational linearityunembeddingkl-divergencelinear probinglayer-wise analysis
Disentanglement Beyond Generative Models with Riemannian ICA
The paper introduces Riemannian ICA (RICA), a theoretical framework for local disentanglement that extends Independent Component Analysis (ICA) without requiring global generative assumptions. RICA formalizes disentanglement through radial curves in data space mapping to axis-aligned latent lines, quantified by a novel disentanglement tensor combining Hessian and Ricci curvature terms. In controlled experiments with known sources, RICA outperforms ICA baselines across multiple manifolds, demonstrating robustness to coordinate representations. The work bridges geometric structure with disentanglement theory for pretrained encoders.
disentanglementriemannian geometryindependent component analysisrepresentation learningricci curvature
Generative Modeling by Value-Driven Transport
The paper introduces value-driven transport (VDT), a novel generative modeling framework based on discrete-time stochastic control and measure transport. The method formulates generation as a linear program whose dual variables encode optimal value functions, enabling simulation-free primal-dual optimization for policy learning. VDT policies produce straight transport paths, support fast sampling, and retain compatibility with conditional generation and guidance techniques. Experiments demonstrate competitive performance and scalability compared to flow-based, diffusion, and Schrödinger bridge approaches.
generative modelingstochastic controlmeasure transportlinear programmingvalue function
EnCAgg: Enhanced Clustering Aggregation for Robust Federated Learning against Dynamic Model Poisoning
The paper proposes EnCAgg, a robust federated learning aggregation method against dynamic model poisoning attacks. It introduces density-based low-dimensional gradient clustering to identify malicious gradients while preserving benign ones, an enhancing clustering generator model to create pseudo-gradients bridging benign outliers, and gradient re-clustering to recover misclassified benign gradients. Evaluated on MNIST, CIFAR-10, and MIND datasets, EnCAgg demonstrates superior fidelity and robustness in dynamic poisoning scenarios compared to existing defenses.
federated learningmodel poisoningdensity-based clusteringpseudo-gradientsrobust aggregation
The Signal in the Noise: OOD Detection Through Goodness-of-Fit Testing in Factorised Latent Spaces
The paper introduces Signal in the Noise (SITN), a method for out-of-distribution (OOD) detection that leverages the diffeomorphic and mass-preserving properties of continuous normalizing flows. By analyzing how OOD samples map to atypical noise under the prior, SITN detects anomalies without requiring OOD data, maintaining low computational overhead and strict false positive rate control. Evaluations on standard benchmarks and synthetic perturbations demonstrate its effectiveness, avoiding the complexity bias of likelihood-based approaches.
out-of-distribution detectioncontinuous normalizing flowsgoodness-of-fit testingfalse positive ratelatent spaces
Represented Is Not Computed: A Causal Test of Candidate Algorithmic Intermediates in a Transformer
The study investigates how a Transformer implements structured arithmetic operations by analyzing base-digit extraction ($\lfloor N/B^D \rfloor \bmod B$). Using linear probes and causal interventions, it tests whether the model computes algorithmic intermediates (e.g., $N/B^D$) as suggested by the closed-form solution. While probes decode these intermediates, causal tests reveal sparse, late-combining routes for $N$, $B$, and $D$, diverging from the staged computation hypothesis. Results show 99.83% accuracy but demonstrate that represented intermediates are not causally used in the output stream, highlighting a probe-causality gap.
transformercausal interventionlinear probesalgorithmic intermediatessparse circuit
When Stronger Triggers Backfire: A High-Dimensional Theory of Backdoor Attacks
The paper develops a high-dimensional theory revealing counterintuitive behaviors in backdoor poisoning attacks: stronger training triggers can paradoxically improve clean test accuracy while reducing attack success. Analyzing regularized generalized linear models on Gaussian-mixture data in the proportional regime (p/n→κ), the authors prove three phenomena: (i) clean accuracy increases with trigger strength α, (ii) attack success peaks at finite α then declines, and (iii) minimum covariance eigenvectors yield optimal triggers. Theoretical results are derived for squared loss and extended to convex GLMs via Gaussian-proxy fixed points, with experiments on CIFAR-10 and ResNet-18 validating the findings beyond convex settings.
backdoor attackshigh-dimensional statisticsgeneralized linear modelsgaussian-proxycovariance eigenvectors
Structured-Sparse Attention for Entity Tracking with Subquadratic Sequence Complexity
The paper introduces a structured-sparse attention mechanism for efficient entity tracking in long sequences, achieving subquadratic complexity. By analyzing attention patterns as block-diagonal with sparse cross-block connections, the method performs exact within-block computations while approximating cross-block interactions through a reduced system, yielding O(n^{4/3}d) complexity. Evaluated on tracking benchmarks, it matches dense attention accuracy with 12-29% faster inference and 2.4× speedup over compact Transformers, though performance degrades when tracking properties exceed attention head count.
entity trackingsubquadratic complexitystructured-sparse attentionblock-diagonal attentionresolvent operator
Winner-Take-All bottlenecks enforce disentangled symbolic representations in multi-task learning
The paper demonstrates that winner-take-all (WTA) bottlenecks in deep neural networks enforce disentangled symbolic representations of categorical latent factors under specific conditions. Using theoretical analysis and empirical validation on two datasets, the authors show that WTA mechanisms lead to single-neuron or population-level encoding of abstract features (e.g., objects, colors). This symbolic representation improves generalization, bridging subsymbolic and symbolic AI systems. The work provides insights into WTA components in transformers' softmax attention and their role in factor disentanglement.
winner-take-alldisentangled representationssymbolic aimulti-task learninglatent factors
Lost in Tokenization: Fundamental Trade-offs in Graph Tokenization for Transformers
The work establishes fundamental trade-offs in graph tokenization for transformers, demonstrating that different tokenizations induce distinct depth regimes for graph computations. Analyzing spectral, random-walk, and adjacency tokenizations, the authors prove that random-walk tokenization is inherently lossy, spectral tokenization is lossless but ill-conditioned for local tasks, and limited-depth transformers cannot convert between tokenization families. Theoretical lower bounds and impossibility results show tokenization choice affects structural representation recovery. Experiments on synthetic and real-world tasks validate these separations, revealing task-dependent preferences and benefits from combining complementary tokenizations.
graph tokenizationtransformer expressivityspectral tokenizationrandom-walk tokenizationdepth regimes
Reinforcement learning for ion shuttling on trapped-ion quantum computers
The authors present the first reinforcement learning (RL) approach for optimizing ion shuttling in trapped-ion quantum computers, addressing the high-dimensional optimization challenge in modular chip architectures. Their RL method learns shuttling strategies through direct interaction, outperforming heuristic techniques by reducing operations by up to 36.3%. The approach demonstrates versatility across different chip designs, offering a practical tool for evaluating shuttling efficiency in future quantum architectures.
reinforcement learningquantum computingion shuttlingoptimizationtrapped-ion
Do Not Trust The Auctioneer: Learning to Bid in Feedback-Manipulated Auctions
(No summary returned.)
AMUSE: Anytime Muon with Stable Gradient Evaluation
The paper introduces AMUSE (Anytime MUon with Stable gradient Evaluation), a novel optimizer combining Muon's orthogonalized momentum for matrix parameters with Schedule-Free optimization's iterate averaging. AMUSE dynamically interpolates between Muon's fast bulk subspace progress and averaged sequence stability via a time-varying coefficient, eliminating learning rate schedules while suppressing oscillatory valley-wall dynamics. Evaluations on vision tasks and large language model pretraining demonstrate consistent Pareto improvements over Schedule-Free AdamW and Muon in performance-iteration tradeoffs.
optimizerorthogonalized momentumiterate averagingriver-valley landscapeschedule-free
Asymmetric Virtual Memory Paging for Hybrid Mamba-Transformer Inference
The paper introduces Asymmetric Virtual Memory Paging (AVMP), a memory allocator for hybrid Mamba-Transformer models that dynamically manages KV-caches and SSM states. AVMP uses physically separate pools with unified virtual addressing, migrating capacity on allocation failure to handle varying prompt distributions. Evaluated on 270 synthetic cells and 60 ShareGPT traces (RTX 3060 12GB), AVMP reduces OOM events by 7.6% and improves throughput 1.83x-13.3x (synthetic) or 2.36x (ShareGPT), with gains statistically significant under bootstrap CIs. Benefits stem from faster OOM recovery and KV-heavy allocation optimization.
asymmetric virtual memory pagingkv-cachestate space modelsmemory allocationhybrid transformers
Minimum Description Length based Granular-Ball Tree Regularization for Spectral Clustering
The paper proposes MDL-GBTRSC, a Minimum Description Length based Granular-Ball Tree-Regularized Spectral Clustering method that improves affinity graph construction for spectral clustering. The method constructs a granular-ball tree via local MDL model selection, using reciprocal neighborhood continuity to preserve local connections, and employs stable leaf balls to regularize the sample-level affinity graph. It introduces a shared-neighbor bridge code to adjust weak local bridges without threshold tuning. Experiments on real and synthetic datasets show MDL-GBTRSC achieves superior average ARI and NMI compared to classical spectral clustering and granular-ball baselines.
spectral clusteringminimum description lengthgranular-ball treeaffinity graphmodel selection
Cross-Species RSA Reveals Conserved Early Visual Alignment but Divergent Higher-Area Rankings Across Human fMRI and Macaque Electrophysiology
The study investigates cross-species generalization of learning rule alignment by comparing five rules (backpropagation, feedback alignment, predictive coding, STDP, random weights) against macaque electrophysiology and human fMRI. Using representational similarity analysis (RSA) with identical CNN weights, results show: (1) higher early visual alignment in macaque V1/V2 (ρ=0.15-0.30) than human fMRI (ρ=0.01-0.08), (2) STDP and predictive coding lead in macaque V1/V2 (ρ~0.30), (3) no correlation in IT rankings across species (Kendall’s τ=0.00), and (4) pretrained ResNet-50 outperforms custom CNNs in macaque IT (ρ=0.25 vs. 0.07-0.14), suggesting capacity limits higher-area alignment.
representational similarity analysisspike-timing-dependent plasticityelectrophysiologyfeedback alignmentnoise ceiling
A Posterior-Predictive Variance Decomposition for Epistemic and Aleatoric Uncertainty in Wind Power Forecasting
The paper contributes a posterior-predictive variance decomposition method for disentangling epistemic (EU) and aleatoric uncertainty (AU) in wind power forecasting, addressing the conflation problem in existing approaches. The method applies the law of total variance to heteroscedastic neural network regression with Bayesian posterior approximation, enabling compatible estimators for standard training techniques like $β$-NLL. Experiments on synthetic and real-world SCADA data demonstrate theoretically consistent responses of AU and EU to noise structure, distribution shift, and dataset scaling, validating the decomposition's utility.
uncertainty quantificationheteroscedastic regressionbayesian approximationwind power forecastingvariance decomposition
Hybrid Kolmogorov-Arnold Network and XGBoost Framework for Week-Ahead Price Forecasting in Australia's National Electricity Market
The paper proposes a hybrid KAN+XGBoost framework for week-ahead electricity price forecasting in Australia's NEM, addressing volatility and renewable uncertainty. The method integrates Kolmogorov-Arnold Networks (KAN) for global nonlinear representation with XGBoost's local robustness, capturing both long-term dependencies and short-term fluctuations. Evaluated on NEM data via expanding window, the hybrid model reduces MAE by 12% versus XGBoost and over 50% versus naive baselines, outperforming SARIMAX, LSTM, and standalone models.
kolmogorov-arnold networkxgboostelectricity price forecastingexpanding window evaluationnational electricity market
Efficient Higher-order Subgraph Attribution via Message Passing
The authors present linear-time algorithms for higher-order subgraph attribution in GNN-LRP (layer-wise relevance propagation for graph neural networks), avoiding exponential complexity through message passing techniques that exploit the distributive property. The method efficiently computes relevance attributions for subgraphs and generalizes to include neighboring graph features. Experiments demonstrate significant speed improvements and validate the scalability and utility of the generalized attribution approach.
graph neural networkslayer-wise relevance propagationmessage passingsubgraph attributionexponential complexity
Multi-Stage Training for Abusive Comment Detection in Indic Languages
The paper proposes a multi-stage pipeline for abusive comment detection in Indic languages, focusing on minimizing false positives to preserve freedom of expression. The method combines language-specific preprocessing with an ensemble of models, evaluated through extensive experimentation. Results demonstrate improved performance in reducing false-positive rates while maintaining detection accuracy.
abusive comment detectionindic languagesensemble modelsfalse-positive ratelanguage preprocessing
Towards Explainability of SLMs by investigating Token Level Activation
The study introduces Activation Flow Network (AFN), a model-agnostic framework for explainability in small language models (SLMs) by quantifying token-level representational importance through hidden-state activation strengths. AFN computes Token Activation Strength using the L2 norm of Layer-8 hidden representations in BERT, enabling semantic token ranking via a threshold-based bucket formulation (HIGH/LOW-activation groups). Results show semantically meaningful content words dominate HIGH-activation buckets, suggesting Layer 8 as a semantic consolidation zone, offering a computationally efficient alternative to attention-based interpretability methods.
activation flow networktoken activation strengthhidden-state activationsemantic consolidationthreshold-based bucket
Target-Aligned Bellman Backup for Cross-domain Offline Reinforcement Learning
The paper introduces Target-Aligned Bellman Backup (TABB), a method for cross-domain offline reinforcement learning that improves policy transfer by aligning source-domain transitions with target-domain Bellman targets. Unlike prior transition-level similarity approaches, TABB evaluates transferability based on long-term return consistency, selectively leveraging source data that contributes to accurate target-domain value estimation. Experiments across diverse CDRL settings with limited target data demonstrate TABB's consistent performance gains.
cross-domain offline rlbellman backuppolicy transfervalue estimationtransition alignment
Boundary-targeted Membership Inference Attacks on Safety Classifiers
The paper introduces a boundary-targeted membership inference attack (MIA) strategy that exploits low-confidence predictions in safety classifiers to infer training data membership. The method hypothesizes that ambiguous examples near decision boundaries reveal memorization patterns, enabling more effective attacks than standard MIAs. Experiments on a mental health support classifier demonstrate 19% true positive rate at 5% false positive rate, outperforming state-of-the-art MIAs by 3.5×. Analysis shows content filtering fails to protect boundary examples, while noise injection proves effective.
membership inference attacksafety classifiersdecision boundaryprivacy attackslow-confidence examples
ASAP: Attention Sink Anchored Pruning
ASAP (Attention Sink Anchored Pruning) introduces a training-free framework to mitigate computational bottlenecks in Vision Transformers (ViTs) caused by quadratic self-attention complexity. By modeling ViT information flow as a Lazy Random Walk, ASAP identifies attention sinks as dominant probability mass accumulators and partitions tokens via Radial Diffusion Clustering. The method compresses background redundancy through Transition Weight Pooling, achieving up to 48% throughput improvement while maintaining or exceeding baseline accuracy across image, video, and vision-language tasks.
vision transformersattention sinklazy random walkradial diffusion clusteringtransition weight pooling
Partial Fusion of Neural Networks: Efficient Tradeoffs Between Ensembles and Weight Aggregation
The paper introduces partial fusion, a method interpolating between neural network ensembles and weight aggregation to balance computational cost and performance. The approach extends weight aggregation by selectively combining weights of similar neurons, identified via partial optimal transport. Framing weight aggregation as generalized ensemble pruning, it allows neurons to be deleted or linearly combined. Experiments demonstrate that partial fusion achieves flexible tradeoffs, with generalized pruning on single networks yielding comparable benefits. Code is publicly available.
partial fusionweight aggregationneural ensemblesoptimal transportgeneralized pruning
Departure from Regularity: Degree Heterogeneity and Eigengap as the Structural Drivers of ASE-LSE Latent Subspace Disagreement
The paper establishes structural conditions governing the disagreement between Adjacency Spectral Embedding (ASE) and Laplacian Spectral Embedding (LSE) in graph analysis. It proves regularity (uniform node degrees) ensures perfect subspace alignment, while deviations introduce disagreement bounded by degree heterogeneity (increasing divergence) and community structure strength (reducing divergence). Theoretical bounds and empirical validation across thousands of simulated networks demonstrate the heterogeneity-to-community-strength ratio predicts embedding interchangeability.
adjacency spectral embeddinglaplacian spectral embeddingdegree heterogeneitycommunity structuregraph embedding
A Boundary-Layer Mechanism for One-Third Scaling in Online Softmax Classification
The paper identifies a boundary-layer mechanism that explains the observed α⁻¹/³ scaling of generalization error in online softmax classification with fixed learning rates. Using a teacher-student model in the thermodynamic limit, the analysis shows that only examples near decision boundaries (width O(D⁻¹)) remain active at late times, while gradient noise maintains residual variance Δ. The derived asymptotic solution yields power-law learning curves for both test loss and generalization error. Learning-rate schedules can improve scaling to α⁻¹/². Simulations confirm the dynamics, though data structure influences transients, making this a complementary mechanism to spectral explanations of neural scaling laws.
boundary-layer mechanismonline softmax classificationpower-law scalingteacher-student modelgeneralization error
From Snapshots to Trajectories: Learning Single-Cell Gene Expression Dynamics via Conditional Flow Matching
The paper introduces single-cell Flow Matching (scFM), a latent generative framework for learning gene expression dynamics from unpaired scRNA-seq snapshots. The method combines entropically regularized optimal transport couplings with conditioned flow matching to construct soft velocity field targets, while bidirectional consistency and dynamic regularization mitigate distribution drift. Evaluated on time-series scRNA-seq data, scFM improves distributional prediction accuracy by 15-20% for interpolation/extrapolation tasks and yields more coherent trajectory reconstructions compared to baselines.
single-cell rna sequencingoptimal transportflow matchingtrajectory inferencegenerative modeling
Physics-Informed Generative Solver: Bridging Data-Driven Priors and Conservation Laws for Stable Spatiotemporal Field Reconstruction
The authors propose a physics-informed generative solver that combines data-driven priors with conservation laws for stable spatiotemporal field reconstruction. The method employs Martingale-Regularized Score Matching to pretrain a dynamically stable prior and Physics-Informed Implicit Score Sampling to enforce physical constraints during inference without retraining. Evaluated on acoustics and ERA5 meteorological data, the approach successfully reconstructs pressure/velocity fields from sparse sensors and handles extreme sparsity in real-world scenarios. This framework bridges generative AI with first-principles physics for high-dimensional inverse problems.
physics-informed learningscore matchingspatiotemporal reconstructioninverse problemsgenerative prior
Learning Causal Orderings for In-Context Tabular Prediction
The paper introduces TabOrder, a method for learning causal variable orderings in tabular data to improve in-context prediction under distribution shifts. The approach combines causal discovery with predictive modeling by enforcing topological constraints in attention mechanisms, allowing predictions to use only causally preceding features. TabOrder learns orderings unsupervised via likelihood maximization, with theoretical analysis of functional model classes and missing data effects. Experiments demonstrate accurate ordering recovery while handling prediction, imputation, and providing interpretability on biological intervention data.
causal discoverytabular datain-context learningattention mechanismsmissing data
Riemannian geometry meets fMRI: the advantages of modeling correlation manifolds and eigenvector subspaces
The paper introduces a geometric framework for analyzing fMRI correlation matrices, combining (i) the Off-log metric for closed-form operations on correlation manifolds and (ii) Grassmannian subspace discrimination for eigenvector comparison. The Off-log metric enables standard statistical modeling without complex manifold optimization, while Grassmannian methods resolve sign/basis ambiguities. Validated across Parkinson's, psychosis, and aging cohorts, the Off-log metric increased sensitivity in permutation tests and matched/exceeded baselines in classification. Grassmannian methods consistently outperformed Euclidean baselines in identifying disease-relevant networks, with Riemannian metrics excelling in brain-age prediction for two cohorts.
off-log metricgrassmannian subspacefrechet meansprincipal-angle distancesriemannian geometry
Chebyshev Policies and the Mountain Car Problem: Reinforcement Learning for Low-Dimensional Control Tasks
The authors analytically solve the Mountain Car problem, a 36-year-old RL benchmark, revealing a simplicity gap between optimal control and modern RL agents. They propose Chebyshev policies as a universal, parameter-efficient policy class derived from first principles, serving as drop-in replacements for neural networks. Evaluations show 4.18x regret reduction and 277x parameter efficiency gains across PPO, ARS, and REINFORCE, with consistent improvements on real-world nonlinear control tasks.
reinforcement learningoptimal controlchebyshev policiessample efficiencyregret minimization
Long-term Fairness with Selective Labels
The paper introduces a reinforcement learning framework for achieving long-term fairness in decision-making systems with selective labels, where true outcomes (e.g., loan repayment) are only observed for positively decided instances (e.g., approved loans). The method decomposes fairness into observed fairness and prediction bias, using a label predictor model to estimate true fairness measures. Theoretical analysis provides sufficient conditions for satisfying fairness from observable quantities. Experiments in semi-synthetic environments demonstrate that the proposed algorithm achieves fairness and performance comparable to an oracle with access to true labels.
long-term fairnessselective labelsreinforcement learninglabel predictorfair decision-making
Adaptive Measurement Allocation for Learning Kernelized SVMs Under Noisy Observations
The paper introduces an adaptive measurement-allocation strategy for learning kernelized SVMs under noisy Bernoulli observations, addressing the limitations of uniform allocation in settings like quantum machine learning. The method combines geometric sensitivity (kernel entry perturbations' impact on classifier margin) and active-set instability (support-vector membership changes due to noise) to prioritize measurements on decision-critical kernel regions. Theoretical analysis identifies regimes favoring adaptive or uniform allocation, while empirical results on synthetic and quantum kernel datasets show improved support-vector recovery, margin estimation, and decision-function accuracy under fixed budgets, with early stopping via dual-coefficient stability.
kernelized svmsmeasurement allocationgeometric sensitivityactive-set instabilityquantum machine learning
Automatic Contextual Audio Denoising
The authors propose Automatic Contextual Audio Denoising (ACAD), a method that dynamically defines noise based on inferred acoustic scene context, where out-of-context (OC) sound events are removed while preserving in-context (IC) components. They implement a deep learning system that jointly infers scene class and performs context-dependent denoising, comparing it against non-contextual baselines and oracle-context variants. Evaluations on paired clean/noisy data across diverse scenes show superior performance in objective metrics, demonstrating effective context inference and scene-adaptive processing.
audio denoisingacoustic scene classificationcontext-aware processingdeep learningout-of-context detection
An Evidence Hierarchy for Bayesian Object Classification via OSINT-Aided Heterogeneous Sensor Fusion
The paper proposes a Bayesian object classification framework for CBRNE threat detection via OSINT-aided heterogeneous sensor fusion. Key innovations include: (1) an evidence hierarchy modeling direct/indirect/contextual information, (2) integration of OSINT-derived environmental context, and (3) domain knowledge-informed Bayesian priors. Evaluated in simulated scenarios, the method achieves 95% classification accuracy while demonstrating robustness to sensor clutter and prior mismatch through hierarchical evidence fusion.
bayesian classificationsensor fusionevidence hierarchyosint integrationcbrne detection
No Epoch Like the Present: Robust Climate Emulation Requires Out-of-Distribution Generalisation
The study identifies a critical limitation in current climate emulation methods, demonstrating their vulnerability to out-of-distribution (OOD) shifts caused by climate change. By establishing seasonal variation as a proxy for long-term climate shifts, the authors introduce a novel evaluation framework that reveals significant performance degradation in state-of-the-art hybrid-ML emulators under realistic distribution shifts. They propose compositional generalization through physically motivated decompositions as a solution, showing improved OOD robustness with only modest in-distribution trade-offs (exact metrics unspecified in source).
climate emulationout-of-distribution generalizationhybrid-mlcompositional generalizationdistribution shift
Decomposing Ensemble Spread in Lorenz '96 With Learned Stochastic Parameterizations
The study systematically decomposes ensemble forecast uncertainty in the Lorenz '96 system, comparing deterministic, autoregressive, Bayesian, and novel flow-based parameterizations. Using this controlled testbed, the authors demonstrate that ensemble perturbations regulate trajectory decorrelation rather than increasing long-term variance, while stochastic parameterizations with temporal persistence improve early spread growth and spread-error consistency. Results clarify uncertainty interactions in chaotic systems and provide design guidelines for stochastic parameterizations in weather/climate models.
ensemble forecastingstochastic parameterizationlorenz '96chaotic dynamicsspread-error consistency
Decision-Aware Quadratic ReLU Replacement for HE-Friendly Inference
The paper introduces a decision-aware method for replacing ReLU activations with quadratic polynomials in neural networks to enable efficient homomorphic encryption (HE)-friendly inference. By formulating quadratic replacement as a linear separation problem in a lifted space, the authors derive necessary and sufficient conditions for calibration-lossless replacement and propose an algorithm for coefficient construction. When the positive-margin condition fails, they extend the framework using reduced convex hulls and Lagrangian-dual soft-margin relaxations. Empirical results show that the quadratic replacement maintains plaintext top-1 accuracy while achieving 3.7--4.1× faster activation and 1.18--1.68× faster end-to-end inference compared to Remez-7 under CKKS encryption.
homomorphic encryptionrelu replacementquadratic polynomialdecision-awareckks
Holomorphic Neural ODEs with Kolmogorov-Arnold Networks for Interpretable Discovery of Complex Dynamics
The paper introduces Holomorphic KAN-ODE, a framework combining Kolmogorov-Arnold Networks (KANs) with Neural ODEs to model complex dynamical systems while preserving holomorphic structure. The method replaces MLPs with KANs featuring learnable B-spline activations and enforces Cauchy--Riemann equations as differentiable regularization. Evaluated on six complex dynamical systems, the 280-parameter model achieves velocity-field R² > 0.95, identifies governing symbolic families via spline-to-formula fitting, and reconstructs Julia sets with 98.0% accuracy. It shows 4% MSE degradation under 10% noise (vs. 15.2× for MLPs) and 90.4% transfer learning improvement. KANs provide interpretable equations and noise resilience absent in black-box architectures.
holomorphic neural odeskolmogorov-arnold networkscauchy-riemann equationscomplex dynamical systemsinterpretable machine learning
How Many Different Outputs Can a Transformer Generate?
The paper establishes theoretical bounds on the diversity of sequences a transformer can generate, showing the maximum length of accessible sequences grows linearly with prompt length while their proportion decays exponentially beyond a critical threshold. By analyzing architectural characteristics, the authors derive an upper bound tight within a factor of 10 across model sizes, validated empirically. The results explain known transformer failures on simple tasks like copying and cramming, proving these limitations persist even with unbounded context and computation time.
transformersequence generationtheoretical boundaccessible sequencesprompt length
ARC-STAR: Auditable Post-Hoc Correction for PDE Foundation Models
The paper introduces ARC-STAR, a frozen-solver post-hoc correction framework for PDE foundation models that addresses prediction drift in unfamiliar flows. The method employs a three-stage pipeline: global correction for broad solver bias, blockwise local refinement for residual errors, and label-free risk scoring to route compute to high-risk regions under budget constraints. Evaluated on five flow benchmarks, ARC-STAR reduces velocity rollout error by 36x versus raw Poseidon, with global correction cutting host error by 91-99% and local refinement reducing residuals by up to 94.4%. The approach maintains solver integrity while offering auditable, budget-aware correction.
pde foundation modelspost-hoc correctionadaptive refinementrisk-calibrated triagefrozen-solver
Survive or Collapse: The Asymmetric Roles of Data Gating and Reward Grounding in Self-Play RL
The study identifies data-level gating and reward grounding as asymmetric levers governing stability in self-play reinforcement learning for language models. Through controlled experiments on Python output-prediction and deterministic-DSL tasks, it demonstrates that strict data gating ensures stability across all reward variants, including self-consistency rewards without ground truth, while no reward variant stabilizes training without gating. Results reveal a Grounded Proposer Paradox, where ground-truth access accelerates collapse, and a two-stage phase transition in training dynamics with continuous gating strictness.
self-play reinforcement learningdata gatingreward groundinggrounded proposer paradoxphase transition
Kernel-Based Safe Exploration in Deep Reinforcement Learning
The paper introduces kernel-based safe exploration (KBSE), a deep reinforcement learning algorithm that jointly learns optimal policies and barrier functions for probabilistic safety guarantees. KBSE represents barriers as conditional mean embeddings via kernel methods, enabling iterative refinement during exploration while bounding unsafe state visitation probabilities. The method intervenes to modify unsafe actions during exploration, maintaining safety without reward degradation. Evaluations on continuous control benchmarks demonstrate KBSE's effectiveness in synthesizing safe policies while preserving performance.
barrier functionkernel embeddingssafe explorationdeep reinforcement learningconditional mean embeddings
Reinforced Graph of Thoughts: RL-Driven Adaptive Prompting for LLMs
The paper introduces Reinforced Graph of Thoughts (RGoT), an automated method for adaptive prompting in large language models (LLMs). RGoT extends the Graph of Thoughts (GoT) paradigm by using reinforcement learning to dynamically construct operation graphs from a predefined set, eliminating manual design requirements. Experiments demonstrate that RL-driven adaptation can effectively tailor the operation graph to task complexity under specific constraints.
reinforcement learninggraph of thoughtslarge language modelsadaptive promptingoperation graph
Bandit Convex Optimization with Gradient Prediction Adaptivity
The paper introduces prediction-adaptive regret bounds for bandit convex optimization (BCO) by leveraging optimistic gradient predictions. It first establishes an Ω(√T) lower bound under single-point feedback, demonstrating fundamental limitations due to gradient estimation variance. The authors then propose TP-VR-OPT, a two-point feedback algorithm with a variance-reduced gradient estimator whose error scales with prediction accuracy rather than gradient norm, achieving O(√d𝔼[S_T]) regret. Matching lower bounds (Ω(√𝔼[S_T])) confirm near-optimality. Extensions include adaptive variants without prior knowledge of 𝔼[S_T] or T, and dynamic regret guarantees for non-stationary environments.
bandit convex optimizationgradient predictionvariance reductionregret boundsonline learning
From Sequential Nodes to GPU Batches: Parallel Branch and Bound for Optimal $k$-Sparse GLMs
The paper introduces a CPU-GPU framework for parallelizing branch and bound (BnB) in cardinality-constrained generalized linear models (GLMs), addressing challenges in discrete optimization. The method batches heterogeneous BnB nodes for GPU processing, using padding and custom kernels to handle irregular data structures. Results demonstrate 10-100x speedups and optimality certification on challenging instances, with extensions enabling Rashomon set collection for downstream analysis like variable importance and model selection under secondary metrics (e.g., AUC).
branch and boundgpu accelerationsparse glmrashomon setoptimality gap
Enhancing Multimodal Large Language Models for Safety-Critical Driving Video Analysis
The paper introduces a pipeline to enhance Multimodal Large Language Models (MLLMs) for safety-critical driving video analysis by fusing downsampled video frames with high-frequency telematics data (IMU, GPS) and semantic insights from specialized computer vision models. The method generates pseudo-labels (captions, QA pairs) to train MLLMs for identifying Safety-Critical Events (SCEs). Fine-tuning QwenVL-2.5 with DoRA adapters yields significant improvements in SCE identification and explanation, using fewer than 50M trainable parameters and limited compute.
multimodal large language modelssafety-critical eventstelematics datadora adapterspseudo-labeling
IKNO: Infinite-order Kernel Neural Operators
The paper introduces Infinite-order Kernel Neural Operator (IKNO), a novel neural operator framework that extends beyond first-order kernel integrals to enhance expressivity. IKNO employs infinite-order kernel integrals with closed-form finite approximations, presenting two variants: IKNO-Vanilla, utilizing full-kernel resolvents via Kronecker eigendecomposition, and IKNO-TP, a tensor-product operator with per-axis resolvents. Both variants feature efficient computation schemes for scalable, high-performance global information aggregation. Empirical evaluations on time-dependent and time-independent benchmarks, including large-scale industrial datasets, demonstrate IKNO's state-of-the-art accuracy and scalability across diverse input shapes.
neural operatorskernel integralskronecker eigendecompositiontensor-product operatorscalability
Maestro: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles
Maestro introduces a Reinforcement Learning (RL)-based orchestration framework for dynamically composing ensembles of frozen expert models and skills. The method employs a lightweight 4B parameter policy to sequentially select model-skill pairs from a hierarchical registry, optimized via outcome-based RL without step-level supervision. Evaluated across ten multimodal benchmarks, Maestro achieves 70.1% average accuracy, outperforming GPT-5 (69.3%) and Gemini-2.5-Pro (68.7%), while generalizing to unseen models and skills with 59.5% accuracy on out-of-domain tasks.
reinforcement learningmodel-skill ensemblemultimodal benchmarksoutcome-based rlhierarchical registry
Beyond Euclidean Proximity: Repairing Latent World Models with Horizon-Matched Trajectory Reachability Metrics
The paper introduces trajectory reachability metrics (TRM), a post-hoc method to improve terminal-state ranking in latent world models by replacing Euclidean distance with a learned horizon-aware metric. TRM trains a small pairwise head on logged trajectory data while keeping the base model components fixed, focusing on temporal separation matching the planning horizon. Evaluations on the TwoRoom benchmark show TRM boosts success rates from 7.0% to 97.0% for LeWorldModel and from 32.7% to 84.0% for PLDM, with mechanistic analysis revealing latent MSE misranks despite linear XY decodability (R^2=0.998). The method also demonstrates utility in continuous manipulation tasks like PushT.
latent world modelstrajectory reachability metricshorizon-aware supervisionterminal-state rankingpairwise head
Spectra as Language: Large Language Models for Scalable Stellar Parameter and Abundance Inference
The study introduces a two-stage large language model (LLM) framework for stellar parameter and abundance inference from spectra, addressing limitations of traditional methods in handling high-dimensional datasets. Leveraging LLMs' generalization capabilities, the approach treats stellar spectra as continuous sequential signals, enabling accurate estimation of effective temperature, surface gravity, metallicity, and ~20 elemental abundances. Scaling-law analyses demonstrate systematic performance improvements with increased data, offering a scalable solution for large-scale spectroscopic surveys.
stellar spectralarge language modelsparameter inferencescaling-lawspectroscopic surveys
Algebraic Machine Learning for Small-to-Medium Datasets Is Competitive against Strong Standard Baselines
Algebraic Machine Learning (AML), a symbolic framework using subdirect decomposition of algebraic structures instead of numerical optimization, demonstrates competitive performance against standard baselines on small-to-medium datasets (50–2000 examples). Without requiring validation or cross-validation, AML outperforms CNNs on image classification and matches specialized methods like LightGBM on tabular data, despite lacking modality-specific inductive biases. Results show AML's generic algebraic approach achieves comparable accuracy to task-optimized models while eliminating hyperparameter tuning.
algebraic machine learningsubdirect decompositioninductive biassymbolic methodshyperparameter-free
From Betting to Empirical Bernstein LIL
The technical report establishes a law of the iterated logarithm (LIL) via online betting strategies, demonstrating how wealth guarantees in sequential decision-making can yield empirical Bernstein-type concentration inequalities. The method leverages the duality between regret minimization in prediction with expert advice and tail bounds for martingales. Results provide finite-sample LIL bounds applicable to adaptive, non-i.i.d. processes without requiring variance proxies or sub-Gaussian assumptions.
law of iterated logarithmonline bettingempirical bernsteinmartingale concentrationregret minimization
Self-Supervised ConvLSTM for Fermi Large Area Telescope Transient Detection
The authors propose a self-supervised ConvLSTM framework for detecting transient gamma-ray phenomena in Fermi-LAT data. The method combines synthetic sky simulations (10-year duration via gtobssim) with spatio-temporal deep learning, processing daily all-sky maps into time-ordered sequences. A ConvLSTM learns nominal sky evolution, with anomalies detected via pixel-wise residuals and statistically motivated thresholds. Spatial coherence filters suppress noise, enabling identification of astrophysical transients (e.g., flares, GRBs). The pipeline provides anomaly detection benchmarks for Fermi-LAT-like datasets while preserving spatial locality and temporal dependencies.
convlstmself-supervised learninggamma-ray transientspatio-temporal modelinganomaly detection
Aerodynamic force reconstruction using physics-informed Gaussian processes
The paper introduces a physics-informed Gaussian process model for reconstructing aerodynamic loads from noisy structural response data. The method combines probabilistic machine learning with physical constraints to avoid overfitting and regularization needs, while accommodating heterogeneous and multi-fidelity data. Validation on the Great Belt East Bridge demonstrates accurate load reconstruction, with strong agreement in RMSE, phase angles, and peak values between predicted and true loads. The approach shows potential for model validation, load estimation, and structural damage prognosis.
aerodynamic load reconstructiongaussian processesphysics-informed machine learningmulti-fidelity datastructural dynamics
Q-PhotoNAS: Hybrid Quantum Neural Architecture Search Framework on Photonic Devices
The paper introduces Q-PhotoNAS, a neural architecture search framework for hybrid quantum-classical models on photonic devices, addressing the challenge of joint optimization across classical preprocessing, phase encoding, and photonic circuit design. The method employs a genetic algorithm with 19 hyperparameters across six gene groups, using group-based crossover and per-gene mutation to evolve architectures, evaluated via short training budgets before full retraining. On Digits and MNIST, the framework achieves 99.44% and 98.78% validation accuracy respectively, with photonic QPU inference times of 67 ms and 149 ms, demonstrating non-redundant quantum feature extraction.
hybrid quantum-classicalneural architecture searchphotonic quantum computinggenetic algorithmphase encoding
RobustSpeechFlow: Learning Robust Text-to-Speech Trajectories via Augmentation-based Contrastive Flow Matching
RobustSpeechFlow introduces a training strategy for flow-matching text-to-speech (TTS) that improves alignment robustness without external aligners or preference data. The method extends contrastive flow matching with length-preserving repeat and skip latent augmentations, directly penalizing common failure modes while maintaining compatibility with existing pipelines. Evaluations on Seed-TTS-eval show a word error rate (WER) reduction from 1.44 to 1.38 using 0.06B parameters. On the ZERO500 benchmark, it achieves consistent intelligibility improvements, reducing English character error rate (CER) from 0.48% to 0.35% and Korean CER from 0.81% to 0.57% at NFE=24.
flow-matchingtext-to-speechcontrastive learninglatent augmentationsalignment robustness
CoRMA: Contrastive RMA for Contact-Rich Meta-Adaptation
CoRMA introduces a contrastive meta-adaptation framework for contact-rich robotic assembly, replacing raw simulator-parameter adaptation with a 6D semantic contact context. The method employs a causal Transformer adapter to infer contact context online from force, proprioceptive, and action histories, using semantic regression and a force-regime contrastive objective. Evaluated on PegInsert, GearMesh, and NutThread tasks in Isaac Lab/Isaac Sim~5.0 and a real Marvin arm, CoRMA outperforms FORGE baselines, maintaining higher real-world success rates under target-pose noise without demonstrations or gradient updates.
contrastive learningmeta-adaptationsemantic contact contextcausal transformerrobotic assembly
Can Breath Biomarkers Causally Influence Blood Glucose? Investigating VOC-Mediated Modulation in Diabetes
This study proposes a non-invasive framework for diabetes risk assessment using Volatile Organic Compounds (VOCs) and lifestyle variables, combining causal inference and machine learning. The authors employ causal techniques to quantify the influence of acetone, isopropanol, isoprene, and ethanol on blood glucose levels, alongside a classifier for diabetic/non-diabetic discrimination using Gaussian Mixture Models for population clustering. Results indicate significant causal relationships between specific VOCs and glucose levels, with machine learning models achieving reliable classification and risk stratification for early screening.
volatile organic compoundscausal inferencegaussian mixture modelblood glucosediabetes screening
TWINGS: Thin Plate Splines Warp-aligned Initialization for Sparse-View Gaussian Splatting
TWINGS enhances 3D Gaussian Splatting (3DGS) for sparse-view novel view synthesis by introducing Thin Plate Splines (TPS) to align backprojected points with triangulated 3D control points. This method minimizes bending energy to estimate a globally coherent warp, providing geometrically accurate initialization for 3DGS. Evaluated on DTU, LLFF, and Mip-NeRF360, TWINGS outperforms existing methods in structural detail preservation and color fidelity under sparse-view conditions.
3d gaussian splattingthin plate splinesnovel view synthesissparse-view reconstructionnon-rigid deformation
CASE-NET: Deep Spatio-Temporal Representation Learning via Causal Attention and Channel Recalibration for Multivariate Time Series Classification
The paper introduces CASE-NET, a novel architecture for multivariate time series classification that addresses temporal non-causality and channel noise through causal attention and adaptive recalibration. The method combines a Causal Temporal Encoder with masked self-attention and causal convolutions to enforce temporal constraints, alongside an Adaptive Channel Recalibration module to suppress noise. Evaluations across six domains show state-of-the-art performance, including 98.6% accuracy on AWR dataset and robustness in non-stationary conditions.
multivariate time seriescausal attentionchannel recalibrationnon-stationary dynamicsself-attention
RADAR: Defending RAG Dynamically against Retrieval Corruption
RADAR introduces a dynamic defense framework for Retrieval-Augmented Generation (RAG) systems against adversarial attacks in volatile web environments. The method formulates reliable context selection as a graph-based energy minimization problem, solved via Max-Flow Min-Cut, and employs a Bayesian memory node for recursive belief updates instead of storing raw documents. Evaluated on a novel dynamic dataset, RADAR demonstrates superior robustness and response quality with minimal storage overhead compared to baselines.
retrieval-augmented generationadversarial attacksenergy minimizationbayesian memorymax-flow min-cut
PointLLM-R: Enhancing 3D Point Cloud Reasoning via Chain-of-Thought
The paper introduces PointLLM-R, a 3D multimodal language model enhanced with Chain-of-Thought (CoT) reasoning for point cloud understanding. The method involves a two-stage data-centric framework: (1) refining point-text instruction data via vision-language-model quality evaluation and reference-guided refinement, and (2) synthesizing reasoning paths through Human-in-the-Loop Prompt Optimization (HiLPO). This yields PoCoTI, a 55K-sample CoT-enhanced dataset. Experiments show PointLLM-R achieves SOTA performance in generative 3D classification and captioning, with robust generalization to real-world scans and multi-turn dialogues.
chain-of-thought3d point cloudmultimodal language modelhuman-in-the-loopinstruction-following
Uniform-in-Time Weak Propagation-of-Chaos in Shallow Neural Networks
(No summary returned.)
Check Your LLM's Secret Dictionary! Five Lines of Code Reveal What Your LLM Learned (Including What It Shouldn't Have)
The paper demonstrates that singular value decomposition (SVD) of the lm_head weight matrix in transformer-based LLMs reveals interpretable semantic subspaces without requiring model inference. Analyzing GPT-OSS-120B, Gemma-2-2B, and Qwen2.5-1.5B, the method exposes training data composition, with GPT showing functional subspaces, Gemma exhibiting historical English orthography clusters, and Qwen containing ethically problematic multilingual subspaces persisting post-alignment. The authors introduce Vocabulary Cluster Score (VCS) for subspace coherence and Weighted Projection Score (WPS) for static glitch token detection, recovering known artifacts like shokubutsu-hyakka-tsu (ID 137606). They advocate for SVD-based safety auditing and suggest applications in tokenizer optimization.
singular value decompositionlm_head matrixvocabulary cluster scoreweighted projection scoreglitch token
Toward Understanding Adversarial Distillation: Why Robust Teachers Fail
The paper identifies a key mechanism behind inconsistent performance in Adversarial Distillation, where robust teachers sometimes degrade student robustness. Through theoretical analysis of two-layer neural networks, the authors demonstrate that teacher confidence on unlearnable samples (Robustly Unlearnable Set) causes students to memorize noise patterns, leading to robust overfitting. Empirical validation on synthetic and real-image datasets shows that teacher uncertainty on these samples suppresses noise memorization, improving student robustness. Predictive entropy on unlearnable samples emerges as a reliable indicator for teacher selection.
adversarial distillationrobust overfittingunlearnable samplespredictive entropyfeature learning dynamics
Reasoning through Verifiable Forecast Actions: Consistency-Grounded RL for Financial LLMs
The paper introduces StockR1, a time-series-enhanced LLM that unifies stock forecasting and financial reasoning through verifiable forecast actions. The model emits structured forecast actions, invokes a time-series decoder conditioned on these actions, and optimizes the pipeline via RL with rewards for answer validity, forecast accuracy, and action-time-series consistency. Evaluated on a 10-year financial benchmark, StockR1 outperforms time-series baselines and general-purpose LLMs, improving reasoning accuracy by 17.7% (4B) and 25.9% (8B).
financial llmsverifiable forecast actionstime-series decoderreinforcement learningdistributional trajectories
How Sparsity Allocation Shapes Label-Free Post-Pruning Recoverability
The paper investigates how sparsity allocation affects label-free post-pruning recoverability in neural networks, demonstrating that allocation choice significantly impacts repair effectiveness under fixed activation-statistic repair. Using ERK and LAMP allocations on ResNet-18/34/50 with CIFAR-10/100 and Imagenette at 90-95.5% sparsity, the study reveals allocation preference varies by architecture, dataset difficulty, and sparsity level, identifying a repair-sensitive transition regime where BatchNorm recalibration fails but activation-statistic repair remains viable. Validation on ImageNet-100 and DenseNet-121 confirms the regime's dependence on data scale and connectivity, suggesting joint study of pruning allocation and repair.
sparsity allocationlabel-free repairactivation-statistic repairbatch norm recalibrationpost-pruning recoverability
An Improved Adaptive PID Optimizer with Enhanced Convergence and Stability for Deep Learning
The paper introduces IAdaPID-ADG, an improved adaptive PID optimizer addressing convergence and stability issues in deep learning. The method integrates AMSGrad's non-increasing effective learning rate to enhance convergence and DiffGrad's gradient-difference modulation for stability, building upon the AdaPID framework. Evaluations on MNIST, CIFAR10, IARC, and AnnoCerv datasets demonstrate superior performance over competing optimizers, with ablation studies validating individual component contributions.
adaptive pid optimizernon-increasing learning rategradient-difference modulationconvergence stabilitydeep learning optimization
Dynamic Mixture of Latent Memories for Self-Evolving Agents
The paper proposes MoLEM, a dynamic mixture-of-experts framework for continual learning that avoids catastrophic forgetting while internalizing new knowledge. The method employs latent memory generation through expert modules, with a router performing key-query matching to aggregate memories while keeping the base model frozen. Experiments across math, science, and code domains show a 10.40% average accuracy improvement over pretrained baselines, with consistent performance across different task orders.
continual learningmixture-of-expertslatent memorycatastrophic forgettingkey-query matching
SCI-Defense: Defending Manipulation Attacks from Generative Engine Optimization
The paper introduces SCI-Defense, a framework for detecting Generative Engine Optimization (GEO) attacks on LLM-based ranking systems. The method combines Perplexity detection (PPL), Semantic Integrity Scoring (SIS) evaluating four manipulation dimensions (Authority Attribution, Narrative Purposiveness, Comparative Claims, Temporal Claims), and Inter-Candidate Detection (ICD). Evaluated on 600 Amazon product descriptions and 600 MS MARCO web passages, it achieves perfect precision (1.000) and recall against String attacks, with varying recall against Reasoning (0.952) and Review (0.830) attacks. The study exposes limitations in existing defenses (PPL-only filters, SafetyClf, paraphrasing) and identifies new attack vectors like Specification Amplification.
generative engine optimizationsemantic integrity scoringperplexity detectionllm-based rankingmanipulation attacks
Optimal Guarantees for Auditing Rényi Differentially Private Machine Learning
The authors present an optimal black-box auditing framework for machine learning algorithms claiming Rényi differential privacy (RDP) guarantees. Their method employs hypothesis testing with Donsker-Varadhan variational estimators to directly measure Rényi divergence between neighboring executions, providing explicit non-asymptotic confidence intervals that separate statistical error from privacy leakage. Theoretical analysis proves minimax optimality of sample complexity, while empirical evaluations on DP-SGD with MNIST and CIFAR-10 demonstrate improved RDP lower bounds, particularly at small-to-moderate Rényi orders where auditing is most difficult.
rényi differential privacyblack-box auditingdonsker-varadhan estimatorhypothesis testingdp-sgd
A2QTGN: Adaptive Amplitude Quantum-Integrated Temporal Graph Network for Dynamic Link Prediction
The authors propose A2QTGN, a hybrid quantum-classical framework for dynamic link prediction that combines adaptive amplitude encoding with a Temporal Graph Network backbone. The method represents node interactions as quantum states and selectively updates amplitude embeddings based on temporal activity, reducing unnecessary quantum re-encoding while capturing structural changes. Evaluated on five Temporal Graph Benchmark datasets, A2QTGN demonstrates strong predictive performance, with ablation studies confirming the importance of both quantum embeddings and adaptive updates. Hardware-aware experiments on noisy quantum backends support near-term feasibility.
quantum-integratedtemporal graph networkadaptive amplitude encodingdynamic link predictionhybrid quantum-classical
CCLab: Adversarial Testing of Learning- and Non-Learning-Based Congestion Controllers
The paper introduces CCLab, an adversarial testing framework for evaluating robustness of congestion controllers (CCs) under perturbed conditions. The method employs a reinforcement learning-based adversarial agent that generates bounded perturbations at feature-level (input signals) or environment-level (network conditions), constrained for realism. Results show learning-based CCs generally outperform traditional CCs in robustness, though both degrade under adversarial conditions. Additionally, adversarial traces from CCLab improve training robustness, yielding CCs that surpass existing learning-based methods in normal and challenging scenarios.
congestion controladversarial testingreinforcement learningnetwork robustnessperturbation analysis
Noise Schedule Design for Diffusion Models: An Optimal Control Perspective
(No summary returned.)
When to Switch, Not Just What: Transition Quality Prediction in Clash Royale
The paper introduces TQP (Transition Quality Predictor), a three-stage pipeline for strategy recommendation in competitive games that addresses the Zero Switching Cost Assumption by modeling transition-level decisions. TQP combines PersonaGate (player-specific consistency filtering), TimingGate (state-aware switching timing), and ScoreFusion (strategy ranking via adoptability and delta win-rate signals), evaluated using the SwitchGap metric. Analysis of 926,334 Clash Royale matches shows switching frequency inversely correlates with win rates (-10.4pp SwitchGap at 5.4% recommendation rate), with loss-triggered switchers benefiting most from subtype-conditioned guidance.
transition quality predictorzero switching cost assumptionswitchgappersonagatetiminggate
PhylaFlow: Hybrid Flow Matching in Billera-Holmes-Vogtmann Tree Space for Phylogenetic Inference
PhylaFlow introduces a hybrid flow-matching model for phylogenetic inference in Billera-Holmes-Vogtmann (BHV) tree space, coupling continuous branch-length motion with discrete topology transitions. The method learns posterior-basin transport by training on BHV geodesic paths from random starting trees to short-run posterior samples, enabling efficient recovery of posterior-supported topologies. Evaluated on DS1-DS8 benchmarks, PhylaFlow reduces initial Tree-KL divergence, improves topology-recovery trajectories, and outperforms short-warmup and PhyloGFN under finite-budget refinement, demonstrating effective geometry-aware proposals for Bayesian phylogenetic inference.
phylaflowbhv tree spaceflow matchingphylogenetic inferencebayesian refinement
Geometry-Adaptive Explainer for Faithful Dictionary-Based Interpretability under Distribution Shift
The paper introduces Geometry-Adaptive Explainer (GAE), a method to maintain dictionary-based interpretability faithfulness under distribution shift. GAE addresses the misalignment between in-distribution (ID) dictionaries and out-of-distribution (OOD)-active subspaces by realigning the explainer's dictionary geometrically without gradient updates, using only unlabeled OOD activations. Theoretical analysis shows GAE's excess loss is quadratically bounded by second-moment shift, while empirical results demonstrate superior causal faithfulness across models and OOD settings compared to training-based baselines.
mechanistic interpretabilitydictionary-based explainersdistribution shiftfaithfulness gapgeometry-adaptive explainer
Causal Discovery in Structural VAR Models Under Equal Noise Variance
The paper introduces ENVAR, a sparsity-based method for causal discovery in structural VAR models under equal noise variance assumptions, where multiple parameterizations can induce identical observed processes. The authors formalize observational equivalence via orthogonal transformations and global scaling, proposing the observational alignment discrepancy to compare models within equivalence classes. Evaluations on synthetic data and fMRI demonstrate ENVAR's ability to identify sparse structural representatives despite non-unique identifiability.
structural var modelscausal discoveryobservational equivalenceequal noise variancesparsity-based estimation
Energy-Gated Attention: Spectral Salience as an Inductive Bias for Transformer Attention
The paper introduces Energy-Gated Attention (EGA), a transformer attention variant that gates value aggregation by the spectral energy of key token embeddings to prioritize informationally dense positions. EGA computes spectral energy via a learned linear projection identifying the dominant spectral mode, adding only 12,480 parameters (<0.26% overhead) with no computational cost increase. Experiments on TinyShakespeare (+0.103 validation loss improvement) and Penn Treebank (+0.101) demonstrate consistent gains, with ablation studies showing data-adaptive spectral bases outperform fixed wavelets. The learned energy threshold (τ ≈ 0.35) aligns with linguistic analyses of English content word distribution (~36%).
energy-gated attentionspectral saliencetransformer attentionwavelet packetscontent words
On-Policy Consistency Training Improves LLM Safety with Minimal Capability Degradation
The paper introduces On-Policy Consistency Training (OPCT), a novel alignment method that improves LLM safety by training models to generate consistent responses across contrastive prompts using self-supervised signals. Unlike offline supervised fine-tuning (SFT), OPCT computes objectives over the model's own responses, enhancing generalization and reducing capability degradation. Evaluated across three safety axes (sycophancy, jailbreaking, safety awareness) on three model families, OPCT reduces sycophancy rates by nearly half (8.1% vs. 15.4% baseline), maintains 99% jailbreak defense success, and matches or exceeds SFT on safety awareness while avoiding SFT's capability regressions (e.g., 28-point MATH-500 drop).
consistency trainingllm safetyon-policy learningjailbreak defensesycophancy mitigation
Beyond Scalar Objectives: Expert-Feedback-Driven Autonomous Experimentation for Scientific Discovery at the Nanoscale
The paper introduces deep-kernel pairwise learning (DKPL), a Bayesian optimization framework that incorporates expert feedback for autonomous microscopy experiments. DKPL replaces scalar objectives with pairwise expert judgments to learn a latent utility function, enabling discovery of nanoscale structures without predefined metrics. Evaluations on model datasets show DKPL effectively identifies high-information regions and distinguishes domain-wall characteristics in bismuth ferrite and erbium manganite. The method demonstrates how interdisciplinary expert knowledge can guide self-driving laboratories beyond scalar-metric limitations.
bayesian optimizationautonomous experimentationdeep-kernel learningnanoscale microscopyexpert feedback
Symbolic Density Estimation for Discrete Distributions
The authors introduce symbolic density estimation (SDE), an unsupervised framework for automatically discovering closed-form probability mass functions by composing elementary operations within a structured search space. The method combines evolutionary search with domain-specific structural priors and validity-aware inference, extending to complex families like zero-inflated and finite mixture distributions. Evaluated on a novel benchmark of common discrete distributions, SDE recovers all ground-truth families with accurate parameter estimation. Real-world applications demonstrate improved goodness-of-fit via interpretable mixture models compared to standard approaches.
symbolic density estimationdiscrete distributionsevolutionary searchmixture modelsgoodness-of-fit
Truncated Neural Likelihood Estimation for Simulation-Based Inference in State-Space Models
The paper introduces truncated sequential neural likelihood (T-SNL), a novel algorithm for parameter inference in state-space models (SSMs) that addresses limitations of sequential neural likelihood (SNL). T-SNL improves accuracy, training stability, and scalability to longer sequences while enabling amortization for new observations. The method demonstrates superior performance in experiments, showing sample efficiency and robustness compared to existing approaches.
state-space modelssequential neural likelihoodparameter inferenceamortized inferencesimulation-based inference
Mapping Tomato Cropping Systems in California Using AlphaEarth Geospatial Embeddings and Deep Learning Analysis
The study demonstrates that AlphaEarth geospatial embeddings enable accurate field-scale tomato mapping without manual feature engineering. Using LandIQ 2018 crop polygons, researchers assembled a balanced dataset (9,484 fields) and trained a U-Net segmentation model with Monte Carlo dropout for uncertainty estimation. The model achieved 99.19% pixel accuracy, 98.69% precision, and 99.40% recall on an independent test set, with uncertainty maps highlighting field edges as high-variance regions.
geospatial embeddingsu-net segmentationmonte carlo dropoutfield-scale mappingalphaearth
Same Architecture, Different Capacity: Optimizer-Induced Spectral Scaling Laws
The study reveals that optimizers induce distinct spectral scaling laws in Transformer architectures, independent of model size or training duration. By analyzing eigenspectra of feed-forward networks through soft and hard spectral-ranks, the authors demonstrate that AdamW and Muon produce markedly different capacity utilization, particularly for rare-token representations (TAIL), with Muon achieving 2.3× better scaling exponent (β=1.02 vs. β=0.44). Validation loss alone fails to capture these spectral differences, highlighting optimizer choice as a critical factor in representation structure. Architectural interventions (e.g., attention rank) are shown to have smaller effects than optimizer-induced spectral shifts, suggesting optimizer-architecture co-design as a promising direction.
spectral scaling lawstransformer architectureoptimizer effectseigenspectra analysisrepresentation capacity
Why Semantic Entropy Fails: Geometry-Aware and Calibrated Uncertainty for Policy Optimization
The paper introduces Geometric-aware Calibrated Policy Optimization (GCPO), a novel framework addressing limitations of entropy-based uncertainty estimators in post-training optimization for large language models. GCPO integrates geometry-aware measures of semantic disagreement with reward-calibrated uncertainty signals to better regulate gradient variance and learning signal quality. Theoretical and empirical analysis reveals two gaps in current methods: anisotropic and calibration gaps. Experiments across multiple benchmarks demonstrate GCPO's superior tracking of gradient variability and improved post-training performance, highlighting the need for optimization-aligned uncertainty signals.
policy optimizationsemantic entropygradient variancepost-traininguncertainty calibration
stable-worldmodel: A Platform for Reproducible World Modeling Research and Evaluation
The paper introduces stable-worldmodel (swm), an open-source platform addressing reproducibility challenges in world modeling research. The system combines (1) a Lance-based data layer supporting MP4/HDF5/LeRobot formats, (2) standardized implementations of world models and planning solvers, and (3) benchmark suites with controlled variations for evaluating dynamics understanding, generalization, and control. By unifying data pipelines, baselines, and evaluation protocols, swm reduces implementation overhead while enabling systematic assessment of model capabilities across visual, geometric, and physical distribution shifts.
world modelsreproducibilitygeneralization benchmarksdata pipelinesplanning solvers
Three Costs of Amortizing Gaussian Process Inference with Neural Processes
The article analyzes the approximation costs when amortizing Gaussian process (GP) inference with latent neural processes (LNPs), decomposing the KL divergence between GP and LNP predictives into three interpretable components. The method examines label contamination, information bottleneck due to finite-dimensional representations, and amortization error from shared encoder networks. Results show bottleneck truncation decays as $O(e^{-cd^{2/d_x}})$ for squared-exponential kernels and $O(d^{-2ν/d_x})$ for Matérn-$ν$ kernels, while label contamination remains $O(1)$ except for observation noise ($O(1/n)$). The analysis yields architectural recommendations, including variance prediction from context locations and second-order pooling to reduce amortization gaps.
gaussian processneural processesamortization errorkl divergenceinformation bottleneck
MMD-Balls as Credal Sets: A PAC-Bayesian Framework for Epistemic Uncertainty in Test-Time Adaptation
The paper develops a PAC-Bayesian framework for quantifying epistemic uncertainty in test-time adaptation (TTA) by interpreting maximum mean discrepancy (MMD)-balls as credal sets. The method connects distribution shift magnitude to prediction reliability through: (i) MMD-dependent generalization bounds under RKHS-Lipschitz loss, (ii) finite-sample concentration results, (iii) risk decomposition over credal sets, and (iv) geodesic preservation guarantees for kernel-guided adaptation. Results demonstrate principled uncertainty separation and adaptation criteria, with theoretical bounds explicitly parameterized by distribution shift.
pac-bayesiancredal setsmaximum mean discrepancytest-time adaptationepistemic uncertainty
Provable Robustness against Backdoor Attacks via the Primal-Dual Perspective on Differential Privacy
The paper introduces a principled framework for certifying robustness against backdoor attacks by connecting randomized smoothing to differential privacy's dual view through privacy profiles. This approach enables tight, modular certification of complex, composed mechanisms by leveraging existing analyses of differentially private mechanisms, addressing the challenge of jointly analyzing training- and test-time randomized mechanisms. The framework is instantiated for DP-SGD and Deep Partition Aggregation with inference-time smoothing, demonstrating effectiveness on MNIST and CIFAR-10 with joint robustness guarantees against training-time and inference-time attacks.
randomized smoothingdifferential privacybackdoor attacksdp-sgddeep partition aggregation
HIDBench: Benchmarking Large Language Models for Host-Based Intrusion Detection
The paper introduces HIDBench, a novel benchmark for evaluating large language models (LLMs) in host-based intrusion detection systems (HIDS), addressing a gap in existing cybersecurity benchmarks. The method unifies three public datasets (DARPA-E3, DARPA-E5, NodLink) and develops a pipeline to transform raw system logs into LLM-compatible inputs. Results show LLMs achieve high precision (>0.8) on simpler datasets but degrade significantly with noisy, complex logs (MCC <0.5), revealing sensitivity to data complexity and distinct detection regimes.
host-based intrusion detectionlarge language modelssystem logsbenchmarkcybersecurity
Manifold-Guided Attention Steering
The paper introduces Manifold-Guided Attention Steering (MAGS), an inference-time intervention that improves reasoning consistency in large language models by dynamically correcting attention head deviations. MAGS learns low-dimensional correctness manifolds from contrastive correct/incorrect traces and projects errant attention outputs back to this subspace during generation. Evaluations on MATH-500, GSM8K, HumanEval, MBPP, and SMILES benchmarks show MAGS outperforms both unsteered baselines and static steering approaches, demonstrating the generality of correctness manifolds in attention geometry.
activation steeringcorrectness manifoldattention headsinference-time interventioncontrastive learning
Memory-R2: Fair Credit Assignment for Long-Horizon Memory-Augmented LLM Agents
Memory-R2 introduces a training framework for long-horizon memory-augmented LLM agents, addressing unfair credit assignment in multi-session reinforcement learning. The core algorithm, LoGo-GRPO, combines local and global group-relative optimization: global objectives preserve end-to-end learning from trajectory rewards, while local rerollouts enable fair comparisons of memory operations from shared intermediate states. The framework jointly optimizes memory formation and evolution via a shared-parameter co-learning design and stabilizes training with a progressive curriculum (8 to 32 sessions).
memory-augmented llmcredit assignmentgroup-relative optimizationmulti-session rlprogressive curriculum
Position: The Time for Sampling Is Now! Charting a New Course for Bayesian Deep Learning
This position paper advocates for sampling-based inference (SAI) in Bayesian neural networks (BNNs), arguing it has reached computational parity with optimization-based methods. The authors identify misconceptions as barriers to adoption and propose focusing on posterior landscape exploration and sample distillation for efficient inference. SAI offers principled uncertainty quantification, model averaging benefits, and insights into BNN behavior, positioning it as a transformative tool in Bayesian deep learning.
sampling-based inferencebayesian neural networksuncertainty quantificationposterior explorationmodel averaging
On the Sample Complexity of Discounted Reinforcement Learning with Optimized Certainty Equivalents
The paper characterizes the sample complexity of risk-sensitive reinforcement learning in finite discounted MDPs using optimized certainty equivalents (OCE). It analyzes a model-based approach for both value and policy learning under recursive OCE, providing PAC sample complexity bounds. Key results show that OCE objectives are PAC-learnable only when the utility function has full domain, with tight bounds in state-action space size $SA$ and improved $ rac{1}{\tau^2}$ dependence for $\text{CVaR}_\tau$ compared to prior work, though with suboptimal horizon dependence $\frac{1}{1-\gamma}$.
optimized certainty equivalentrisk-sensitive rlsample complexitypac-learnabilitydiscounted mdps
📰 Industry Media (11)
Google I/O showed how the path for AI-driven science is shifting
Google DeepMind's WeatherNext demonstrated practical AI applications in weather prediction, potentially saving lives during Hurricane Melissa, while CEO Demis Hassabis speculated about approaching the singularity. The article contrasts specialized AI tools like AlphaFold and WeatherNext with emerging agentic, LLM-based systems such as Gemini for Science, which aim to autonomously conduct research. Google is shifting focus towards these general-purpose agents, evidenced by resource reallocation and new initiatives, though specialized tools remain widely used. Early adopters praise agentic systems' potential, but challenges persist in experimental validation and human-AI collaboration dynamics.
agentic systemsllm-basedprotein-foldingsingularityin-context learning
Roundtables: Can AI Learn to Understand the World?
The MIT Technology Review hosted a subscriber-exclusive roundtable exploring AI's capacity to model and interact with the physical world, featuring editor Mat Honan and AI specialists Will Douglas Heaven and Grace Huckins. The discussion centered on overcoming large language model limitations through world models, referencing Stanford's 2026 AI Index report on rapid AI advancements. No technical specifics or empirical results were disclosed due to paywall restrictions on the source material.
world modelslarge language modelsai indexphysical interactionmodel limitations
A Step-by-Step Coding Tutorial to Implement GBrain: The Self-Wiring Memory Layer Built by Y Combinator’s Garry Tan for AI Agents
GBrain is a self-wiring memory layer for AI agents that constructs a typed knowledge graph from markdown files without LLM calls, achieving P@5 49.1% and R@5 97.9% on BrainBench. The system combines Postgres-backed storage (PGLite) with hybrid search (vector + BM25 + Reciprocal Rank Fusion) and MCP integration for tool exposure. Key innovations include regex-based link extraction (2 edges inferred from 3 pages in testing) and a 74-tool API for agent interaction. The tutorial demonstrates local setup, graph wiring, and search functionality in a 20-minute workflow.
gbrainpglitereciprocal rank fusionknowledge graphmcp
Microsoft Releases Fara1.5: A Family of Browser Computer-Use Agents (4B/9B/27B) That Outperform OpenAI Operator and Gemini 2.5 Computer Use on Online-Mind2Web
Microsoft Research introduces Fara1.5, a family of browser-based computer-use agents (4B/9B/27B parameters) built on Qwen3.5, achieving 72% task success on Online-Mind2Web (300 tasks across 136 sites), outperforming OpenAI Operator (58.3%) and Gemini 2.5 Computer Use (57.3%). The models employ an observe-think-act loop, processing three recent screenshots and emitting actions via MagenticLite's sandboxed browser interface. Training leverages 2M samples (60% web trajectories, 12.8% synthetic data) and FaraGen1.5's synthetic pipeline with six gated-domain clones (Mail, Calendar, etc.). Critical safety features include pausing for user confirmation on ambiguous/irreversible actions.
computer-use agentsobserve-think-act loopsynthetic data pipelinegated-domain tasksmagenticlite
Build Recurrent-Depth Transformers with OpenMythos for MLA, GQA, Sparse MoE, and Loop-Scaled Reasoning
The tutorial demonstrates OpenMythos, a framework for building recurrent-depth transformers with Multi-Latent Attention (MLA) and Grouped-Query Attention (GQA) variants. It implements a sparse mixture-of-experts architecture (4 experts, 1 shared) and evaluates parameter efficiency (128-dim embeddings, 4 heads) and recurrent stability via spectral radius analysis. On a synthetic digit-chain summation task (modulo 7), the model achieves 0.85+ in-distribution accuracy with 8 recurrent loops, showing improved OOD generalization (0.72 accuracy) on longer chains through inference-time loop scaling.
recurrent-depth transformermulti-latent attentiongrouped-query attentionspectral radiusloop scaling
How CopilotKit Is Redefining the Agentic AI Stack in 2026
CopilotKit introduces a three-layer agentic stack (AG-UI, AIMock, Pathfinder) to address production gaps in AI agent deployment. AG-UI standardizes agent-user interaction with real-time UI streaming and human-in-the-loop controls, while AIMock provides deterministic testing for multi-service agent workflows via record-replay and chaos testing. Pathfinder enables hybrid vector-keyword retrieval for agent-accessible knowledge. The stack is adopted by major cloud providers and frameworks, with AG-UI integrated into AWS Bedrock AgentCore and taught on DeepLearning.AI. Enterprise adoption includes Deutsche Telekom and Cisco.
agentic stackin-context learninghybrid retrievaldeterministic testingui streaming
Qwen Introduces Qwen3.7-Max: A Reasoning Agent Model With a 1M-Token Context Window
Alibaba's Qwen team introduced Qwen3.7-Max, a proprietary reasoning agent model with a 1M-token context window, designed for long-horizon tasks like code optimization and workflow automation. The model employs extended-thinking mode, generating chain-of-thought reasoning traces before final outputs, resulting in 97M tokens on benchmarks versus 24M average. It scored 56.6 on the Artificial Analysis Intelligence Index (4.8-point gain over Qwen3.6 Max Preview), with notable improvements in scientific reasoning (+9.7pp on CritPt) but reduced factual recall attempts (-19.3pp on AA-Omniscience). Internal tests demonstrated 1,000+ autonomous tool calls and 35-hour execution.
reasoning agentcontext windowchain-of-thoughttool callingbenchmarking
Cohere Releases Command A+: A 218B Sparse MoE Model for Agentic Workflows That Runs on as Few as Two H100 GPUs
Cohere introduces Command A+, a 218B-parameter sparse Mixture-of-Experts (MoE) Transformer optimized for agentic workflows, with only 25B active parameters per token. The model employs dropless token-choice routing, interleaved sliding-window/global attention, and supports multimodal inputs (text/image/tools) with 128K context. Using NVFP4 W4A4 quantization with Quantization-Aware Distillation on MoE experts, it achieves 80.6% on MathVista and 85% on τ²-Bench Telecom while running on 2×H100 GPUs. Benchmarks show 63% higher throughput and 17% lower latency versus prior models, with expanded multilingual (48 languages) and multimodal capabilities.
sparse mixture-of-expertsquantization-aware distillationtoken-choice routernvfp4 quantizationagentic workflows
OpenAI opens Singapore AI lab as IMDA updates AI framework
OpenAI establishes its first international Applied AI Lab in Singapore, backed by S$300 million, to create 200+ technical roles and serve as a global hub for AI deployment. The lab aligns with Singapore's AI Mission priorities (public service, finance, digital infrastructure) through partnerships with government agencies, education initiatives (OpenAI Academy, Codex hackathons), and startup accelerators. Concurrently, Singapore's IMDA updates its agentic AI governance framework with new risk guidelines (multi-agent systems, third-party agents) and 10+ case studies from organizations like Tencent and GovTech, demonstrating tiered risk controls in AI agents.
agentic aigovernance frameworkapplied ai labin-context learningrisk stratification
China’s AI just mapped its entire renewable energy grid. Here’s why the rest of the world should pay attention
Researchers from Peking University and Alibaba's DAMO Academy developed a deep-learning model to create China's first high-resolution national inventory of renewable energy infrastructure, identifying 319,972 solar photovoltaic facilities and 91,609 wind turbines from 7.56TB of satellite imagery. The study demonstrates that solar-wind complementarity significantly reduces generation variability, with effectiveness scaling geographically. The findings reveal structural inefficiencies in provincial-level grid coordination and propose national-scale optimization to stabilize China's grid, addressing a 44% YoY surge in AI-related electricity demand.
solar-wind complementaritygeospatial aigrid optimizationdeep-learning modelrenewable energy inventory
Musk and Zuckerberg convinced Trump to scrap AI executive order
The Trump administration canceled a planned AI executive order after direct lobbying by tech executives Elon Musk (xAI), Mark Zuckerberg (Meta), and former advisor David Sacks. The order would have established voluntary pre-release security reviews for advanced AI models, citing concerns about maintaining US competitiveness against China. This highlights industry influence over federal AI policy, contrasting with China's structured legislative approach including mandatory ethics committees. The incident underscores divergent regulatory philosophies between open innovation advocates and governance frameworks.
executive ordervoluntary complianceethics review committeesregulatory driftfrontier ai
Generated automatically at 2026-05-22 21:16 UTC. Summaries and keywords are produced by an LLM and may contain inaccuracies — always consult the original article.
