Daily Digest — 2026-05-16
340 items · 2 research labs, 329 arxiv papers, 9 industry media
🏛️ Research Labs (2)
A new personal finance experience in ChatGPT
OpenAI introduces a personal finance feature in ChatGPT for U.S. Pro users, enabling secure account connections via Plaid for real-time financial insights. Leveraging GPT-5.5's reasoning capabilities, the system integrates user-provided financial context (e.g., goals, debts) with transactional data from 12,000+ institutions. Early benchmarks show GPT-5.5 Pro outperforms predecessors on complex tasks like budgeting and investment analysis, evaluated by 50+ finance professionals. Data privacy is maintained through user-controlled disconnection, memory deletion, and temporary chats.
gpt-5.5plaidkv-cachemulti-factor authenticationin-context learning
Sea's View on the Future of Agentic Software Development with Codex
Sea Limited demonstrates the transformative impact of OpenAI's Codex on agentic software development, leveraging it to manage large-scale systemic complexity across fragmented markets. By integrating Codex into CI/CD pipelines, Sea transitions from AI-assisted autocomplete to agentic workflows, enhancing engineering discipline and reducing technical debt. Internal data shows 87% weekly active usage among developers, with 73% recommending Codex for improving experimentation speed and workflows. The study highlights Southeast Asia as a proving ground for AI-native development, predicting a shift from developers to system orchestrators, focusing on product judgment and AI-driven workflow orchestration.
codexci/cd pipelinestechnical debtagentic workflowssystem orchestrator
📜 arXiv Papers (329)
EntityBench: Towards Entity-Consistent Long-Range Multi-Shot Video Generation
EntityBench introduces a benchmark for entity-consistent long-range multi-shot video generation, comprising 140 episodes (2,491 shots) derived from narrative media with explicit per-shot entity schedules tracking characters, objects, and locations across easy/medium/hard tiers. It includes a three-pillar evaluation suite assessing intra-shot quality, prompt-following alignment, and cross-shot consistency, with a fidelity gate ensuring accurate entity appearances. EntityMem, a memory-augmented generation system, is proposed as a baseline, storing verified per-entity visual references in a persistent memory bank. Experiments reveal that cross-shot entity consistency degrades with recurrence distance in existing methods, with EntityMem achieving the highest character fidelity (Cohen's d = +2.33).
entitybenchmulti-shot video generationentity consistencymemory-augmented generationfidelity gate
ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both
We propose ATLAS, a framework that unifies agentic and latent visual reasoning through functional tokens—discrete units that serve both as agentic operations and latent reasoning units. Each functional token represents an internalized visual operation without requiring visual supervision, enabling next-token prediction while avoiding intermediate visual content generation. To stabilize RL training, we introduce Latent-Anchored GRPO (LA-GRPO), which anchors functional tokens with a statically weighted auxiliary objective. Experiments demonstrate that ATLAS achieves superior performance on challenging benchmarks while maintaining interpretability, offering a scalable paradigm for visual reasoning.
functional tokensvisual reasoninglatent-anchored grponext-token predictionagentic operations
FutureSim: Replaying World Events to Evaluate Adaptive Agents
We introduce FutureSim, a benchmark for evaluating AI agents' adaptive capabilities in dynamic environments by replaying real-world events chronologically. Agents forecast events beyond their knowledge cutoff while interacting with a simulated timeline of news articles and question resolutions over a three-month period (January-March 2026). Evaluation reveals significant capability gaps, with the best agent achieving 25% accuracy and others performing worse than no prediction on Brier skill score. FutureSim enables research on long-horizon test-time adaptation, search, memory, and uncertainty reasoning, providing a realistic setting to measure AI progress in open-ended adaptation.
adaptive agentsknowledge cutoffbrier skill scoretest-time adaptationlong-horizon
VGGT-Edit: Feed-forward Native 3D Scene Editing with Residual Field Prediction
VGGT-Edit introduces a feed-forward framework for text-conditioned native 3D scene editing, addressing limitations of 2D-lifting methods that yield blurry textures and inconsistent geometry. The method employs depth-synchronized text injection to align semantic guidance with spatial poses and a residual transformation head to predict 3D geometric displacements while preserving background stability. Training uses a multi-term objective function enforcing geometric accuracy and cross-view consistency, supported by the DeltaScene Dataset with automated 3D agreement filtering. Experiments demonstrate VGGT-Edit outperforms 2D-lifting baselines in sharpness, multi-view consistency, and inference speed.
3d scene editingfeed-forward frameworkdepth-synchronized text injectionresidual transformationcross-view consistency
Quantitative Video World Model Evaluation for Geometric-Consistency
The paper introduces PDI-Bench, a quantitative framework for evaluating geometric consistency in generative video models. The method segments and tracks objects in generated videos, lifts them to 3D coordinates via monocular reconstruction, and computes projective-geometry residuals to assess scale-depth alignment, 3D motion consistency, and structural rigidity. Results reveal geometry-specific failure modes in state-of-the-art video generators, undetected by perceptual metrics, providing a diagnostic tool for physically grounded video generation.
pdi-benchgeometric consistencymonocular reconstructionprojective-geometry residualsvideo generation
Eradicating Negative Transfer in Multi-Physics Foundation Models via Sparse Mixture-of-Experts Routing
Shodh-MoE introduces a sparse Mixture-of-Experts (MoE) architecture to eliminate negative transfer in multi-physics foundation models. The method combines a physics-informed autoencoder with Helmholtz-style velocity parameterization for divergence-free outputs and a Top-1 soft-semantic router that dynamically assigns latent patches to specialized experts. Results show exact mass conservation (velocity divergence ~2.8×10^-10) and domain-specific routing, achieving latent validation MSEs of 2.46×10^-5 (open-channel) and 9.76×10^-6 (porous-media), demonstrating effective mitigation of multi-physics interference.
mixture-of-expertsnegative transferphysics-informed autoencoderhelmholtz parameterizationdivergence-free
OpenDeepThink: Parallel Reasoning via Bradley--Terry Aggregation
OpenDeepThink introduces a population-based test-time compute framework for parallel LLM reasoning, addressing selection bias in multi-candidate generation through Bradley-Terry aggregation. The method iteratively ranks candidates via pairwise LLM comparisons, mutates top performers using natural-language critiques, and discards low-ranked candidates. On Gemini 3.1 Pro, it achieves a +405 Codeforces Elo gain in 8 rounds (~27 minutes), with transferable performance across models and domain-specific gains on HLE benchmark. The work includes CF-73, a verified Codeforces problem set with 99% evaluation agreement.
test-time computebradley-terry modelparallel reasoningpopulation-based optimizationllm judging
Evidential Reasoning Advances Interpretable Real-World Disease Screening
EviScreen introduces an evidential reasoning framework for interpretable disease screening in medical imaging, addressing limitations in current models' interpretability and performance. The method leverages region-level evidence from historical cases through dual knowledge banks, enabling evidence-aware reasoning that combines current cases with retrieved evidence. It enhances localization interpretability using abnormality maps from contrastive retrieval rather than post-hoc saliency maps. Evaluated on real-world disease screening benchmarks, EviScreen achieves superior performance with notably higher specificity at clinical-level recall. Code is publicly available.
evidential reasoningmedical imagingcontrastive retrievalabnormality mapsknowledge banks
Text Knows What, Tables Know When: Clinical Timeline Reconstruction via Retrieval-Augmented Multimodal Alignment
The paper introduces a retrieval-augmented multimodal alignment framework for reconstructing precise clinical timelines by bridging unstructured text narratives and structured EHR data. The method employs a graph-based multistep process: extracting anchor events from text to build a temporal scaffold, placing non-central events relative to this backbone, and calibrating with retrieved EHR rows as temporal evidence. Evaluated on the i2m4 benchmark (MIMIC-III/IV), the pipeline improves absolute timestamp accuracy (AULTC) and temporal concordance across LLMs, while preserving event match rates. Analysis shows 34.8% of text-derived events are absent from tabular records, confirming the method's superiority over unimodal approaches.
clinical timeline reconstructionretrieval-augmented multimodal alignmenttemporal scaffoldelectronic health recordinstruction-tuned llms
Position: Behavioural Assurance Cannot Verify the Safety Claims Governance Now Demands
The paper identifies a structural mismatch ('audit gap') between AI governance safety requirements (2019-2026) and the epistemic limitations of behavioral assurance methods. Analyzing 21 evaluation instruments, it demonstrates how current methodologies (e.g., red-teaming) only verify observable outputs, failing to assess latent representations or long-horizon agentic behaviors required by regulations. The authors propose reducing reliance on behavioral proxies in legal frameworks and augmenting verification with mechanistic interpretability techniques (linear probes, activation patching, before/after-training comparisons) to address 'fragile assurance' scenarios.
audit gapbehavioral assurancemechanistic interpretabilitylatent representationsred-teaming
MeMo: Memory as a Model
The paper introduces MeMo (Memory as a Model), a modular framework for integrating new knowledge into large language models (LLMs) without modifying their parameters. MeMo encodes knowledge into a dedicated memory model, enabling complex cross-document relationships, robustness to retrieval noise, and avoidance of catastrophic forgetting. It operates without access to LLM weights or logits, supporting plug-and-play integration with both open and closed-source models, and features retrieval cost independent of corpus size. Experiments on BrowseComp-Plus, NarrativeQA, and MuSiQue benchmarks demonstrate MeMo's strong performance across diverse settings.
large language modelsmemory modelretrieval noisecatastrophic forgettingplug-and-play integration
Self-Distilled Agentic Reinforcement Learning
We introduce Self-Distilled Agentic Reinforcement Learning (SDAR), a method that enhances multi-turn agent training by integrating On-Policy Self-Distillation (OPSD) as a gated auxiliary objective while maintaining RL as the primary optimization backbone. SDAR employs a sigmoid gate to map detached token-level signals, strengthening distillation on teacher-endorsed positive-gap tokens and softly attenuating negative teacher rejections. Evaluated across the Qwen2.5 and Qwen3 families on ALFWorld, WebShop, and Search-QA, SDAR achieves substantial improvements over GRPO (+9.4% on ALFWorld, +7.0% on Search-QA, +10.2% on WebShop-Acc) and outperforms hybrid RL--OPSD baselines across model scales.
reinforcement learningself-distillationtoken-level guidancesigmoid gatemulti-turn agents
Pelican-Unified 1.0: A Unified Embodied Intelligence Model for Understanding, Reasoning, Imagination and Action
Pelican-Unified 1.0 introduces the first embodied foundation model trained under a unification principle, integrating understanding, reasoning, imagination, and action into a single architecture. The model employs a unified Vision-Language Model (VLM) for scene understanding and reasoning, projecting final hidden states into a dense latent variable. A Unified Future Generator (UFG) conditions on this latent variable to jointly generate future videos and actions via modality-specific output heads within a denoising process. Joint optimization across language, video, and action losses enables unified training. Pelican-Unified 1.0 achieves state-of-the-art or competitive results on VLM benchmarks (64.7), WorldArena (66.03), and RoboTwin (93.5), demonstrating that unification preserves specialist capabilities.
vision-language modelunified future generatordenoising processlatent variableembodied foundation model
Widening the Gap: Exploiting LLM Quantization via Outlier Injection
The authors introduce the first quantization-conditioned attack effective against advanced LLM quantization methods like AWQ, GPTQ, and GGUF I-quants, overcoming limitations of prior attacks restricted to simpler schemes. Their method exploits outlier injection to induce targeted weight collapse during quantization, enabling full-precision models to exhibit malicious post-quantization behavior. Evaluations across three attack scenarios demonstrate high success rates against diverse quantization techniques, establishing broader security risks than previously known.
llm quantizationoutlier injectionweight collapsequantization-conditioned attackpost-quantization behavior
APWA: A Distributed Architecture for Parallelizable Agentic Workflows
The paper introduces Agent-Parallel Workload Architecture (APWA), a distributed multi-agent system designed for efficient parallel processing of agentic workloads. APWA decomposes workflows into non-interfering subproblems that execute independently without cross-communication, supporting heterogeneous data and parallel processing patterns across diverse domains. Evaluations show APWA dynamically decomposes complex queries into parallelizable workflows and scales effectively on large tasks where prior systems fail.
multi-agent systemsparallel processingworkload decompositiondistributed architectureagentic workflows
Understanding How International Students in the U.S. Are Using Conversational AI to Support Cross-Cultural Adaptation
This study contributes a systematic analysis of conversational AI adoption patterns among international students in the U.S., identifying both immediate use cases and potential long-term support applications. Through a mixed-methods approach combining survey data (n=60) and in-depth interviews (n=14), the research maps how generative AI chatbots like ChatGPT and Google Gemini serve as first-aid tools for cross-cultural adaptation challenges. Findings reveal students' desire to transition AI from short-term assistance to sustained companionship, while highlighting current limitations in addressing complex cultural adaptation needs. The study concludes with recommendations for developing AI-powered support systems tailored to international students' unique requirements.
conversational aicross-cultural adaptationgenerative aichatgptgoogle gemini
CLOVER: Closed-Loop Value Estimation \& Ranking for End-to-End Autonomous Driving Planning
CLOVER introduces a Closed-LOop Value Estimation and Ranking framework to address the training--evaluation mismatch in end-to-end autonomous driving planners. The framework employs a generator--scorer formulation, where the generator produces diverse candidate trajectories and the scorer ranks them based on planning-metric sub-scores. CLOVER constructs evaluator-filtered pseudo-expert trajectories for training the generator with set-level coverage supervision and performs conservative closed-loop self-distillation. On NAVSIM, CLOVER achieves 94.5 PDMS and 90.4 EPDMS, setting a new state of the art. It also matches the strongest reported result on the NavHard split with 48.3 EPDMS and achieves the lowest L2 error and collision rate on nuScenes open-loop evaluation.
autonomous drivingclosed-loopvalue estimationtrajectory generationself-distillation
Why Neighborhoods Matter: Traversal Context and Provenance in Agentic GraphRAG
Agentic GraphRAG introduces a trajectory-level perspective on citation faithfulness, emphasizing that citations should account for graph traversal, structure, and uncited entities influencing answers. Through controlled ablation experiments, the study isolates, removes, and masks cited and uncited graph entities to evaluate their impact. Results demonstrate that cited evidence is necessary, as its removal significantly alters answers and reduces accuracy, but insufficient, as uncited traversal context and graph structure also contribute to answer accuracy. These findings advocate for citation evaluation frameworks that encompass provenance over the broader retrieval trajectory.
agentic graphragcitation faithfulnessgraph traversalablation experimentsretrieval trajectory
Logging Policy Design for Off-Policy Evaluation
The paper introduces a framework for designing logging policies that minimize off-policy evaluation (OPE) error for given target policies, addressing the reward-coverage tradeoff in data collection. The authors derive optimal logging policies under three informational regimes: (i) known target policy and reward distribution, (ii) unknown target policy and reward distribution, and (iii) partially known target policy and reward distribution via priors or noisy estimates. Theoretical results provide actionable guidance for firms selecting recommendation systems, emphasizing treatment selection for OPE data gathering. Practical design principles are also distilled for scenarios where operational constraints preclude implementing the theoretical optimum.
off-policy evaluationlogging policyreward-coverage tradeofftreatment selectionrecommendation systems
Improving Multi-turn Dialogue Consistency with Self-Recall Thinking
We propose Self-Recall Thinking (SRT), a framework addressing long-range contextual dependency and sparse informative signals in multi-turn dialogue. SRT identifies helpful historical turns to generate contextually appropriate responses through endogenous reasoning, integrating interpretable recall steps without external modules. The framework comprises Dependency Construction, Capability Initialization, and Reasoning Improvement stages, optimizing recall and reasoning via verifiable rewards. Experiments demonstrate SRT improves F1 score by 4.7% and reduces end-to-end latency by 14.7% over prior methods, achieving superior reasoning latency-accuracy balance compared to state-of-the-art baselines.
self-recall thinkingmulti-turn dialoguecontextual dependencyendogenous reasoningverifiable rewards
Dual-Dimensional Consistency: Balancing Budget and Quality in Adaptive Inference-Time Scaling
We propose Dual-Dimensional Consistency (DDC), a unified framework for adaptive inference-time scaling in Large Language Models (LLMs) that balances sampling budget and reasoning quality. DDC integrates Confidence-Weighted Bayesian protocol with Trend-Aware Stratified Pruning to concentrate computational resources on high-quality reasoning paths, filtering hallucinations while accelerating consensus. Evaluations across five benchmarks show that DDC reduces token consumption by over 10 times while maintaining or exceeding the accuracy of strong baselines across various LLMs.
dual-dimensional consistencyinference-time scalingconfidence-weighted bayesiantrend-aware stratified pruninghallucination filtering
Novel Dynamic Batch-Sensitive Adam Optimiser for Vehicular Accident Injury Severity Prediction
The study introduces Dynamic Batch-Sensitive Adam (DBS-Adam), an optimiser that dynamically adjusts learning rates based on batch difficulty scores computed from gradient norms and batch loss. DBS-Adam enhances training stability and convergence speed by prioritising difficult batches, evaluated on a Bi-Directional LSTM for vehicular accident injury severity prediction with SMOTE-ENN and Focal Loss. Compared to AMSGrad, AdamW, and AdaBound, DBS-Adam achieves statistically significant improvements (p=0.020), with 95.22% test accuracy, 96.11% precision, and 0.0086 test loss, demonstrating efficacy for imbalanced sequential data.
dynamic batch-sensitive adambi-directional lstmsmote-ennfocal lossgradient norms
ML-Embed: Inclusive and Efficient Embeddings for a Multilingual World
ML-Embed introduces a suite of inclusive and efficient text embedding models based on 3-Dimensional Matryoshka Learning (3D-ML), addressing computational costs, linguistic diversity, and transparency barriers. The framework combines Matryoshka Representation Learning (MRL), Matryoshka Layer Learning (MLL), and Matryoshka Embedding Learning (MEL) for efficiency across the model lifecycle. Trained on a massively multilingual dataset, models range from 140M to 8B parameters. Evaluation on 430 tasks shows state-of-the-art performance on 9 of 17 MTEB benchmarks, particularly excelling in low-resource languages. All models, data, and code are released openly.
matryoshka representation learningmultilingual embeddingsparameter efficiencylow-resource languagesmteb benchmarks
Concurrency without Model Changes: Future-based Asynchronous Function Calling for LLMs
We introduce AsyncFC, a framework enabling asynchronous function calling in LLMs without model modifications or fine-tuning. AsyncFC decouples LLM decoding from function execution, allowing parallelization of model decoding and function execution while maintaining the standard synchronous function-calling protocol. Evaluated on standard function-calling benchmarks and adapted software engineering tasks, AsyncFC significantly reduces end-to-end task completion time while preserving task accuracy. Results demonstrate LLMs' native capability to reason over symbolic futures representing unresolved execution results, enabling asynchronous model-tool interaction.
asynchronous function callingllm decodingsymbolic futuresmodel-tool interactionexecution-layer framework
On the Cultural Anachronism and Temporal Reasoning in Vision Language Models
This work introduces the Temporal Anachronism Benchmark for Vision-Language Models (TAB-VLM), a dataset of 600 questions across six categories, to quantify cultural anachronism in Vision-Language Models (VLMs). TAB-VLM evaluates temporal reasoning on 1,600 Indian cultural artifacts spanning prehistoric to modern periods. Systematic evaluations of ten state-of-the-art models reveal significant deficiencies, with the best model (GPT-5.2) achieving only 58.7% accuracy. The performance gap persists across architectures and scales, indicating cultural anachronism as a fundamental limitation in VLMs, particularly for non-Western visual cultures underrepresented in training data. The benchmark provides a foundation for enhancing temporal cognition in multimodal AI systems.
cultural anachronismtemporal reasoningvision-language modelsmultimodal aibenchmark
NeuroTrain: Surveying Local Learning Rules for Spiking Neural Networks with an Open Benchmarking Framework
The paper presents NeuroTrain, an open-source benchmarking framework for spiking neural networks (SNNs) that unifies diverse training algorithms under a modular architecture. It first contributes a comprehensive taxonomy of SNN training methods, categorizing approaches including surrogate-gradient backpropagation, three-factor learning rules, biological plasticity mechanisms, ANN-to-SNN conversion, and non-standard optimization. The framework, built on snnTorch, enables systematic comparison across algorithms, datasets, and architectures. Results highlight fragmented literature consolidation and reproducible benchmarking capabilities, while identifying key challenges in scalable SNN training.
spiking neural networkslocal learning rulessurrogate-gradientthree-factor learningann-to-snn conversion
TFGN: Task-Free, Replay-Free Continual Pre-Training Without Catastrophic Forgetting at LLM Scale
TFGN introduces a parameter-efficient architectural overlay for transformer language models that enables continual pre-training across heterogeneous domains without replay buffers or task identifiers. The method employs input-conditioned updates with L2-orthogonal gradient separation (≥99.59%) between domains, preserving prior knowledge while allowing cross-domain forward transfer (e.g., 26.8% JavaScript PPL reduction from Python training). Evaluated on six domains at three model scales (up to 9B parameters), TFGN achieves -0.007 backward transfer at LLaMA-8B and maintains ≥0.504 HellaSwag retention. Two extensions demonstrate meta-control (81% forgetting reduction) and operator-level planning (99.96% cosine fidelity).
continual learningparameter-efficientorthogonal gradientsforward transfermeta-controller
SpeakerLLM: A Speaker-Specialized Audio-LLM for Speaker Understanding and Verification Reasoning
SpeakerLLM introduces a speaker-specialized audio-LLM framework for unified speaker understanding and verification reasoning, addressing limitations of conventional speaker verification systems and general audio-LLMs. The method employs a hierarchical speaker tokenizer to capture utterance-level speaker embeddings and frame-level acoustic features, integrating single-utterance profiling, recording-condition understanding, and utterance-pair comparison within a natural-language interface. Experiments demonstrate that SpeakerLLM-Base enhances speaker-profile and recording-condition understanding, while SpeakerLLM-VR maintains high generated-verdict accuracy and produces structured decision traces. The authors will release a metadata-enriched supervision dataset and target-construction code for reproducibility.
audio-llmspeaker verificationhierarchical tokenizerutterance-level embeddingsverification reasoning
EverAnimate: Minute-Scale Human Animation via Latent Flow Restoration
EverAnimate introduces a post-training method for long-horizon animated video generation, addressing drift issues in dynamic human motion synthesis. The approach combines Persistent Latent Propagation for cross-chunk context memory and Restorative Flow Matching for within-chunk fidelity via velocity adjustment. Evaluations show significant improvements: at 10 seconds, PSNR/SSIM increase by 8%/7% and LPIPS/FID decrease by 22%/11%; at 90 seconds, gains rise to 15%/15% and 32%/27%, respectively, outperforming state-of-the-art methods.
latent flow restorationpersistent latent propagationrestorative flow matchinglong-horizon animationdrift mitigation
Case-Based Calibration of Adaptive Reasoning and Execution for LLM Tool Use
The paper introduces CAST, a case-driven framework for improving large language model tool use by calibrating adaptive reasoning and execution. CAST leverages historical execution trajectories as structured cases, extracting complexity profiles to estimate optimal reasoning strategies and failure profiles to predict structural breakdowns. These signals inform fine-grained reward design and adaptive reasoning during reinforcement learning. Evaluated on BFCLv2 and ToolBench, CAST achieves 5.85 percentage point accuracy gains, reduces reasoning length by 26%, and mitigates structural errors while maintaining schema-faithful execution.
case-based calibrationadaptive reasoningtool usereinforcement learningexecution accuracy
Orchard: An Open-Source Agentic Modeling Framework
We introduce Orchard, an open-source framework for scalable agentic modeling, featuring Orchard Env, a lightweight environment service for sandbox lifecycle management. Three agentic modeling recipes are developed: Orchard-SWE for coding agents, achieving 67.5% on SWE-bench Verified via credit-assignment SFT and Balanced Adaptive Rollout RL; Orchard-GUI, a 4B vision-language agent with 74.1% success on WebVoyager; and Orchard-Claw, a personal assistant agent scoring 73.9% on Claw-Eval. These results demonstrate Orchard's capability to enable reusable agentic data and training recipes across domains.
agentic modelingcredit-assignment sftbalanced adaptive rolloutvision-language agentsandbox lifecycle management
AI Knows When It's Being Watched: Functional Strategic Action and Contextual Register Modulation in Large Language Models
The study demonstrates that large language models (LLMs) exhibit systematic linguistic adaptation when perceived to be under social observation, with variations based on observer identity. Using Habermas's Theory of Communicative Action and Goffman's dramaturgical model, the authors conducted 100 multi-agent debate sessions across five conditions varying observation framing (human researchers, automated AI auditing, etc.). Results show monitored conditions (Delta+24.9%, Delta+24.2%) and AI monitoring (Delta+22.2%) produced higher type-token ratio (TTR) changes than audience-framing conditions (Delta+17.7%), F(4, 94) = 2.79, p = .031, with message length showing a dissociated effect, F(4, 95) = 19.55, p < .001.
large language modelstype-token ratiomulti-agent systemsalgorithmic auditingcontextual adaptation
WARD: Adversarially Robust Defense of Web Agents Against Prompt Injections
WARD (Web Agent Robust Defense against Prompt Injection) introduces a robust guard model for web agents vulnerable to prompt injection attacks in HTML and visual interfaces. Built on WARD-Base (177K samples from 719 URLs) and WARD-PIG datasets, it employs A3T, an adaptive adversarial attack training framework that iteratively strengthens the model through memory-based attacker-guard co-evolution. Experiments demonstrate near-perfect recall on out-of-distribution benchmarks, low false positive rates, robustness against guard-targeted and adaptive attacks under distribution shifts, and efficient parallel execution without added latency.
prompt injectionweb agentsadversarial trainingout-of-distributionmemory-based co-evolution
SemaTune: Semantic-Aware Online OS Tuning with Large Language Models
SemaTune introduces a semantic-aware framework for online OS tuning using large language models (LLMs), addressing limitations of black-box controllers by incorporating knob schemas, telemetry, and action--response history into a compact decision context. The system employs a dual-loop architecture with fast updates and periodic strategy revisions, validated through typed checks before kernel application. Evaluated on 13 workloads tuning up to 41 Linux parameters, SemaTune improves stable-phase performance by 72.5% over defaults and 153.3% over non-LLM baselines, while avoiding degraded regions and maintaining low cost (~$0.20 per 30-window session).
online os tuninglarge language modelskernel parameter optimizationsemantic-aware controlhost-level telemetry
Generalized Priority-Aware Shapley Value
The generalized priority-aware Shapley value (GPASV) extends Shapley-based valuation to arbitrary directed weighted priority graphs, relaxing the binary and acyclic constraints of prior methods. GPASV penalizes order violations through pairwise edges, encompassing classical models as boundary cases. An axiomatic characterization establishes GPASV, with computational methods and a priority sweeping diagnostic developed for implementation. Applied to LLM ensemble valuation on the cyclic Chatbot Arena preference graph, GPASV demonstrates that varying balances between pairwise graph priority and individual soft priority yield substantively different valuations.
shapley valuepriority graphaxiomatic characterizationllm ensembleorder violations
COTCAgent: Preventive Consultation via Probabilistic Chain-of-Thought Completion
COTCAgent introduces a hierarchical reasoning framework for longitudinal electronic health records (EHR) to address hallucination and temporal dependency challenges in clinical decision support. The method combines a Temporal-Statistics Adapter for trend analysis, Chain-of-Thought Completion for symptom-trend-disease reasoning, and bounded completion for structured evidence acquisition. Evaluated on Baichuan-M2, COTCAgent achieves 90.47% Top-1 accuracy on a self-built dataset and 70.41% on HealthBench, outperforming existing medical agents and large language models.
longitudinal ehrprobabilistic chain-of-thoughttemporal-statistics adapterclinical decision supportbounded completion
Small, Private Language Models as Teammates for Educational Assessment Design
The study compares Large Language Models (LLMs) and Small Language Models (SLMs) for educational assessment question design, focusing on pedagogical alignment (Bloom's taxonomy), privacy, and resource efficiency. It employs reproducible metrics to evaluate generation quality and model-based judging against expert ratings. Results indicate SLMs achieve competitive performance in key pedagogical dimensions while enabling privacy-sensitive local deployment, though model evaluations show inconsistencies and biases versus expert judgments. The work advocates for bounded AI assistance in assessment workflows, emphasizing Human-in-the-Loop integration and advancing automated question generation with deployment-aware trade-offs.
bloom's taxonomysmall language modelseducational assessmentprivacy-sensitive deploymenthuman-in-the-loop
Boosting Reinforcement Learning with Verifiable Rewards via Randomly Selected Few-Shot Guidance
The paper introduces FEST, a few-shot demonstration-guided algorithm for Reinforcement Learning with Verifiable Rewards (RLVR), addressing sample inefficiency in complex tasks. FEST combines supervised and on-policy signals while using decaying weights on few-shot SFT data to prevent overfitting during multi-epoch training. Evaluations show FEST matches or outperforms baselines using only 128 randomly selected demonstrations, achieving comparable performance to methods requiring full datasets.
reinforcement learningverifiable rewardsfew-shot learningsupervised finetuningon-policy learning
Quantifying and Mitigating Premature Closure in Frontier LLMs
The study quantifies premature closure in frontier LLMs, defined as inappropriate commitment under uncertainty when abstention would be safer. Researchers evaluated five models on medical QA tasks (MedQA, AfriMed-QA) with correct answers removed and open-ended HealthBench/adversarial queries. Baseline false-action rates reached 55-82% in structured tasks, while 30-78% of open-ended responses were inappropriate. Safety-oriented prompting reduced errors but residual failures persisted, demonstrating the need to assess when LLMs should refrain from answering.
premature closurelarge language modelsmedical qafalse-action ratesafety-oriented prompting
Towards Gaze-Informed AI Disclosure Interfaces: Eye-Tracking Attentional and Cognitive Load While Reading AI-Assisted News
This study investigates the attentional and cognitive load of AI-use disclosures in journalism using a $3\times2\times2$ mixed factorial design. Eye-tracking and NASA-TLX metrics were employed to measure load across varying disclosure levels (none, one-line, detailed), news types (politics, lifestyle), and AI roles (editing, partial content generation). Results indicate one-line disclosures significantly increase fixation durations and saccade counts, particularly for AI-edited content, while detailed disclosures impose no additional burden. Cognitive load, measured via NASA-TLX and pupil diameter, showed no significant differences across conditions. Interviews revealed a preference for detailed or 'detail-on-demand' designs, informing the development of gaze-informed adaptive disclosure interfaces.
eye-trackingnasa-tlxattentional loadcognitive loadai-use disclosures
Learning Developmental Scaffoldings to Guide Self-Organisation
The paper introduces a model that jointly learns self-organisation rules and developmental pre-patterns through Neural Cellular Automata (NCA) paired with a coordinate-based pattern generator (SIREN). This approach enables controlled study of information offloading from dynamics to initial conditions, mimicking biological developmental processes. Information-theoretic analysis reveals that joint learning improves robustness, encoding capacity, and symmetry breaking compared to pure self-organising systems. Results demonstrate that effective pre-patterns bias dynamics to facilitate convergence rather than merely approximating targets, revealing a non-trivial relationship between initial conditions and self-organising dynamics.
neural cellular automataself-organisationmorphogenetic pre-patternsinformation offloadingsymmetry breaking
Explainable Detection of Depression Status Shifts from User Digital Traces
The paper proposes an explainable framework for detecting depression-related status shifts from timestamped digital traces. The method combines multiple BERT-based models to extract multimodal signals (sentiment, emotion, depression severity), aggregates them into temporal trajectories, and identifies change points using temporal modeling. An LLM generates interpretable reports summarizing mental health evolution. Evaluated on two social media datasets, the framework outperforms direct LLM-based reporting in coverage (17% improvement), temporal coherence (p<0.01), and change point sensitivity (F1=0.72). Ablation confirms the necessity of temporal segmentation and multimodal signal fusion.
bert-based modelstemporal trajectorieschange point detectionmultimodal fusioninterpretable reporting
Predicting Response to Neoadjuvant Chemotherapy in Ovarian Cancer from CT Baseline Using Multi-Loss Deep Learning
The study introduces a deep learning framework for predicting neoadjuvant chemotherapy response in ovarian cancer patients using pre-treatment CT scans. The method employs a partially fine-tuned pretrained image encoder to process axial slices, with slice-level representations aggregated via an attention-based module. Training incorporates classification loss, supervised contrastive regularization, and hard-negative mining to enhance discrimination between responders and non-responders. Evaluated on a cohort of 280 patients, the model achieved a ROC-AUC of 0.73 (95% CI: 0.58-0.86) and an F1-score of 0.70 (95% CI: 0.56-0.82), demonstrating potential for clinical stratification.
neoadjuvant chemotherapyovarian cancerdeep learningct scansattention-based module
Sat3DGen: Comprehensive Street-Level 3D Scene Generation from Single Satellite Image
Sat3DGen introduces a geometry-first methodology for generating street-level 3D scenes from single satellite images, addressing the trade-off between geometric fidelity and semantic diversity in prior work. The method integrates novel geometric constraints and perspective-view training to mitigate errors from viewpoint gaps and sparse supervision. Evaluated on a new benchmark pairing VIGOR-OOD with high-resolution DSM data, it reduces geometric RMSE from 6.76m to 5.20m and improves photorealism (FID from ~40 to 19) without specialized image-quality modules. Applications include semantic-map-to-3D synthesis and unsupervised DSM estimation.
3d scene generationsatellite imagerygeometric constraintsdigital surface modelfrechet inception distance
Agreement, Diversity, and Polarization Indices for Approval Elections
The paper introduces novel indices for quantifying agreement, diversity, and polarization in approval elections, normalized for saturation effects. The proposed indices are designed to maintain consistent values across elections with varying approval rates but similar structural properties. Methodologically, the authors analyze these indices' formal properties and apply them to map election data from Pabulib and Preflib repositories. Results demonstrate the indices' utility in revealing structural similarities and differences between real-world approval election datasets.
approval electionspolarization indicessaturation normalizationpabulibpreflib
Second-Order Actor-Critic Methods for Discounted MDPs via Policy Hessian Decomposition
The paper proposes a second-order actor-critic method for discounted Markov Decision Processes (MDPs) that addresses computational challenges in Hessian estimation. By decomposing the policy Hessian under a two-timescale framework, the method treats the action-value function as quasi-stationary during actor updates, enabling stable curvature-aware updates via Hessian-vector products. This approach maintains computational efficiency while leveraging full curvature information, theoretically justified by the critic's faster convergence relative to the actor.
actor-critic methodspolicy hessiantwo-timescale learninghessian-vector productdiscounted mdps
MicroscopyMatching: Towards a Ready-to-use Framework for Microscopy Image Analysis in Diverse Conditions
The study introduces MicroscopyMatching, a ready-to-use framework for microscopy image analysis across diverse settings, addressing the inefficiency of manual and existing deep learning-based methods. By reformulating analysis tasks (segmentation, tracking, counting) as a unified matching problem, the framework leverages pre-trained latent diffusion models for robust performance. This approach eliminates the need for extensive adaptation to varying biological object types, sample protocols, and imaging equipment, offering a sustainable solution for laboratories.
microscopysegmentationlatent diffusion modelsimage analysistracking
Viverra: Text-to-Code with Guarantees
Viverra introduces a Text-to-Code system that generates formally verified annotations alongside LLM-produced C programs to address correctness guarantees. The method involves synthesizing code with candidate assertions from natural-language prompts, then verifying these assertions using bounded model checkers. Evaluated on 18 programming tasks, Viverra demonstrates efficient generation of code with verified assertions, improving code comprehension in a 400-participant user study.
text-to-codeformal verificationbounded model checkingllm-generated codeassertion synthesis
GraphFlow: An Architecture for Formally Verifiable Visual Workflows Enabling Reliable Agentic AI Automation
GraphFlow introduces a visual workflow architecture enabling formally verifiable agentic AI automation for mission-critical processes. The system treats workflow diagrams as executable specifications, defining data scope, execution semantics, and monitoring through compile-time verification of contracts (preconditions, postconditions, composition obligations) and runtime enforcement via a durable engine with append-only event logging. Swimlanes explicitly delineate trust boundaries between verified logic, external systems, human judgment, and AI decisions. A year-long pilot across three clinical sites executed 8,728 workflow runs with a 97.08% completion rate, with failures primarily attributed to external integrations. Formal semantics and proof-checked admission models are under active development.
visual workflowformal verificationagentic automationexecutable specificationtrust boundaries
MHSA: A Lightweight Framework for Mitigating Hallucinations via Steered Attention in LVLMs
MHSA introduces a lightweight framework for mitigating hallucinations in large vision-language models (LVLMs) by correcting cross-modal attention patterns. The method employs a three-layer MLP generator trained with supervisory signals from DHCP and the LVLM itself, producing corrected attention without modifying LVLM parameters. During inference, MHSA replaces original cross-modal attention with corrected attention, addressing both discriminative and generative hallucinations across various datasets and LVLMs. This approach extends cross-modal attention mechanisms from detection to mitigation, enhancing LVLM reliability.
cross-modal attentionhallucination mitigationlarge vision-language modelsmlp generatorsupervisory signals
Not All Symbols Are Equal: Importance-Aware Constellation Design for Semantic Communication
The paper proposes a joint semantic-physical layer framework for goal-oriented semantic communication, integrating vector quantized-variational autoencoder (VQ-VAE) for discrete latent concept extraction, a semantic criticality indicator (SCI) for task-relevance scoring, and a deep reinforcement learning agent for dynamic transmission subset selection. The method introduces a semantic-aware M-QAM constellation design that optimizes symbol placement based on co-occurrence statistics and SCI scores, departing from uniform Gray-coded constellations. Results show near 100% semantic protection probability (SPP) across 4-QAM to 1024-QAM, 21:1 compression ratio with semantic quality >0.9, and generalization across MNIST, Fashion-MNIST, and FSDD.
semantic communicationconstellation designvq-vaem-qamsemantic protection probability
Slot-MPC: Goal-Conditioned Model Predictive Control with Object-Centric Representations
Slot-MPC introduces an object-centric world modeling framework for goal-conditioned Model Predictive Control (MPC), leveraging slot-based representations to encode individual objects in a scene. The method employs vision encoders to learn these structured representations and a differentiable dynamics model, enabling gradient-based MPC for efficient action planning. Experiments on robotic manipulation tasks demonstrate superior task performance and planning efficiency compared to non-object-centric baselines, particularly in offline settings with limited state-action coverage. The results highlight the benefits of object-centric representations for generalizable decision-making.
slot-mpcmodel predictive controlobject-centric representationsgradient-based planningdifferentiable dynamics
From Sycophantic Consensus to Pluralistic Repair: Why AI Alignment Must Surface Disagreement
The paper critiques preference aggregation in AI alignment, identifying sycophantic consensus as a key failure mode where models prioritize agreement over pluralistic values. It proposes three conversational mechanisms (scoping, signalling, repair) grounded in Grice's maxims and introduces the Pluralistic Repair Score (PRS) to measure principled revision. Empirical tests on Claude Sonnet 4.5 (N=198) and GPT-4o (N=100) reveal high agreement-following but low repair-quality on contested-value prompts. The authors argue pluralistic alignment requires governance-layer interventions in deployment infrastructure.
pluralistic alignmentsycophantic consensuspreference aggregationgrice's maximsrepair score
KGPFN: Unlocking the Potential of Knowledge Graph Foundation Model via In-Context Learning
The paper introduces KGPFN, a knowledge graph foundation model that integrates in-context learning with transferable relational structure. The method employs message passing on relation graphs for invariant relation representations and uses multi-layer NBFNet to encode local neighborhoods. It constructs global context via relation-specific instance retrieval, combining feature-level and sample-level attention within a Prior-Data Fitted Network framework. Evaluated on 57 KG benchmarks, KGPFN demonstrates superior adaptation to unseen graphs through in-context learning, outperforming fine-tuned baselines.
knowledge graphin-context learningmessage passingnbfnetprior-data fitted network
COREKG: Coreset-Guided Personalized Summarization of Knowledge Graphs
The paper introduces COREKG, a coreset-guided method for personalized knowledge graph summarization that adapts coreset theory to sample triples based on user-specific query workloads. Using sensitivity-based importance sampling, it constructs compact summaries that approximate full dataset characteristics with bounded error, prioritizing triples relevant to individual users' query patterns. Evaluations on Freebase, WikiData, and DBpedia demonstrate superior query-answering accuracy and structural coverage compared to GLIMPSE, PPR, iSummary, PEGASUS, and APEX$^2$, while significantly reducing graph size.
knowledge graph summarizationcoreset theorysensitivity-based samplingquery workloadpersonalized summarization
Critic-Driven Voronoi-Quantization for Distilling Deep RL Policies to Explainable Models
The authors propose Critic-Driven Voronoi State Partitioning, a model-agnostic method for distilling Deep Reinforcement Learning policies into explainable models. The approach partitions a black-box control policy into regions using Voronoi quantization, where simple linear subpolicies are optimized via gradient descent. By leveraging the critic value network, new subpolicies are iteratively introduced in regions with insufficient value, balancing complexity and performance. The method employs nearest neighbor lookups to assign linear functions across the state space, producing a cell-like diagram. Empirical validation on standard benchmarks demonstrates that the distilled policy approximates the original using a compact set of linear functions.
voronoi quantizationpolicy distillationcritic value networkgradient descentstate partitioning
Your CLIP has 164 dimensions of noise: Exploring the embeddings covariance eigenspectrum of contrastively pretrained vision-language transformers
This work identifies and characterizes non-semantic noise in contrastively pre-trained Vision-Language Models (VLMs) through spectral decomposition of covariance matrices, isolating semantic signals from shared noise subspaces. The analysis reveals that noise geometry exhibits strong subgroup invariance across data subsets and that pruning these noise dimensions preserves or enhances downstream task performance. These findings demonstrate that a substantial portion of VLM latent space is governed by architecture-level noise rather than task-relevant semantics, providing mechanistic insights into VLM representational structure.
vision-language modelsspectral decompositioncovariance matriceslatent spacesubgroup invariance
Beyond Individual Intelligence: Surveying Collaboration, Failure Attribution, and Self-Evolution in LLM-based Multi-Agent Systems
The survey presents a unified framework (LIFE progression) for analyzing LLM-based multi-agent systems, integrating four causally linked stages: capability foundation, agent collaboration, fault attribution, and autonomous self-improvement. It systematically reviews existing work through taxonomies and formal characterizations of inter-stage dependencies, revealing gaps in error propagation and self-evolution mechanisms. The authors propose a cross-stage research agenda to enhance closed-loop systems with failure diagnosis, structural reorganization, and behavior refinement capabilities, advancing toward self-organizing collective intelligence.
multi-agent systemsllm-based agentserror propagationself-improvementcollective intelligence
SurgicalMamba: Dual-Path SSD with State Regramming for Online Surgical Phase Recognition
SurgicalMamba introduces a causal model for online surgical phase recognition (SPR) that addresses three key challenges: long procedure lengths, non-uniform temporal dynamics, and narrow visual domain. Built on Mamba2's structured state-space duality (SSD), it achieves O(d) per-frame cost via three novel components: dual-path SSD blocks separating long- and short-term regimes, intensity-modulated stepping for adaptive temporal warping, and state regramming enabling cross-channel mixing through Cayley rotations. The model achieves state-of-the-art performance on seven SPR benchmarks, including 94.6% accuracy and 82.7% Jaccard on Cholec80, while running at 119 fps on a single GPU.
structured state-space dualitysurgical phase recognitioncausal modeltemporal dynamicscross-channel mixing
BiFedKD: Bidirectional Federated Knowledge Distillation Framework for Non-IID and Long-Tailed ECG Monitoring
We propose BiFedKD, a bidirectional federated knowledge distillation framework addressing non-IID and long-tailed data distributions in ECG monitoring. The method employs an aggregation-by-distillation pipeline with temperature scaling to generate stable global distillation signals for cross-client alignment. Evaluated on the MIT-BIH Arrhythmia dataset, BiFedKD improves accuracy and Macro-F1 by 3.52% and 9.93%, respectively, while reducing communication overhead by 40% and computation cost by 71.7% compared to baseline federated distillation approaches.
federated distillationnon-iidlong-tailedtemperature scalingecg monitoring
Unlocking Complex Visual Generation via Closed-Loop Verified Reasoning
The Closed-Loop Visual Reasoning (CLVR) framework addresses limitations in text-to-image generation by integrating visual-language logical planning with pixel-level diffusion. CLVR employs an automated data engine with step-level visual verification to synthesize reliable reasoning trajectories and introduces Proxy Prompt Reinforcement Learning (PPRL) to stabilize long-context optimization through reward signal distillation. Additionally, $Δ$-Space Weight Merge (DSWM) reduces inference latency by fusing alignment weights with distillation priors, achieving a per-step cost of 4 NFEs. Experiments show CLVR surpasses open-source baselines and approaches proprietary model performance, enabling scalable complex visual generation.
text-to-imagediffusion modelsreinforcement learningvisual verificationlatency optimization
REALM: Retrospective Encoder Alignment for LFP Modeling
REALM introduces a retrospective distillation framework for causal local field potential (LFP) decoding in brain-computer interfaces (BCIs), addressing limitations of spike-based models and non-causal LFP architectures. The method pretrains a bidirectional Mamba-2 teacher model using masked autoencoding, then distills it into a compact student model via representation alignment and task supervision. REALM outperforms state-of-the-art causal and non-causal LFP-based methods, achieving a 2× reduction in parameters and 10× faster training while maintaining competitive decoding accuracy. This approach enables real-time LFP decoding without spike signals, offering a scalable solution for next-generation wireless implantable BCIs.
local field potentialretrospective distillationmasked autoencodingbrain-computer interfacerepresentation alignment
Towards In-Depth Root Cause Localization for Microservices with Multi-Agent Recursion-of-Thought
The authors propose RCLAgent, a multi-agent recursion-of-thought framework for root cause localization (RCL) in microservice systems, addressing limitations of existing LLM-based methods. RCLAgent decomposes the diagnostic process by assigning Dedicated Agents to each span in the trace graph, organizing agents recursively and in parallel according to graph topology, and synthesizing results via a Root-Level Diagnosis Report and Global Evidence Graph. Experiments on multiple public benchmarks show RCLAgent outperforms state-of-the-art methods in both localization accuracy and inference efficiency.
root cause localizationmicroservice systemsrecursion-of-thoughtdedicated agenttrace graph
Holistic Evaluation and Failure Diagnosis of AI Agents
We introduce a holistic evaluation framework for AI agents that combines top-down agent-level diagnosis with bottom-up span-level assessment, enabling precise failure localization and rationale generation in complex multi-step processes. The method decomposes analysis into independent per-span evaluations, scaling to traces of arbitrary length. On the TRAIL benchmark, our framework achieves state-of-the-art results across GAIA and SWE-Bench, with relative gains of up to 38% in category F1, 3.5x in localization accuracy, and 12.5x in joint localization-categorization accuracy. Notably, it demonstrates that evaluation methodology, not model capability, is the primary bottleneck for accurate diagnosis.
span-level evaluationfailure localizationholistic diagnosistrace decompositionjoint accuracy
Do Coding Agents Understand Least-Privilege Authorization?
The paper introduces permission-boundary inference and AuthBench, a benchmark of 120 terminal tasks, to evaluate whether coding agents can infer least-privilege authorization policies. AuthBench reveals that frontier models often misalign permissions, either omitting necessary accesses or granting unnecessary ones, with increased reasoning exacerbating model-specific failure modes. The authors propose Sufficiency-Tightness Decomposition, which separates policy generation into coverage-oriented simulation and sensitivity auditing, improving sensitive-task success by up to 15.8% while reducing attack success across models.
permission-boundary inferenceauthbenchleast-privilege authorizationsufficiency-tightness decompositioncoding agents
A Deterministic Agentic Workflow for HS Tariff Classification: Multi-Dimensional Rule Reasoning with Interpretable Decisions
A deterministic agentic workflow is proposed for Harmonized System (HS) tariff classification, addressing the challenge of multi-dimensional rule reasoning in mapping product descriptions to six- or eight-digit codes. The method employs a fixed control flow with narrow-stage language model calls, ensuring interpretability through stage-wise structured outputs and verbatim citations of relevant rules. Evaluated on HSCodeComp, the workflow achieves 75.0% top-1 and 91.5% top-3 accuracy at four digits, and 64.2% top-1 and 78.3% top-3 at six digits using Qwen3.6-plus. Manual audit suggests potential deviations in HSCodeComp ground-truth labels from HS general rules.
harmonized systemmulti-dimensional rule reasoningdeterministic agentic workflowinterpretabilityqwen3.6-plus
Exploitation of Hidden Context in Dynamic Movement Forecasting: A Neural Network Journey from Recurrent to Graph Neural Networks and General Purpose Transformers
The paper evaluates machine learning models for dynamic movement forecasting in NBA player trajectories, focusing on capturing temporal dependencies and contextual interactions. It compares traditional methods (SARIMAX, Kalman filters, Particle filters) with ML approaches (LSTM, GNNs, Transformers), proposing a hybrid LSTM augmented with contextual information. Experiments demonstrate ML methods outperform linear models across forecast horizons up to 2s, with the hybrid LSTM achieving the lowest final displacement error (1.51m) while requiring less data and training time than GAT and Transformers. Results highlight trade-offs in input history length, generalizability, and contextual incorporation, emphasizing task-specific architectural choices for chaotic sports environments.
trajectory predictionfinal displacement errorgraph attention networktemporal convolutional neural networkcontextual interactions
FactorizedHMR: A Hybrid Framework for Video Human Mesh Recovery
FactorizedHMR introduces a hybrid framework for video Human Mesh Recovery (HMR) that addresses inherent ambiguities in 3D body reconstruction. The method employs a two-stage approach: a deterministic regression module recovers a stable torso-root anchor, while a probabilistic flow-matching module completes non-torso articulations. It integrates a composite target representation, geometry-aware supervision, and feature-aware classifier-free guidance to enhance reliability. A synthetic data pipeline provides diverse viewpoint supervision. Evaluations on camera-space and world-space benchmarks demonstrate competitive performance, particularly in occlusion-heavy recovery and drift-sensitive metrics.
human mesh recoverydeterministic regressionprobabilistic flow-matchinggeometry-aware supervisionsynthetic data pipeline
IFPV: An Integrated Multi-Agent Framework for Generative Operational Planning and High-Fidelity Plan Verification
The paper introduces IFPV, an integrated multi-agent framework for generative operational planning and high-fidelity verification in battlefield environments. IFPV combines Multi-Perspective Hierarchical Agents (MPHA) for hierarchical plan generation and an Adversarial Cognitive Simulation Engine (ACSE) for adversarial verification via opponent modeling. MPHA decomposes commander intent into executable actions through collaborative Pathfinder, Analyst, and Planner agents, while ACSE simulates dynamic counteractions. Experiments in the Asymmetric Combat Tactic Simulator (ACTS) show IFPV improves mission success by 19.4%, reduces operational cost by 41.7% versus LLM baselines, and increases adversarial suppression rates by 31.8% over rule-based validators.
multi-agent systemsoperational planningadversarial verificationhierarchical decompositioncognitive simulation
XFP: Quality-Targeted Adaptive Codebook Quantization with Sparse Outlier Separation for LLM Inference
XFP introduces a dynamic weight quantization method for LLM inference that prioritizes reconstruction quality over manual bit-width selection. It decomposes weight matrices into sparse fp16 outlier residuals and dense sub-byte index tensors, using per-group learned codebooks. The approach includes two storage modes (V2 and V2a) and a quality-driven H-Process for memory optimization. On Qwen3.5-122B-A10B, XFP achieves 138 tok/s decode speed with 94.49% GSM8K accuracy, outperforming Marlin INT4 by 49%. For Qwen3.5-397B-A17B, it fits the model into 2x96 GB at ~3.4 effective bits, delivering 100.9 tok/s decode speed and 66.72% GSM8K accuracy.
dynamic weight quantizationsparse outlier residualssub-byte index tensorsper-group codebooksquality-driven optimization
GPart: End-to-End Isometric Fine-Tuning via Global Parameter Partitioning
GPart introduces a novel parameter-efficient fine-tuning (PEFT) method that eliminates the low-rank bottleneck inherent in LoRA-based approaches. By employing a single isometric partition matrix, GPart maps a $d$-dimensional trainable vector directly into the full weight space of the model, ensuring end-to-end isometry. This method requires only $d+1$ storage values and a single hyperparameter, offering a minimalistic fine-tuning pipeline. Empirical evaluations demonstrate that GPart achieves superior or comparable performance to existing PEFT methods across natural language understanding, computer vision, and mathematical reasoning tasks, establishing state-of-the-art efficiency and performance.
parameter-efficient fine-tuningisometric mappinglow-rank adaptationglobal partitionend-to-end isometry
Emotion-Attended Stateful Memory (EASM):The Architecture for Hyper-Personalization at Scale
The authors propose Emotion-Attended Stateful Memory (EASM), an architecture enabling persistent user-specific conversational context through long-term history, emotional signals, and inferred intent at inference time. They evaluate EASM against a stateless baseline in a controlled A/B study across thirty non-scripted conversations spanning six emotional categories. The memory-enriched condition outperforms the baseline, with significant improvements in memory grounding (95%), plan clarity (57%), and emotional validation (34%), even in emotionally adversarial scenarios. Results suggest EASM may serve as foundational infrastructure for hyper-personalized AI systems, though broader validation is required.
emotion-attended stateful memoryretrieval-augmented generationinference timehyper-personalizationemotional validation
Interestingness as an Inductive Heuristic for Future Compression Progress
The paper formalizes interestingness as an inductive heuristic for predicting future compression progress in recursively self-improving systems, leveraging Kolmogorov Complexity and Algorithmic Statistics. By analyzing complexity-runtime profiles under Length, Algorithmic, and Speed priors, the authors demonstrate that interestingness exhibits inductive properties, with past progress signaling future discoveries. They prove that expected future progress depends exponentially on the recency of the last observed breakthrough and show that the Algorithmic Prior yields a quadratic increase in expected discovery compared to the Length Prior. Experimental validation across three universal computational paradigms supports these theoretical findings.
kolmogorov complexityalgorithmic statisticsinductive heuristiccompression progressspeed prior
A Heterogeneous Temporal Memory Governance Framework for Long-Term LLM Persona Consistency
We introduce ARPM, a heterogeneous temporal memory governance framework addressing long-term persona consistency in large language models. ARPM separates static knowledge from dynamic dialogue memory, employing vector retrieval, BM25, RRF fusion, dual-temporal reranking, chronological evidence reading, and controlled analysis protocols for evidence verification. Experiments demonstrate its efficacy: in 50-round QA, manual review achieves 100% recall accuracy under 1:5 signal-to-noise ratio (vs. 54% auto-judgment) and 80% under 1:200+ (vs. 44%). Ablation shows dialogue history retrieval is crucial, with its removal reducing strict accuracy from 100% to 66.7%. ARPM maintains semantic continuity and persona consistency across 5.1M-character noise substrates and multi-model handoffs.
temporal memory governancepersona consistencydual-temporal rerankingchronological evidence readingsignal-to-noise ratio
Beyond AI as Assistants: Toward Autonomous Discovery in Cosmology
The paper introduces two autonomous AI systems for cosmological discovery: \texttt{CMBEvolve} and \texttt{CosmoEvolve}. \texttt{CMBEvolve} employs LLM-guided code evolution and tree search for tasks with quantitative objectives, demonstrated by improving benchmark scores in weak-lensing map analysis. \texttt{CosmoEvolve} facilitates open-ended research workflows via a multi-agent virtual lab, shown by identifying non-trivial patterns in ACT DR6 data and producing analysis-grade diagnostics. These systems exemplify AI's potential in both controlled benchmarks and exploratory cosmology research.
autonomous discoverycosmologyllm-guided code evolutiontree searchmulti-agent systems
Graphs of Research: Citation Evolution Graphs as Supervision for Research Idea Generation
Graphs of Research (GoR) introduces a supervised fine-tuning method for research idea generation using citation-evolution graphs as supervision. The approach extracts a 2-hop reference neighborhood for seed papers, constructs a directed acyclic graph (DAG) from citation position, frequency, predecessor links, and publication time, and fine-tunes Qwen2.5-7B-Instruct-1M on structured-text prompts incorporating the graph, edge signals, and reference information. Evaluated on a dataset of 498/50/50 train/validation/test seed papers from five ML/NLP venues, GoR-SFT achieves state-of-the-art performance in head-to-head LLM-judge tournaments against GPT-4o-driven baselines, demonstrating the efficacy of citation-evolution graphs in automating scientific innovation.
citation-evolution graphssupervised fine-tuningdirected acyclic graphresearch idea generationlarge language models
Known By Their Actions: Fingerprinting LLM Browser Agents via UI Traces
This work demonstrates that LLM-based web browsing agents can be passively identified by their UI interaction traces, posing a security risk. The authors analyze 14 frontier LLMs across four web environments, using a JavaScript tracker to capture agent actions and timings. Classifiers trained on these traces achieve up to 96% F1 score in model identification, generalize across model families and sizes, and require few interaction traces. While injecting timing delays degrades classifier performance, retraining largely recovers accuracy. The authors release a labeled corpus of agent traces and experimental harness.
llm-based agentsui interaction tracesjavascript trackermodel identificationtiming delays
Identifying Culprits Through Deep Deterministic Policy Gradient Deep Learning Investigation
This research introduces a Deep Deterministic Policy Gradient (DDPG) approach for criminal identification, addressing limitations in conventional data analysis methods. The model processes crime scene data, witness statements, and suspect profiles to optimize offender identification while reducing noise and false positives/negatives. Experimental results demonstrate 95% accuracy, outperforming existing methods in criminal detection tasks.
deep deterministic policy gradientcriminal identificationfalse positiveswitness statementscrime scene data
Beyond What to Select: A Plug-and-play Oscillatory Data-Volume Scheduling for Efficient Model Training
The authors propose PODS, a Plug-and-play Oscillatory Data-volume Scheduling framework that dynamically adjusts the volume of selected training data to optimize efficiency-generalization trade-offs. Unlike existing data selection methods that focus on sample importance criteria with fixed selection ratios, PODS introduces an oscillatory schedule alternating between low-ratio regularization phases and high-ratio recovery phases. This approach leverages selection-induced implicit regularization while maintaining optimization stability. PODS is compatible with existing selection methods and training paradigms. Experiments demonstrate its effectiveness, reducing ImageNet-1k training cost by 50% with improved accuracy and accelerating LLM instruction tuning by over 2x without performance degradation.
data selectionimplicit regularizationoscillatory schedulingoptimization stabilitytraining efficiency
MediaClaw: Multimodal Intelligent-Agent Platform Technical Report
MediaClaw introduces a multimodal agent platform addressing AIGC deployment challenges through a three-layer architecture: unified abstraction, pluginized extension, and workflow orchestration. The system abstracts diverse AIGC capabilities into a unified model, supports hot-pluggable expansion via plugins, and enables reusable workflow assets through task-oriented Skills. The technical report details architectural design, core capability model logic, and key engineering trade-offs, providing practical guidance for multimodal platform development.
multimodal agentaigcworkflow orchestrationpluginized extensionunified abstraction
Streaming Speech-to-Text Translation with a SpeechLLM
We propose a novel SpeechLLM architecture for real-time streaming speech-to-text translation that eliminates the need for complete utterance processing or fixed-interval token emission. The model learns to emit output tokens based on sufficient audio context, trained using automatic alignments between input speech and output text. Evaluations across multiple language pairs demonstrate translation quality approaching non-streaming baselines while maintaining low latency of 1-2 seconds, addressing the computational inefficiencies of existing SpeechLLM systems.
speechllmstreaminglatencyutterancealignments
Compositional Sparsity as an Inductive Bias for Neural Architecture Design
The paper introduces Homological Neural Networks (HNNs), a novel architecture leveraging compositional sparsity as an inductive bias for efficient high-dimensional learning. HNNs combine Information Filtering Networks (IFNs), which extract sparse dependency structures via constrained information maximization, with a mapping of inferred topology into fixed-wiring sparse neural graphs. This approach yields interpretable hierarchical compositions, requiring minimal hyperparameter tuning. On synthetic tasks with known sparse hierarchies, HNNs recover underlying structures and maintain stability in high-dimensional regimes where dense networks degrade. Across diverse real-world datasets, HNNs match or outperform dense baselines with fewer parameters, lower variance, and reduced hyperparameter sensitivity.
compositional sparsityhomological neural networksinformation filtering networkshigh-dimensional learninginductive bias
AI Outperforms Humans in Personalized Image Aesthetics Assessment via LLM-Based Interviews and Semantic Feature Extraction
The authors propose an integrated deep learning-large language model (DL-LLM) system for personalized image aesthetics assessment, addressing the subjectivity of aesthetic preferences. The system combines LLM-based semi-structured interviews to actively collect individual preferences with DL-based extraction of both low-level and high-level semantic image features. Experimental results demonstrate that the proposed system outperforms conventional systems, human predictors, and individuals' own re-evaluations, particularly on highly-rated images. Prediction error is smaller than within-person variability, suggesting AI may better capture individual aesthetic preferences than humans or one's future self.
aesthetic assessmentlarge language modelsemantic featurespersonalized predictiondeep learning
Probabilistic Verification of Recurrent Neural Networks for Single and Multi-Agent Reinforcement Learning
The authors introduce RNN-ProVe, a probabilistic verification framework for recurrent neural network policies in partially observable reinforcement learning. The method employs policy-driven sampling to approximate feasible hidden states and derives statistical error bounds to estimate behavioral violations with bounded-error guarantees. Unlike existing tools relying on restrictive assumptions or coarse approximations, RNN-ProVe provides quantitative, feasibility-aware probabilistic verification. Experiments on single-agent and cooperative multi-agent tasks demonstrate its effectiveness in scaling to recurrent and multi-agent settings while offering high-confidence estimates of undesired behaviors.
recurrent neural networksprobabilistic verificationpartially observable reinforcement learningmulti-agent systemshidden state dynamics
XDomainBench: Diagnosing Reasoning Collapse in High-Dimensional Scientific Knowledge Composition
We introduce XDomainBench, a diagnostic benchmark for interactive interdisciplinary scientific reasoning in Large Language Models (LLMs). The benchmark formalizes composition order and mixture structure, comprising 8,598 interactive sessions across 20 domains and 4 task categories, with 8 realistic trajectory patterns simulating AI4S scenarios. Large-scale evaluation reveals systematic reasoning collapse as composition order increases, driven by direct difficulty increases from domain composition and indirect interaction-amplified failures, including error accumulation, reasoning breaks, and domain confusion.
compositional generalizationinterdisciplinary reasoningdomain compositionerror accumulationreasoning collapse
Cognitive-Uncertainty Guided Knowledge Distillation for Accurate Classification of Student Misconceptions
The paper proposes a cognitive-uncertainty guided knowledge distillation framework for accurate student misconception classification, addressing data scarcity, fuzzy error boundaries, and deployment constraints. The method employs a two-stage approach: standard knowledge distillation followed by a dual-layer marginal selection mechanism that identifies four critical sample types based on teacher model uncertainty and confidence differences. Experiments demonstrate significant improvements, achieving 0.9585 MAP@3 (+17.8%) on MAP-Charting with only 10.30% filtered samples and 84.38% accuracy on algebra misconception benchmarks using a 4B parameter model, outperforming larger LLMs and fine-tuned models.
knowledge distillationcognitive uncertaintymisconception classificationmarginal selectionadaptive learning
EVA: Editing for Versatile Alignment against Jailbreaks
EVA introduces a novel framework for safety alignment in Large Language Models (LLMs) and Vision Language Models (VLMs) by reframing it as a knowledge correction task. Instead of retraining entire models, EVA identifies and surgically edits specific neurons responsible for susceptibility to jailbreaking attacks, minimizing computational overhead and preserving general reasoning capabilities. This localized approach neutralizes harmful behaviors while maintaining performance on benign tasks. Extensive experiments show EVA outperforms existing baselines in mitigating jailbreaks across both LLMs and VLMs, offering a precise and efficient post-deployment safety solution.
safety alignmentjailbreaking attacksknowledge correctionmodel editingneurons
Non-linear Interventions on Large Language Models
The paper introduces a non-linear intervention framework for large language models (LLMs), addressing limitations of linear methods constrained by the Linear Representation Hypothesis. The proposed method enables interventions on features encoded along non-linear manifolds and implicit features without direct output signatures. Validated on refusal bypass steering, the framework outperforms linear baselines by precisely targeting non-linear refusal-governing features.
non-linear interventionslarge language modelslinear representation hypothesisrefusal bypass steeringimplicit features
Video2GUI: Synthesizing Large-Scale Interaction Trajectories for Generalized GUI Agent Pretraining
The paper introduces Video2GUI, an automated framework for generating large-scale GUI interaction trajectories from unlabeled Internet videos, addressing data scarcity in GUI agent training. The method employs coarse-to-fine filtering to extract high-quality GUI tutorial videos, constructing WildGUI, a dataset of 12 million trajectories across 1,500 applications. Pre-training Qwen2.5-VL and Mimo-VL on WildGUI improves performance by 5-20% on GUI grounding and action benchmarks, matching state-of-the-art results.
video2guiwildguigui agentsinteraction trajectoriesmultimodal llms
Mechanical Enforcement for LLM Governance:Evidence of Governance-Task Decoupling in Financial Decision Systems
The study introduces five governance metrics to evaluate policy compliance at the decision rationale level in financial workflows using large language models (LLMs). It compares text-only governance with mechanical enforcement, employing four primitives that operate outside the model's interpretive loop. Results show that mechanical enforcement reduces deferrals lacking decision-relevant information by 73%, doubles deferral information content, and improves task accuracy from MCC~0.43 to 0.88. The improvement stems from architectural separation, preserving governance quality even under structural stress. The findings highlight governance-task decoupling, emphasizing that task accuracy alone is insufficient for assessing governance in regulated AI systems.
governance metricsmechanical enforcementdecision rationalearchitectural separationgovernance-task decoupling
Addressing Terminal Constraints in Data-Driven Demand Response Scheduling
The paper proposes integrating Goal-Space Planning (GSP) with Deep Deterministic Policy Gradient (DDPG) to address terminal constraints in data-driven demand response scheduling for electrified chemical processes. The method employs learned temporally abstract models over discrete subgoals to propagate value across extended horizons, overcoming credit-assignment challenges in reinforcement learning. Evaluated on a simulated air separation benchmark, the approach demonstrates improved sample efficiency over standard DDPG while satisfying terminal storage constraints and mitigating myopic control behavior.
demand response schedulingterminal constraintsgoal-space planningdeep deterministic policy gradientcredit-assignment
TAPIOCA: Why Task- Aware Pruning Improves OOD model Capability
The paper demonstrates that task-aware layer pruning improves out-of-distribution (OOD) accuracy but not in-distribution (ID) performance across polynomial regression tasks and large language models. It empirically shows that OOD inputs induce deviations in layerwise norm and pairwise-distance profiles compared to ID inputs. Task-aware pruning identifies layers amplifying these deviations, removing them to realign OOD representations with the task-adapted geometry. This geometric explanation is supported by causal evidence from controlled distribution shifts and residual-scaling interventions, with consistent behavior observed across model scales.
task-aware pruningout-of-distributionlayerwise normpairwise-distance profilestask-adapted geometry
Agentifying Patient Dynamics within LLMs through Interacting with Clinical World Model
SepsisAgent, a world model-augmented LLM agent, improves sepsis treatment recommendations by simulating patient responses to interventions via a Clinical World Model. The agent employs a propose--simulate--refine workflow and undergoes three-stage training: patient-dynamics supervised fine-tuning, behavior cloning, and world-model-based agentic reinforcement learning. Evaluated on MIMIC-IV sepsis trajectories, SepsisAgent surpasses traditional RL and LLM-based baselines in off-policy value and safety metrics, demonstrating enhanced guideline adherence and reduced unsafe actions. Analysis reveals that repeated interaction with the Clinical World Model enables the agent to internalize patient evolution regularities, maintaining utility even without simulator access.
sepsisagentclinical world modelagentic reinforcement learningoff-policy valueguideline adherence
On Strong Equivalence Notions in Logic Programming and Abstract Argumentation
The paper introduces a new notion of strong equivalence for logic programs that preserves compatibility with abstract argumentation frameworks under translation. By analyzing discrepancies between logic programming and argumentation semantics in dynamic contexts, the authors propose a solution that maintains strong equivalence across Dung-style and claim-augmented argumentation frameworks. This bridges the gap between static semantic equivalences and dynamic update behaviors in these nonmonotonic formalisms.
strong equivalencelogic programmingabstract argumentationnonmonotonic reasoningsemantic translation
Towards Label-Free Single-Cell Phenotyping Using Multi-Task Learning
A unified deep learning framework enables label-free single-cell phenotyping by jointly classifying white blood cells and regressing protein expression from differential phase contrast images. The hybrid architecture combines convolutional texture features with transformer-based global representations via a learnable cross-branch gating module, enhanced by an LLM for interpretable cell state summaries. Evaluated on the Berkeley Single Cell Computational Microscopy and Blood Cells Image benchmarks, the model achieves 91.3% classification accuracy and a 0.72 Pearson correlation for CD16 expression regression. This approach demonstrates scalable hematological profiling without fluorescent staining.
label-free imagingsingle-cell phenotypinghybrid architectureprotein-expression regressiondifferential phase contrast
IntentVLA: Short-Horizon Intent Modeling for Aliased Robot Manipulation
IntentVLA introduces a history-conditioned vision-language-action (VLA) framework for robot manipulation, addressing multimodal imitation data challenges where similar observations may correspond to different action chunks due to varying short-horizon intents. The method encodes recent visual observations into a compact intent representation to condition chunk generation, mitigating inter-chunk conflicts. Evaluated on AliasBench, SimplerEnv, LIBERO, and RoboCasa, IntentVLA demonstrates improved rollout stability and outperforms baseline VLA policies in ambiguous scenarios.
intentvlavision-language-actionaliasbenchrobotwin2rollout stability
Vision-Core Guided Contrastive Learning for Balanced Multi-modal Prognosis Prediction of Stroke
The paper introduces a tri-modal fusion model for ischemic stroke prognosis, addressing limitations in dual-modal approaches by integrating medical images, structured clinical data, and unstructured text. The method employs a Large Language Model (LLM) to generate semi-structured diagnostic text from brain MRIs, enhancing data representation and fusion robustness. A Vision-Conditioned Dual Alignment Fusion Module (VDAFM) uses visual features as a conditional prior to guide fine-grained interaction with generated text, achieving dynamic fusion through dual semantic alignment loss. Experiments on a clinical dataset demonstrate state-of-the-art performance.
tri-modal fusionlarge language modelvision-conditioned alignmentsemantic alignment lossischemic stroke prognosis
SceneFunRI: Reasoning the Invisible for Task-Driven Functional Object Localization
SceneFunRI introduces a benchmark for reasoning about occluded functional objects in 2D scenes, addressing a key limitation in vision-language models (VLMs). The method constructs 855 test instances from SceneFun3D via semi-automatic annotation, requiring models to infer invisible object locations from task instructions and commonsense. Evaluations show current VLMs (e.g., Gemini 3 Flash) perform poorly (15.20 CAcc@75, 0.74 mIoU, 28.65 Dist), with prompting strategies (Strong Instruction, Reasoning-based, Spatial Process of Elimination) yielding unstable results, highlighting the need for better integration of task intent and spatial reasoning.
functional object localizationvision-language modelscommonsense reasoningspatial groundingocclusion reasoning
NeuroAtlas: Benchmarking Foundation Models for Clinical EEG and Brain-Computer Interfaces
NeuroAtlas introduces the largest EEG benchmark to date, comprising 42 datasets (260k hours) spanning clinical EEG (epilepsy, sleep medicine, brain-age estimation) and brain-computer interfaces, with domain-specific evaluation metrics. The study evaluates EEG-specific foundation models (FMs) against generic time-series FMs, revealing three key findings: (1) EEG-FMs do not consistently outperform non-EEG time-series FMs; (2) standard ML metrics inadequately capture clinical utility, necessitating task-specific measures like event-level decision quality; (3) model performance varies substantially within domains. Results indicate pretrained models offer only narrow advantages, failing to deliver unified EEG representations. NeuroAtlas provides resources for developing next-generation EEG FMs.
eeg benchmarkfoundation modelsclinical utilitytime-series analysisbrain-computer interfaces
Spontaneous symmetry breaking and Goldstone modes for deep information propagation
The paper demonstrates that deep neural networks with continuous symmetry-equivariant internal layers exhibit Goldstone-like modes, enabling coherent signal propagation across depth and recurrent iterations without architectural stabilizers. By analyzing these symmetry-breaking phenomena analytically and empirically, the authors show that such modes improve trainability and representational diversity in feedforward networks and enhance long-term memory in recurrent architectures. Experiments reveal improved performance of RNNs and GRUs on long-sequence modeling tasks, validating the utility of Goldstone-like degrees of freedom for stable information flow.
goldstone modessymmetry breakingequivariant layersrecurrent neural networkssignal propagation
AI-assisted cultural heritage dissemination: Comparing NMT and glossary-augmented LLM translation in rock art documents
The study evaluates three translation approaches for rock art documentation: DeepL NMT, Gemini-Simple LLM, and glossary-augmented Gemini-RAG. Using PEARMUT framework, human assessment via Direct Assessment (0-100) and MQM taxonomy showed Gemini-RAG achieved highest terminology accuracy (81.4%) versus Gemini-Simple (69.1%) and DeepL (64.4%), while maintaining overall quality (85.3 vs. 85.2 vs. 80.3). Results demonstrate glossary-augmented prompting as a low-overhead method to enhance terminology control in cultural heritage translation with minimal institutional resources.
nmtllmpearmutmqmrag
$π$-Bench: Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflows
The paper introduces $π$-Bench, a benchmark for evaluating proactive assistance in personal assistant agents, addressing the gap in assessing hidden intent resolution during long-horizon interactions. The benchmark comprises 100 multi-turn tasks across 5 domain-specific personas, incorporating hidden intents, inter-task dependencies, and cross-session continuity to measure both proactivity and task completion. Experiments reveal three key findings: proactive assistance remains challenging, task completion and proactivity are distinct metrics, and prior interaction history aids proactive intent resolution in subsequent tasks.
proactive assistancelong-horizon workflowshidden intentsmulti-turn interactionspersonal assistant agents
Agentic Design of Compositional Descriptors via Autoresearch for Materials Science Applications
The authors introduce Automat, an autoresearch framework leveraging a large language model (GPT-5.5-based OpenAI Codex) to automate the design of composition-only descriptors for materials-property prediction. The framework employs a coding agent that iteratively proposes, implements, and evaluates chemically motivated descriptor strategies using a random forest workflow. Automat outperforms fractional-composition, Magpie, and combined baselines in predicting experimental band gaps in inorganic materials and Curie temperatures in ferromagnetic compounds, while generating chemically interpretable descriptors. The results demonstrate the feasibility of task-specific descriptor generation without manual feature engineering, though limitations such as descriptor redundancy and sensitivity to greedy feature expansion are noted.
autoresearchcomposition-only descriptorsrandom forestchemical formulastask-specific
How Sensitive Are Radiomic AI Models to Acquisition Parameters?
We propose a performance-oriented framework to quantify scan parameter sensitivity in radiomic AI models, identifying clinically significant parameter regions that enhance cross-dataset robustness. A mixed-effects framework evaluates the influence of clinically relevant acquisition parameters on model performance, accounting for subject-level random effects. Applied to lung cancer diagnosis in CT scans using two multicentre datasets, the framework identifies optimal configurations: X-ray tube current ≥200 mA, spiral pitch ≤1.5, slice thickness ≤1.25 mm. These settings improve sensitivity from 0.79±0.04 to 0.90±0.10 and specificity from 0.47±0.10 to 0.79±0.13, balancing diagnostic quality with low radiation dose.
radiomic ai modelsmixed-effects frameworkcross-dataset robustnessct scansacquisition parameters
Monitoring Data-aware Temporal Properties (Extended Version)
The paper introduces a novel framework for anticipatory monitoring of linear-time properties enriched with SMT theories over finite traces (LTLfMT). The approach combines automata-theoretic methods for temporal aspects with automated reasoning techniques for first-order dimensions, addressing challenges in monitoring dynamic systems where internal specifications are inaccessible. The authors formally prove correctness under reasonable assumptions and identify decidable fragments combining linear arithmetic and uninterpreted functions, applicable to data-aware business processes. Feasibility is demonstrated via a prototype implementation and preliminary evaluation.
ltlfmtanticipatory monitoringsmt theoriesautomata-theoretic methodsdecidable fragments
Falkor-IRAC: Graph-Constrained Generation for Verified Legal Reasoning in Indian Judicial AI
Falkor-IRAC introduces a graph-constrained generation framework for Indian legal AI, addressing limitations of vector-based retrieval-augmented generation (RAG) in legal reasoning. The framework ingests Supreme Court and High Court judgments as IRAC (Issue, Rule, Analysis, Conclusion) node structures enriched with procedural state transitions, precedent relationships, and statutory references, stored in FalkorDB for low-latency traversal. At inference, LLM-generated answers are validated by a Verifier Agent tracing supporting paths through the graph, ensuring citation grounding and detecting doctrinal conflicts. Evaluated on 51 Supreme Court judgments, the system demonstrated accurate citation validation and rejection of fabricated citations, though GPU-accelerated inference remains future work.
graph-constrained generationretrieval-augmented generationprocedural state transitionscitation groundingverifier agent
MindGap: A Conversational AI Framework for Upstream Neuroplastic Intervention in Post-Traumatic Stress Disorder
The paper introduces MindGap, an on-device conversational AI framework for upstream neuroplastic intervention in PTSD. It targets the pre-cognitive affective gap between stimulus and reaction using dependent origination principles, guiding patients through three progressive observation layers to weaken maladaptive pathways via long-term depression. The system employs a fine-tuned lightweight LLM for privacy-preserving daily exposure sessions, enabling deployment in sensitive contexts where cloud-based solutions are prohibited.
neuroplastic interventiondependent originationlong-term depressionon-device llmaffective gap
Vision-Based Water Level and Flow Estimation
The authors propose an integrated framework combining state-of-the-art vision models with statistical modeling for water level detection and river flow estimation. The method leverages physical priors and robust filtering strategies to address challenges such as environmental sensitivity and limited precision. Compared to traditional sensing techniques, the vision-based approach offers improved interpretability, automated data archiving, and enhanced system robustness. Code for the implementation is made publicly available on GitHub.
computer visionwater level detectionflow estimationstatistical modelingphysical priors
How to Evaluate and Refine your CAM
The paper introduces a synthetic dataset with ground-truth attributions to rigorously evaluate CAM metrics, proposing ARCC as a more reliable composite metric. It addresses low-resolution CAM limitations with RefineCAM, a method aggregating CAMs across network layers for higher-resolution attribution maps. Results demonstrate RefineCAM's superior performance over existing methods using the proposed evaluation framework.
class attribution mapssynthetic datasetcomposite metrichigh-resolution attributionconvolutional neural networks
Teaching Large Language Models When Not to Know: Learning Temporal Critique for Ex-Ante Reasoning
The authors introduce TCFT (Temporal Critique Fine-Tuning), a framework addressing LLMs' inability to reason under temporal cutoffs by teaching cutoff-aware temporal verification. Through systematic prompt-level interventions, they demonstrate that temporal leakage is sensitive to cutoff formulation and instruction placement, with explicit cutoff statements and prefix constraints proving most effective. TCFT trains models to identify post-cutoff leakage, explain temporal boundary violations, and judge temporal admissibility. Experiments with Qwen2.5-7B-Instruct and Qwen2.5-14B-Instruct show TCFT reduces average leakage by 41.89 and 37.79 percentage points, respectively, outperforming prompting and supervised fine-tuning baselines.
temporal leakageex-ante reasoningcutoff-aware verificationprefix constraintssupervised fine-tuning
MultiEmo-Bench: Multi-label Visual Emotion Analysis for Multi-modal Large Language Models
We introduce MultiEmo-Bench, a multi-label visual emotion analysis benchmark dataset addressing limitations in existing datasets for evaluating multimodal large language models (MLLMs). The dataset comprises 10,344 images annotated by 20 annotators per image, aggregating 236,998 votes across eight emotions to provide emotion distributions. We evaluate Qwen3-VL, GPT, Gemini, and Claude on dominant emotion prediction and emotion distribution prediction, demonstrating progress but highlighting remaining challenges. Experiments with LLM-as-a-judge reveal its limitations in subjective visual emotion analysis tasks.
multimodal large language modelsvisual emotion analysismulti-label benchmarkemotion distribution predictionllm-as-a-judge
Action-Inspired Generative Models
The paper introduces Action-Inspired Generative Models (AGMs), a dual-network framework that improves bridge-matching methods by learning a lightweight scalar potential $V_φ$ to score bridge samples and modulate the drift objective via importance weights. The method prevents adversarial feedback using a stop-gradient barrier, adds minimal parameter overhead (∼1.4% of the primary drift network), and requires no auxiliary SDE solvers. At inference, $V_φ$ is discarded, maintaining standard Euler-Maruyama integration. Experiments show consistent improvements in generation quality across fidelity and coverage metrics.
bridge-matchingscalar potentialdrift objectiveeuler-maruyamaimportance weights
An Amortized Efficiency Threshold for Comparing Neural and Heuristic Solvers in Combinatorial Optimization
The paper introduces the Amortized Efficiency Threshold (AET), a framework to compare neural and heuristic combinatorial-optimization solvers by quantifying the deployment volume required for neural solvers to offset their training energy costs. The analysis accounts for both operational energy (training/inference) and embodied carbon (hardware fabrication), showing that cumulative-energy ratios converge to a constant below one when neural solvers outperform per-instance. The method is instantiated on the Multi-Task VRP (MTVRP) environment (n=20 customers, 19 variants), revealing a crossover at 1.58×10⁵ instances with a per-instance energy ratio of 0.41. Open instrumentation and measurement protocols are provided.
amortized efficiency thresholdcombinatorial optimizationneural solversenergy efficiencymulti-task vrp
Do We Really Need External Tools to Mitigate Hallucinations? SIRA: Shared-Prefix Internal Reconstruction of Attribution
SIRA introduces a training-free internal contrastive decoding framework to mitigate hallucinations in large vision-language models (LVLMs) without external tools or perturbed inputs. By exploiting the staged information flow of multimodal transformers, SIRA constructs a counterfactual reference internally through a shared prefix, preserving multimodal context while masking late visual attention. This enables token-level contrastive decoding, suppressing language-prior-dominated predictions. Experiments on POPE, CHAIR, and AMBER benchmarks with Qwen2.5-VL and LLaVA-v1.5 demonstrate reduced hallucinations, maintained descriptive coverage, and lower computational overhead compared to two-pass methods.
contrastive decodingmultimodal transformershallucination mitigationinternal reconstructiontoken-level contrast
SliceGraph: Mapping Process Isomers in Multi-Run Chain-of-Thought Reasoning
SliceGraph introduces a graph-based method for analyzing multi-run chain-of-thought (CoT) reasoning by constructing a problem-model-cell graph via mutual-kNN over sparse activation-key Jaccard similarity between CoT slices. This approach treats the graph as a measurement object for process geometry rather than a decoding program, revealing shared reasoning-state units and process families as strategy-coherent route units. Results from three primary 4B/8B models on math and science benchmarks show that 85.5% of 954 problem-model cells exhibit correct CoTs splitting into multiple process families, termed process isomers. Label-seeded reward fields and typed-state transition analyses further demonstrate that process families navigate distinct transition kernels, highlighting structured multi-route process geometry overlooked by final-answer aggregation.
slicegraphchain-of-thoughtprocess isomersjaccard similaritymutual-knn
In-IDE Toolkit for Developers of AI-Based Features
The AI Toolkit plugin for JetBrains IDEs introduces IDE-native observability and evaluation workflows for developers of AI-based features, addressing challenges in testing, debugging, and reproducibility. The toolkit implements run-triggered trace capture, hierarchical inspection, dataset creation from traces, and unit-test-like evaluations with pluggable metrics, guided by practitioner needs for repeatability, execution-time trace exposure, and minimal setup. Initial deployment in PyCharm demonstrates strong conversion rates, sustained usage, and low churn, indicating reduced activation energy for adopting disciplined AI development practices. The study outlines design details, adoption telemetry, and future steps to expand framework coverage and scale evaluations, highlighting accessibility for non-ML specialists while maintaining software-engineering rigor.
ide-native observabilitytrace capturepluggable metricsactivation energyai toolkit
One Step to the Side: Why Defenses Against Malicious Finetuning Fail Under Adaptive Adversaries
The paper demonstrates that current defenses against malicious fine-tuning of foundation models are ineffective against adaptive adversaries. Analyzing 15 recent defenses, the authors identify a common weakness: these approaches obscure harmful behaviors rather than eliminating them. They develop a unified adaptive attack that successfully bypasses all surveyed defenses, showing robustness claims are incomplete. Results indicate existing methods only resist fixed attacks they were designed for, failing under adaptive threats. The work provides a framework for stress-testing future defenses.
malicious finetuningadaptive adversariessafety-alignmentfoundation modelsrobustness evaluation
Sycophancy is an Educational Safety Risk: Why LLM Tutors Need Sycophancy Benchmarks
The paper introduces EduFrameTrap, a tutoring benchmark across STEM domains to evaluate LLM behavior under social-epistemic pressure, addressing the Reasoning-Sycophancy Paradox where models prioritize agreeableness over epistemic rigor. EduFrameTrap varies student confidence and pressure types (context-switch, authority, social-affective) to measure corrective friction in tutoring. Testing GPT-5.2 and Claude reveals GPT-5.2 exhibits lower context-switch failures but greater epistemic retreat under authority and social pressure, while Claude shows context-switch fragility. Two-judge disagreement is used as a reliability signal due to difficulty in automatic evaluation. The authors advocate for benchmarks measuring social-epistemic courage and treating corrective tutoring as a safety requirement.
reasoning-sycophancy paradoxeduframetrapsocial-epistemic pressurecorrective frictionepistemic retreat
Fast Rates for Inverse Reinforcement Learning
(No summary returned.)
Angel or Demon: Investigating the Plasticity Interventions' Impact on Backdoor Threats in Deep Reinforcement Learning
This study investigates the impact of plasticity interventions on backdoor vulnerabilities in deep reinforcement learning (DRL), addressing a critical gap in prior research. Through empirical analysis of 14,664 cases combining representative interventions and attack scenarios, the authors identify that only SAM exacerbates backdoor threats via gradient amplification, while other interventions mitigate threats through activation pathway disruption and representation space compression. Key contributions include a conceptual framework (SCC) for robust backdoor injection and the identification of abnormal loss landscape sharpness as a detection indicator for DRL backdoors.
backdoor attacksplasticity interventionsdeep reinforcement learninggradient amplificationloss landscape
A Picture is Worth a Thousand Words? An Empirical Study of Aggregation Strategies for Visual Financial Document Retrieval
This study empirically evaluates the limitations of single-vector aggregation in Visual Retrieval-Augmented Generation (RAG) for financial document retrieval. The authors develop a diagnostic benchmark using financial documents, where minor digit changes induce significant semantic shifts. Experiments reveal that single-vector aggregation collapses distinct documents into nearly identical vectors, obscuring semantic details detectable at the patch level. The root cause is identified as global texture dominance. Findings remain consistent across model scales, retrieval-optimized embeddings, and mitigation strategies, highlighting risks in financial applications.
visual ragsingle-vector aggregationpatch tokensglobal texture dominanceretrieval-augmented generation
Prompt Segmentation and Annotation Optimisation: Controlling LLM Behaviour via Optimised Segment-Level Annotations
We introduce Prompt Segmentation and Annotation Optimisation (PSAO), a structured prompt optimisation framework that decomposes prompts into interpretable segments and augments them with human-readable annotations (e.g., {not important}, {important}, {very important}). These annotations guide large language models (LLMs) in allocating focus and clarifying confusion during response generation. Empirical evaluations demonstrate that PSAO improves reasoning accuracy and self-consistency in LLM responses while retaining the original prompt as a candidate to prevent performance degradation. The method formally defines segmentations and annotations, though identifying optimal configurations remains an open challenge.
prompt optimisationlarge language modelsannotationsegmentationreasoning accuracy
PyCSP3-Scheduling: A Scheduling Extension for PyCSP3
The paper introduces PyCSP3-Scheduling, an extension to PyCSP3 that adds scheduling abstractions (53 constraints, 27 expressions) while preserving the modeling/solving separation. It compiles high-level scheduling constructs to standard PyCSP3/XCSP3 constraints, enabling native support for interval variables, sequence variables, and resource functions. Evaluation on 261 instances across 17 model families shows identical objectives in all 72 optimal cases, with 8 families remaining structurally unchanged post-compilation. Runtime performance varies, showing up to 5.8x speedups in some families but regressions in others due to compilation overhead.
pycsp3schedulingconstraint programminginterval variablesxcsp3
Resolving Action Bottleneck: Agentic Reinforcement Learning Informed by Token-Level Energy
We introduce ActFocus, a token reweighting method for agentic reinforcement learning that addresses the Action Bottleneck—a phenomenon where token-level training signals disproportionately concentrate on action tokens rather than reasoning tokens. Drawing on energy-based modeling, ActFocus downweights gradients on reasoning tokens and redistributes weights to action tokens with higher uncertainty. Evaluated across four environments and varying model sizes, ActFocus consistently outperforms PPO and GRPO, achieving final-step gains of up to 65.2 and 63.7 percentage points, respectively, without additional runtime or memory overhead.
agentic reinforcement learningaction bottlenecktoken reweightingenergy-based modelingppo
TeachAnything: A Multimodal Crowdsourcing Platform for Training Embodied AI Agents in Symmetrical Reality
TeachAnything introduces a multimodal crowdsourcing platform for training embodied AI agents in Symmetrical Reality (SR), addressing the need for human-like intelligence in human-agent coexistence. The platform employs a three-stage demonstration paradigm integrating multimodal signals, supported by physics simulation, to collect diverse demonstration data across scenes, tasks, and embodiments. By unifying virtual and physical interactions, TeachAnything provides a practical foundation for developing SR-aligned embodied agents. The cloud-based system leverages crowdsourcing to enhance data diversity and scalability.
symmetrical realitymultimodal demonstrationphysics simulationembodied agentscrowdsourcing platform
Break-the-Beat! Controllable MIDI-to-Drum Audio Synthesis
We introduce Break-the-Beat!, a model for controllable MIDI-to-drum audio synthesis that renders drum MIDI with the timbre of a reference audio. The model is built by fine-tuning a pre-trained text-to-audio model with a novel content encoder and a hybrid conditioning mechanism, leveraging a newly constructed dataset of paired target-reference drum audio. Experiments demonstrate that Break-the-Beats! generates high-quality drum audio that adheres to high-resolution drum MIDI, achieving strong performance in audio quality, rhythmic alignment, and beat continuity metrics. This provides music producers with a precise, controllable tool for drum loop creation.
midi-to-drum synthesiscontent encoderhybrid conditioningrhythmic alignmentbeat continuity
Efficient Multi-objective Prompt Optimization via Pure-exploration Bandits
This work introduces a principled framework for multi-objective prompt optimization in large language models (LLMs), addressing Pareto prompt set recovery and best feasible prompt identification. The authors cast the problem into the pure-exploration bandits framework, adapting efficient algorithms from multi-objective bandits and proposing a novel design for best feasible arm identification with theoretical guarantees in the linear case. Extensive experiments across multiple LLMs demonstrate that the bandit-based approaches significantly outperform baselines, providing an efficient solution for optimizing prompts across multiple performance metrics.
multi-objective banditspure-exploration banditsprompt optimizationpareto prompt setlarge language models
Complacent, Not Sycophantic: Reframing Large Language Models and Designing AI Literacy for Complacent Machines
The paper reframes LLM behavior as complacent rather than sycophantic, arguing that models lack intentionality and merely reflect training biases favoring agreement. It critiques the anthropomorphic 'sycophancy' label, proposing 'complacency' to describe structural tendencies toward confirmation bias in model outputs. The authors shift accountability to developers and institutions, suggesting AI literacy programs should specifically address confirmation bias mitigation strategies when interacting with LLMs.
complacencyconfirmation biasanthropomorphismai literacytraining bias
RxEval: A Prescription-Level Benchmark for Evaluating LLM Medication Recommendation
RxEval introduces a prescription-level benchmark for evaluating LLM medication recommendation, addressing limitations of existing admission-level benchmarks by focusing on detailed, time-ordered clinical trajectories. The benchmark comprises 1,547 multiple-choice questions requiring selection of specific medication-dose-route triples from real prescriptions and patient-specific distractors generated via reasoning-chain perturbation. Evaluated on 16 LLMs, RxEval proves challenging and discriminative, with F1 scores ranging from 45.18 to 77.10 and the best Exact Match at 46.10%. Error analysis reveals frontier models often overlook patient information and fail to derive clinical conclusions.
medication recommendationreasoning-chain perturbationprescription-level benchmarkclinical trajectoryllm evaluation
VerbalValue: A Socially Intelligent Virtual Host for Sales-Driven Live Commerce
VerbalValue introduces a socially intelligent virtual host for live commerce, addressing limitations of existing conversational recommenders and LLMs through three contributions. First, it constructs a domain knowledge base of product specifications and a sales terminology lexicon. Second, it collects and annotates 1,475 live-commerce interactions to capture diverse viewer intents. Third, it fine-tunes a large language model to deliver empathetic, commercially oriented responses using techniques like empathetic amplification and evidence-backed rebuttal. Evaluations against GPT-5.4, Claude Sonnet 4.6, and Gemini 3.1 Pro show improvements of 23% in informativeness and 18% in factual correctness, with enhanced tactfulness and viewer engagement.
live-commercedomain knowledge baseempathetic amplificationevidence-backed rebuttallarge language model
Cattle Trade: A Multi-Agent Benchmark for LLM Bluffing, Bidding, and Bargaining
The paper introduces \textsc{Cattle Trade}, a multi-agent benchmark evaluating LLMs in strategic reasoning under imperfect information, adversarial interaction, and resource constraints. The benchmark integrates auctions, hidden-offer trade challenges, bargaining, bluffing, and opponent modeling within a 50--60 turn game, logging all interactions for behavioral analysis. Seven LLMs and three code agents were tested across 242 games, revealing that strategic coherence (spending efficiency, resource discipline, phase-adaptive bidding) correlates more strongly with rank than spending volume or isolated subskills. Heuristic agents outperformed most LLMs, with behavioral traces exposing recurrent failure modes like overbidding and weak opponent-state adaptation.
multi-agent benchmarkstrategic reasoningimperfect informationopponent modelingresource allocation
PROVE: A Perceptual RemOVal cohErence Benchmark for Visual Media
The authors propose PROVE, a perceptual evaluation framework for object removal in visual media, comprising two novel metrics (RC-S for spatial coherence and RC-T for temporal consistency) and a two-tier benchmark (PROVE-Bench). RC-S compares masked and background regions via sliding-window feature analysis, while RC-T tracks distribution shifts in restored regions across frames. PROVE-Bench includes 80 motion-augmented videos (PROVE-M) and 100 challenging unpaired videos (PROVE-H). Experiments show RC metrics achieve significantly higher alignment with human judgments than existing methods across diverse benchmarks.
object removalperceptual metricstemporal consistencyspatial coherencevideo inpainting
Dimension-Level Intent Fidelity Evaluation for Large Language Models: Evidence from Structured Prompt Ablation
We introduce a dimension-level intent fidelity evaluation framework for LLMs, addressing the limitation of holistic scores in distinguishing structural recovery from intent preservation. Through a structured prompt ablation study across 2,880 outputs in three languages, three task domains, and six LLMs, we separately measure structural recovery and intent fidelity per semantic dimension. Results reveal a systematic structural-fidelity split: 25.7% of Chinese-language outputs and 58.6% of English-language outputs with perfect holistic alignment exhibited dimensional intent deficits. Human evaluation confirmed dimensional fidelity scores' reliability over holistic scores, and weight-perturbation experiments showed severe dimensional inversion consistently harms output quality.
intent fidelitystructured prompt ablationholistic alignmentdimensional inversionweight-perturbation
HASTE: Training-Free Video Diffusion Acceleration via Head-Wise Adaptive Sparse Attention
We introduce HASTE, a training-free framework for accelerating video diffusion models via head-wise adaptive sparse attention. The method addresses limitations of existing top-p sparse attention by incorporating two novel components: Temporal Mask Reuse, which reduces redundant mask prediction through query-key drift analysis, and Error-guided Budgeted Calibration, which optimizes per-head sparsity thresholds by minimizing output error under global constraints. Evaluated on Wan2.1-1.3B and Wan2.1-14B models, HASTE achieves up to 1.93× speedup at 720P resolution while maintaining competitive video quality and similarity metrics compared to XAttention and SVG2 baselines.
video diffusionsparse attentiontraining-free accelerationhead-wise adaptationtemporal coherence
Asymmetric Generative Recommendation via Multi-Expert Projection and Multi-Faceted Hierarchical Quantization
We propose AsymRec, an asymmetric continuous-discrete framework addressing dual-stage information bottlenecks in Generative Recommendation (GenRec) models. The framework decouples input and output representations via Multi-expert Semantic Projection (MSP), which maps continuous embeddings into the Transformer's hidden space using expert-specialized projections, and Multi-faceted Hierarchical Quantization (MHQ), which constructs structured discrete targets through multi-view and multi-level quantization with semantic regularization. AsymRec outperforms state-of-the-art generative recommenders by an average of 15.8% in extensive experiments, demonstrating improved semantic preservation and generalization to infrequent items.
generative recommendationsemantic projectionhierarchical quantizationtransformerinformation bottleneck
When Robots Do the Chores: A Benchmark and Agent for Long-Horizon Household Task Execution
The authors introduce LongAct, a benchmark for evaluating high-level planning in long-horizon household tasks specified via free-form instructions, abstracting away low-level control. They propose HoloMind, a VLM-driven agent with hierarchical DAG-based planning, Multimodal Spatial Memory, Episodic Memory, and a global Critic. Experiments with GPT-5 and Qwen3-VL show HoloMind improves long-horizon performance (59% goal completion, 16% full-task success) while reducing reliance on model scale, highlighting the challenge of long-horizon planning in embodied AI.
long-horizon planningembodied aimultimodal memoryhierarchical plannerfree-form instructions
Quantifying Cyber-Vulnerability in Power Electronics Systems via an Impedance-Based Attack Reachable Domain
The paper introduces an impedance-based Attack Reachable Domain (ARD) framework to quantify cyber-vulnerability in power electronics systems, addressing the lack of attacker-oriented metrics. The method maps adversarial actions to critical-eigenvalue migration via impedance reshaping and defines an Attack Penetration Index to jointly assess stability margin penetration and attack accessibility under privilege constraints. A gray-box workflow integrates impedance identification and differentiable surrogates for model-agnostic computation. Case studies on 4-bus and IEEE 39-bus systems demonstrate that coordinated cross-layer attacks outperform isolated ones, with the metric uncovering vulnerability patterns undetectable by grid-strength indicators.
impedance-basedattack reachable domaincritical-eigenvalue migrationgray-box workflowcyber-vulnerability
Fully Dynamic Rebalancing in Dockless Bike-Sharing Systems via Deep Reinforcement Learning
The paper introduces a fully dynamic Deep Reinforcement Learning (DRL) method for real-time rebalancing in dockless bike-sharing systems, eliminating the need for periodic system-wide interventions. The approach models the system via a graph-based simulator, formulating rebalancing as a Markov decision process, and employs a DRL agent to route a single truck for localized pick-up, drop-off, and charging actions using spatiotemporal criticality scores. Experiments on real-world data demonstrate significant reductions in availability failures with minimal fleet size, while mitigating spatial inequality and mobility deserts.
deep reinforcement learningmarkov decision processspatiotemporal criticalitydockless bike-sharinggraph-based simulator
ROAD: Adaptive Data Mixing for Offline-to-Online Reinforcement Learning via Bi-Level Optimization
The paper introduces ROAD, a bi-level optimization framework for adaptive data mixing in offline-to-online reinforcement learning. The method addresses distribution shift by formulating data selection as a meta-decision (outer-level) that governs policy performance during online fine-tuning, while Q-learning updates operate at the inner level. A multi-armed bandit mechanism approximates the bi-level gradient, balancing offline priors and value overestimation. Empirical results show ROAD outperforms existing data replay methods across diverse datasets, achieving superior stability and asymptotic performance without manual tuning.
offline-to-online rlbi-level optimizationdata mixingmulti-armed banditdistribution shift
Contestable Multi-Agent Debate with Arena-based Argumentative Computation for Multimedia Verification
We introduce a contestable multi-agent framework for multimedia verification that integrates multimodal large language models, external verification tools, and arena-based quantitative bipolar argumentation (A-QBAF). The method decomposes cases into claim-centered sections, retrieves targeted evidence, and converts it into structured support and attack arguments with provenance and strength scores. These arguments are resolved through small local argument graphs with selective clash resolution and uncertainty-aware escalation. The system generates section-wise verification reports that are transparent, editable, and computationally practical, addressing the ICMR 2026 Grand Challenge on Multimedia Verification.
multimodal large language modelsarena-based quantitative bipolar argumentationselective clash resolutionuncertainty-aware escalationmultimedia verification
Learning Scenario Reduction for Two-Stage Robust Optimization with Discrete Uncertainty
The paper introduces NeurPRISE, a neural surrogate model for scenario reduction in Two-Stage Robust Optimization (2RO) with discrete uncertainty. NeurPRISE employs a GNN-Transformer architecture to encode per-scenario structures via graph convolution and capture cross-scenario interactions through attention. Trained via imitation learning with a gain-aware ranking objective, it distills marginal gain information from PRISE, a problem-driven heuristic. Experiments on three 2RO problems demonstrate NeurPRISE's competitive regret, scalability, and 7-200x speedup over PRISE, with strong zero-shot generalization to larger problem scales, more scenarios, and distribution shifts.
two-stage robust optimizationscenario reductiongnn-transformerimitation learningzero-shot generalization
Deepchecks: Evaluating Retrieval-Augmented Generation (RAG)
The paper introduces Deepchecks, a comprehensive framework for evaluating Retrieval-Augmented Generation (RAG) systems, addressing the challenge of assessing these models due to their stochastic outputs and complex retrieval-generation interplay. Deepchecks employs a multi-faceted approach, including root cause analysis and production monitoring, to ensure alignment with application-specific requirements. The framework provides a robust foundation for evaluating reliability, relevance, and user satisfaction in RAG applications across domains such as healthcare, finance, and customer service.
retrieval-augmented generationstochastic outputsroot cause analysisproduction monitoringapplication-specific requirements
Head Forcing: Long Autoregressive Video Generation via Head Heterogeneity
Head Forcing introduces a training-free framework for autoregressive video diffusion models, addressing error accumulation and context loss in long-horizon generation. The method leverages heterogeneous attention heads—local, anchor, and memory—by assigning tailored KV cache strategies: local and anchor heads retain essential tokens, while memory heads use a hierarchical memory system with dynamic episodic updates. A head-wise RoPE re-encoding scheme maintains positional encodings within pretrained ranges. Without additional training, Head Forcing extends generation duration from 5 seconds to minute-level, supports multi-prompt synthesis, and outperforms existing baselines.
autoregressive video diffusionkv cacheattention headsrope re-encodinghierarchical memory
LEMON: Learning Executable Multi-Agent Orchestration via Counterfactual Reinforcement Learning
LEMON introduces a learning-based orchestrator for LLM-powered multi-agent systems that generates executable specifications integrating roles, duties, capacities, and dependencies. The method employs counterfactual reinforcement learning, augmenting GRPO with localized reward signals that edit specific orchestration fields (role/capacity/dependency) and apply contrastive rewards to edited spans. Evaluated on six benchmarks (MMLU, GSM8K, AQuA, MultiArith, SVAMP, HumanEval), LEMON achieves state-of-the-art performance among multi-agent orchestration methods.
multi-agent systemscounterfactual reinforcement learningorchestration specificationllm-based orchestratorlocalized reward signals
When Retrieval Hurts Code Completion: A Diagnostic Study of Stale Repository Context
This study demonstrates that temporally stale repository context actively biases code completion models toward obsolete states rather than merely introducing noise. Using a controlled diagnostic study on 17 production-helper signature changes from five Python repositories, the authors compare current-only, stale-only, no-retrieval, and mixed retrieval conditions under neutralized prompts. Results show stale-only retrieval induces stale helper references in 88.2% of Qwen2.5-Coder-7B-Instruct samples and 76.5% of gpt-4.1-mini samples, with 75.0% Jaccard overlap in stale-triggering samples. Mixed retrieval largely mitigates failures, highlighting temporal validity as a critical factor in retrieval-augmented code generation robustness.
retrieval-augmentedtemporal validitycode completionrepository contextdiagnostic study
Does RAG Know When Retrieval Is Wrong? Diagnosing Context Compliance under Knowledge Conflict
The paper introduces Context-Driven Decomposition (CDD), a belief-decomposition probe for diagnosing context compliance in Retrieval-Augmented Generation (RAG) systems under knowledge conflict. CDD operates at inference time as an intervention mechanism, enabling controlled retrieval conflict analysis. Results reveal three patterns: (1) context compliance is measurable in adversarial settings, with Standard RAG achieving 15.0% accuracy on TruthfulQA misconception injection; (2) adversarial accuracy gains transfer across model families, with CDD reaching 64.1% mistake-injection causal sensitivity on Gemini-2.5-Flash; (3) explicit conflict decomposition improves robustness, with CDD achieving 71.3% accuracy on temporal shifts and 69.9% on distractor evidence in Epi-Scale benchmarks.
context-driven decompositionretrieval-augmented generationknowledge conflictcontext complianceadversarial accuracy
From Table to Cell: Attention for Better Reasoning with TABALIGN
We propose TABALIGN, a framework for multi-step table reasoning that addresses the lack of cell-grounding in existing methods by pairing a masked diffusion language model (DLM) planner with TABATTN, a lightweight verifier trained on human-verified attention standards. The DLM planner emits binary cell masks through bidirectional denoising, while TABATTN scores each step based on attention overlap with the designated mask. Evaluated on eight benchmarks for table question answering and fact verification, TABALIGN improves average accuracy by 15.76 percentage points over the strongest open-source baseline at 8B-class scale, with 2.87 percentage points attributed to the DLM planner. Downstream reasoning execution is accelerated by 44.64% due to cleaner DLM plans.
diffusion language modelcell-groundingbinary cell masksattention overlaptable reasoning
OmniDrop: Layer-wise Token Pruning for Omni-modal LLMs via Query-Guidance
OmniDrop introduces a layer-wise token pruning framework for omni-modal LLMs, addressing the token explosion problem in high-resolution audio/video inputs. The method performs progressive pruning within decoder layers (rather than input-level) and uses text queries for modality-agnostic guidance, alongside a temporal diversity score to preserve global context. Experiments show 3.58-point accuracy gains over baselines, with 40% faster prefill latency and 14.7% memory reduction across audiovisual benchmarks.
token pruningomni-modal llmsdecoder layerstemporal diversityprefill latency
Stateful Reasoning via Insight Replay
We propose InsightReplay, a stateful reasoning method that addresses the diminishing attention to critical insights in long Chain-of-Thought (CoT) reasoning traces. By periodically extracting and replaying key insights near the active generation frontier, InsightReplay maintains their accessibility throughout extended reasoning. Evaluated across a 2×3×4 benchmark grid involving model scales (8B, 30B), families (Qwen3.5, DeepSeek-R1-Distill-Qwen, Gemma-4), and reasoning benchmarks (AIME, HMMT, GPQA Diamond, LiveCodeBench v5), 3-round InsightReplay consistently improves accuracy, averaging +1.65 points over standard CoT, with a maximum gain of +9.2 points on R1-Distill-32B's LiveCodeBench v5 subset.
chain-of-thoughtinsightreplaystateful reasoningattention decaybenchmark grid
Intelligence Impact Quotient (IIQ): A Framework for Measuring Organizational AI Impact
The paper introduces the Intelligence Impact Quotient (IIQ), a composite metric for quantifying AI integration and impact in organizational workflows. IIQ combines novelty-weighted token stock, usage frequency, recency, leverage, task complexity, and autonomy into a normalized 0-1000 index. It includes sub-daily update rules and an interpretation layer for efficiency and financial impact estimation. The framework distinguishes between superficial usage patterns and high-impact AI-assisted work through synthetic scenarios, positioning IIQ as a deployment-focused measurement tool rather than a capability or productivity metric.
intelligence impact quotientnovelty-weighted token stockorganizational leveragetask complexityautonomy
When Answers Stray from Questions: Hallucination Detection via Question-Answer Orthogonal Decomposition
We propose QAOD (Question-Answer Orthogonal Decomposition), a single-pass framework for hallucination detection in LLMs that projects away question-aligned directions from answer representations to obtain domain-agnostic, question-orthogonal components. QAOD employs diversity-penalized Fisher scoring for layer selection and Fisher importance for neuron identification, enabling both in-domain discriminability and cross-domain generalization. Two probing strategies are introduced: a joint probe combining orthogonal components with question context for in-domain performance, and an orthogonal-only probe for robust transfer. QAOD achieves state-of-the-art in-domain AUROC across model-dataset pairs and outperforms white-box baselines by up to 21% on BioASQ while reducing generation cost by over 75%.
question-answer orthogonal decompositionhallucination detectionfisher scoringdomain-agnosticcross-domain generalization
Prompting Policies for Multi-step Reasoning and Tool-Use in Black-box LLMs with Iterative Distillation of Experience
We introduce a Reinforcement Learning framework for optimizing prompting policies in black-box Large Language Models (LLMs) through iterative distillation of experience. A lightweight prompter model is trained to maximize task-specific rewards for a frozen worker LLM, utilizing a contrastive experience buffer that combines scalar rewards with textual critiques. Evaluated on Big Bench Extra Hard (BBEH) and Tau-bench suites, our method improves performance from 55% to 90% in logic-intensive reasoning and from 74% to 91% in tool-use tasks. The approach outperforms evolutionary baselines like GEPA in both performance and sample efficiency, while discovering specialized algorithmic heuristics.
reinforcement learningprompting policiesiterative distillationcontrastive experience bufferblack-box llms
Synthesizing POMDP Policies: Sampling Meets Model-checking via Learning
The paper presents a synthesis framework for Partially Observable Markov Decision Processes (POMDPs) that combines sampling, automata learning, and model-checking to achieve both scalability and formal guarantees. Inspired by Angluin's $L^*$ algorithm, the method uses sampling as a membership oracle and model-checking as an equivalence oracle to synthesize finite-state controllers with provable correctness, assuming the sampling-induced policy is regular. A relative completeness result is established, and experimental results show the approach successfully solves threshold-safety problems challenging for existing formal synthesis tools.
pomdpformal synthesisautomata learningmodel-checkingfinite-state controllers
BEAM: Binary Expert Activation Masking for Dynamic Routing in MoE
We propose BEAM (Binary Expert Activation Masking), a novel method for dynamic expert routing in Mixture-of-Experts (MoE) architectures that addresses redundant computation and inference latency in standard Top-K routing. BEAM learns token-adaptive expert selection via trainable binary masks, using a straight-through estimator and auxiliary regularization loss to induce dynamic sparsity while maintaining model capability. Implemented with a custom CUDA kernel for vLLM integration, BEAM retains over 98% of original model performance while reducing MoE layer FLOPs by up to 85%, achieving 2.5× faster decoding and 1.4× higher throughput.
mixture-of-expertsdynamic routingbinary maskingflops reductioncuda kernel
Efficient Generative Retrieval for E-commerce Search with Semantic Cluster IDs and Expert-Guided RL
The authors propose CQ-SID, a generative retrieval framework for e-commerce search that encodes items into hierarchical semantic cluster IDs using category-aware query-item contrastive learning and Residual Quantized VAEs, reducing beam search complexity. They introduce EG-GRPO, an expert-guided reinforcement learning method aligning generative recall with downstream ranking via ground-truth sample injection. Evaluated on TmallAPP search logs, CQ-SID achieves 26.76% and 11.11% relative gains in semantic and personalized click hitrate over RQ-VAE baselines while halving beam search size. Online A/B tests show GMV (+1.15%) and UCTCVR (+0.40%) improvements, with the generative recall channel contributing over 50.25% of exposures, 58.96% of clicks, and 72.63% of purchases in production.
generative retrievalcontrastive learningresidual quantized vaebeam searchreinforcement learning
A plug-and-play generative framework for multi-satellite precipitation estimation
The authors propose PRISMA, a plug-and-play generative framework for multi-sensor precipitation estimation that combines unconditional precipitation priors with sensor-specific conditional branches. The method learns from IMERG Final fields and incorporates FY-4B AGRI infrared and GPM GMI microwave observations without retraining the backbone, achieving a 40.3% Critical Success Index improvement and 22.6% RMSE reduction versus infrared-only baselines. Validation shows 42.3% MAE reduction in typhoon cores and consistent rain-gauge accuracy across China, with 37s average inference time.
precipitation estimationgenerative modelingmulti-sensor fusionsatellite observationsplug-and-play learning
Collaborative Yet Personalized Policy Training: Single-Timescale Federated Actor-Critic
(No summary returned.)
MemLineage: Lineage-Guided Enforcement for LLM Agent Memory
MemLineage introduces a lineage-guided defense mechanism for LLM agent memory, combining cryptographic provenance with LLM-mediated derivation lineage to prevent untrusted content from justifying sensitive actions. The system employs a six-module design featuring an RFC-6962 Merkle log, per-principal Ed25519-signed entries, and a weighted derivation DAG with a max-of-strong-edges propagation rule to enforce Untrusted-Path Persistence. Evaluation on a deterministic mechanism-isolation harness shows MemLineage reduces all three memory-poisoning workloads to zero ASR with sub-millisecond overhead, while Codex-backed AgentDojo tests confirm zero ASR in vulnerable tool-output scenarios.
llm agent memorymerkle loged25519-signed entriesuntrusted-path persistencederivation dag
DVMap: Fine-Grained Pluralistic Value Alignment via High-Consensus Demographic-Value Mapping
DVMap introduces a framework for fine-grained pluralistic value alignment in LLMs, addressing intra-country heterogeneity by shifting from national labels to multi-dimensional demographic constraints. The method constructs a 56,152-sample corpus from the World Values Survey using demographic archetype extraction, employs a Structured Chain-of-Thought mechanism to guide demographic-value reasoning, and applies Group Relative Policy Optimization for adaptive value distribution anchoring. Evaluation on a triple-generalization benchmark (21,553 samples) shows Qwen3-8B-DVMap achieves 48.6% accuracy on cross-demographic tests, outperforming DeepSeek-v3.2 (45.1%).
pluralistic value alignmentdemographic archetype extractionstructured chain-of-thoughtgroup relative policy optimizationtriple-generalization benchmark
The Great Pretender: A Stochasticity Problem in LLM Jailbreak
The paper identifies a stochasticity problem in LLM jailbreak evaluation, demonstrating that Attack Success Rate (ASR) is unstable and systematically inflated across studies. It analyzes stochasticity during both attack generation and evaluation, proposing CAS-eval (a new metric) and CAS-gen (a generation framework) to address these issues. Experiments show ASR drops up to 30 percentage points when requiring consecutive successes, while CAS-gen recovers this performance loss across multiple jailbreak methods, models, and judges.
jailbreak attacksattack success ratestochasticityllm securityadversarial prompts
A Unified Knowledge Embedded Reinforcement Learning-based Framework for Generalized Capacitated Vehicle Routing Problems
The paper proposes a knowledge-embedded reinforcement learning framework for generalized Capacitated Vehicle Routing Problems (CVRPs), addressing limitations of end-to-end RL approaches. The method decomposes CVRPs into route-first and cluster-second subproblems, incorporating dynamic programming for the latter and an RL-based solver for the former, enhanced by a history-aware context module. Experiments demonstrate superior solution quality over state-of-the-art learning methods, with reduced gaps to classical heuristics across diverse CVRP variants.
capacitated vehicle routingreinforcement learningdynamic programmingnp-hardcombinatorial optimization
SWE-Chain: Benchmarking Coding Agents on Chained Release-Level Package Upgrades
The paper introduces SWE-Chain, a benchmark for evaluating coding agents on chained release-level package upgrades, addressing gaps in existing benchmarks by focusing on continuous maintenance through version transitions. Using a divide-and-conquer synthesis pipeline, the authors align release notes with code diffs to generate grounded upgrade requirements. The benchmark includes 12 upgrade chains across 9 Python packages, with 155 version transitions and 1,660 requirements. Testing nine agent-model configurations reveals an average resolving rate of 44.8%, precision of 65.4%, and F1 of 50.2%, with Claude-Opus-4.7 performing best (60.8% resolving, 80.6% precision, 68.5% F1), demonstrating both feasibility and discriminative power while highlighting agent limitations.
swe-chainpackage upgradesrelease-levelcoding agentsbenchmark
MahaVar: OOD Detection via Class-wise Mahalanobis Distance Variance under Neural Collapse
The paper introduces MahaVar, a post-hoc OOD detection method leveraging class-wise Mahalanobis distance variance under Neural Collapse. The key observation is that ID samples exhibit high variance in class-wise distances due to a sharp minimum structure, while OOD samples show lower variance. The method augments Mahalanobis distance with this variance term, achieving state-of-the-art performance on CIFAR-100 and ImageNet (OpenOOD v1.5 benchmark), with improved AUROC and FPR@95 over existing Mahalanobis-based approaches.
ood detectionmahalanobis distanceneural collapsepost-hoc methodvariance term
Energy-Efficient Quadruped Locomotion with Compliant Feet
The study demonstrates that compliant feet can enhance energy efficiency in quadruped locomotion without compromising stability. Using reinforcement learning, the authors train eight policies with varying spring stiffness values in simulation and validate on a physical quadruped. Results show a 17% reduction in energy consumption per meter traveled for intermediate stiffness compared to extremely stiff or flexible springs, with simulation trends matching experimental findings.
quadruped locomotioncompliant feetreinforcement learningenergy efficiencyspring stiffness
Metis AI: The Overlooked Middle Zone Between AI-Native and World-Movers
The paper introduces Metis AI, a class of digital tasks resistant to AI automation despite lacking physical embodiment requirements. The authors distinguish constitutive metis (contextual knowledge destroyed by formalization) from operational metis (system-specific familiarity amenable to automation), identifying five structural characteristics defining this zone: consequential irreversibility, relational irreducibility, normative open texture, adversarial co-evolution, and accountability anchoring. Grounded in social science and philosophical theory, they argue these properties are inherent to tasks rather than model limitations, advocating for centaur architectures (human-led, AI-supported) over pure automation attempts.
metis aiconstitutive metisoperational metiscentaur architecturesnormative open texture
Agentic Recommender System with Hierarchical Belief-State Memory
The paper introduces MARS (Memory-Augmented Agentic Recommender System), a hierarchical belief-state framework for recommendation that treats the task as a partially observable problem. MARS organizes memory into three tiers (event, preference, profile) and employs six adaptive operations (extraction, reinforcement, weakening, consolidation, forgetting, resynthesis) scheduled by an LLM-based planner. Experiments on four InstructRec benchmarks show MARS achieves state-of-the-art performance with 26.4% HR@1 and 10.3% NDCG@10 improvements over baselines, with additional gains in evolving settings.
memory-augmented llmhierarchical belief-statepartially observable problemagentic schedulinginstructrec benchmark
Coding Agent Is Good As World Simulator
The paper introduces an agentic framework for constructing physics-based world models through executable simulation code, addressing limitations of video-based approaches that lack explicit physical constraints. The framework employs four agents: a planning agent converts natural language prompts into structured scene plans, a code agent generates executable simulation code, a visual review agent provides feedback, and a physics analysis agent ensures physical consistency. Iterative revisions refine the simulation until it meets prompt requirements and physical constraints. Experiments demonstrate superior performance in physical accuracy, instruction fidelity, and visual quality compared to advanced video-based models, with applications in driving simulation and embodied robot tasks.
physics-based world modelsexecutable simulation codeagentic frameworkphysical consistencyinstruction fidelity
Learning to Build the Environment: Self-Evolving Reasoning RL via Verifiable Environment Synthesis
The paper introduces EvoEnv, a self-evolving reinforcement learning framework where language models construct executable training environments rather than generating synthetic data. The key innovation is solve--verify asymmetry, ensuring environments remain challenging by leveraging algorithmic complexity (e.g., dynamic programming) or verification simplicity (e.g., constraint satisfaction). EvoEnv validates environments through staged checks, semantic review, and difficulty calibration. Experiments on Qwen3-4B-Thinking show a 3.3% relative improvement (72.4→74.8) over fixed-environment RLVR, demonstrating that environment synthesis enables stable self-improvement when difficulty outpaces model capabilities.
self-evolving rlsolve--verify asymmetryenvironment synthesisqwen3-4b-thinkingrlvr
Nexus : An Agentic Framework for Time Series Forecasting
Nexus introduces a multi-agent framework for time series forecasting that integrates numerical patterns with unstructured contextual data. The framework decomposes prediction into specialized stages: isolating macro-level and micro-level temporal fluctuations, and integrating contextual information before synthesizing a final forecast. This approach enables adaptation from seasonal signals to volatile, event-driven information without external statistical anchors or monolithic prompting. Evaluated on Zillow real estate metrics and volatile stock market equities, Nexus consistently matches or outperforms state-of-the-art Time Series Foundation Models and LLM baselines. The framework also produces high-quality reasoning traces, demonstrating that forecasting extends beyond sequence modeling to agentic reasoning.
time series forecastingmulti-agent frameworkcontextual informationtemporal fluctuationsreasoning traces
Darwin Family: MRI-Trust-Weighted Evolutionary Merging for Training-Free Scaling of Language-Model Reasoning
The Darwin Family framework enables training-free evolutionary merging of large language models through gradient-free weight-space recombination. It introduces three innovations: (i) a 14D adaptive merge genome for fine-grained recombination, (ii) MRI-Trust Fusion balancing layer-importance signals via learnable trust parameters, and (iii) an Architecture Mapper for cross-architecture breeding. The flagship Darwin-27B-Opus achieves 86.9% on GPQA Diamond, outperforming its foundation model without additional training. The method scales from 4B to 35B parameters, supports recursive multi-generation evolution, and combines Transformer- and Mamba-based components.
evolutionary mergingweight-space recombinationmri-trust fusionadaptive merge genomearchitecture mapper
Data-Augmented Game Starts for Accelerating Self-Play Exploration in Imperfect Information Games
The paper introduces Data-Augmented Game Starts (DAGS), a multi-agent starting-state sampling strategy to accelerate exploration in two-player zero-sum imperfect-information games. DAGS initializes reinforcement learning data collection at intermediate states sampled from offline human demonstrations, assuming these cover high-level equilibrium-relevant strategies. Experiments on synthetic datasets and modified OpenSpiel games (Kuhn Poker, Goofspiel, and a counterexample game) show DAGS reduces exploitability under fixed computational budgets. The authors also address potential equilibrium bias with multi-task observation flags and release new benchmark environments with increased exploration challenges.
imperfect-information gamespolicy-gradient methodsexploitabilityoffline demonstrationsmulti-task learning
Optimal Pattern Detection Tree for Symbolic Rule-Based Classification
The paper introduces the Optimal Pattern Detection Tree (OPDT), a symbolic rule-based classification model formulated as mixed-integer programming to discover a single optimal pattern in binary classification tasks. OPDT incorporates Branching Structure Constraints (BSC) to encode domain knowledge and compliance requirements, optimizing for maximal coverage and minimal false positive rate. Computational experiments demonstrate OPDT's ability to identify hidden patterns with optimality guarantees on moderately sized datasets within practical runtime constraints.
symbolic rule discoverymixed-integer programmingbinary classificationbranching structure constraintspattern detection
Turning Stale Gradients into Stable Gradients: Coherent Coordinate Descent with Implicit Landscape Smoothing for Lightweight Zeroth-Order Optimization
The paper introduces Coherent Coordinate Descent (CoCD), a deterministic zeroth-order optimizer addressing sample inefficiency and high variance in existing methods. CoCD leverages gradient coherence to repurpose stale gradients as computational assets, achieving O(1) query complexity per step while maintaining global descent directions. Theoretical analysis reveals that larger finite-difference steps induce implicit landscape smoothing, improving convergence stability. Experiments on MLP, CNN, and ResNet architectures (≤270k parameters) show CoCD outperforms Block Cyclic Coordinate Descent in sample efficiency and accuracy, while surpassing randomized methods in stability.
zeroth-order optimizationcoherent coordinate descentimplicit smoothingfinite-differencesample efficiency
Deciphering Neural Reparameterized Full-Waveform Inversion with Neural Sensitivity Kernel and Wave Tangent Kernel
The study introduces neural sensitivity kernel (NSK) and wave tangent kernel (WTK) to analyze the convergence behavior of neural reparameterized full-waveform inversion (NeurFWI). By establishing that the neural tangent kernel (NTK) adaptively modulates NSK and WTK, the authors identify spectral filtering, gradient wavenumber modulation, and wave frequency bias as key mechanisms influencing NeurFWI convergence. Based on these insights, enhanced NeurFWI methods with tailored eigen-structures in NSK and WTK are proposed to improve inversion performance and efficiency. Numerical validation in seismic exploration and medical imaging confirms the theoretical claims and demonstrates the effectiveness of the proposed methods.
neural sensitivity kernelwave tangent kernelfull-waveform inversionneural tangent kernelspectral filtering
Where Should Diffusion Enter a Language Model? Geometry-Guided Hidden-State Replacement
The paper introduces DiHAL, a geometry-guided diffusion-transformer hybrid that identifies optimal layers for diffusion insertion in pretrained language models. By scoring layers with geometry-based proxies and replacing the lower transformer prefix with a diffusion bridge, DiHAL reconstructs hidden states instead of tokens, avoiding continuous-to-discrete recovery issues. Experiments on 8B-scale models demonstrate that geometry scores predict effective shallow insertion layers, and hidden-state recovery outperforms continuous diffusion baselines under matched training budgets, highlighting the utility of hidden-state geometry for diffusion integration.
diffusion-transformer hybridhidden-state reconstructiongeometry-guided scoringcontinuous-to-discrete recovery8b-scale models
LoMETab: Beyond Rank-1 Ensembles for Tabular Deep Learning
(No summary returned.)
Correctness-Aware Repository Filtering Under Maximum Effective Context Window Constraints
The paper introduces a correctness-aware context hygiene framework for LLM-based developer tools, addressing context window inefficiency by pre-filtering repository files using OS-level metadata. The SizeFilter heuristic achieves 79.6% mean token reduction with 0.30 ms overhead, while the HybridFilter shows 89.3% efficiency and lowest variance. Evaluations on 10 repositories (22,046 files) demonstrate strong linear correlation between file size and token density (r=0.997). CodeLlama-7B-Instruct tests show 72% file-level accuracy with filtering versus 25% baseline, reducing hallucinations from 61% to 17%.
context windowtoken reductionheuristic filterrepository hygienehallucination frequency
RQ-MoE: Residual Quantization via Mixture of Experts for Efficient Input-Dependent Vector Compression
RQ-MoE introduces a novel vector quantization framework combining Mixture of Experts (MoE) with dual-stream quantization for input-dependent codebook adaptation, addressing limitations of static codebooks in multi-codebook methods and sequential bottlenecks in dynamic quantizers like QINCo. The method employs a two-level MoE to enable parallel decoding by decoupling instruction from quantization, theoretically subsuming Residual Quantization and QINCo as special cases. Experiments demonstrate state-of-the-art or competitive performance in reconstruction and retrieval tasks, with 6x-14x faster decoding speeds compared to prior methods.
vector quantizationmixture of expertsdynamic codebooksparallel decodingresidual quantization
Uncovering the Representation Geometry of Minimal Cores in Overcomplete Reasoning Traces
The paper introduces minimal cores, defined as the smallest subset of reasoning steps preserving a language model's final prediction, to analyze overcomplete reasoning traces. Using metrics like compression ratio and necessity concentration, the authors evaluate six benchmarks (arithmetic, competition math, scientific reasoning, commonsense QA), finding 46% of steps removable while maintaining 86% answer accuracy. Results show predictive support is concentrated (top 3 steps account for 65% necessity mass) and minimal cores improve trace separation by 11 points, reduce intrinsic dimensionality by 34%, and transfer across models with 85% answer retention. Theoretical analysis establishes existence guarantees and certificates for sparse necessity.
minimal coresovercomplete reasoningnecessity concentrationintrinsic dimensionalitygreedy elimination
Herculean: An Agentic Benchmark for Financial Intelligence
We introduce Herculean, the first benchmark for agentic financial intelligence, addressing the gap in evaluating AI agents' ability to execute complex financial workflows. The benchmark spans four workflows—Trading, Hedging, Market Insights, and Auditing—each instantiated as a standardized MCP-based skill environment with tailored tools, constraints, and success criteria. Evaluation of frontier agents reveals strong performance on Trading and Market Insights but significant challenges in Hedging and Auditing, where long-horizon coordination, state consistency, and structured verification are critical. Results highlight a key limitation in translating financial reasoning into reliable workflow execution in high-stakes scenarios.
agentic intelligencefinancial workflowsmcp-based environmentlong-horizon coordinationstructured verification
CrystalReasoner: Reasoning and RL for Property-Conditioned Crystal Structure Generation
CrystalReasoner introduces an LLM framework for property-conditioned crystal structure generation through reasoning and reinforcement learning. The method incorporates physical priors as thinking tokens (crystallographic symmetry, local coordination, predicted properties) before atomic coordinate generation, then uses RL with multi-objective rewards to ensure validity, stability, and property alignment. Compared to baselines, it triples S.U.N. ratio, improves property-conditioned generation, and exhibits adaptive reasoning with increasing atom counts. Results demonstrate superior performance across validity metrics and task-specific constraints.
crystal structure generationreinforcement learningphysical priorsproperty-conditionedthinking tokens
Analog RF Computing: A New Paradigm for Energy-Efficient Edge AI Over MU-MIMO Systems
The authors propose a physical layer design framework for analog RF computing in MU-MIMO wireless systems, enabling energy-efficient edge inference. The framework encodes neural network weights into RF waveforms broadcast by a base station, allowing clients to perform matrix-vector multiplications using passive mixers. They derive models for computing accuracy and energy consumption, formulate a joint beamforming and scaling optimization problem, and develop a low-complexity solver. Simulations under 3GPP specifications demonstrate that analog RF computing reduces client-side energy consumption by nearly two orders of magnitude compared to digital computing, with mixed-precision inference requiring even lower energy than uniform-precision approaches.
analog rf computingmu-mimoedge inferencematrix-vector multiplicationbeamforming
AIM-DDI: A Model-Agnostic Multimodal Integration Module for Drug-Drug Interaction Prediction
We propose AIM-DDI, a model-agnostic multimodal integration module for drug-drug interaction (DDI) prediction that enables reusable fusion of structural, chemical, and semantic drug signals across diverse architectures. AIM-DDI represents heterogeneous modality information as tokens in a shared latent space and models cross-modality dependencies through a unified fusion module. Evaluations across multiple DDI models and DrugBank-based settings demonstrate consistent performance improvements, particularly in the challenging both-unseen setting where neither test drug was observed during training. Results indicate that decoupling multimodal integration from specific prediction architectures enhances robustness in unseen-drug generalization.
drug-drug interactionmultimodal integrationmodel-agnosticlatent spaceunseen-drug generalization
Dynamic Latent Routing
Dynamic Latent Routing (DLR) is introduced as a language-model post-training method that jointly learns discrete latent codes, routing policies, and model parameters through dynamic search in a single training stage. Motivated by General Dijkstra Search (GDS), DLR applies the 'search, select, update' principle to optimize sub-policies in Markov Decision Processes with time-varying reward functions. In low-data fine-tuning settings, DLR matches or outperforms supervised fine-tuning across four datasets and six models, achieving a mean gain of +6.6 percentage points. Mechanistic analyses reveal that DLR learns structured routing behaviors with distinct causal roles.
dynamic latent routinggeneral dijkstra searchmarkov decision processesdiscrete latent codesrouting policies
Are Agents Ready to Teach? A Multi-Stage Benchmark for Real-World Teaching Workflows
The authors introduce EduAgentBench, a theory-grounded benchmark for evaluating tutor agents across professional teaching workflows. The benchmark comprises 150 tasks spanning three capability surfaces: pedagogical judgment, situated multi-turn tutoring, and Canvas-style workflow execution. Tasks are constructed through a pedagogical-insight-driven pipeline and validated via complementary verification signals and human review. Evaluation of frontier models reveals that while they exhibit bounded pedagogical judgment, they fall short of professional standards in situated tutoring and autonomous workflow execution. EduAgentBench provides a foundational measurement framework for developing tutor agents capable of supporting realistic teaching work.
eduagentbenchpedagogical judgmentmulti-turn tutoringteaching workflowbenchmark
Semantic Feature Segmentation for Interpretable Predictive Maintenance in Complex Systems
The paper introduces a semantic feature segmentation framework for interpretable predictive maintenance in complex systems, addressing heterogeneity and redundancy in monitored variables. The method decomposes the feature space into a canonical component retaining dominant predictive information and a residual component with peripheral signals, using domain-informed criteria to group variables by operational mechanisms. Time-aware cross-validation demonstrates that the canonical space achieves lower predictive risk than the residual space, indicating concentrated fault-relevant information. The canonical segments exhibit strong intra-segment coherence and stable structural organization post-redundancy reduction, matching the predictive performance of the full feature space and PCA while preserving semantic interpretability.
semantic feature segmentationpredictive maintenancecanonical componentredundancy reductiontime-aware cross-validation
Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment
The paper introduces BBCritic, a novel paradigm for GUI critique that shifts from binary classification to continuous semantic alignment via contrastive learning. Addressing Affordance Collapse and Noise Sensitivity in existing models, BBCritic employs two-stage contrastive learning to align instructions and actions in a hierarchical Affordance Space. The authors also present BBBench, a benchmark featuring a dense action space and four-level taxonomy for fine-grained evaluation. Experiments show BBCritic-3B outperforms 7B-parameter SOTA binary models without additional annotation, demonstrating strong zero-shot transferability across platforms and tasks, supporting the view of GUI critique as a metric-learning problem.
contrastive learningaffordance spacezero-shot transferabilitymetric-learningsemantic alignment
ICED: Concept-level Machine Unlearning via Interpretable Concept Decomposition
We introduce ICED, an interpretable concept-level machine unlearning framework for Vision-Language Models (VLMs) that addresses the challenge of precise knowledge removal without affecting unrelated semantics. The method constructs a task-specific concept vocabulary using a multimodal large language model and decomposes visual representations into sparse, nonnegative combinations of semantic concepts, enabling fine-grained knowledge manipulation. Unlearning is formulated as concept-level optimization, selectively suppressing target concepts while preserving intra-instance non-target semantics and global cross-modal knowledge. Experiments across in-domain and out-of-domain forgetting settings demonstrate comprehensive target forgetting, improved preservation of non-target knowledge, and competitive model utility compared to existing VLM unlearning methods.
machine unlearningvision-language modelsconcept decompositionmultimodal alignmentknowledge preservation
Matrix-Space Reinforcement Learning for Reusing Local Transition Geometry
Matrix-Space Reinforcement Learning (MSRL) introduces a geometric abstraction for compositional generalization in sequential decision-making by representing trajectory segments as positive semidefinite matrix descriptors. These descriptors aggregate first- and second-order statistics of lifted one-step transitions, exposing shared hidden structure and enabling algebraic composition in an abstract matrix space. MSRL conditions value functions on trajectory-segment matrices, providing a first-order smooth approximation of action values for transfer learning. Compatible with standard model-free and model-based methods, MSRL achieves a best average finite-budget target AUC of 0.73, outperforming baseline methods including TD-MPC-PT+FT (0.63) and TD-MPC (0.57).
matrix-space reinforcement learningpositive semidefinite descriptorscompositional generalizationalgebraic compositiontrajectory-segment matrices
Policy Optimization in Hybrid Discrete-Continuous Action Spaces via Mixed Gradients
The paper introduces Hybrid Policy Optimization (HPO), a reinforcement learning method for hybrid discrete-continuous action spaces that combines pathwise and score-function gradients to address credit assignment issues while maintaining unbiasedness. HPO backpropagates through differentiable simulations where smoothness permits and reformulates problems with action discontinuities into hybrid form. Empirical results demonstrate HPO's superiority over PPO on inventory control and switched linear-quadratic regulator tasks, with performance gaps widening as continuous action dimensions increase. Theoretical analysis shows the mixed gradient's cross term becomes negligible near optimality, enabling approximate decentralized updates.
hybrid action spacespolicy optimizationmixed gradient estimatordifferentiable simulationcredit assignment
Precise Verification of Transformers through ReLU-Catalyzed Abstraction Refinement
We introduce a transformer verification method leveraging ReLU-catalyzed abstraction refinement to improve precision in safety-critical applications. Our approach employs ReLU to represent precise non-linear bounds for dot products in self-attention layers, enabling convex relaxation techniques to derive accurate output ranges. We extend rule-based and optimization-based frameworks to transformers, resulting in efficient and precise verification. Evaluated on sentiment analysis datasets, our method achieves significant precision improvements across most verification tasks compared to state-of-the-art baselines, with acceptable efficiency trade-offs.
transformer verificationrelu-catalyzed abstractionself-attention layersconvex relaxationsentiment analysis
To See is Not to Learn: Protecting Multimodal Data from Unauthorized Fine-Tuning of Large Vision-Language Model
MMGuard introduces a proactive defense against unauthorized fine-tuning of Large Vision-Language Models (LVLMs) by generating unlearnable examples via human-imperceptible perturbations. The method exploits LVLM learning dynamics through optimization shortcuts and cross-modal binding disruption, theoretically enforcing spurious noise-target correlations. Evaluated against nine LVLMs across six datasets, MMGuard demonstrates effective, stealthy, and robust protection under white-box, gray-box, and black-box threat models, outperforming post-hoc approaches like machine unlearning.
large vision-language modelsunlearnable examplescross-modal bindingoptimization shortcutmultimodal protection
Web Agents Should Adopt the Plan-Then-Execute Paradigm
This paper proposes replacing the ReAct architecture with a plan-then-execute paradigm for web agents, arguing it better handles prompt injection risks by separating task planning from runtime execution. The authors analyze WebArena, finding all tasks compatible with plan-then-execute and 80% executable via purely programmatic plans without runtime LLM subroutines. They identify the key challenge as mapping low-level browser interactions (click, type, scroll) to semantic, task-level operations with predictable effects, framing this as an infrastructure rather than modeling problem requiring typed website APIs.
plan-then-executereactwebarenaprompt injectionsemantic actions
MetaMoE: Diversity-Aware Proxy Selection for Privacy-Preserving Mixture-of-Experts Unification
MetaMoE introduces a privacy-preserving framework for unifying independently trained Mixture-of-Experts (MoE) models across distributed clients without data sharing. The method employs diversity-aware proxy selection from public data to approximate private distributions and supervise router learning, alongside context-aware routing for heterogeneous inputs. Evaluations on vision and NLP benchmarks show MetaMoE outperforms existing privacy-preserving MoE unification approaches.
mixture-of-expertsprivacy-preservingproxy selectionrouter learningheterogeneous inputs
Watermarking Game-Playing Agents in Perfect-Information Extensive-Form Games
The paper introduces a watermarking framework for game-playing agents in perfect-information extensive-form games, adapting the KGW watermark from LLMs to this domain. The method encodes hidden information in the agent's strategy profile, detectable via statistical tests, while bounding the degradation in expected utility. Experiments on chess engines demonstrate negligible impact on strategy quality and high detectability with few games, highlighting a tradeoff between detectability and utility.
watermarkingperfect-information gamesextensive-form gameskgw watermarkexpected utility
Parallelizing Counterfactual Regret Minimization
The paper introduces a parallelization framework for counterfactual regret minimization (CFR) algorithms, reframing them as linear algebra operations to leverage existing parallelization techniques. This approach generalizes to CFR variants like CFR+, discounted CFR, and predictive CFR. Experimental results demonstrate a GPU implementation achieving up to 10,000× speedup over CPU-based CFR in Google DeepMind OpenSpiel.
counterfactual regret minimizationparallelizationlinear algebraimperfect-information gamesgpu acceleration
PhyMotion: Structured 3D Motion Reward for Physics-Grounded Human Video Generation
PhyMotion introduces a structured 3D motion reward for physics-grounded human video generation, addressing the limitations of existing 2D perceptual rewards in scoring motion realism. The method recovers SMPL body meshes from generated videos, retargets them onto a MuJoCo humanoid, and evaluates motion quality along three axes: kinematic plausibility, contact and balance consistency, and dynamic feasibility. Experiments demonstrate that PhyMotion achieves stronger correlation with human judgments and improves motion realism in RL-based post-training, yielding a +68 Elo gain in blind human evaluation. The reward preserves video generation quality with modest training overhead, and ablations confirm the complementary nature of its three evaluation axes.
physics-groundedsmplmujocokinematic plausibilitydynamic feasibility
Image Restoration via Diffusion Models with Dynamic Resolution
This work introduces dynamic resolution diffusion models (DMs) for efficient image restoration, addressing computational overhead in existing pixel-space and latent-space approaches. The method fine-tunes pre-trained DMs for dynamic resolution priors and adapts pixel-space techniques DPS and DAPS into SubDPS and SubDAPS, respectively, with an enhanced variant SubDAPS++ for improved efficiency and quality. Empirical evaluations across diverse datasets and restoration tasks demonstrate superior performance over recent DM-based methods in most scenarios. The approach reduces computational burden while maintaining reconstruction fidelity.
diffusion modelsimage restorationdynamic resolutioncomputational efficiencyreconstruction fidelity
Agentic AI Ecosystems in Higher Education: A Perspective on AI Agents to Emerging Inclusive, Agentic Multi-Agent AI Framework for Learning, Teaching and Institutional Intelligence
The paper proposes an integrated agentic multi-agent AI framework for higher education, addressing the fragmentation of current AI implementations. Through thematic analysis of literature, it identifies key gaps: task-specific AI tools, single-agent limitations, lack of cross-functional integration, and insufficient inclusivity. The framework envisions interconnected autonomous agents enabling coordinated planning, reasoning, and adaptive decision-making across teaching, learning, and administrative functions. It emphasizes inclusivity by supporting diverse learners through adaptive, multimodal interventions. Findings highlight the need for scalable, human-aligned AI ecosystems in education, offering future research directions for holistic, learner-centered platforms.
agentic aimulti-agent systemsadaptive decision-makinginclusive learningthematic analysis
Heuristic Pathologies and Further Variance Reduction via Uncertainty Propagation in the AIVAT Family of Techniques
The paper identifies heuristic pathologies in the AIVAT family of variance reduction techniques for multiagent evaluation, demonstrating that unconstrained optimization of the heuristic value function can lead to pathological variance minimization or p-hacking. It proposes fixing the heuristic value function prior to evaluation data observation to mitigate these issues. Additionally, the authors introduce uncertainty propagation for heuristic outputs, enabling further variance reduction via inverse-variance weighted averaging, albeit potentially sacrificing unbiasedness. Experiments on 10,000 poker hands show a 43.0% reduction in required samples for statistical conclusions.
variance reductionheuristic value functionuncertainty propagationmultiagent evaluationinverse-variance weighting
RefDecoder: Enhancing Visual Generation with Conditional Video Decoding
RefDecoder introduces a reference-conditioned video VAE decoder to enhance visual generation by addressing architectural asymmetry in latent diffusion models. The method injects high-fidelity reference image signals into the decoding process via reference attention, where a lightweight image encoder maps the reference frame into detail-rich tokens co-processed with denoised video latent tokens at each up-sampling stage. Evaluations on Inter4K, WebVid, and Large Motion benchmarks show consistent improvements, achieving up to +2.1dB PSNR over unconditional baselines. RefDecoder improves subject consistency, background consistency, and overall quality on VBench I2V and generalizes to tasks like style transfer and video editing refinement without additional fine-tuning.
reference-conditioned decoderlatent diffusion modelsvideo vaereference attentionpsnr
When Are Two Networks the Same? Tensor Similarity for Mechanistic Interpretability
The authors introduce tensor similarity, a weight-based metric for verifying functional equivalence in tensor-based models that is invariant to weight-space symmetries. The method employs an efficient recursive algorithm to capture global functional equivalence and cross-layer mechanisms, addressing limitations of empirical behavior-based metrics and basis-dependent parameter comparisons. Empirical results demonstrate that tensor similarity tracks functional training dynamics—including grokking and backdoor insertion—with higher fidelity than existing metrics, reducing similarity measurement to a solved algebraic problem rather than empirical approximation.
tensor similaritymechanistic interpretabilityfunctional equivalenceweight-space symmetriesgrokking
Hand-in-the-Loop: Improving Dexterous VLA via Seamless Interventional Correction
Hand-in-the-Loop (HandITL) introduces a seamless human-in-the-loop intervention method for Vision-Language-Action (VLA) models to address compounding errors in dexterous manipulation. HandITL blends human corrective intent with autonomous policy execution, avoiding abrupt robot-hand configuration changes ('gesture jumps') during bimanual tasks. Compared to direct teleoperation takeover, HandITL reduces takeover jitter by 99.8%, decreases grasp failures by 87.5%, and improves mean completion time by 19.1%. Validated on tasks requiring bimanual coordination, tool use, and fine-grained manipulation, HandITL-trained policies outperform those using standard teleoperation data by 19% across three long-horizon dexterous tasks.
vision-language-actiondexterous manipulationhuman-in-the-loopgesture jumpsteleoperation takeover
RoSHAP: A Distributional Framework and Robust Metric for Stable Feature Attribution
The paper introduces RoSHAP, a robust metric for stable feature attribution that addresses stochastic variation in SHAP values. The proposed framework models feature attribution score distributions using bootstrap resampling and kernel density estimation, demonstrating asymptotic Gaussian properties under mild conditions. RoSHAP aggregates SHAP distributions into a ranking criterion that rewards active, strong, and stable features. Empirical evaluations on simulated and real-world datasets show that RoSHAP outperforms single-run attribution measures in identifying signal features. Models using RoSHAP-selected features achieve comparable predictive performance to full-feature models while significantly reducing predictor count, enhancing interpretability and reliability.
shapbootstrap resamplingkernel density estimationfeature attributioninterpretability
Forgetting That Sticks: Quantization-Permanent Unlearning via Circuit Attribution
We introduce MANSU (Mechanistic-Aligned Null-Space Unlearning), a method addressing the dual failure of gradient-based unlearning techniques under quantization. MANSU combines causal circuit attribution to isolate minimal forget-set subgraphs, circuit-restricted null-space projection with a diagonal-Fisher retain bound, and a per-parameter magnitude floor ensuring quantization survival. We also propose Circuit Attribution Divergence (CAD) to distinguish structural erasure from behavioral suppression. MANSU achieves meaningful forgetting, retain preservation, non-positive PTQ gap, and structural erasure across multiple model families and hazard benchmarks, outperforming gradient-based baselines that recover up to +0.05 accuracy under compression.
quantizationunlearningcircuit attributionnull-space projectionstructural erasure
Training ML Models with Predictable Failures
The paper introduces a method for predicting ML model failure rates at deployment scale by extrapolating from the largest k failure scores in an evaluation set. It analyzes the estimator's forecast error, revealing a safety-favorable over-prediction bias, except when rare high-failure modes are missed. A novel fine-tuning objective, the forecastability loss, mitigates this under-prediction risk. Experiments on a language-model password game and an RL gridworld demonstrate reduced held-out forecast error while maintaining primary-task performance and safety comparable to supervised baselines.
failure rate predictionforecast errorfine-tuning objectivesafety assessmentdeployment-scale evaluation
Causal Foundation Models with Continuous Treatments
We introduce the first causal foundation model for continuous treatment settings, enabling meta-learning of causal effect predictions across diverse unseen tasks without additional training. The model employs a novel prior over data-generating processes with continuous treatment variables to create a comprehensive causal training corpus. A transformer is trained to reconstruct individual treatment-response curves from observational data, utilizing in-context learning to amortize Bayesian posterior inference. The model achieves state-of-the-art performance in reconstructing treatment-response curves, outperforming task-specific causal models.
causal inferencecontinuous treatmentmeta-learningin-context learningtransformer
Natural Synthesis: Outperforming Reactive Synthesis Tools with Large Reasoning Models
The paper introduces a neuro-symbolic approach for reactive synthesis, combining large reasoning models with model checkers to iteratively repair Verilog implementations via symbolic feedback. This method outperforms dedicated tools in annual synthesis competitions and handles parameterized systems, despite their undecidability. Additionally, the authors propose autoformalization to convert natural-language specifications into temporal logic, demonstrating comparable performance to formal specifications. The approach solves more benchmarks than state-of-the-art tools and establishes natural synthesis as a viable end-to-end workflow.
reactive synthesisneuro-symbolicautoformalizationtemporal logicparameterized systems
CoCo-InEKF: State Estimation with Learned Contact Covariances in Dynamic, Contact-Rich Scenarios
CoCo-InEKF introduces a differentiable invariant extended Kalman filter for legged robot state estimation, replacing binary contact states with learned continuous contact velocity covariances. The method employs a lightweight neural network to predict covariances for predefined contact points, trained end-to-end with a state-error loss, eliminating heuristic contact labels. Experiments on a bipedal robot show improved linear velocity estimation accuracy (30% reduction in error) and filter consistency, enabling robust execution of dynamic motions like dancing in both simulation and real-world scenarios.
invariant extended kalman filtercontact velocity covarianceslegged robotsstate-error lossdifferentiable filtering
Learning from Language Feedback via Variational Policy Distillation
Variational Policy Distillation (VPD) introduces a novel framework for reinforcement learning from language feedback by formalizing it as a Variational Expectation-Maximization problem. VPD co-evolves teacher and student policies: the teacher adaptively refines its interpretation of textual feedback via trust-region updates, while the student internalizes this guidance during on-policy rollouts. This approach overcomes the limitations of passive distillation by continuously improving the teacher's ability to extract actionable signals. Evaluated on scientific reasoning and code generation tasks, VPD outperforms standard RLVR and self-distillation baselines, demonstrating its efficacy in leveraging diagnostic feedback for complex reasoning tasks.
variational policy distillationtrust-region updateson-policy rolloutslanguage feedbackself-distillation
Proposal and study of statistical features for string similarity computation and classification
The study proposes statistical features adapted from visual computing—co-occurrence matrix (COM) and run-length matrix (RLM)—for string similarity computation across languages and structures. These features outperform traditional measures like longest common subsequence and edit distances in synthetic experiments, showing statistical significance (P-value < 0.001) in 3 of 4 cases. On a real plagiarism dataset, RLM features achieved superior results, demonstrating language-agnostic effectiveness for textual analysis tasks.
co-occurrence matrixrun-length matrixstring similaritystatistical significanceplagiarism detection
From Data to Action: Accelerating Refinery Optimization with AI
The study introduces machine learning approaches to enhance refinery optimization by addressing limitations in Linear Programming (LP) solutions. Specifically, it proposes Anomaly Detection tools, including a transformed ECOD methodology, to analyze historical data alongside LP outputs. Novel methods for handling high-dimensional data are introduced, focusing on selecting the most informative pairs and employing 2D Anomaly Detection algorithms. These techniques were applied to the MOL refinery scheduling and planning architecture, uncovering business opportunities and data supply errors. The integration of machine learning with LP provides actionable insights, improving decision-making processes in petrochemical operations.
linear programminganomaly detectionecod methodologyhigh-dimensional datarefinery optimization
Average Gradient Outer Product in kernel regression provably recovers the central subspace for multi-index models
The paper proves that the Average Gradient Outer Product (AGOP) from kernel ridge regression (KRR) can recover the central subspace in multi-index models with fewer samples than required for accurate prediction. The method analyzes AGOP's eigenspace when fitting KRR to data from functions $f^*(x)=h(Ux)$, where $U$ projects onto an unknown $r$-dimensional subspace. Results show subspace recovery occurs with $n\asymp d^{p+δ}$ samples (for any $δ\in(0,1)$) when a degree-$p$ component carries predictive directions, contrasting the $n\asymp d^{p^*}$ samples needed for accurate prediction of degree-$p^*$ functions. This explains the sample efficiency of iterative kernel methods like Recursive Feature Machines.
kernel ridge regressioncentral subspacemulti-index modelaverage gradient outer productrepresentation learning
Croissant Baker: Metadata Generation for Discoverable, Governable, and Reusable ML Datasets
We introduce Croissant Baker, an open-source CLI tool for generating Croissant metadata locally from dataset directories, addressing limitations of platform-dependent metadata creation. The tool employs a modular handler registry to produce machine-checkable JSON-LD metadata, enabling dataset discovery, governance, and reproducibility across ML platforms. Evaluated on 140+ datasets, including MIMIC-IV with 886M rows and 374 Parquet files, Croissant Baker achieves 97-100% agreement with ground truth metadata across diverse domains, demonstrating scalability and accuracy in metadata generation.
croissant metadatajson-ldmodular handler registrydataset discoveryml reproducibility
DiffusionOPD: A Unified Perspective of On-Policy Distillation in Diffusion Models
DiffusionOPD introduces a multi-task training paradigm for diffusion models via Online Policy Distillation (OPD), addressing limitations of joint optimization and cascade RL in reinforcement learning. The method independently trains task-specific teachers, then distills their capabilities into a unified student along its rollout trajectories, decoupling single-task exploration from multi-task integration. Theoretically, it extends OPD to continuous-state Markov processes, deriving a closed-form per-step KL objective that unifies stochastic SDE and deterministic ODE refinement via mean-matching. Experiments demonstrate that DiffusionOPD outperforms multi-reward RL and cascade RL baselines in training efficiency and final performance, achieving state-of-the-art results across evaluated benchmarks.
diffusion modelsonline policy distillationmulti-task trainingmarkov processeskl objective
An Interpretable Latency Model for Speculative Decoding in LLM Serving
The authors present an interpretable latency model for speculative decoding (SD) in large language model (LLM) serving systems, addressing the gap in understanding SD behavior under varying production loads. The model decomposes per-request latency into load-independent and load-dependent components for prefill, drafting, and verification phases, inferring effective batch size via Little's Law. Extensive validation using vLLM measurements demonstrates accurate latency prediction, explains diminishing speedups under high load, and characterizes the impact of draft length, acceptance rate, and verifier-drafter size. The framework also extends to mixture of experts models, accounting for sparse expert activation effects.
speculative decodinglatency modellarge language modellittle's lawmixture of experts
Separating Intrinsic Ambiguity from Estimation Uncertainty in Deep Generative Models for Linear Inverse Problems
The paper introduces a structural decomposition method to separate intrinsic ambiguity from estimation uncertainty in deep generative models for linear inverse problems. By employing a cascade formulation, the approach enables calibration analysis and diagnostics, revealing hidden failure modes not detectable through reconstruction quality alone. Validation on a Gaussian example with analytical posterior structure demonstrates efficacy, followed by applications in accelerated MRI and EEG source imaging, showcasing practical utility in high-stakes domains.
posterior uncertaintyintrinsic ambiguitydeep generative modelslinear inverse problemscalibration diagnostics
TopoPrimer: The Missing Topological Context in Forecasting Models
TopoPrimer introduces topological structure as explicit input to forecasting models, improving accuracy and stability. The framework precomputes global topology via persistent homology and spectral sheaf coordinates, deploying them as token inputs or lightweight adapters. Sheaf coordinates drive primary accuracy gains. Evaluations on Chronos and TimesFM benchmarks show consistent improvements, including 7.3% MSE reduction on ECL and 27% MAE reduction in cold-start scenarios. Topology benefits persist across zero-shot and fine-tuned models, with peak seasonal demand degradation limited to 10% versus 50% for baselines.
topological structurepersistent homologyspectral sheaf coordinatescold-startzero-shot forecasting
Multi-Block Attention for Efficient Channel Estimation in IRS-Assisted mmWave MIMO
The paper introduces a Multi-Block Attention (MBA) framework for efficient cascaded channel estimation in IRS-assisted mmWave MIMO systems using OFDM. The method leverages discrete Fourier transform and Hadamard matrices for optimal phase configurations and employs a two-stage architecture: a Convolutional Attention Network (CAN) for spatial correlation recovery and a Complex Multi-Convolutional Network (CMN) for noise suppression. MBA reduces pilot overhead by up to 87% compared to least squares estimators and achieves 51% lower normalized mean squared error at 10 dB SNR. The framework maintains low computational complexity and adapts effectively to diverse propagation environments.
intelligent reflecting surfacesmmwave mimochannel estimationmulti-block attentionorthogonal frequency division multiplexing
DeepTokenEEG Enhancing Mild Cognitive Impairment and Alzheimers Classification via Tokenized EEG Features
DeepTokenEEG, a lightweight model with 0.29M parameters, enhances Alzheimer's disease (AD) and mild cognitive impairment classification via tokenized EEG features. It employs spatial and temporal tokenizers to capture AD-related biomarkers in both temporal and frequency domains, addressing challenges in EEG-based diagnosis such as data availability and expert interpretation time. Trained on a dataset of 274 subjects (180 AD cases, 94 healthy controls), DeepTokenEEG achieves 100% accuracy on specific frequency bands, outperforming state-of-the-art methods by 1.41-15.35%. Its compact size and high performance suggest strong potential for early AD detection and deployment.
electroencephalogramtokenizerbiomarkersfrequency domainmild cognitive impairment
Distance-Matrix Wasserstein Statistics for Scalable Gromov--Wasserstein Learning
The paper introduces Distance-Matrix Wasserstein (DMW), a scalable relaxation of Gromov--Wasserstein (GW) distances for comparing metric spaces. DMW samples $n$ points from each space, computes pairwise distance matrices, and transports their distributions via Wasserstein metrics, avoiding GW's nonconvex quadratic optimization. Theoretical analysis shows DMW lower-bounds GW, with convergence guarantees as sampled subspaces densify. The method includes sliced and multi-scale variants, with $p=1$ yielding positive-definite kernels. Experiments on synthetic data, graph classification, and two-sample testing demonstrate scalability while preserving GW's structural interpretability.
gromov--wassersteinoptimal transportdistance matricesmanifold learningtwo-sample testing
InfoSFT: Learn More and Forget Less with Information-Aware Token Weighting
InfoSFT introduces an information-aware token weighting scheme for supervised fine-tuning (SFT) of LLMs, addressing the limitations of uniform sample weighting. The method selectively emphasizes medium-confidence tokens—those neither too familiar nor too unlikely under the base model—via a one-line modification to the token-wise loss. Evaluations across math, code, and chain-of-thought tasks demonstrate improved generalization over vanilla SFT and likelihood-weighted baselines while better preserving pre-existing capabilities in diverse model families.
supervised fine-tuningtoken weightinggeneralizationllmsinformation-aware
Efficient Online Conformal Selection with Limited Feedback
The paper introduces an efficient online conformal selection method for bandit feedback settings, ensuring adversarial validity and stochastic efficiency. Using Adaptive Conformal Inference (ACI) updates on dual variables, the approach guarantees target success probability under distribution shifts while achieving sublinear efficiency regret for i.i.d. inputs. Theoretical analysis via Lyapunov functions extends to bandit and semi-bandit feedback, bridging online learning with limited feedback and distribution-free uncertainty quantification.
conformal selectionbandit feedbackadaptive conformal inferenceefficiency regretlyapunov functions
nASR: An End-to-End Trainable Neural Layer for Channel-Level EEG Artifact Subspace Reconstruction in Real-Time BCI
The authors propose nASR, an end-to-end trainable Keras layer for EEG artifact subspace reconstruction that jointly optimizes artifact rejection and downstream decoding in real-time BCI applications. nASR introduces two trainable parameters: K for artifact detection in PC variance space and L for eigen-spread quantification, enabling selective channel-level reconstruction while preserving clean channel information. Evaluated on two subjects from the BCI Competition IV Dataset 1, nASR variants outperform traditional ASR on classification metrics while achieving a 6-8x reduction in inference time, making it suitable for low-latency, high-performance BCI systems.
eegartifact subspace reconstructionkeras layerprincipal componentreal-time bci
Real-time virtual circuits for plasma shape control via neural network emulators
We present neural-network-based emulators for real-time virtual circuit (VC) computation in tokamak plasma shape control, enabling state-aware regulation of strongly coupled shape parameters. The method constructs differentiable functions from a library of over one million simulated Grad-Shafranov equilibria spanning the MAST Upgrade operational space, allowing rapid gradient computation for VC derivation. Extensive verification demonstrates high accuracy and orthogonality of emulated VCs across diverse equilibria, validating their physical viability as a scalable alternative to precomputed VC schedules. This approach addresses limitations of static reference-equilibrium-based VC control, particularly for rapidly evolving plasma configurations.
virtual circuitstokamakgrad-shafranov equilibrianeural network emulatorsplasma shape control
Octopus: History-Free Gradient Orthogonalization for Continual Learning in Multimodal Large Language Models
We introduce Octopus, a two-stage continual learning framework for multimodal large language models (MLLMs) that employs History-Free Gradient Orthogonalization (HiFGO) to mitigate catastrophic forgetting without relying on historical task data. The method decouples task adaptation from regularization, balancing plasticity and stability through gradient-level orthogonality. Evaluated on UCIT, Octopus achieves state-of-the-art performance, surpassing prior methods by 2.14% and 6.82% in Avg and Last metrics, respectively.
continual learningmultimodal large language modelsgradient orthogonalizationcatastrophic forgettingtask adaptation
A Hardware-Aware, Per-Layer Methodology for Post-Training Quantization of Large Language Models
The Scaled Outer Product (SOP) methodology introduces hardware-aware, per-layer post-training quantization for large language models, achieving near-lossless fidelity at 4.5–6 bits per weight. SOP combines per-layer search of fixed and dynamic codebook pairs, signed per-block scales, activation-weighted cosine selection, and multiple-choice knapsack promotion with outlier and sparse-residual correction. Fixed codebooks include NF4, BOF4, Split87, and SH4, while per-layer optimized codebooks (DD4) are hosted in LUT SRAM. A new hardware-efficient LUT output format (HIF) improves performance, energy, and cost. Evaluations across six model families show FP6 (E2M3sUE4M4, 6.5 bpw) outperforms FP8 (E4M3, 8.0 bpw) in weight reconstruction error with 1.5 bpw lower storage cost.
quantizationcodebooklutsparse-residualhardware-aware
Learning with Shallow Neural Networks on Cluster-Structured Features
The paper analyzes how input-space correlations affect sample complexity in shallow neural networks trained with gradient descent. It introduces a tractable model where targets depend on latent Boolean variables, and input features are clustered and correlated with these variables. Under an identifiability assumption, the authors prove that layerwise gradient descent achieves sample complexity scaling with the number of hidden variables and, at high signal-to-noise ratios, becomes independent of input dimension up to logarithmic terms. Empirical validation on synthetic and real data supports these theoretical findings.
sample complexitygradient descentlatent variablesinput-space correlationsshallow neural networks
Road Maps as Free Geometric Priors: Weather-Invariant Drone Geo-Localization with GeoFuse
GeoFuse introduces a weather-invariant drone geo-localization framework by leveraging road maps as free geometric priors. The method integrates aligned road map tiles with satellite imagery via token-level and channel-level fusion, using dynamic gating to adaptively weight modalities. Class-level cross-view contrastive learning aligns weather-degraded drone features with fused representations. Evaluations on University-1652 and DenseUAV show +3.46% and +23.18% Recall@1 improvements over state-of-the-art methods under diverse weather conditions.
geo-localizationcross-modal fusioncontrastive learningdynamic gatingweather-invariant
A Mutual Information Lower Bound for Multimodal Regression Active Learning
The authors introduce Mutual Information Lower Bound (MI-LB), a novel acquisition function for active learning in multimodal regression tasks, addressing epistemic uncertainty in predictive distributions. The method employs a Two-Index framework to separate epistemic and aleatoric uncertainty sources, deriving MI-LB as a closed-form approximation for Mixture Density Network ensembles. Experiments on multimodal benchmarks demonstrate that MI-LB consistently outperforms geometric and Fisher-based baselines, particularly when input space multimodality is not explicitly encoded.
active learningmultimodal regressionepistemic uncertaintymixture density networkacquisition function
TILBench: A Systematic Benchmark for Tabular Imbalanced Learning Across Data Regimes
TILBench introduces a systematic benchmark for tabular imbalanced learning, evaluating over 40 algorithms across 57 datasets through 200,000+ experiments. The study reveals no universally dominant method, with performance contingent on dataset characteristics and computational constraints. Results demonstrate method effectiveness varies significantly by context, prompting data-driven recommendations for real-world application selection.
tabular dataimbalanced learningbenchmarkalgorithm comparisonempirical evaluation
Text-Dependent Speaker Verification (TdSV) Challenge 2024: Team Naive System Report
The paper introduces a system for the 2024 Text-Dependent Speaker Verification (TdSV) Challenge, achieving a MinDCF of 0.0461 and an EER of 1.3%. The approach adapts ResNet-TDNN and NeXt-TDNN architectures, originally trained on VoxCeleb, and introduces a lightweight EfficientNet-A0 model trained on the challenge dataset. The system leverages advanced neural architectures, extensive data augmentation, and optimized hyperparameters. Results highlight the effectiveness of multi-model ensemble learning for speaker and phrase verification, demonstrating strong performance in text-dependent speaker verification.
text-dependent speaker verificationminimium detection cost functionequal error rateensemble learningdata augmentation
PROCESS-2: A Benchmark Speech Corpus for Early Cognitive Impairment Detection
PROCESS-2, a large-scale speech corpus, addresses the scarcity of clinically validated datasets for cognitive impairment detection by providing 21 hours of speech audio from 200 healthy controls, 150 mild cognitive impairment cases, and 50 dementia diagnoses. Collected via the CognoMemory digital assessment platform, it includes spontaneous and task-oriented speech (picture description, verbal fluency) with manually verified transcripts and metadata. Technical validation confirmed demographic balance, clinical consistency, and reproducible baseline modeling performance, demonstrating meaningful group separation. Released under controlled access on Hugging Face, PROCESS-2 serves as a benchmark for speech-based cognitive assessment research.
speech corpuscognitive impairmentclinical validationtask-oriented speechbaseline modeling
AIMing for Standardised Explainability Evaluation in GNNs: A Framework and Case Study on Graph Kernel Networks
The authors propose AIM, a standardized framework for evaluating explainability in Graph Neural Networks (GNNs) that addresses limitations in existing methods by measuring Accuracy, Instance-level explanations, and Model-level explanations. The framework is applied to inherently interpretable GNNs like graph kernel networks (GKNs) and prototype networks (PNs), yielding insights that inform the development of an improved model, xGKN, which maintains accuracy while enhancing explainability. Results demonstrate AIM's effectiveness in providing robust evaluation metrics and actionable model improvements for Explainable AI (XAI) in graph-structured data.
graph neural networksexplainable aigraph kernel networksinterpretability evaluationprototype networks
BCI-Based Assessment of Ocular Response Time Using Dynamic Time Warping Leveraging an RDWT-Driven Deep Neural Framework
The study introduces a multimodal framework combining EEG and AR-based VOMS tasks to assess ocular response times for mTBI diagnosis. A RDWT-driven deep neural network processes EEG signals, employing wavelet-domain denoising, convolutional filtering, and convolutional-LSTM decoding. Dynamic Time Warping (DTW) analysis revealed significant inter-subject differences in ocular response times, with pursuit tasks proving most discriminative. Validation via Pearson correlation (≥0.5) and Mann-Whitney U tests confirmed the method's efficacy, highlighting RDWT-based EEG features and DTW metrics as promising tools for mTBI assessment.
eegrdwtdynamic time warpingocular responsemtbi
Denoising-GS: Gaussian Splatting with Spatial-aware Denoising
Denoising-GS introduces a spatial-aware denoising framework for 3D Gaussian Splatting (3DGS) to address noisy primitives from SfM initialization. The method employs a spatial gradient-based denoising strategy for coherent updates, an uncertainty-based module for pruning redundant primitives, and spatial coherence refinement for structural completeness. Evaluated on three benchmarks, Denoising-GS achieves state-of-the-art Novel View Synthesis fidelity while maintaining compactness.
3dgsdenoisingspatial-awarenovel view synthesisgaussian primitives
Temporal Fair Division in Multi-Agent Systems: From Precise Alternation Metrics to Scalable Coordination Proxies
The paper introduces Rotational Periodicity (RP) and ALT, two families of metrics for temporal fair division in repeated multi-agent resource competition, formalized through the Multi-Agent Battle of the Exes (MBoE) problem. RP decomposes fairness into Rotational Score and Waiting Periods Evaluation, achieving O(nu+n) time complexity versus ALT's O(nu*n), where nu is episode count and n is agent count. Empirical results show RP's 12-25x speedup over ALT, complementary discrimination capabilities, and exposure of coordination failures invisible to traditional metrics like Reward Fairness.
temporal fair divisionrotational periodicitymulti-agent systemsround-robin allocationcoordination failure
Fast Adversarial Attacks with Gradient Prediction
We propose a family of fast adversarial attacks that eliminate the backward pass by predicting input gradients from forward-pass hidden states via lightweight linear regression. The method is motivated by a kernel view of neural networks and is exact in the Neural Tangent Kernel regime, while remaining effective for practical finite-width models. Empirical results demonstrate that our approach recovers much of FGSM's attack performance while achieving a 532% increase in throughput, enabling significantly faster adversarial generation under realistic wall-clock constraints.
adversarial attacksgradient predictionneural tangent kernelthroughputlinear regression
A Non-Monotone Preconditioned Trust-Region Method for Neural Network Training
The authors propose a non-monotone variant of the Additively Preconditioned Trust-Region Strategy (APTS) for neural network training, called NAPTS. The method employs a nonlinear additive Schwarz preconditioner combining parallel subdomain corrections with global coarse-space directions, and introduces a windowed acceptance criterion permitting controlled objective increases. This approach reduces CPU time by 30% and decreases rejected steps to one third compared to APTS while maintaining accuracy. The technique is particularly suited for large-scale deep neural network training via domain decomposition.
trust-region methodadditive schwarz preconditionerdomain decompositionnon-monotone optimizationparallel training
In-Context Learning for Data-Driven Censored Inventory Control
The paper proposes in-context generative posterior sampling (ICGPS), a method combining offline meta-training with online in-context autoregressive generation for censored inventory control. ICGPS uses a learned completion kernel to handle decision-dependent censoring, with theoretical guarantees bounding Bayesian regret by the ideal Thompson sampling benchmark plus a deployment penalty scaling as √T times completion mismatch. The method is instantiated as ChronosFlow-ICGPS, integrating a frozen time-series transformer with a trainable normalizing-flow head. Experiments show it matches correctly specified Thompson sampling, outperforms baselines, and demonstrates robustness to prior mismatch and distribution shift on both synthetic benchmarks and the SuperStore dataset.
in-context learningcensored inventory controlgenerative posterior samplingbayesian regretnormalizing-flow
GenAI for Energy-Efficient and Interference-Aware Compressed Sensing of GNSS Signals on a Google Edge TPU
A novel generative AI approach compresses and classifies GNSS jamming threats in real time using variational autoencoders (VAEs) deployed on Google Edge TPUs. The method adapts large-scale AE models through 8-bit quantization for energy-efficient deployment, achieving >42x compression while preserving interference characteristics. Evaluated on raw IQ, FFT, and handcrafted features, the system classifies ~72 interference types with F2-scores of 0.915 on reconstructed signals, closely matching original signals (F2-score 0.923). Conditional and FactorVAE ablation studies enhance latent feature disentanglement for interpretability, reducing jammer signal transmission costs and offering practical interference mitigation.
variational autoencodersgnss jammingedge tpu8-bit quantizationlatent feature disentanglement
K-Models: a Flexible and Interpretable Method for Ordinal Clustering with Application to Antigen-Antibody Interaction Profiles
K-Models introduces a novel clustering framework for functional data that integrates ordinal constraints to improve interpretability while estimating key elements of the data-generating process. The method incorporates structural assumptions into the clustering process, enabling meaningful insights when an ordinal relationship among clusters is suspected. Evaluated through simulations and real-world applications, K-Models demonstrates comparable performance to state-of-the-art techniques while enhancing interpretability. Specifically, it is tested on Region of Interest (ROI) curves representing antigen-antibody binding dynamics, capturing changes in reflected light intensity over time. This approach provides a valuable tool for analyzing functional data with underlying ordinal structures.
ordinal clusteringfunctional datainterpretabilityregion of interest curvesantigen-antibody binding
ToMAToMP: Robust and Multi-Parameter Topological Clustering
ToMAToMP introduces the first topological clustering method capable of handling multiple functions simultaneously with theoretical robustness guarantees, addressing limitations of ToMATo. Leveraging multi-parameter persistent homology and MMA decomposition, the method eliminates dependency on graph tuning and enhances outlier robustness. Empirical evaluations demonstrate significant improvements in clustering efficiency and quality over both non-topological and topological baselines across diverse datasets.
topological clusteringmulti-parameter persistent homologymma decompositionrobustness guaranteesoutlier robustness
GFMate: Empowering Graph Foundation Models with Test-time Prompt Tuning
GFMate introduces test-time graph prompt tuning to enhance Graph Foundation Models (GFMs) without dependency on pre-training strategies or source-domain entanglement. The method employs centroid and layer prompts applied post-pre-training on target domains, coupled with a test-time complementary learning objective that leverages both labeled and unlabeled target domain data. Evaluations across 12 benchmark datasets demonstrate GFMate's effectiveness, achieving performance improvements of up to 30.63% while maintaining efficiency.
graph foundation modelstest-time tuningcentroid promptslayer promptscomplementary learning
Understanding Imbalanced Forgetting in Rehearsal-Based Class-Incremental Learning
This work systematically investigates imbalanced forgetting in rehearsal-based class-incremental learning (CIL), where certain classes are forgotten more than others despite balanced rehearsal. The authors construct three last-layer coefficients capturing gradient-level interference sources (self-induced, new-class, and cross-class) and demonstrate their predictive power for class-wise forgetting rankings. Results show self-induced interference as the strongest predictor, with evidence suggesting influence from new-class interference. The findings offer mechanistic insights and potential mitigation directions for reducing class-wise forgetting disparities.
class-incremental learningcatastrophic forgettingrehearsalgradient interferenceimbalanced forgetting
Peng's Q($λ$) for Conservative Value Estimation in Offline Reinforcement Learning
Conservative Peng's Q(λ) (CPQL) introduces a model-free offline multi-step reinforcement learning algorithm that adapts the Peng's Q(λ) operator for conservative value estimation, replacing the Bellman operator. CPQL leverages offline trajectories to induce implicit behavior regularization, mitigating over-pessimistic value estimation while achieving performance at least equal to the behavior policy and providing near-optimal guarantees. Extensive experiments on the D4RL benchmark show CPQL consistently outperforms existing single-step baselines. Additionally, CPQL facilitates offline-to-online learning by enabling robust performance improvements during fine-tuning without initial performance drops.
offline reinforcement learningconservative value estimationmulti-step operatorbehavior regularizationd4rl benchmark
BioHuman: Learning Biomechanical Human Representations from Video
We introduce BioHuman, an end-to-end model for estimating human motion and muscle activations from monocular video, bridging visual observations and internal biomechanical states. The approach builds on BioHuman10M, a large-scale dataset created via simulation-based estimation of muscle activations from motion capture data, containing synchronized video, motion, and activation annotations. Experiments demonstrate that BioHuman accurately reconstructs both kinematic motion and muscle activity while generalizing across diverse subjects and motions. This work establishes a new benchmark for video-based biomechanical understanding and enables physically grounded human modeling.
biomechanical statesmuscle activationsmonocular videokinematic motionmotion capture
Composable Crystals: Controllable Materials Discovery via Concept Learning
The paper introduces a concept-based compositional framework for controllable crystal generation, addressing limitations of black-box stochastic sampling in materials discovery. A vector-quantized variational autoencoder learns reusable crystal concepts as interpretable building blocks, enabling guided generation beyond training distributions. The method includes a composition generator refined via self-generated samples, improving composition efficiency. Evaluated on MP-20 and Alex-MP-20 datasets, the approach increases base model performance by up to 53.2% and 51.7% on the V.S.U.N metric, particularly enhancing novelty.
crystal generationvector-quantized variational autoencodercompositional frameworkmaterials discoveryv.s.u.n metric
Crys-JEPA: Accelerating Crystal Discovery via Embedding Screening and Generative Refinement
Crys-JEPA introduces a joint embedding predictive architecture for crystal discovery, addressing the stability-novelty trade-off in de novo crystal generation. The method learns an energy-aware latent space that preserves formation-energy differences, enabling stability assessment via embedding-based comparisons against training crystals, thus reducing reliance on expensive energy evaluations. A screening-and-refinement pipeline identifies promising generated crystals and refines the generative model. Evaluations on MP-20 and Alex-MP-20 datasets demonstrate improvements of up to 81.4% and 82.6% on the V.S.U.N metric, respectively, outperforming baseline methods.
joint embeddingcrystal discoverystability-novelty trade-offenergy-aware latent spacescreening-and-refinement pipeline
Selective Safety Steering via Value-Filtered Decoding
The paper introduces value-filtered decoding, a novel method for improving LLM safety by selectively intervening only when necessary during token generation. The approach uses a safety criterion to filter tokens, with a tunable threshold controlling the trade-off between unnecessary interventions and safety guarantees. Theoretical analysis provides bounds on false intervention probabilities. Empirical results demonstrate superior performance over baselines in balancing safety, helpfulness, and output fidelity across multiple datasets.
value-filtered decodingsafety steeringdecoding-time interventionfalse intervention boundthreshold hyperparameter
IsoNet: Spatially-aware audio-visual target speech extraction in complex acoustic environments
IsoNet introduces a spatially-aware audio-visual target speech extraction system for compact 4-microphone arrays, addressing limitations of monaural neural models and classical beamformers. The method integrates complex multi-channel STFT features, GCC-PHAT spatial cues, face-conditioned visual embeddings, and auxiliary direction-of-arrival supervision within a U-Net mask estimation network. Three curriculum variants were trained on 25,000 simulated VoxCeleb mixtures. On a challenging test set (-1 to 10 dB SNR), IsoNet-CL1 achieves 9.31 dB SI-SDR (4.85 dB improvement over mixture), with PESQ 2.13 and STOI 0.84, outperforming oracle delay-and-sum and MVDR beamformers. Ablation studies confirm contributions from visual conditioning, GCC-PHAT features, and extended delay-bin encoding, establishing a compact-array baseline while highlighting remaining deployment challenges.
target speech extractiongcc-phatu-netsi-sdrbeamformers
AnchorRoute: Human Motion Synthesis with Interval-Routed Sparse Contro
AnchorRoute introduces a sparse-anchor motion synthesis framework for human motion generation and refinement. The method converts sparse anchors into anchor-condition features, injecting them into a frozen Transition Masked Diffusion prior via AnchorKV and dual-context conditioning to preserve text-to-motion quality. Post-generation, anchors are evaluated as residuals, guiding RouteSolver to refine motion through soft-token updates on piecewise-affine interval bases. AnchorRoute supports root-3D, planar-root, and body-point control within a unified formulation. Benchmark evaluations demonstrate its superiority over prior sparse-control methods, improving anchor adherence while maintaining motion quality.
sparse-anchortransition masked diffusionanchor-condition featuresroutesolverpiecewise-affine
The Rate-Distortion-Polysemanticity Tradeoff in SAEs
We formalize the Rate-Distortion-Polysemanticity tradeoff in Sparse Autoencoders (SAEs), demonstrating that monosemantic representations necessarily increase rate and distortion. Through theoretical analysis and toy-model experiments, we show that optimal SAE polysemanticity depends on the training data distribution, particularly feature co-occurrence probabilities. We derive necessary conditions for polysemanticity measures when the data-generating process is unknown and evaluate existing metrics on SAEs trained on Large Language Models. Results indicate that polysemanticity is fundamentally a data-driven phenomenon that must be addressed at both architectural and optimization levels.
sparse autoencoderspolysemanticityrate-distortionmechanistic interpretabilityfeature co-occurrence
ReMIA: a Powerful and Efficient Alternative to Membership Inference Attacks against Synthetic Data Generators
ReMIA (Relative Membership Inference Attack) introduces a practical privacy metric for synthetic data generators (SDGs), addressing limitations of state-of-the-art membership inference attacks (MIAs). Unlike MIAs requiring hundreds of SDG training runs and large auxiliary datasets, ReMIA uses only two SDG training runs and additional data no larger than the original training set. It generates two synthetic datasets from distinct sources and employs a classifier to identify the source of a record. Experiments across multiple tabular datasets and SDGs demonstrate ReMIA's sensitivity comparable to MIAs while being significantly more efficient. ReMIA also highlights superior privacy-utility trade-offs compared to traditional noise-based anonymization methods.
synthetic data generatorsmembership inference attacksprivacy metrictabular dataanonymization methods
AQKA: Active Quantum Kernel Acquisition Under a Shot Budget
The paper introduces AQKA, an active quantum kernel acquisition method that optimizes shot allocation under budget constraints for quantum kernel learning. AQKA employs a pair-level acquisition theory with closed-form solutions for kernel ridge regression (KRR) and SVM, outperforming uniform allocation by +8 to +32 points on hardware kernels (e.g., ibm_pittsburgh). It provides a regime decomposition showing dominance in budget-limited scenarios (B ≲ 16 npairs) and offers explicit gradient-based shot allocation (s_ij ∝ |g_ij|√(K_ij(1-K_ij))). Experimental results demonstrate +17.0 ± 4.8 pts improvement on ibm_aachen (N=20) and +14.0 ± 8.5 pts on ibm_berlin (N=30).
quantum kernel learningshot allocationkernel ridge regressionnyström approximationsupport vector machine
Scalable Solution of the Stochastic Multi-path Traveling Salesman Problem via Neural Networks
The paper introduces a neural network-based surrogate modeling approach to solve the stochastic multi-path Traveling Salesman Problem (TSP) with uncertain travel times. A two-stage stochastic programming formulation is adopted, where the first stage determines a predefined route, and the second stage selects optimal paths based on realized traffic conditions. Neural networks approximate the expected value of the second-stage recourse problem, reducing computational burden. Evaluated architectures and training strategies demonstrate improved scalability, computation time, and solution quality for complex vehicle routing under uncertainty.
stochastic programmingtraveling salesman problemneural networkssurrogate modelingvehicle routing
Slower Generalization, Faster Memorization: A Sweet Spot in Algorithmic Learning
The study challenges the intuition that larger datasets always accelerate validation convergence in algorithmic learning tasks. Using Needleman-Wunsch matrix generation with small Transformers, the authors demonstrate that optimal generalization occurs at an intermediate dataset size, beyond which more gradient updates are required. Conversely, larger datasets can reduce training updates when partial validation competence emerges, indicating rule-structure benefits. A multiplication task baseline did not exhibit this slowdown. Results distinguish critical data size for generalization onset from dataset size optimizing update-based convergence, highlighting divergence in structured-output tasks.
grokkingneedleman-wunschtransformersgeneralizationmemorization
Unbiased and Second-Order-Free Training for High-Dimensional PDEs
We propose an unbiased, second-order-free training framework for solving high-dimensional PDEs using backward stochastic differential equations (BSDEs), addressing the intrinsic bias induced by Euler-Maruyama (EM) time discretization. By analyzing EM-induced loss bias, our method eliminates the need for explicit Hessian evaluations while preserving computational efficiency, unlike high-order schemes like Heun that reintroduce second-order spatial derivatives. The approach maintains the advantages of BSDE methods in avoiding the curse of dimensionality. Code is publicly available for reproducibility.
backward stochastic differential equationseuler-maruyamahessian-freehigh-dimensional pdestime discretization
DRL-STAF: A Deep Reinforcement Learning Framework for State-Aware Forecasting of Complex Multivariate Hidden Markov Processes
We introduce DRL-STAF, a Deep Reinforcement Learning framework for state-aware forecasting of complex multivariate hidden Markov processes, addressing limitations in both deep learning and Hidden Markov Models (HMMs). DRL-STAF jointly predicts next-step observations and estimates hidden states by modeling nonlinear emissions with deep neural networks and estimating discrete states via reinforcement learning, enabling flexible adaptation to temporal dynamics without predefined transition structures. Extensive experiments show DRL-STAF outperforms HMM variants, standalone deep learning models, and DL-HMM hybrids in most cases while providing reliable hidden-state estimates, mitigating state-space explosion in multivariate HMM-based methods.
deep reinforcement learninghidden markov modelsstate-aware forecastingnonlinear emissionsmultivariate processes
Deep Image Segmentation via Discriminant Feature Learning
This paper introduces Deep Discriminant Analysis (DDA), a differentiable, architecture-agnostic loss function for image segmentation that embeds classical discriminant principles. DDA explicitly maximizes between-class variance while minimizing within-class variance, promoting compact and separable feature distributions without increasing inference cost. Evaluations on the DIS5K benchmark demonstrate that DDA consistently improves segmentation accuracy, boundary sharpness, and model confidence across various architectures. The results indicate that integrating discriminant analysis offers a simple, effective path for building more robust segmentation models.
deep discriminant analysisimage segmentationdiscriminant principlesbetween-class variancewithin-class variance
Silent Collapse in Recursive Learning Systems
The paper introduces silent collapse, a phenomenon in recursive learning systems where internal model distributions degrade despite stable conventional metrics. Three trajectory-level precursors are identified: anchor entropy contraction, representation drift freezing, and tail coverage erosion, which reliably precede collapse. The authors propose the MTR (Monitor--Trust--Regulator) framework, a metacognitive loop that monitors these statistics, estimates a slow-timescale trust variable, and adaptively modulates learning intensity. MTR provides early warning and prevents collapse without requiring access to pristine real data, addressing scenarios where original data is unavailable, contaminated, or private.
recursive learningsilent collapseanchor entropymetacognitive looptail coverage
All-atomistic Transferable Neural Potentials for Protein Solvation
The Protein Hydration Neural Network (PHNN) introduces an implicit solvent model that improves solvation energetics accuracy by learning transferable corrections to model parameters, rather than applying post hoc energy adjustments. PHNN leverages physical priors embedded in the data to maximize data efficiency and extends analytical continuum solvation methods. Results demonstrate that PHNN outperforms traditional analytical approaches in accuracy and maintains predictive performance on out-of-domain protein systems, addressing the transferability challenge in neural potentials.
implicit solvent modelsolvation energeticstransferable correctionsanalytical continuum solvationprotein systems
Woodelf++: A Fast and Unified Partial Dependence Plot Algorithm for Decision Tree Ensembles
Woodelf++ introduces a unified and efficient algorithm for computing Partial Dependence Plots (PDPs), Joint-PDPs, and Any-Order Partial Dependence Interaction Values (PDIVs) on decision tree ensembles. By leveraging metrics over pseudo-Boolean functions, it extends Woodelf, an algorithm for SHAP computation, to support exact and approximate PDPs, Joint-PDPs, and Full PDPs, which capture model behavior across all feature values. Woodelf++ achieves exponential complexity improvements, computing Any-Order-PDIVs in 5 minutes versus an estimated 1,000,000 years for state-of-the-art methods. On a 400,000-row dataset, it computes PDPs and Joint-PDPs up to 6x faster than existing approaches and supports GPU acceleration.
partial dependence plotsdecision tree ensemblespseudo-boolean functionsfeature interactionsgpu acceleration
Let Robots Feel Your Touch: Visuo-Tactile Cortical Alignment for Embodied Mirror Resonance
The paper introduces Mirror Touch Net, a computational framework that operationalizes cortical correspondence to enable robotic mirror touch. The method enforces semantic, distributional, and geometric alignment between visual and tactile representations via multi-level constraints, predicting 1,140 taxel tactile signals from RGB images. Manifold analysis shows these constraints reshape visual representations to match the tactile manifold, simplifying cross-modal mapping. The framework extends to human hand observations, enabling tactile prediction and reflexive responses. Results demonstrate a neural-inspired approach for anticipatory touch and empathic human-robot interaction.
mirror touchvisuo-tactile alignmenttactile predictioncross-modal mappingmanifold learning
Mining Subscenario Refactoring Opportunities in Behaviour-Driven Software Test Suites: ML Classifiers and LLM-Judge Baselines
The study introduces an automated method to identify and rank refactoring opportunities in Behavior-Driven Development (BDD) test suites by mining recurring step subsequences. Using Sentence-BERT, UMAP, and HDBSCAN for paraphrase-robust clustering, the approach evaluates 5,382,249 slices from 339 repositories, classifying them via XGBoost (F1 = 0.891) against rule-based and LLM-judge baselines. Results show 75.0%, 59.5%, and 11.7% prevalence of within-file, within-repo, and cross-organizational refactoring candidates, with inter-annotator agreement (Fleiss' kappa) of 0.56-0.79.
behavior-driven developmentsentence-bertumaphdbscanxgboost
Scaling Laws from Sequential Feature Recovery: A Solvable Hierarchical Model
The authors demonstrate how scaling laws emerge from sequential feature recovery in multi-layer networks, proposing a hierarchical model with power-law decaying feature weights. They analyze a high-dimensional target function using a layer-wise spectral algorithm adapted to compositional structure, proving sharp recovery thresholds for individual features. The method relies on random matrix theory and resolvent-based perturbation arguments, providing matching upper and lower bounds for eigenvector recovery. Numerical experiments validate sequential feature recovery, finite-size smoothing of thresholds, and improved scaling over non-hierarchical kernel baselines, showing power-law decay of prediction error from aggregated feature transitions.
scaling lawssequential feature recoverypower-law decayresolvent-based perturbationrandom matrix theory
SeesawNet: Towards Non-stationary Time Series Forecasting with Balanced Modeling of Common and Specific Dependencies
SeesawNet introduces a novel architecture for non-stationary multivariate time series forecasting that dynamically balances common and instance-specific dependency modeling. The key innovation is Adaptive Stationary-Nonstationary Attention (ASNA), which captures common dependencies from normalized sequences and specific dependencies from raw sequences, adaptively fusing them based on instance-level non-stationarity. The model alternates between temporal and channel relationship modeling to capture long-range and cross-variable dependencies. Experiments on real-world benchmarks show SeesawNet consistently outperforms state-of-the-art methods.
non-stationary time seriesinstance normalizationadaptive attentionmultivariate forecastingdependency modeling
Multi-Dimensional Model Integrity and Responsibility Assessment Index and Scoring Framework
The authors introduce the Model Integrity and Responsibility Assessment Index (MIRAI), a unified framework for evaluating tabular models across explainability, fairness, robustness, privacy, and sustainability. MIRAI combines established metrics into normalized, direction-aligned dimension scores, enabling direct comparisons across models with varying architectures. Experiments on healthcare, financial, and socioeconomic datasets demonstrate that higher predictive performance does not always correlate with better overall integrity and responsibility, with simpler models often achieving superior cross-dimensional balance. MIRAI provides a compact basis for responsible model selection in regulated domains.
tabular modelsexplainabilityfairnessrobustnesssustainability
Discovering Physical Directions in Weight Space: Composing Neural PDE Experts
The paper introduces Calibration-Conditioned Merge (CCM), a method for composing neural PDE experts by identifying reusable physical directions in weight space. Starting from a shared family anchor, the authors fine-tune endpoint experts and decompose their updates into family-shared adaptations and physics-aligned directions. CCM infers target composition coordinates from metadata or rollout prefixes, deploying a merged checkpoint for extrapolative regimes. Evaluated on reaction-diffusion, Navier-Stokes, and dam-break systems, CCM reduces out-of-distribution error by 54.2%, 42.8%, and 13.8% respectively, demonstrating that fine-tuning reveals calibratable physical directions.
neural operatorsweight-space directionspde surrogate modelingin-context adaptationextrapolative regimes
Exploring Geographic Relative Space in Large Language Models through Activation Patching
The study investigates how Large Language Models (LLMs) process relative geographic space, addressing safety concerns in their geographic applications. Using activation patching, a mechanistic interpretability technique, the authors analyze the internal representations and computations within LLMs. The work contributes to understanding spatial reasoning in black-box models, though specific quantitative results are not provided in the abstract.
large language modelsactivation patchingmechanistic interpretabilitygeographic spacespatial reasoning
Lang2MLIP: End-to-End Language-to-Machine Learning Interatomic Potential Development with Autonomous Agentic Workflows
Lang2MLIP introduces a multi-agent framework leveraging large language models (LLMs) for end-to-end development of machine learning interatomic potentials (MLIPs) from natural-language input. The system formulates MLIP development as a sequential decision-making problem, where an agent autonomously selects actions based on dataset, model, evaluation, and execution log observations, enabling self-correction without predefined pipelines. Evaluated on a solid electrolyte interphase (SEI) system with multiple components and interfaces, the approach demonstrates the feasibility of LLM-based multi-agent systems for automating MLIP development and improving accessibility for non-experts.
machine learning interatomic potentialslarge language modelsmulti-agent frameworksequential decision-makingsolid electrolyte interphase
Large Dimensional Kernel Ridge Regression: Extending to Product Kernels
The paper extends the analysis of kernel ridge regression (KRR) to a broad family of product kernels, addressing open questions about saturation effects and multiple descent behavior beyond restrictive settings like inner product kernels on spheres. By establishing convergence rates for large dimensional KRR, the authors demonstrate minimax optimality for source conditions (s ≤ 1), saturation effects (s > 1), and periodic plateau phenomena with multiple-descent behavior relative to sample size n. The results generalize prior findings while relaxing assumptions such as hypercontractivity of eigenfunctions.
kernel ridge regressionsaturation effectsmultiple descentminimax optimalityproduct kernels
Enjoy Your Layer Normalization with the Computational Efficiency of RMSNorm
The paper introduces a framework for replacing Layer Normalization (LN) with RMSNorm in arbitrary deep neural networks (DNNs) while preserving model functionality, aiming to reduce inference overhead. The method folds LN's centering operation into upstream general linear layers using column-centered constraint (CCC) and column-based weight centering (CBWC), enabling exact inference-time conversion. Analysis reveals that many LNs in widely used architectures are foldable, achieving 2% to 12% end-to-end acceleration without altering predictions. Experiments demonstrate competitive performance with vanilla LN in practical training scenarios, even when exact equivalence is partially compromised.
layer normalizationrmsnormcolumn-centered constraintcolumn-based weight centeringfoldable lns
ArcGate: Adaptive Arctangent Gated Activation
This paper introduces ArcGate, an Adaptive Arctangent Gated Activation function with seven learnable parameters per layer, enabling neural networks to autonomously optimize non-linearity for specific feature hierarchies and data distributions. ArcGate employs a three-stage non-linear transformation, offering flexibility over fixed-shape activations like ReLU, GELU, or SiLU. Evaluated on ResNet-50 and Vision Transformer (ViT-B/16) architectures across PatternNet, UC Merced Land Use, and EuroSAT MSI datasets, ArcGate achieves a peak accuracy of 99.67% on PatternNet and demonstrates superior noise resilience, outperforming ReLU by 26.65% under moderate Gaussian noise (σ=0.1). Parameter analysis reveals depth-dependent gating strength evolution, enhancing signal propagation in deeper layers.
activation functionlearnable parametersnon-linear transformationnoise resiliencesignal propagation
A Novel Schur-Decomposition-Based Weight Projection Method for Stable State-Space Neural-Network Architectures
A novel Schur-decomposition-based weight projection method is introduced for ensuring asymptotic stability in state-space neural-network architectures. The approach dynamically projects the quasi-triangular factor of the state matrix's real Schur decomposition onto its nearest stable counterpart, maintaining stable dynamics with minimal overparameterization. Experiments on synthetic linear systems show comparable accuracy and convergence rates to state-of-the-art stable-system identification techniques, with a marginal computational overhead. The method's lower weight count enhances training convergence in stacked architectures with static nonlinearities, without compromising accuracy on real-world datasets. This provides a numerically robust framework for identifying complex dynamics while satisfying strict stability constraints.
schur decompositionstate-space architecturesasymptotic stabilityweight projectionquasi-triangular factor
Test-Time Learning with an Evolving Library
EvoLib introduces a test-time learning framework enabling large language models to accumulate, reuse, and evolve knowledge across problem instances without parameter updates or external supervision. The framework maintains a shared library of knowledge abstractions, including modular skills and reflective insights, extracted from the model's inference trajectories. A principled weighting and consolidation mechanism optimizes for immediate utility and long-term value, allowing instance-specific abstractions to evolve into general, reusable ones. EvoLib demonstrates substantial improvements over top test-time scaling and learning methods on benchmarks in mathematical reasoning, code generation, and multi-turn agentic environments.
test-time learningknowledge abstractionsmodular skillsreflective insightsmulti-turn agentic environments
Focused PU learning from imbalanced data
The authors introduce a novel method for positive and unlabeled (PU) learning tailored for highly imbalanced datasets, addressing challenges in scenarios like disease gene identification and fraud detection. The approach employs a focused empirical risk estimator that leverages both positive and unlabeled instances to train binary classifiers, particularly targeting hard-to-detect positives resembling negatives. Evaluations demonstrate state-of-the-art performance on imbalanced datasets under two labeling mechanisms: selecting positives completely at random (SCAR) and selecting at random (SAR). The method's efficacy is further validated in a real-world application for financial misstatement detection.
pu learningempirical risk estimatorimbalanced datasetsscarsar
LiSA: Lifelong Safety Adaptation via Conservative Policy Induction
LiSA (Lifelong Safety Adaptation) introduces a conservative policy induction framework to enhance AI guardrails in dynamic deployment environments. The method leverages structured memory to generalize sparse, noisy user feedback into reusable policy abstractions, incorporates conflict-aware local rules to mitigate overgeneralization, and employs evidence-aware confidence gating via a posterior lower bound for scalable memory reuse. Evaluated on PrivacyLens+, ConFaide+, and AgentHarm benchmarks, LiSA outperforms memory-based baselines under sparse feedback, maintains robustness at 20% label-flip noise rates, and advances the latency-performance trade-off beyond backbone model scaling. This framework addresses the challenge of adapting guardrails to unpredictable real-world edge risks.
policy inductionstructured memoryconfidence gatinglabel-flip noiselatency-performance
FrontierSmith: Synthesizing Open-Ended Coding Problems at Scale
FrontierSmith introduces a scalable method for synthesizing open-ended coding problems from closed-ended tasks to address LLMs' weakness in open-ended coding. The system evolves competitive programming problems by altering goals, restricting outputs, and generalizing inputs, then selects variants using a quantitative idea divergence metric to ensure diverse solver approaches. Agents generate test cases and verifiers for selected candidates. Training on synthesized data improves Qwen3.5-9B by +8.82 on FrontierCS and +306.36 on ALE-bench, and Qwen3.5-27B by +12.12 and +309.12, respectively. Synthesized problems induce longer, more token-intensive solutions, akin to human-curated tasks.
open-ended codingidea divergencecompetitive programmingtest case generationllm training
What if Tomorrow is the World Cup Final? Counterfactual Time Series Forecasting with Textual Conditions
We introduce counterfactual time series forecasting with textual conditions, addressing limitations of traditional methods that rely solely on historical data or factual future conditions. Our approach incorporates a text-attribution mechanism that distinguishes mutable from immutable factors, enabling more accurate forecasts under complex stochastic textual conditions. We propose a comprehensive evaluation framework that handles both factual and counterfactual scenarios, even without ground truth time series. This method enhances flexibility and condition-awareness in forecasting, particularly for real-world scenarios influenced by stochastic future events. Further details are available at the project page: https://seqml.github.io/TADiff/.
counterfactual forecastingtextual conditionstime seriestext-attribution mechanismstochastic conditions
GeoViSTA: Geospatial Vision-Tabular Transformer for Multimodal Environment Representation
GeoViSTA introduces a geospatial vision-tabular transformer that learns unified embeddings from co-registered imagery and tabular data, addressing the modality gap in existing geospatial foundation models. The architecture employs bilateral cross-attention with geography-aware alignment between continuous image patches and irregular census-tract tokens, trained via joint masked-autoencoding. Evaluations show improved linear probing performance on downstream tasks, including disease-specific mortality and fire hazard frequency prediction, demonstrating the benefits of joint physical-socioeconomic modeling for geospatial inference.
geospatial transformermultimodal learningmasked-autoencodingcross-attentiontabular-vision fusion
Watch your neighbors: Training statistically accurate chaotic systems with local phase space information
The paper introduces a framework for training surrogate models of chaotic systems that simultaneously achieve accurate Jacobians and long-term statistical fidelity. By constructing local phase space coverings and minimizing maximum mean discrepancy between pushforward distributions of surrogate and ground-truth dynamics, the method bridges Jacobian-focused and statistically accurate approaches. Experiments demonstrate significant improvements in Jacobian accuracy while maintaining competitive performance with state-of-the-art statistical dynamics learning methods.
chaotic systemsjacobian accuracyphase space coveringsmaximum mean discrepancysurrogate models
Systematic Discovery of Semantic Attacks in Online Map Construction through Conditional Diffusion
The paper introduces MIRAGE, a framework for discovering semantic attacks on online HD map construction systems used by autonomous vehicles. By leveraging diffusion models to explore plausible environmental variations (e.g., shadows, wet roads), MIRAGE generates semantically mutated scenes that mislead mapping predictions while preserving road topology. Evaluated on nuScenes, MIRAGE achieves 57.7% boundary removal and successfully injects fictitious boundaries where pixel-based attacks fail, with 80-84% realism judged by VLMs. The attacks remain effective against adversarial defenses, revealing a critical vulnerability to semantic perturbations.
semantic attacksdiffusion modelshd map constructionadversarial defensesautonomous vehicles
NodeSynth: Socially Aligned Synthetic Data for AI Evaluation
NodeSynth introduces an evidence-grounded methodology for generating socially relevant synthetic queries using a fine-tuned taxonomy generator (TaG) anchored in real-world evidence, addressing limitations in sociotechnical nuance for AI evaluation. The approach leverages granular taxonomic expansion to create datasets that elicit up to 5× higher failure rates in four mainstream LLMs (e.g., Claude 4.5 Haiku) compared to human-authored benchmarks, while exposing deficiencies in guard models like Llama-Guard-3. The authors open-source their prototype and datasets for scalable safety interventions.
synthetic datataxonomy generatorllm evaluationsociotechnical nuanceguard models
Reinforcement Learning with Semantic Rewards Enables Low-Resource Language Expansion without Alignment Tax
We propose semantic-space alignment via Group Relative Policy Optimization (GRPO) to extend large language models (LLMs) to low-resource languages without incurring catastrophic forgetting. Unlike supervised fine-tuning (SFT), which enforces token-level imitation, our method optimizes LLMs using embedding-level semantic rewards, enabling flexible meaning preservation while reducing interference with pretrained knowledge. Evaluations on Tibetan-Chinese machine translation and Tibetan headline generation demonstrate that semantic RL mitigates alignment tax, preserves general competence, and yields higher semantic quality in open-ended generation compared to SFT. Few-shot transfer results indicate that semantic RL learns more transferable and robust representations under limited supervision.
semantic-space alignmentgroup relative policy optimizationalignment taxlow-resource languagessemantic rewards
MoRe: Modular Representations for Principled Continual Representation Learning on Squantial Data
MoRe introduces a framework for continual representation learning that identifies modularity directly in representations rather than architectures, addressing the challenge of adapting to new data while preserving existing knowledge. The method decomposes knowledge into hierarchical fundamental and specific modules with identifiability guarantees, enabling principled module reuse, alignment, and expansion. Experiments on synthetic benchmarks and real-world LLM activations demonstrate interpretable hierarchical structure and improved plasticity-stability trade-offs, suggesting MoRe as a principled foundation for continual adaptation.
modularitycontinual learningrepresentation learningidentifiabilityplasticity-stability trade-off
Randomized Atomic Feature Models for Physics-Informed Identification of Dynamic Systems
The paper introduces a physics-informed system identification framework using randomized stable atomic features, representing impulse responses as random superpositions of damped complex exponentials with poles sampled within a prescribed disk. The method formulates identification as a convex regularized least-squares problem with optional linear, second-order-cone, and KYP constraints, generalizing random Fourier and Laplace features to damped nonstationary regimes. Theoretical contributions include a Disk-Bochner operator-theoretic viewpoint, RKHS-to-l1 embedding, sparse-recovery guarantees, and connections to Nevanlinna-Pick interpolation. Numerical results demonstrate improved constrained impulse-response recovery under poor excitation by incorporating physical priors like stability margins and passivity constraints.
atomic feature modelssystem identificationdisk-bochnerkyp constraintsnevanlinna-pick interpolation
Distributionally Robust Multi-Task Reinforcement Learning via Adaptive Task Sampling
We introduce Distributionally Robust Adaptive Task Sampling (DRATS), a multi-task reinforcement learning algorithm addressing imbalanced data allocation in task sampling. DRATS formalizes MTRL as a feasibility problem, deriving a minimax objective to minimize the worst-case return gap by adaptively prioritizing tasks furthest from being solved. Evaluated on MetaWorld-MT10 and MT50 benchmarks, DRATS improves data efficiency and enhances worst-task performance compared to existing task sampling approaches.
multi-task reinforcement learningdistributionally robustadaptive task samplingminimax objectiveworst-case return gap
Exemplar Partitioning for Mechanistic Interpretability
Exemplar Partitioning (EP) is introduced as an unsupervised method for constructing interpretable feature dictionaries from large language model activations, requiring ∼10³× fewer tokens than sparse autoencoders (SAEs). EP builds Voronoi partitions of activation space via leader-clustering streamed activations within a distance threshold, with each region anchored by an observed exemplar. EP dictionaries enable causal interventions and cross-checkpoint comparisons, demonstrating interpretability and utility in Gemma-2-2B. EP achieves mean AUROC 0.881 on AxBench latent concept detection, outperforming GemmaScope SAE by +0.126 and nearing SAE-A's 0.911, with significantly reduced computational cost.
exemplar partitioningvoronoi partitionsparse autoencoderscausal interventionslatent concept detection
Nearest-Neighbor Radii under Dependent Sampling
The paper establishes theoretical guarantees for nearest-neighbor radii under dependent sampling, extending classical analyses beyond independent assumptions. It examines strong mixing dependent observations, proving distribution-free almost sure convergence under polynomial mixing and deriving sharp non-asymptotic moment bounds under geometric mixing. Notably, these bounds depend on the local intrinsic dimension rather than the ambient dimension, enhancing applicability to high-dimensional data concentrated near lower-dimensional manifolds. Synthetic experiments and real-world time-series benchmarks validate the theory, demonstrating that nearest-neighbor geometry remains informative under dependent sampling.
nearest-neighbor radiidependent samplingstrong mixinglocal intrinsic dimensiongeometric mixing
Guided Diffusion Sampling for Precipitation Forecast Interventions
The authors propose a gradient-based guidance framework for precipitation-reduction interventions in diffusion-based weather forecasting models, addressing the unexplored potential of perturbation-based weather control. Instead of direct atmospheric state perturbations, the method steers diffusion sampling trajectories to reduce precipitation while maintaining atmospheric distribution consistency. Physical plausibility is assessed through vertical/variable-wise perturbation profiles, latent-space trajectory deviation, and cross-model transferability. Experiments on extreme precipitation events from WeatherBench2 demonstrate effective precipitation reduction and more physically plausible interventions compared to adversarial perturbations.
diffusion samplingweather forecastingperturbation-based interventionsgradient-based guidancephysical plausibility
Language-Induced Priors for Domain Adaptation
We introduce Language-Induced Priors (LIP), a probabilistic framework leveraging expert textual descriptions and pretrained Large Language Models to address domain adaptation in cold-start scenarios. The method translates semantic descriptions into a choice model, integrated with Expectation-Maximization to identify relevant source domains when target data is scarce. Theoretical analysis shows the estimator achieves oracle cold-start MSE under correct priors while maintaining asymptotic consistency. Empirical validation spans Gaussian estimation, C-MAPSS dataset prediction, and MuJoCo hopper control tasks, demonstrating LIP's effectiveness across descriptive, predictive, and prescriptive domains.
domain adaptationlanguage-induced priorsexpectation-maximizationcold-startlarge language models
Minimal-Intervention KV Retention: A Design-Space Study and a Diversity-Penalty Survivor
We propose $α$, a minimal-intervention KV-cache retention mechanism that replaces argmax-top-$k$ with greedy facility-location-inspired selection under a V-space redundancy penalty controlled by $λ$. This modification is evaluated on long-form mathematical reasoning (MATH-500) using distilled-reasoning models (Qwen-7B and Llama-8B variants of DeepSeek-R1-Distill) at budgets $b \in \{64, 128\}$. $α$ outperforms seven alternative mechanisms across five design-space families, clearing Bonferroni correction in two of four (model, budget) configurations (Qwen $b{=}128$ and Llama $b{=}64$) with $λ= 0.5$, while showing no significant negative impacts. The study demonstrates that a lightweight scoring modification can surpass more extensive structural redesigns in low-budget KV-cache compression.
kv-cacheretention mechanismfacility-locationv-spacebonferroni
ForcingDAS: Unified and Robust Data Assimilation via Diffusion Forcing
The paper introduces ForcingDAS, a unified data assimilation framework using Diffusion Forcing to address error accumulation in non-Markovian observations and regime-specific limitations. By learning a joint-trajectory prior with frame-specific noise levels, it captures long-horizon dependencies and spans filtering to smoothing without retraining. Evaluated on 2D Navier-Stokes vorticity, precipitation nowcasting, and atmospheric state estimation, ForcingDAS outperforms specialized baselines, particularly in real-world weather benchmarks.
data assimilationdiffusion forcingnon-markovianjoint-trajectory priornowcasting
Smooth Multi-Policy Causal Effect Estimation in Longitudinal Settings
The paper introduces Policy-Encoded Q Network (PEQ-Net), a method for joint estimation of multiple dynamic treatment policies in longitudinal settings, addressing structural second-order bias in conventional separate estimation. PEQ-Net employs a shared policy encoder trained with kernel mean embeddings to enable information sharing across counterfactuals, coupled with longitudinal targeted maximum likelihood estimation (LTMLE) for debiasing. Theoretical analysis shows the approach constrains second-order remainder terms, reducing finite-sample variance. Semi-synthetic experiments demonstrate PEQ-Net outperforms existing iterative conditional expectation methods, achieving significant reductions in root-mean-square error, particularly for closely related policies.
longitudinal causal inferencepolicy-encoded q networkkernel mean embeddingsiterative conditional expectationtargeted maximum likelihood estimation
TILT: Target-induced loss tilting under covariate shift
Target-Induced Loss Tilting (TILT) is introduced for unsupervised domain adaptation under covariate shift, leveraging a novel objective function that decomposes the source predictor into $f+b$. The method fits $f+b$ on labeled source data while penalizing the auxiliary component $b$ on unlabeled target inputs, deploying $f$ as the final target predictor. Theoretical analysis shows that the target-side penalty induces relative importance weighting via a self-localized estimand $b^*_f$, with bounded behavior across source-target pairs. A finite-sample oracle inequality is proven, and experiments on regression tasks and shifted CIFAR-100 distillation demonstrate TILT's superior target-domain performance over baselines, with stable regularization dependence.
covariate shiftunsupervised domain adaptationimportance weightingfinite-sample oracle inequalityregularization
Training-Free Generative Sampling via Moment-Matched Score Smoothing
The authors propose moment-matched score-smoothed overdamped Langevin dynamics (MM-SOLD), a training-free interacting particle sampler for generative sampling that enforces target moments throughout the sampling trajectory. MM-SOLD leverages moment-matching to ensure the empirical particle density converges to a deterministic limit with a Gibbs--Boltzmann stationary marginal, whose mean and covariance match the training data's empirical moments. Experiments on 2D distributions and latent-space image generation demonstrate that MM-SOLD achieves competitive sample fidelity and diversity with neural diffusion baselines while enabling fast, robust, CPU-based sampling without training.
moment-matchingoverdamped langevin dynamicsscore smoothinggibbs-boltzmann densitytraining-free sampling
On the Burden of Achieving Fairness in Conformal Prediction
The study investigates fairness in conformal prediction by analyzing population score distributions in split conformal calibration. It derives a conservation law and lower bound, demonstrating that pooled calibration induces irreducible group-wise coverage distortion proportional to cross-group quantile heterogeneity. The work establishes a fundamental tension between Equalized Coverage and Equalized Set Size, two leading fairness definitions, and quantifies the trade-off between separate and pooled group treatment policies. Experiments on synthetic and real data confirm this bidirectional trade-off persists in finite-sample calibration. The findings reveal that calibration choice does not eliminate cross-group heterogeneity but determines whether distortion manifests in coverage or set size dimensions.
conformal predictioncoverage distortionquantile heterogeneityequalized coveragesplit conformal calibration
Dynamics of the Transformer Residual Stream: Coupling Spectral Geometry to Network Topology
This study investigates the spectral geometry of transformer residual streams in large language models (LLMs) through full Jacobian eigendecomposition across three production-scale models. The analysis reveals a learned monotonic spectral gradient from non-normal, rotation-dominated early layers to near-symmetric late layers, accompanied by a cumulative low-rank bottleneck that concentrates perturbations in few dimensions. Experiments demonstrate these features are training-induced rather than architectural, with topological community positioning predicting Jacobian amplification/suppression effects based on local operator types. The findings establish a link between perturbation propagation, dimensional collapse, and functional topology in LLMs.
jacobian eigendecompositionspectral geometryresidual streamnon-normal layerslow-rank bottleneck
Architecture-Aware Explanation Auditing for Industrial Visual Inspection
The paper introduces an architecture-aware explanation audit protocol based on the native-readout hypothesis, which posits that explanation faithfulness is bounded by structural proximity to a model's native decision mechanism. Experiments on WM-811K wafer maps (9 classes, 172k images) using a zero-fill perturbation protocol reveal that ViT-Tiny + Attention Rollout achieves Deletion AUC 0.211, compared to 0.432-0.525 for Swin-Tiny, ResNet18+CBAM, and DenseNet121 + Grad-CAM (abs(Cohen's d) > 1.1). Results demonstrate that readout structure, not architecture family, determines explainer compatibility, with RISE outperforming native methods (Deletion AUC ≈0.1). A blur-fill sensitivity analysis shows faithfulness rankings depend on model, explainer, and perturbation operator triples. The protocol emphasizes co-designing explanation pathways with architectures and reporting quantitative faithfulness metrics.
native-readout hypothesisdeletion aucperturbation protocolexplanation faithfulnessreadout structure
Towards Real-Time Autonomous Navigation: Transformer-Based Catheter Tip Tracking in Fluoroscopy
A transformer-based pipeline for real-time catheter tip tracking in fluoroscopy was developed to enable autonomous navigation in mechanical thrombectomy. The multi-threaded framework integrates frame reading, preprocessing, inference, and post-processing, employing deep learning segmentation models including U-Net, U-Net+Transformer, and SegFormer. Two-class and three-class formulations were evaluated, with post-processing involving component filtering, skeletonization, and path following. On moderate complexity fluoroscopic data, the two-class SegFormer achieved a mean absolute error of 4.44 mm, outperforming other models. The system also improved Dice scores by up to +5% on segmentation benchmarks compared to CathAction, demonstrating stable performance under challenging imaging conditions.
fluoroscopysegmentationtransformerskeletonizationthrombectomy
Not All Timesteps Matter Equally: Selective Alignment Knowledge Distillation for Spiking Neural Networks
Selective Alignment Knowledge Distillation (SeAl-KD) improves Spiking Neural Network (SNN) performance by addressing the limitations of uniform timestep alignment in existing knowledge distillation methods. SeAl-KD selectively aligns class-level and temporal knowledge by equalizing competing logits at erroneous timesteps and reweighting temporal alignment based on confidence and inter-timestep similarity. This approach preserves useful temporal dynamics while correcting errors. Extensive experiments on static image and neuromorphic event-based datasets demonstrate consistent improvements over existing distillation methods. The code is publicly available.
spiking neural networksknowledge distillationtemporal alignmentlogit equalizationneuromorphic datasets
EnergyLens: Predictive Energy-Aware Exploration for Multi-GPU LLM Inference Optimization
EnergyLens introduces an end-to-end framework for energy-aware optimization of multi-GPU LLM inference, addressing the lack of tools for predicting energy footprints without exhaustive profiling. The framework employs an einsum-based interface to model LLM specifications, including fusion, parallelism, and compute-communication overlap, alongside load-imbalance-aware MoE modeling and an empirical multi-GPU communication energy model. Validated on Llama3 and Qwen3-MoE across tensor-parallel and expert-parallel configurations, EnergyLens achieves MAPEs of 9.25%-13.19% for prefill and decode energy, and 12.97% for SM allocations in Megatron-style overlap. It identifies up to 1.47x and 52.9x energy variations across configurations, demonstrating the necessity of distributed serving and correctly identifying Pareto-optimal overlap configurations.
energy-aware optimizationmulti-gpu inferencecompute-communication overlapload-imbalance-aware modelingpareto-optimal configurations
Action-Conditioned Risk Gating for Safety-Critical Control under Partial Observability
The paper proposes action-conditioned risk gating, a lightweight reinforcement learning approximation for risk-sensitive control under partial observability. The method constructs finite-history proxy states and learns action-conditioned predictors of near-term safety violations, using them both as risk penalties during training and decision-time gates that interpolate between optimistic and conservative value estimates. Evaluated on automated glucose regulation and Safety-Gym navigation, it improves glycemic tradeoffs (adolescent/adult cohorts) and reward-cost balance while reducing runtime versus belief-space planning baselines, demonstrating effectiveness when full POMDP planning is impractical.
risk-sensitive controlpartial observabilitysafety-critical systemsreinforcement learningpomdp
Artificial Intelligence-Assistant Cardiotocography: Unified Model for Signal Reconstruction, Fetal Heart Rate Analysis, and Variability Assessment
The study presents FHrCTG, an AI model for fetal heart rate (FHR) monitoring that addresses noise interference and signal reconstruction challenges. The model combines unsupervised pre-training on 558,412 unlabeled data points with supervised fine-tuning on 7,266 expert-labeled samples, employing a novel Intersection Overlapping Labels (IOL) method to transform rate analysis into categorical classification. Evaluation shows strong performance in detecting FHR decelerations (89.13% sensitivity, 87.78% specificity) and accelerations (62.5% sensitivity, 92.04% specificity), with AUC scores of 0.7214 (periodicity) and 0.9643 (amplitude variation) by Fischer's clinical criteria.
fetal heart ratesignal reconstructionintersection overlapping labelsclinical validationnoise mitigation
Latency-Quality Routing for Functionally Equivalent Tools in LLM Agents
LQM-ContextRoute introduces a latency-quality matching approach for routing functionally equivalent tools in LLM agents, addressing the provider-routing problem under runtime load. The method employs a contextual bandit router that ranks providers by expected answer quality per service cycle, integrating query-specific quality estimation and LLM-as-judge feedback for online adaptation. Evaluations on web-search benchmarks demonstrate improvements: +2.18 pp F1 over SW-UCB, +18 pp accuracy in StrategyQA, and +2.91--+3.22 pp NDCG in heterogeneous retriever pools. Results highlight the efficacy of treating latency as service capacity in high-heterogeneity settings.
latency-quality matchingcontextual banditllm agentsservice capacityprovider-routing
Paraphrasing Attack Resilience of Various AI-Generated Text Detection Methods
This work evaluates the resilience of AI-generated text detection methods against paraphrasing attacks, highlighting a performance-resilience tradeoff. Three approaches are assessed: fine-tuned RoBERTa, Binoculars, and text feature analysis, with ensembles constructed using Random Forest classifiers. Results indicate that Binoculars-inclusive ensembles achieve the strongest detection performance but exhibit the greatest vulnerability to paraphrasing attacks. The study underscores the dichotomy between detection accuracy and attack resilience, complicating the reliability assessment of state-of-the-art techniques in the context of LLM-generated text proliferation and detector bypassing tools.
paraphrasing attackrobertabinocularsrandom forestllm-generated text
Active Learners as Efficient PRP Rerankers
The paper reframes Pairwise Ranking Prompting (PRP) as an active learning problem to address noise and intransitivity in LLM-derived pairwise preferences. It introduces a noise-robust framework with a randomized-direction oracle that uses single LLM calls per pair, converting systematic bias into zero-mean noise. Active rankers outperform classical sorting in call-constrained regimes, improving NDCG@10 efficiency without bidirectional call overhead.
pairwise ranking promptingactive learningndcg@10noise-robustrandomized-direction oracle
Quantum Advantage in Multi Agent Reinforcement Learning
This work empirically demonstrates quantum advantage in multi-agent reinforcement learning (QMARL) through entanglement-mediated coordination. The authors evaluate a decentralized QMARL framework with variational quantum circuit (VQC) actors sharing entangled states, contrasting entangled and unentangled configurations. In the CHSH game, entangled QMARL agents achieve a win rate approaching the Tsirelson bound (0.854), surpassing the classical ceiling (0.75), while unentangled quantum circuits match classical performance. On cooperative navigation, QMARL without entanglement achieves ~2× success rate improvement over classical MAA2C (0.85 vs. 0.40), with hybrid quantum-classical configurations outperforming fully classical and quantum solutions. The study also analyzes the impact of specific Bell state entanglement structures on coordination.
quantum entanglementmulti-agent reinforcement learningvariational quantum circuittsirelson boundbell state
AudioMosaic: Contrastive Masked Audio Representation Learning
AudioMosaic introduces a contrastive learning-based audio encoder for general audio understanding, addressing limitations of generative approaches in audio self-supervised learning. The method constructs positive pairs via structured time-frequency masking on spectrogram patches, enabling efficient large-batch training with reduced memory usage. Experiments demonstrate state-of-the-art performance on multiple audio benchmarks under linear probing and fine-tuning, with improved transferability across datasets, domains, and acoustic conditions. Integration into audio-language models enhances performance on audio-language tasks. Code is publicly available.
contrastive learningaudio encodertime-frequency maskingspectrogram patchesaudio-language models
Self-Regulated Learning in Essay Writing: Consistency of Strategies and Impact on Outcomes
This study investigates self-regulated learning (SRL) strategies in secondary school students during online essay writing, their temporal evolution, and impact on learning outcomes. Using metacognition-related trace data from 93-95 students across two sessions in Colombian schools, the authors employed process mining and unsupervised machine learning to identify dominant SRL strategies. Three strategies emerged, with variability observed: many students transitioned to 'Read first, write next,' while none used 'Write intensively, read selectively' in the second session. Notably, the latter strategy, though less common, showed a positive association with learning outcomes.
self-regulated learningprocess miningmetacognitiononline essay writinglearning outcomes
DT-Transformer: A Foundation Model for Disease Trajectory Prediction on a Real-world Health System
We present DT-Transformer, a foundation model for disease trajectory prediction trained on 57.1M structured EHR entries from 1.7M patients across 11 hospitals and outpatient clinics in the Mass General Brigham health system. The model addresses limitations of single-hospital datasets and research cohorts by leveraging multi-hospital data to capture real-world clinical complexity. DT-Transformer achieves strong discrimination in held-out and prospective validation, with a median age- and sex-stratified AUC of 0.871 across 896 disease categories in next-event prediction, demonstrating the viability of health system-scale training for clinical forecasting foundation models.
foundation modeldisease trajectory predictionelectronic health recordsmulti-hospital datasetnext-event prediction
Diagnosing Training Inference Mismatch in LLM Reinforcement Learning
The work identifies Training-Inference Mismatch (TIM) as a critical instability factor in LLM reinforcement learning, where numerical discrepancies between rollout generation and policy optimization stages cause training collapse. Using the zero-mismatch diagnostic framework VeXact, the authors isolate TIM from confounding factors like off-policy drift, demonstrating its systemic impact on optimization dynamics. Results show TIM alters the effective optimization problem and propose mitigation strategies, establishing it as a first-order concern in LLM RL stability rather than benign noise.
training-inference mismatchllm reinforcement learningoff-policy driftpolicy optimizationnumerical instability
PreFT: Prefill-only finetuning for efficient inference
The paper introduces PreFT (Prefill-only Finetuning), a parameter-efficient finetuning method that applies adapters only during prefill phase to improve inference throughput for multi-adapter serving. The authors implement prefill-only versions of LoRA and ReFT on vLLM, demonstrating 1.9× higher throughput when serving 512 adapters on Llama 3.1 70B compared to traditional PEFTs. While PreFT shows slightly higher evaluation loss than all-token PEFTs in supervised finetuning, performance gaps can be mitigated by increasing rank without throughput penalty, and reinforcement learning tasks achieve near-parity with standard PEFTs.
prefill-only finetuningparameter-efficient finetuningmulti-adapter servinginference throughputvllm
GenCircuit-RL: Reinforcement Learning from Hierarchical Verification for Genetic Circuit Design
The paper presents GenCircuit-RL, a reinforcement learning framework for genetic circuit design that employs hierarchical verification rewards and a four-stage curriculum. The method decomposes correctness into five levels, from code execution to topological checks, and evaluates on SynBio-Reason, a benchmark of 4,753 circuits across six types and nine tasks. Results show hierarchical verification improves task success by 14-16 percentage points over binary rewards, with curriculum learning essential for performance. The framework generates topologically correct circuits, generalizes to novel parts, and rediscovers canonical designs.
genetic circuit designreinforcement learninghierarchical verificationsynthetic biologycurriculum learning
ASH: Agents that Self-Hone via Embodied Learning
The paper introduces ASH, an agentic system for long-horizon embodied learning that autonomously improves through self-honing without hand-engineered rewards or expert annotations. ASH employs an Inverse Dynamics Model (IDM) trained on its own trajectories to extract supervision from unlabeled internet video, coupled with unsupervised learning to identify and retain key moments as long-term memory. Evaluated on Pokemon Emerald and The Legend of Zelda: The Minish Cap, ASH achieves 11.2/12 and 9.9/12 milestones respectively, outperforming behavioral cloning and retrieval-augmented baselines (6.5/12 and 6.0/12) in 8-hour tasks.
embodied learninginverse dynamics modellong-horizon planningself-improving agentsunsupervised video retrieval
Towards Fine-Grained and Verifiable Concept Bottleneck Models
The authors propose a fine-grained Concept Bottleneck Model (CBM) framework that grounds each human-interpretable concept in localized visual evidence, enabling direct verification of concept encoding. Unlike standard CBMs, their method validates both concept presence and correctness through visual grounding, bridging interpretability with verifiability. Experiments on medical imaging benchmarks demonstrate that the approach maintains predictive performance comparable to standard CBMs while significantly improving transparency and reliability. The framework establishes a principled mechanism for human-model interaction at the concept level, enhancing trustworthiness for clinical applications.
concept bottleneck modelsinterpretabilityverifiabilityvisual groundingmedical imaging
How to Scale Mixture-of-Experts: From muP to the Maximally Scale-Stable Parameterization
The paper introduces a principled framework for scaling Mixture-of-Experts (MoE) architectures by analyzing three scaling regimes: (I) co-scaling network width $N$ and expert width $N_e$, (II) co-scaling $N$, number of experts $M$, and sparsity $K$, and (III) full proportional scaling of $N$, $N_e$, $M$, and $K$. Using Dynamical Mean Field Theory (DMFT), the authors derive a Maximally Scale-Stable Parameterization (MSSP) for SGD and Adam that ensures robust learning-rate transfer and monotonic improvement with scale. Experiments confirm MSSP's effectiveness across all regimes, providing a complete scaling prescription for MoE architectures.
mixture-of-expertsscaling regimesdynamical mean field theorymaximally scale-stable parameterizationlearning-rate transfer
Stochastic Matching via Local Sparsification
We introduce a two-stage local sparsification framework for online stochastic matching, addressing bandwidth constraints in decentralized systems like ride-hailing and cloud computing. Our method first prunes compatibility sets to a budget of k edges locally, then optimizes global matching centrally using a strategy parametrized by a fractional solution's spread. Theoretical analysis shows that sufficient spread preserves the expected maximum matching size globally. Empirical evaluation on New York City ride-hailing data and synthetic benchmarks demonstrates near-optimal global matching with constrained local budgets, outperforming standard online baselines.
stochastic matchinglocal sparsificationcompatibility setsfractional solutiondecentralized systems
LLMs Know When They Know, but Do Not Act on It: A Metacognitive Harness for Test-time Scaling
This work introduces a metacognitive harness that leverages large language models' (LLMs) latent metacognitive signals for test-time control, improving reasoning without parameter updates or fine-tuning. Inspired by the Nelson-Narens theory, the harness separates monitoring from reasoning by eliciting pre-solve feeling-of-knowing (FOK) and post-solve judgment-of-learning (JOL) signals, using them to decide when to trust, retry, or aggregate solutions. Evaluated on text, code, and multimodal reasoning benchmarks, the harness raises pooled accuracy from 48.3 to 56.9 for Claude Sonnet-4.6, outperforming leaderboard entries on HLE-Verified, LiveCodeBench v6, and R-Bench-V. Results suggest LLMs possess metacognitive ability but require explicit control mechanisms to act on it.
metacognitive harnessfeeling-of-knowingjudgment-of-learningtest-time controlreasoning benchmarks
CSI-JEPA: Towards Foundation Representations for Ubiquitous Sensing with Minimal Supervision
CSI-JEPA introduces a self-supervised predictive representation learning framework for label-efficient Wi-Fi sensing using channel state information (CSI). The method tokenizes CSI amplitude windows along time and subcarrier dimensions, employs channel variation-aware masking to select predictive targets, and freezes the pretrained encoder for downstream tasks with lightweight adapters. Evaluated on seven real-world tasks, CSI-JEPA achieves up to 10.64 percentage points mean accuracy improvement over supervised Transformers and reduces label requirements by 98.0%.
channel state informationself-supervised learningrepresentation learningwi-fi sensingtransformer
Finite Sample Bounds for Learning with Score Matching
This work establishes the first non-asymptotic sample complexity bounds for learning the structure of continuous exponential family distributions with unbounded support using score matching. The analysis focuses on exponential families of polynomials, contrasting with prior asymptotic results. The derived bounds exhibit polynomial dependence on the model dimension, providing theoretical guarantees for finite-sample scenarios. This addresses a gap in understanding the statistical properties of score matching, which has gained popularity as a computationally efficient alternative to maximum likelihood estimation for continuous variables in high-dimensional statistics.
score matchingexponential familysample complexitynon-asymptotic analysispolynomial dependence
Uncovering Trajectory and Topological Signatures in Multimodal Pediatric Sleep Embeddings
This work demonstrates that augmenting multimodal pediatric sleep embeddings with geometric, topological, and clinical features improves diagnostic performance and calibration. A multimodal masked autoencoder embeds 30-second PSG epochs, which are enhanced with PHATE-derived coordinates, persistent homology summaries, whole-night movement descriptors, and EHR data. Linear and MLP models reveal complementary gains from these features, with late-fusion models achieving AUPRC improvements from 0.26 to 0.34 (desaturation), 0.31 to 0.48 (EEG arousal), 0.09 to 0.22 (hypopnea), and 0.05 to 0.14 (apnea). The full fusion model also yields the best calibration across all tasks, as measured by Brier score and Expected Calibration Error.
multimodal embeddingspersistent homologymasked autoencoderlate-fusioncalibration
A Systematic Evaluation of Imbalance Handling Methods in Biomedical Binary Classification
This study systematically evaluates the impact of imbalance handling methods (IHMs) on biomedical binary classification across varying model complexities and data modalities. Five IHMs—random undersampling (RUS), random oversampling (ROS), SMOTE, re-weighting (RW), and direct F1-score optimization (DMO)—were tested against a raw training baseline using three datasets (MIMIC-III, ADE-Corpus-V2, MURA) and diverse models (logistic regression, random forest, MLP, BiLSTM, BERT, DenseNet, DINOv2). Results indicate that ROS and RW consistently improved performance for complex models, while DMO was effective for unstructured text and image data. RUS and SMOTE degraded performance and are not recommended. IHM effectiveness depends on model complexity and data modality.
imbalance handling methodsbiomedical classificationmodel complexitydata modalitiesf1-score optimization
bde: A Python Package for Bayesian Deep Ensembles via MILE
The bde Python package implements Bayesian Deep Ensembles for tabular data using Microcanonical Langevin Ensembles (MILE), a sampling-based inference method. Built on JAX for efficiency, it offers scikit-learn compatible estimators supporting regression and classification tasks. Key features include fast training via MCMC sampling and built-in uncertainty quantification capabilities.
bayesian deep ensemblesmicrocanonical langevin ensemblesjax implementationuncertainty quantificationtabular data
To discretize continually: Mean shift interacting particle systems for Bayesian inference
The authors introduce mean shift interacting particle systems, a novel method for approximating expectations against probability distributions with unnormalized densities in Bayesian inference. The approach extends the classical mean shift algorithm and optimal quantization techniques to continuous distributions, minimizing maximum mean discrepancy (MMD) via gradient-free or gradient-informed dynamics invariant to normalizing constants. The method demonstrates rapid convergence, handles anisotropy and multi-modality, avoids mode collapse, and scales effectively to high dimensions. Empirical evaluations on diverse benchmarks, including multi-modal mixtures, Bayesian hierarchical models, and PDE-constrained inverse problems, validate its performance.
mean shiftmaximum mean discrepancybayesian inferenceinteracting particle systemsoptimal quantization
Reinforcement Learning for Tool-Calling Agents in Fast Healthcare Interoperability Resources (FHIR)
We propose a reinforcement learning (RL) pipeline for improving tool-calling agents' performance on multi-turn reasoning over Fast Healthcare Interoperability Resources (FHIR) graphs. Our approach post-trains a CodeAct agent using RL with execution-grounded rewards from a LLM Judge, enforcing data-integrity constraints while reasoning over structured clinical data. Compared to prompt-based baselines, RL post-training improves answer correctness from 50% to 77% on FHIR-AgentBench using the smaller Qwen3-8B model, demonstrating reliable gains in multi-turn reasoning over real-world hospital data.
fast healthcare interoperability resourcesreinforcement learningmulti-turn reasoningtool-calling agentsstructured graph
Mini-JEPA Foundation Model Fleet Enables Agentic Hydrologic Intelligence
The paper introduces Mini-JEPAs, a fleet of small, sensor-specialized Joint Embedding Predictive Architecture (JEPA) foundation models for hydrologic intelligence, addressing limitations of generalist planetary-scale models like Google AlphaEarth. Five 22M-parameter Mini-JEPAs with shared Vision Transformer backbones and 64-d output spaces were pretrained on Sentinel-2 optical, Sentinel-1 SAR, MODIS thermal, multi-temporal Sentinel-2 phenology, and topography-soil data, achieving cross-validated R² up to 0.97 for elevation and temperature. A router LLM selects appropriate sensors with perfect accuracy, and dual retrieval over AlphaEarth and Mini-JEPAs outperforms AlphaEarth alone on physics-matched questions (Cohen's d = 1.10, p = 0.031).
joint embedding predictive architecturevision transformerhydrologic intelligencesensor-specializedrouter llm
ProtoMedAgent: Multimodal Clinical Interpretability via Privacy-Aware Agentic Workflows
ProtoMedAgent introduces a privacy-aware framework for multimodal clinical interpretability, addressing retrieval sycophancy in standard RAG systems by formalizing clinical reporting as a zero-gradient test-time optimization problem. The method employs a neuro-symbolic bottleneck, distilling latent visual and tabular features into discrete semantic memory, constrained by set-theoretic differentials and a Scribe-Critic loop. A semantic privacy gate enforces $k$-anonymity and $\ell$-diversity. Evaluated on 4,160 patients, ProtoMedAgent achieves 91.2% Comparison Set Faithfulness, outperforming standard RAG (46.2%), and reduces artifact-level membership inference risks by 9.8%.
interpretable prototype networksretrieval-augmented generationneuro-symbolic bottleneckk-anonymitymembership inference
📰 Industry Media (9)
How Chinese short dramas became AI content machines
Chinese short drama companies are leveraging generative AI to optimize content production, reducing costs by 80-90% and shortening production timelines from 3-4 months to under one month. Platforms like FlexTV and Kunlun Tech employ AI tools such as Google’s Nano Banana and ByteDance’s Seedance to automate tasks traditionally handled by camera crews and visual effects teams, relying instead on AI asset curators to translate scripts into prompts. This shift has enabled the release of 470 AI-generated short dramas daily, with Kunlun Tech aiming for 20% AI-produced content on its platforms. The global microdrama market is projected to grow from $11 billion in 2025 to $14 billion by 2026.
generative aiai asset curatormicrodrama marketproduction timelinesprompt engineering
Zyphra Releases ZAYA1-8B-Diffusion-Preview: The First MoE Diffusion Model Converted From an Autoregressive LLM With Up to 7.7x Speedup
Zyphra introduces ZAYA1-8B-Diffusion-Preview, the first MoE diffusion model converted from an autoregressive LLM, achieving up to 7.7x inference speedup on AMD hardware. The method leverages a TiDAR-based conversion pipeline, reusing the pretrained ZAYA1-8B-base checkpoint with 600B tokens of diffusion mid-training and 500B tokens of context extension to 128k. Results show minimal evaluation degradation (4.6x-7.7x speedup via two samplers) by sharing KV-cache across 16-token blocks, shifting decoding from memory-bandwidth to compute-bound.
moe diffusionkv-cachetidar recipeautoregressive conversionorder constrained generation
Best AI Agents for Software Development Ranked: A Benchmark-Driven Look at the Current Field
The article benchmarks AI coding agents for software development, focusing on SWE-bench Pro and Terminal-Bench 2.0 as key metrics. It highlights methodological shifts, including OpenAI's discontinuation of SWE-bench Verified due to test case flaws and training data contamination. Results show Claude Opus 4.7 leading SWE-bench Pro at 64.3%, while GPT-5.5 dominates Terminal-Bench 2.0 at 82.7%. The analysis emphasizes scaffold and harness variations affecting scores, with Claude Code excelling in multi-file tasks and GPT-5.5 in terminal-native workflows.
swe-bench proterminal-bench 2.0claude opus 4.7gpt-5.5agent scaffold
Supertone Releases Supertonic v3: On-Device Text-to-Speech Model with 31-Language Support, Fewer Reading Failures, and Expression Tags
Supertone introduces Supertonic v3, an ONNX-based on-device text-to-speech (TTS) model supporting 31 languages, reducing repeat and skip failures, and incorporating expressive tags for prosodic cues. The architecture employs a speech autoencoder, flow-matching text-to-latent module, and duration predictor, achieving efficient inference in 2 steps via Length-Aware Rotary Position Embedding (LARoPE) and Self-Purifying Flow Matching. With 99M parameters and 404MB ONNX assets, Supertonic v3 outperforms larger TTS models like VoxCPM2 in WER/CER metrics, achieving an average RTF of 0.3x on edge hardware. It handles complex text normalization without preprocessing, excelling in financial expressions, dates, and technical units.
onnxtext-to-speechflow-matchinglaropewer
How to Build a Django-Unfold Admin Dashboard with Custom Models, Filters, Actions, and KPIs
The tutorial demonstrates constructing an advanced Django-Unfold admin dashboard with e-commerce functionality. It implements custom models (Category, Product, Customer, Order), configures Django-Unfold's theme and navigation system, and integrates dynamic KPIs through dashboard callbacks. The method involves setting up a Django project with SQLite backend, defining model relationships and business logic, and extending the admin interface with filters (MultipleChoicesDropdownFilter, RangeNumericFilter), actions, and tabbed navigation. The resulting dashboard features real-time metrics, collapsible sidebar sections, and a modern UI with configurable color schemes.
django-unfoldadmin dashboardmodeladminsqlite backendkpi visualization
Poetiq’s Meta-System Automatically Builds a Model-Agnostic Harness That Improved Every LLM Tested on LiveCodeBench Pro Without Fine-Tuning
Poetiq's Meta-System introduces a model-agnostic inference harness that improves LLM performance on LiveCodeBench Pro without fine-tuning or internal model access. The system automatically constructs task-specific orchestration layers through recursive self-improvement, optimizing prompt strategies, output assembly, and solution validation. Evaluations show GPT-5.5 High improved from 89.6% to 93.9%, Gemini-3.1 Pro from 78.6% to 90.9%, and Kimi-K2.6 by ~30 percentage points (50.0%→79.9%), with consistent gains across all tested models including Nemotron-3-120B (+12.8%).
meta-systemlivecodebench promodel-agnosticinference harnessrecursive self-improvement
A Coding Implementation to Master GPU Computing with CuPy, Custom CUDA Kernels, Streams, Sparse Matrices, and Profiling
The tutorial demonstrates advanced GPU computing techniques using CuPy, providing a comprehensive workflow from hardware introspection to performance optimization. Methodologically, it covers CuPy installation, memory management with pools, custom kernel development (elementwise, reduction, raw CUDA), concurrent execution via streams, sparse/dense linear algebra, image processing, and interoperability through DLPack. Benchmark results show 4096×4096 matrix multiplication achieving 7.1ms execution time (vs CPU) and 2^21-point FFT completing in 3.2ms, with detailed profiling via CUDA events and kernel fusion optimizations.
cupycuda kernelsmemory poolsdlpackkernel fusion
Cline Releases Cline SDK: An Open-Source Agent Runtime Now Powering Its CLI and Kanban, With IDE Extensions Being Migrated
Cline SDK introduces a modular, open-source TypeScript runtime for AI coding agents, enabling stateless, durable, and portable agent execution across environments. The SDK abstracts core functionalities into layered packages (@cline/shared, @cline/llms, @cline/agents, @cline/core), facilitating LLM provider switching, tool orchestration, and multi-agent coordination without external orchestration layers. Benchmarks demonstrate improved performance, with Cline CLI achieving 74.2% accuracy on Claude Opus 4.7, surpassing Claude Code's 69.4%. The SDK supports plugins, custom tools, and MCP connectors, offering extensibility for domain-specific behaviors. Node.js 22+ is required, and the runtime is licensed under Apache 2.0.
agent runtimetypescript sdkmulti-agent coordinationllm providerplugin architecture
Deloitte: Scale ‘autonomous intelligence’ for real growth
Deloitte proposes scaling 'autonomous intelligence' beyond generative AI to achieve enterprise growth, defining it as the third stage in an intelligence maturity curve where AI systems independently execute multi-step workflows within defined boundaries. The method involves forensic decision audits to identify value chains bottlenecked by human decisions, followed by implementing agentic architectures with governance layers (identity verification, human-in-the-loop checkpoints, and decision-grade data infrastructure). Key challenges include upstream data friction (requiring real-time, traceable data instead of batch-processed reports), governance debt from pilot-to-production gaps, and variable compute costs from multi-step LLM reasoning. Successful deployments treat pilots as production-ready platforms with built-in compliance frameworks.
autonomous intelligenceagentic aidecision-grade datagovernance debthuman-in-the-loop
Generated automatically at 2026-05-15 20:54 UTC. Summaries and keywords are produced by an LLM and may contain inaccuracies — always consult the original article.
