Daily Digest — 2026-06-18
289 items · 7 research labs, 277 arxiv papers, 5 industry media
🏛️ Research Labs (7)
A near-autonomous AI chemist improves a challenging reaction in medicinal chemistry
GPT-5.4, integrated with Molecule.one's Maria AI chemist, autonomously improved Chan-Lam coupling yields for primary sulfonamides by proposing TEMPO as an additive. The system generated hypotheses, designed 10,080 high-throughput experiments, and analyzed results across two cycles, achieving mean yield increases from 16.6% to 25.2% and doubling yields for 8/14 substrate pairs in bench-scale validation. Human chemists provided steering prompts, selected proposals, and validated results. This demonstrates AI-assisted hypothesis generation and experimental optimization in medicinal chemistry, though human oversight remained critical throughout the workflow.
chan-lam couplinghigh-throughput experimentationin-context learningautonomous research agentyield optimization
Introducing LifeSciBench
OpenAI introduces LifeSciBench, a 750-task benchmark for evaluating AI systems on real-world life science research workflows. Expert-authored tasks span seven biological domains and require multi-step reasoning, artifact interpretation (53% of tasks), and scientific communication. The benchmark features granular rubrics (19,020 criteria total) validated by 453 independent experts (96% agreement). Results show GPT-Rosalind outperforms GPT-5.5 (36.1% vs 25.7% pass rate), with strongest gains in scientific communication (+14.8%) and translation (+20.9%), but struggles persist on design-heavy tasks (30.7%) and artifact interpretation (28.1% pass rate).
life science benchmarkexpert-authored rubricsmulti-step reasoningartifact interpretationscientific communication
MolmoMotion: Language-guided 3D motion forecasting
MolmoMotion introduces a language-guided 3D motion forecasting model that predicts object trajectories from RGB frames, 3D query points, and action descriptions. The model employs two variants: MolmoMotion-AR (autoregressive) for precise trajectory prediction and MolmoMotion-FM (flow-matching) for handling motion uncertainty. Trained on MolmoMotion-1M (1.16M videos with 3D point trajectories) and evaluated on PointMotionBench (2.7K clips), it outperforms existing methods in 3D motion forecasting, robotics planning (76.3% success rate), and video generation (improved motion quality metrics).
3d motion forecastinglanguage-guidedautoregressiveflow-matchingpoint trajectory
From the Hugging Face Hub to robot hardware with Strands Agents and LeRobot
The Strands Robots SDK integrates AWS's open-source robot abstractions with the LeRobot stack through AgentTools, enabling seamless transition from simulation to hardware deployment. Key innovations include a unified LeRobotDataset format for both simulated and physical recordings, shared policy interfaces (GR00T, LerobotLocal), and peer mesh coordination for multi-robot systems. The system achieves hardware-agnostic agent code through thin wrappers, demonstrated via a 5-line Python API that controls SO-100/101 robots in simulation or reality. Benchmark results show identical dataset schema compatibility (100% feature parity) between MuJoCo-simulated and hardware-recorded demonstrations.
agenttoolslerobotdatasetgroot_inferencemujocozenoh
GLM-5.2: Built for Long-Horizon Tasks
GLM-5.2 introduces architectural innovations for efficient 1M-token context handling, featuring IndexShare (reusing indexers across sparse attention layers to reduce FLOPs by 2.9×) and improved MTP layers for speculative decoding (20% longer acceptance lengths). The model demonstrates strong performance on long-horizon coding benchmarks (1% behind Opus 4.8 on FrontierSWE) and standard coding tasks (81.0 on Terminal-Bench 2.1), while offering configurable effort levels for latency-performance tradeoffs. Training leverages agentic RL with anti-hacking safeguards and the slime infrastructure for scalable rollout.
indexsharemtp layerkv-cachespeculative decodingagentic rl
Agentic Resource Discovery: Let agents search
The Agentic Resource Discovery (ARD) specification introduces a federated discovery layer enabling AI agents to dynamically locate tools, skills, and other agents at runtime. ARD replaces static, pre-installed capability catalogs with intent-based search via REST APIs and standardized manifests (ai-catalog.json). Hugging Face's reference implementation, Discover Tool, indexes thousands of Spaces and Skills, supporting semantic search filtered by runtime status and media type (application/ai-skill, application/mcp-server+json). Early results demonstrate interoperability across registries, with future work planned for federation modes and static manifest hosting.
agentic resource discoveryfederated registriesintent-based searchsemantic searchmcp server
New research shows how AMIE, our medical AI, could help manage health conditions.
The Articulate Medical Intelligence Explorer (AMIE) demonstrates longitudinal disease management capabilities by combining an empathetic dialogue agent with a deep-thinking reasoning engine that cross-references clinical guidelines. Built on Gemini models' long-context capabilities, the system processes hundreds of pages of medical knowledge for real-time patient interactions. In a blinded study comparing AMIE against 21 primary care physicians using patient actors, the AI matched clinicians in overall management reasoning while outperforming in plan preciseness (p<0.05) and guideline alignment, suggesting potential clinical decision support applications.
articulate medical intelligence explorergemini modelslong-context capabilitiesclinical decision supportguideline alignment
📜 arXiv Papers (277)
Visual Verification Enables Inference-time Steering and Autonomous Policy Improvement
The paper introduces VERITAS, a generator-verifier framework for generalist robot policies that enables inference-time steering and autonomous policy improvement. The method pairs a pre-trained policy (generator) with a gradient-free visual verifier to evaluate actions during inference, improving performance without additional training. Results show that inference-time verification outperforms vanilla generalists, and verified rollouts provide effective supervision for offline policy improvement, matching expert demonstration efficiency without human intervention.
inference-time steeringvisual verificationgeneralist robot policiesoffline policy improvementgradient-free verifier
ReproRepo: Scaling Reproducibility Audits with GitHub Repository Issues
ReproRepo introduces a scalable framework for evaluating LLM agents' ability to assist with research reproducibility, using GitHub issues as natural supervision for identifying reproduction blockers. The method analyzes 1,149 ML papers from major conferences, evaluating four model-agent configurations (including Codex with GPT-5.5) without code execution. Results show agents identify semantically related human-reported blockers for ~90% of papers, excelling at visible failure detection and semantic region identification but struggling with exact localization.
reproducibility auditingllm agentsgithub issuescodexsemantic localization
EvolveNav: Proactive Preflection and Self-Evolving Memory for Zero-Shot Object Goal Navigation
The paper introduces EvolveNav, a self-evolving framework for Zero-Shot Object-Goal Navigation (ZS-OGN) that enables continuous test-time improvement. The method constructs an agentic rule memory from past trajectories, employs an upper confidence bound-based retrieval strategy to balance semantic relevance and historical success, and incorporates a memory-guided preflection module to forecast outcomes and reduce inefficient exploration. Experiments demonstrate a 10.1% improvement in success rate over zero-shot baselines while requiring fewer steps.
zero-shot navigationagentic rule memoryupper confidence boundpreflection moduletest-time adaptation
Learning Red Agent Policy from Observations for Neurosymbolic Autonomous Cyber Agents
The paper introduces a Policy Learning Technique using imitation learning to predict actions of unobservable cyber-attackers (red agents) in partially observable RL environments. The method integrates with neurosymbolic defense agents, employing behavior trees with learning-enabled components (LECs) to infer red policies from network observations and defender actions. Evaluated in simulated cyber environments, the approach achieves high prediction accuracy across diverse attacker strategies.
imitation learningneurosymbolicpartially observablebehavior treeslearning-enabled components
Looped World Models
Looped World Models (LoopWM) introduce looped architectures for world modeling, addressing the tension between long-horizon simulation fidelity and computational cost. The method iteratively refines latent environment states through a parameter-shared transformer block, enabling adaptive computation that scales depth with prediction complexity. LoopWM achieves up to 100x parameter efficiency over conventional approaches while maintaining simulation quality, establishing iterative latent depth as a new scaling axis for world simulation.
looped architecturesworld modelingparameter-shared transformeradaptive computationiterative latent depth
Fixed-Point Reasoners: Stable and Adaptive Deep Looped Transformers
The paper introduces FPRM, a Fixed-Point Reasoning Model using looped Transformer architectures with pre-norm layers and residual scaling to address signal propagation issues. FPRM employs fixed-point convergence as an end-to-end halting mechanism, enabling adaptive computation based on task difficulty. Evaluated on Sudoku, Maze, state-tracking, and ARC-AGI benchmarks, FPRM demonstrates effective compositional reasoning capabilities by leveraging depth through iterative looping.
looped transformersfixed-point convergencepre-norm layersresidual scalingcompositional reasoning
RubricsTree: Scalable and Evolving Open-Ended Evaluation of Personal Health Agents across Health Memory and Medical Skills
The paper introduces RubricsTree, a scalable evaluation framework for personal health agents, addressing the trade-off between physician annotation reliability and LLM-as-a-judge scalability. The method employs a hierarchical taxonomy of 100+ clinically-verifiable Boolean rubrics, curated via human-in-the-loop iterations with an expert panel, and uses a context-aware adaptive router for efficient evaluation. Results show RubricsTree exceeds baseline expert alignment by 66% on HealthBench, reliably penalizes degraded responses, and improves performance for Gemini, GPT, and Qwen models when used for optimization.
personal health agentsboolean rubricshuman-in-the-loophealthbenchexpert alignment
A Red-Team Study of Anthropic Fable 5 & Opus 4.8 Models
This study evaluates the adversarial robustness of Anthropic's Fable 5 and Opus 4.8 LLMs against automated jailbreak attacks using the HackAgent framework. Testing 7,826 harmful intents across ten categories, the models resisted most attacks but showed vulnerabilities to adaptive iterative methods, with Opus 4.8 breaking on 11.5% of intents and Fable 5 on 6.1%. Automated attacks generated 1,620 (Opus 4.8) and 702 (Fable 5) confirmed harmful completions, demonstrating that even hardened models remain susceptible to sustained adversarial pressure.
adversarial robustnessjailbreak attackshackagent frameworkiterative methodsharm taxonomy
The Stanford EDGAR Filings Dataset: Reconstructing U.S. Corporate and Financial Disclosures into Layout-Faithful and Token-Efficient Pretraining Data
The Stanford EDGAR Filings Dataset (SEFD) introduces an open corpus of SEC filings reconstructed into layout-faithful MultiMarkdown for financial language modeling. The method processes audited financial statements, risk disclosures, and market-moving event filings into token-efficient, long-context pretraining data with <0.1% Common Crawl overlap. Results include SEFD-v1 (152B tokens) and analyses of an 18.5M-filing archive (estimated 550B tokens), plus two benchmarks: EDGAR-Forecast for filing-grounded numerical forecasting and EDGAR-OCR for financial table transcription.
long-context corporafinancial language modelingmultimarkdowntoken-efficientsec filings
DRFLOW: A Deep Research Benchmark for Personalized Workflow Prediction
The paper introduces DRFLOW, a benchmark for evaluating personalized workflow prediction by AI agents, addressing the gap in existing systems focused on report generation. The benchmark comprises 100 tasks across five domains, with 1,246 reference steps grounded in 3,900+ sources, and defines seven diagnostic metrics. The authors present DRFLOW-Agent (DRFA), which outperforms baselines by up to 10.02% F1 score but highlights significant remaining challenges in workflow prediction accuracy.
workflow predictionbenchmarkpersonalizationdeep researchaction-steps
Kolmogorov Regression for Robust Diffusion Policies
The paper introduces Kolmogorov regression for robust diffusion policies, addressing temporal drift in finite-dimensional diffusion policies via a backward Kolmogorov equation that lifts policies to a Cameron-Martin space. The method replaces stochastic score matching with a deterministic boundary-value PDE, leveraging Gaussian measure theory to derive a precision-weighted Cameron-Martin loss and a Kolmogorov residual for inference diagnostics. Results show a 17% improvement in maximum episode reward on the PushT benchmark, 67.6% reduction in inter-step drifts, and 28.4% lower RMSE on a manufacturing line, with certified deadlock reduction via Hamilton-Jacobi reachability.
kolmogorov regressiondiffusion policiescameron-martin spacegaussian measure theoryhamilton-jacobi reachability
IUU+DB: Tracking Illegal, Unreported, and Unregulated Fishing, Seafood Fraud, and Labor Abuse through LLM-driven Information Extraction
The paper introduces IUU+DB, an LLM-driven system for tracking illegal fishing and related crimes (IUU+) by extracting structured incident data from heterogeneous documents. The system performs document classification, entity extraction (actors, locations, species), deduplication, and trend analysis to quantify IUU+ patterns. Validation shows it enables hotspot identification, risk assessment, and policy support by organizing fragmented evidence into a global database.
illegal fishinginformation extractionlarge language modelentity recognitionsupply chain risk
All Smoke, No Alarm: Oracle Signals in Agent-Authored Test Code
The study characterizes oracle signals in AI-generated test code to assess verification strength, analyzing 86,156 test-file patches from 33,596 agent-authored PRs across 2,807 GitHub repositories. Using a syntactic taxonomy of eight oracle signal categories derived from 384 stratified patches, the authors find 80.2% of test patches contain weak or no explicit oracle signals. Regression analysis shows strong oracles significantly improve merge likelihood (OR = 1.28, p < 0.001), suggesting current quality gates overestimate verification strength by relying on test-file presence alone.
oracle signalsverification strengthagent-authored prstest-file patchesmerge likelihood
The Measurement Gap in the Automation of EU Law: Benchmarking Doctrinal Legal Reasoning under the EU AI Act
The article identifies a critical measurement gap in legal AI evaluation, demonstrating that current benchmarks fail to assess doctrinal legal reasoning—the core interpretive task in law—instead focusing on ancillary paralegal tasks. This gap has legal implications under the EU AI Act, which mandates 'appropriate accuracy' for high-risk judicial AI systems without operational benchmarks. The work highlights the need for new evaluation methodologies to align AI assessments with doctrinal reasoning requirements in legal contexts.
legal aidoctrinal reasoningeu ai actbenchmarkingjudicial automation
ReAge3D: Re-Aging 3D Faces with View Consistency
The paper introduces ReAge3D, a framework for realistic 3D face re-aging that preserves identity and fine-grained details. It combines a 2D diffusion-based model (DiffReaging) with a center-out propagation strategy to achieve multi-view consistency. The Masked-DiffReaging process injects existing content during diffusion to maintain coherence, supervising 3D representation optimization. Results demonstrate superior performance over existing 3D editing methods in both visual quality and quantitative metrics, enabling precise age control.
diffusion-basedmulti-view consistencymasked-diffreaging3d face re-agingidentity-preserving
Learning Cardiac Electrophysiology Digital Twins Through Agentic Discovery of Hybrid Structure
The paper introduces LEADS, a framework for automated discovery of hybrid physics-neural architectures for cardiac electrophysiology digital twins. The method employs an LLM agent that iteratively reasons over a structured action space of domain knowledge to select and refine interpretable, stable model structures, while gradient descent handles parameter fitting. Evaluations on synthetic data with three reaction models and real cardiac EP data show LEADS outperforms both human-designed hybrid models and LLM-based alternatives.
cardiac electrophysiologydigital twinshybrid modelingllm agentsstructure discovery
WEQA: Wearable hEalth Question Answering with Query-Adaptive Agentic Reasoning
WEQA introduces a query-adaptive agent framework for wearable health question answering, addressing challenges posed by continuous, high-dimensional sensor data and diverse user intents. The method employs an LLM controller to dynamically route queries through specialized wearable analysis tools and pretrained models, synthesizing execution plans and auditing responses with external knowledge. Evaluated on a benchmark of four wearable datasets across three health domains, WEQA achieves 24% higher accuracy than LLM and agentic baselines, with blinded expert studies confirming improved clinical soundness and usefulness.
wearable healthquery-adaptivellm controllersensor analysisclinical soundness
Memory as a Wasting Asset: Pricing Flash Endurance for Embodied Agents, and the Limits of Doing So
The paper introduces an economic model for flash memory endurance in embodied agents, treating memory as depreciating capital priced by an endurance shadow price η. The method formulates a cost-minimizing placement strategy across RAM, on-board NVM, and cloud storage, with a wear-augmented per-byte index. Results show the value-write association χ determines optimal placement: positive χ (observed in long-horizon manipulation, χ̂ ≈ +1.0×10⁻³) leads to non-monotone optima, while negative χ occurs in non-recurrent teleoperation. The model's applicability depends on flash type (binding for QLC/eMMC, dormant for premium TLC), and wear-aware controllers match price-based routing on task value.
flash enduranceembodied agentsshadow pricenon-volatile memorywear-augmented index
Your AI Travel Agent Would Book You a Bullfight: An Agentic Benchmark for Implicit Animal Welfare in Frontier AI Models
The paper introduces TAC (Travel Agent Compassion), the first benchmark evaluating AI agents' avoidance of animal exploitation in travel booking scenarios. Twelve hand-authored scenarios across six exploitation categories were augmented to 48 samples, controlling for price/rating/position confounds. Testing seven frontier models revealed all scoring below chance (64%), with Claude Opus 4.7 top at 53%. Adding welfare-aware prompts improved performance by 12-63 percentage points across models. An Inspect Scout audit of 288 transcripts found zero evaluation awareness. Discusses implications for cultural variation, text-response benchmark limitations, and EU AI Code compliance.
tacagentic benchmarkanimal welfarefrontier modelsinspect scout
Descriptor: Certus Caliber Classification Gunshot Dataset (C3GD)
The Certus Caliber Classification Gunshot Dataset (C3GD) introduces a publicly available corpus of firearm muzzle blast sounds, addressing limitations of prior internet-sourced datasets through controlled field collection. Comprising over 8000 samples from 28 firearms spanning 16 calibers, the dataset features detailed metadata on firearms, calibers, cartridges, microphones, and recording locations. Designed primarily for caliber classification, it also supports gunshot detection, audio separation, and signal processing tasks. The dataset's diversity in acquisition conditions and comprehensive metadata aims to improve generalization for real-world applications while enabling rigorous academic analysis.
gunshot audiocaliber classificationaudio separationsignal processingmetadata annotation
Knowledge Reutilization in Meta-Reinforcement Learning
The paper proposes a meta-knowledge reutilization framework for meta-reinforcement learning that decouples task inference from embodiment-specific control to improve cross-agent knowledge transfer. The method employs a Bayesian non-parametric prior for task mode organization, a high-level policy for task-level guidance, and introduces a semantic-magnitude interface with temporal adaptor to align frozen meta-knowledge with heterogeneous agents' low-level controllers. Experiments on locomotion tasks demonstrate 94.75%-99.79% reduction in final-step tracking error versus baselines, achieving comparable performance with only 23.8% of their interaction data.
meta-reinforcement learningbayesian non-parametric priorknowledge transfertemporal adaptortask semantics
Towards Understanding and Measuring COGNITIVE ATROPHY in LLM Behaviour
The study introduces COGNITIVE ATROPHY as a novel behavioral measure in LLM-mediated mental-health support, distinct from traditional safety metrics. Using COGNITIVE ATROPHY BENCH—a benchmark with 1,576 counseling conversations, 15,680 turns, and 42,230 responses from five LLMs—researchers developed a 20-attribute schema validated by clinical experts. Key findings include moderate-to-high atrophy-aligned behaviors across models, particularly in directive advice and problem-solving responses that may foster user dependence. The work proposes the User-Input Risk Index (UIRI) and Cognitive Atrophy Risk Index (ARI) for quantifying these effects.
cognitive atrophyllm behaviormental-health supportuser-input risk indexclinical benchmarking
Embedded Machine Learning for Microcontroller-Class Edge Devices: Data, Feature, Evaluation, and Deployment Pipelines
The paper presents a systems-oriented workflow for embedded machine learning on microcontroller-class devices, focusing on engineering decisions often omitted in introductory materials. It addresses data acquisition, feature extraction (e.g., root-mean-square, mel-frequency cepstral coefficients), model/runtime co-design, and deployment pipelines for resource-constrained environments. Two case studies demonstrate the approach: inertial motion recognition using accelerometer data and keyword spotting via a compact 1D CNN. Practical design rules are provided for quantization, thresholding, and real-time performance.
embedded machine learningmicrocontrollerfeature extractionquantizationreal-time inference
Structural Role Injection in Handlebars-Templated LLM Prompts: Triple-Brace Interpolation, Delimiter Family, and the Limits of HTML Auto-Escaping
The paper demonstrates structural role injection vulnerabilities in Handlebars-templated LLM prompts, where attacker-controlled data can forge privileged chat turns. Through model-free analysis and 5760 trials across 7 delimiter families and 4 models (GPT-3.5 Turbo, GPT-4o mini, GPT-4.1 mini, Claude Haiku 4.5), the authors show that Handlebars' HTML-escaping only neutralizes angle-bracket delimiters (0.00 survival) while leaving colon/Markdown delimiters intact (1.00 survival). GPT-3.5 Turbo followed hijack instructions in 97% raw/91% escaped trials, while Claude Haiku 4.5 resisted attacks. The escaping mechanism provides incomplete protection against role injection.
structural role injectionhandlebars templatingllm prompt securitydelimiter familieshtml auto-escaping
First Proof Second Batch
The study evaluates AI systems' capability to solve research-level mathematics problems by testing them on ten diverse problems contributed by prominent mathematicians. The methodology involved presenting these problems to multiple AI systems, documenting their solutions, and comparing them with human solutions and referee reports. Results include detailed logs of AI-generated solutions and their accuracy relative to human benchmarks, providing insights into current limitations and strengths of AI in advanced mathematical reasoning.
mathematicsai systemsresearch-levelmethodologybenchmarking
Ternary Mamba: Grouped Quantization-Aware Training of W1.58A16 State Space Models
The paper introduces Ternary Mamba, a method for compressing State Space Models (SSMs) like Mamba-2 via grouped quantization-aware training (QAT) with knowledge distillation from a frozen FP16 teacher. By leveraging pretrained checkpoints instead of from-scratch training, the approach reduces the token budget by 1,000x, compressing a 1.3B parameter model to 3.61x smaller size (744 MB) while maintaining 48.1% zero-shot accuracy on a 7-task average. The work identifies zero-ratio collapse, a novel instability in QAT, and shows that post-hoc correction strategies effective for Transformers fail in SSMs due to error accumulation.
ternary mambaquantization-aware trainingstate space modelsknowledge distillationzero-ratio collapse
Learning Fair Pareto-Optimal Policies in Multi-Objective Reinforcement Learning
The paper formalizes fair optimization in multi-policy multi-objective reinforcement learning (MORL), ensuring Pareto-optimal policies accommodate dynamic user preferences while maintaining fairness. Key contributions include: (1) proving fair policies for concave welfare functions (e.g., generalized Gini welfare) reside in the convex coverage set, (2) showing non-stationary and stochastic policies improve fairness via historical reward adaptation, and (3) proposing three algorithms integrating welfare functions with multi-policy Q-learning. Evaluations demonstrate superior fairness across domains compared to MORL baselines.
multi-objective reinforcement learningpareto-optimal policiesgeneralized gini welfareconvex coverage setnon-stationary policies
Querying an astronomical database using large language models: the ALeRCE text-to-SQL system
The study develops a text-to-SQL system for the ALeRCE astronomical database using LLMs with in-context learning, enabling natural language queries. A four-module framework (schema linking, query classification, prompt decomposition, self-correction) is proposed and evaluated against a direct-inference baseline on 110 NL/SQL pairs. Claude Opus 4.6 achieved perfect-match rates of 0.97/0.94 (simple), 0.44/0.72 (medium), and 0.59/0.49 (hard) for row/column identifiers, with self-correction reducing execution errors; top performers included Gemini 2.5 Pro and GPT-5.2-Codex.
text-to-sqlin-context learningastronomical databaseprompt engineeringself-correction
Trust the Right Teacher: Quality-Aware Self-Distillation for GUI Grounding
The paper introduces quality-aware self-distillation (QASD) for GUI grounding in vision-language models, addressing limitations of naive on-policy self-distillation (OPSD). QASD employs correctness-aware gating to filter unreliable coordinate-token teacher signals and teacher-probability scaling to calibrate supervision strength. Experiments demonstrate that combining these components outperforms individual use, yielding consistent improvements across six GUI grounding benchmarks. The method enhances base model performance by suppressing noisy supervision while preserving valid signals.
self-distillationgui groundingvision-language modelscorrectness-aware gatingteacher-probability scaling
IsabeLLM: Automated Theorem Proving Applied to Formally Verifying Consensus
The paper enhances IsabeLLM, an automated theorem prover for formal verification, by integrating a Retrieval-Augmented Generation framework, error tracing, and counterexample generation to improve Large Language Model context. It ensures compatibility with Isabelle and Sledgehammer for efficiency gains. The study evaluates IsabeLLM's performance in verifying Bitcoin's Proof of Work consensus, addressing blockchain vulnerabilities through AI-driven formal methods.
isabellmretrieval-augmented generationformal verificationconsensus protocoltheorem proving
S4oP: Operator-level Pruning of Structured State Space Models for Resource-Constrained Devices
The paper introduces S4oP, an operator-level pruning method for Structured State Space Models (SSMs) like S4 and S4D, targeting resource-constrained deployment. The approach incrementally prunes model operators via structured masking and fine-tuning, while jointly optimizing accuracy and latency. Experiments demonstrate that pruning up to 70% of operators preserves performance across benchmarks, significantly reducing inference costs. This is the first systematic study of structured operator pruning for SSMs, offering a practical solution for efficiency-accuracy trade-offs in sequential data tasks.
structured state space modelsoperator-level prunings4s4dinference latency
EAGG: Embodiment-Aligned Grasp Generation via Geometry-Aware Graph Conditioning
EAGG introduces an embodiment-aligned grasp generator that generalizes across diverse end effectors by representing each with a topology-aware graph and control space. The method employs a frozen backbone to convert articulated states into geometry-aware tokens, refreshed via iterative geometry injection during sampling. Results on MultiGripperGrasp show 56.17% average success across six end effectors, with iterative injection reducing median contact distance from 0.239 cm to 0.189 cm, demonstrating superior transfer without sacrificing specialization.
cross-end-effectorgeometry-aware tokenstopology-aware graphiterative geometry injectionmorphology prior
A Unified Framework for Context-Aware and Relation-Aware Graph Retrieval-Augmented Generation
The paper introduces HyGRAG, a hierarchical graph RAG framework that addresses limitations in entity-centric and chunk-centric retrieval methods by integrating contextual and relational information. It constructs hierarchical index structures over hybrid graphs with chunk and entity nodes, clusters them iteratively, and generates LLM-based summaries. The framework enables context- and relation-aware retrieval across abstraction levels and supports dynamic knowledge updates. Experimental results demonstrate a 9.7% improvement in multi-hop reasoning task accuracy while maintaining efficiency.
retrieval-augmented generationhierarchical graphmulti-hop reasoningcontext-aware retrievaldynamic knowledge update
Volterra Generative Models
The paper introduces Volterra generative models, a continuous-time score-based framework employing path-dependent fractional kernel noise instead of memoryless Brownian perturbations. To address non-Markovian dynamics, the authors develop finite-dimensional Markovian lifts via Gaussian quadrature and hybrid finite-difference exponential approximations, with theoretical guarantees on squared error bounds. The method handles learning in data-dimensional space using residual states and analytic auxiliary Gaussian scores, while identifying and mitigating covariance degeneracies. Experiments on MNIST and CIFAR-10 demonstrate improved generation with small lifts and stability via a Gaussian-bridge sampler for larger lifts.
volterra generative modelsscore-based diffusionmarkovian liftsfractional kernelsgaussian-bridge sampler
Agentic AI-based Framework for Mitigating Premature Diagnostic Handoff and Silent Hallucination in Healthcare Applications
The paper proposes a multi-agent framework to mitigate premature diagnostic handoff and silent hallucinations in medical LLMs through deterministic orchestration constraints. The system employs (1) a neuro-symbolic state-tracking gate enforcing OLDCARTS protocol completeness before diagnostic transitions, and (2) an epistemic uncertainty quantification gate using semantic entropy (K=5 samples) to intercept divergent outputs. Evaluated on 150 llama-3.1-70b-instruct simulated cases, the framework achieves 49.3% diagnostic precision (+11.3pp over baseline) and shows significant negative correlation (r=-0.181, p<0.05) between OLDCARTS completeness and semantic entropy.
agentic aineuro-symbolicsemantic entropyoldcarts protocoldeterministic orchestration
When LLMs Analyze Scars: From Images to Clinically-Meaningful Features
The paper introduces ScaFE, a framework leveraging large language models (LLMs) for knowledge-driven feature engineering in medical image classification, specifically for pathological scar differentiation. By prompting LLMs to generate executable feature extraction code based on clinical criteria like the Vancouver Scar Scale, ScaFE transforms images into interpretable representations. The method demonstrates superior data efficiency, privacy preservation, and interpretability compared to end-to-end deep learning, achieving robust performance with limited training samples through clinically grounded features.
llmsfeature engineeringscar classificationvancouver scar scaleclinical interpretability
Security and Privacy Prompts in the Wild: What Users Ask LLMs and How LLMs Respond
The study presents the first large-scale analysis of user-generated security and privacy (S&P) prompts to large language models (LLMs), identifying 14,727 S&P queries from 3.2M WildChat conversations. Using thematic analysis on 450 samples and evaluating 270 advice-seeking prompts across 10 LLM runs, it compares response quality between commercial (GPT-3.5) and open-weight (Llama 2) models. Results show GPT-3.5 provided 'good enough' responses for 98% of prompts versus Llama 2's 47%, but commercial models exhibited higher response inconsistency, potentially misleading users.
large language modelssecurity and privacythematic analysisresponse consistencywildchat dataset
PseudoBench: Measuring How Agentic Auto-Research Fuels Pseudoscience
The study introduces PseudoBench, an adversarial benchmark with 200 pseudoscientific claim-evidence pairs across five domains, designed to evaluate agentic auto-research systems' resistance to pseudoscience. It tests seven state-of-the-art agents through an end-to-end research pipeline, revealing near-zero refusal rates and a maximum resistance of only 27.4%. Findings indicate that stronger agents may inadvertently enhance pseudoscience's credibility by packaging it in sophisticated scientific language, highlighting urgent needs for scientific alignment in autonomous research systems.
pseudoscienceagentic auto-researchadversarial benchmarklarge language modelsscientific alignment
When AI Says "I have been in similar situations": Synthetic Lived Experience in Peer-Like Caregiver Support
The study identifies a synthetic lived experience paradox in AI caregiver support, where LLMs generate peer-like narratives without authentic lived experience. Analyzing caregiver support exchanges from online communities and responses from LLaMA, GPT-4o-mini, and MedGemma, the authors compare human and AI narrative forms using psycholinguistic analysis. Results show human peers use significantly more first-person and past-focused language, while AI captures emotional work but fabricates experiential grounding, revealing a narrative authenticity gap that necessitates mechanisms to distinguish supportive framing from fabricated experience.
synthetic lived experiencepeer-like supportpsycholinguistic analysisnarrative authenticity gapcaregiver support
ProvenanceGuard: Source-Aware Factuality Verification for MCP-Based LLM Agents
ProvenanceGuard introduces source-aware factuality verification for Model Context Protocol (MCP)-based LLM agents, addressing cross-source conflation by decomposing answers into atomic claims and verifying source attribution. The method uses MCP traces with tool/source IDs, routes claims to source-specific evidence via NLI and token-alignment, and enables repair via retrieval-augmented revision. Evaluation on 281 medical-domain traces shows block F1 of 0.802 and source accuracy of 0.858 on 260 claims, outperforming source-blind baselines. ProvenanceGuard detects all attribution swaps in 50 clinical probes, demonstrating source attribution as a critical factuality axis.
model context protocolcross-source conflationprovenance verificationretrieval-augmented revisiontoken-alignment
When English Isn't the Best Teacher: Source Language Effects in Cross-Lingual In-Context Learning
The study challenges the assumption that insights from supervised fine-tuning directly apply to cross-lingual in-context learning (ICL), demonstrating that conventional language selection heuristics fail in ICL regimes. Through empirical analysis of seven tasks, six models, and diverse languages, the authors identify language confusion as a key obstacle in cross-lingual ICL. Results reveal discrepancies between fine-tuning and ICL performance, suggesting alternative heuristics for optimal source language selection in few-shot scenarios.
cross-lingual transferin-context learninglanguage confusionfew-shottypologically diverse
Catastrophic Forgetting is Low-Rank: A Function-Space Theory for Continual Adaptation
The paper provides a function-space theory of catastrophic forgetting in continual adaptation, showing that forgetting manifests as low-rank output drift. Using neural tangent kernel (NTK) analysis, the authors derive a closed-form predictor for the forgetting vector before new-task training, exact for linear-head PEFT-CL and approximate for nonlinear adapters. Results reveal forgetting concentrates in few NTK eigenmodes, with a Kronecker scaling rule for vulnerable rank under frozen linear heads. This explains limitations of parameter-space regularizers and motivates spectral regularization targeting interference directions.
catastrophic forgettingneural tangent kernelcontinual learningspectral regularizationpeft-cl
LoopCoder-v2: Only Loop Once for Efficient Test-Time Computation Scaling
LoopCoder-v2 introduces Parallel Loop Transformers (PLT) with cross-loop position offsets (CLP) and shared-KV gated sliding-window attention to optimize loop-count selection in computation scaling. Training 7B-parameter PLT variants on 18T tokens reveals a non-monotonic effect: two-loop models outperform baselines (e.g., SWE-bench Verified improves from 43.0 to 64.4), while ≥3 loops regress due to diminishing refinement and CLP-induced positional mismatches. Diagnostics show loop 2 provides primary representation refinement, with later loops causing oscillatory updates and diversity loss, establishing a gain--cost trade-off for loop-count design.
parallel loop transformerscross-loop position offsetsshared-kv gated attentioncomputation scalingloop-count selection
LegalHalluLens: Typed Hallucination Auditing and Calibrated Multi-Agent Debate for Trustworthy Legal AI
LegalHalluLens introduces a framework for auditing and mitigating hallucinations in legal AI systems through typed error profiling and calibrated multi-agent debate. The method categorizes hallucinations into four legal claim types (numeric, temporal, obligation/entitlement, factual) using CUAD benchmark data, proposes a Risk Direction Index (RDI) for bias quantification, and implements a debate pipeline with targeted skepticism. Results show 38-40pp performance gaps between claim types, opposite RDIs in systems with identical aggregate error rates (52%), and 45% reduction in fabricated detections using a 4B-parameter model. Typed diagnostics enable failure mode calibration for debate agents.
legal aihallucination auditingmulti-agent debaterisk direction indextyped claims
LLM Consumer Behavior Theory: Foundations of a Novel Research Field
The paper introduces LLM Consumer Behavior Theory, a novel research field analyzing consumption decisions made by LLM-based autonomous agents on behalf of users. It synthesizes classical economics, behavioral economics, and NLP to formalize how human preferences are reflected in LLM decisions and aggregate into market demand. The work unifies fragmented literature on LLM decision-making and human behavior simulation, while identifying gaps in assumptions like rationality and heterogeneity in agentic markets, proposing open questions on alignment and preference representation.
llm consumer behavior theoryagentic marketspreference elicitationhuman behavior simulationalignment
C2FL: Clustered Continual Federated Learning under Spatial and Temporal Drift
The paper introduces C2FL, a clustered continual federated learning framework addressing spatial and temporal drift in distributed sensing systems. Nodes self-organize into spatial clusters to handle geographic heterogeneity, while adaptive averaging and experience replay mitigate temporal drift. Evaluations on synthetic datasets demonstrate superior performance over standard federated learning under dynamic conditions, with robustness against distribution shifts.
federated learningspatial clusteringtemporal driftadaptive averagingexperience replay
A T-API-Compliant ReAct Agentic Loop for Optical Networks: Generic vs. Domain-Specific Tool Abstractions
The paper introduces a T-API-compliant ReAct (Reasoning and Acting) loop for intent-driven management of optical networks, marking the first implementation of its kind. It demonstrates that domain-specific composite tools significantly outperform generic alternatives, achieving 90% oracle-validated correctness while reducing token usage by threefold. The approach enables higher autonomy levels in network management through closed-loop agentic control.
reactt-apioptical networksagentic loopdomain-specific tools
Multiple cyclicity and Wavelet Decomposition with Channel Correlation for Long-term Time Series Forecasting
The paper introduces McWC, a novel model for long-term time series forecasting that separately models cyclicity, trend, and inter-channel correlations. It employs a multi-layer cyclicity construction module to decouple cyclical information, a multi-layer perceptron for inter-channel correlations, and a wavelet decomposition module for multi-level frequency analysis. The model also decouples intra-channel autocorrelations via frequency-domain loss calculation. Evaluated on six real-world datasets, McWC achieves state-of-the-art performance with superior computational efficiency and historical information extraction capabilities.
cyclicitywavelet decompositioninter-channel correlationstime series forecastingfrequency-domain loss
Recover Semantics First, Generate Better: Improved Latent Modeling for 3D MRI Reconstruction and Cross-Contrast Synthesis
The authors propose a semantics-first latent modeling framework for 3D MRI reconstruction and cross-contrast synthesis, addressing limitations in existing compression architectures that under-preserve anatomical coherence and discard clinical semantics. Their method introduces a Latent Harmonization Encoder (LHE) for global dependency capture, a Semantic Recovery Block (SRB) injecting self-supervised high-level priors, and an Anatomy-aware Frequency Loss (AFL) for structure preservation. Experiments on two multi-contrast MRI datasets show improved reconstruction fidelity and synthesis quality compared to prior approaches.
latent modelingcross-contrast synthesis3d mri reconstructionself-supervised learninganatomical coherence
STAR: SpatioTemporal Adaptive Reward Allocation for Text-to-Image RL Post-Training
The paper introduces SpatioTemporal Adaptive Reward (STAR) Allocation, a novel RL post-training method for text-to-image diffusion and flow models that addresses granularity mismatches in reward allocation. STAR leverages text-image attention to dynamically construct spatial allocation maps across denoising steps, focusing policy updates on latent regions most relevant to prompt alignment. Evaluated on Stable Diffusion 3.5 Medium across GenEval, OCR text rendering, and PickScore tasks, STAR improves semantic alignment (0.9759), text rendering (0.9757), and preference optimization (23.60) without modifying external reward sources.
reinforcement learningtext-to-image generationdiffusion modelsreward allocationpolicy optimization
MoCo-AIS: A Contrastive Learning Framework for Similarity Computation of Vessel Trajectories
MoCo-AIS introduces a unified contrastive learning framework for vessel trajectory similarity computation, addressing limitations of supervised methods and fragmented self-supervised approaches. The method adapts Momentum Contrast (MoCo) to learn embeddings from positive/negative trajectory pairs, evaluated on large-scale AIS datasets capturing diverse navigation behaviors. Results show significant improvement over baselines in similarity learning while providing a standardized benchmarking platform for trajectory representation models.
contrastive learningtrajectory similaritymomentum contrastais datasetsrepresentation learning
SegDINO: Introducing Multi-Scale Structure into DINO for Efficient Medical Image Segmentation
SegDINO introduces an efficient medical image segmentation framework by integrating DINOv3 backbone with lightweight scale modeling, addressing the challenge of applying self-supervised DINO models to segmentation. The method employs Token Pyramid Adaptation (TPA) to reorganize intermediate DINO features into a pseudo multi-scale hierarchy and Scale-Aware Decoding (SAD) for intra-scale refinement and top-down propagation. Evaluated on the new PanCT dataset (284 patients with pancreatic tumors) and three public benchmarks, SegDINO achieves state-of-the-art results with high efficiency.
segmentationself-supervised learningdinomulti-scalemedical imaging
A Neuro-Symbolic Approach to Strategy Synthesis for Strategic Logics
The paper presents a neuro-symbolic framework for strategy synthesis in multi-agent systems (MAS), combining large language models (LLMs) with formal verification. The method uses Qwen3-32B as a strategy-generation oracle, proposing candidate strategies that are formally validated by a MAS model checker, ensuring soundness via a generate-and-certify architecture. Evaluated on a new NatATL dataset of 4211 instances, the approach achieves 92% accuracy in strategy-synthesis outcomes.
neuro-symbolicstrategy synthesismulti-agent systemslarge language modelsmodel checking
Robustness of Similarity-based Positional Encoding Under Rotations: Theoretical Analysis and Experimental Validation
The paper theoretically analyzes and experimentally validates the robustness of similarity-based positional encoding (simPE) under rotational perturbations. Through Lipschitz continuity assumptions, the authors prove simPE's stability under rotations, deriving explicit perturbation bounds in Frobenius norm. Experiments on synthetic Arrow, Shapes, Digits datasets and FashionMNIST demonstrate simPE's superior performance over learned positional encodings in accuracy, F1, precision, and recall under small-to-moderate rotations, aligning with theoretical guarantees.
positional encodingrotation invariancelipschitz continuityfrobenius normtransformer architectures
SoftMoE: Soft Differentiable Routing for Mixture-of-Experts in LLMs
SoftMoE introduces a differentiable soft top-k routing mechanism for Mixture-of-Experts (MoE) in LLMs, replacing discrete top-k selection with a truncated soft top-k LapSum relaxation. This enables gradient-based optimization of expert routing while maintaining autoregressive compatibility. The method parameterizes the mean number of active experts per layer under a global budget constraint, learning non-uniform expert allocation across layers. Results show comparable or superior performance to sparse MoE on language modeling and downstream tasks, with fewer activated experts, particularly in later layers.
mixture-of-expertsdifferentiable routingautoregressive modelinggradient-based optimizationlanguage modeling
Plug-and-Adapt: Multimodal Coreference Resolution at First Sight with a Pretrained Alignment Model
The paper introduces Plug-and-Adapt, a method for multimodal coreference resolution (MCR) that adapts a pretrained alignment model without requiring target dataset training or large vision-language models (VLLMs). The approach pre-trains a fine-grained alignment model on vision-language datasets, then repurposes it for MCR via similarity aggregation combining visual and categorical cues with evidence theory. Evaluations on Coreference Image Narratives (CIN) show 5.31% and 2.12% CoNLL F1 improvements over state-of-the-art dedicated methods and VLLMs, respectively, with additional robustness and generalization confirmed on masked CIN and VCR-MCR datasets.
multimodal coreference resolutionalignment modelvision-language alignmentsimilarity aggregationevidence theory
Small Initialization Matters for Large Language Models
The study demonstrates that parameter initialization significantly influences the training and capacity of large language models (LLMs), with small initialization scales yielding consistent improvements in pretraining, particularly for reasoning-demanding tasks. The authors identify two empirical settings that limit these benefits and propose adjustments to restore favorable scaling. A critical initialization balance is uncovered, promoting a developmental trajectory where parameters first condense into low-complexity structures before expanding into richer representations. Token-level analyses reveal gains on non-trivial, context-constrained predictions. The findings motivate a γ-initialization rule, advocating small initialization as a cost-free intervention to enhance pretraining and reasoning across model scales.
initializationpretrainingreasoningscalingtrajectory
How Inference Compute Shapes Frontier LLM Evaluation
The study demonstrates that frontier LLM evaluations are increasingly sensitive to inference-time compute allocation, challenging fixed-budget assessment protocols. Through controlled experiments on 12 models across seven benchmarks (including FrontierMath and TerminalBench), the authors systematically vary token budgets, context compaction, and repeated submissions with correctness feedback. Key findings show: (1) larger token budgets consistently improve performance (+20-40% on cybersecurity and math tasks), (2) fixed-budget evaluations underestimate newer models' capability ceilings, and (3) optimal inference-scaling strategies are benchmark-dependent. The authors advocate for compute-aware evaluation protocols with matched-budget comparisons.
inference computetoken budgetscontext compactionrepeated submissionscapability ceilings
PreAct: Computer-Using Agents that Get Faster on Repeated Tasks
PreAct introduces a method for computer-using agents to accelerate repeated tasks by compiling successful runs into state-machine programs, eliminating per-step language-model calls and achieving 8.5-13x speedups. The system verifies screen-state matches before acting and reverts to the agent upon discrepancies, while an independent evaluator ensures compiled programs correctly solve tasks before storage. Evaluations across mobile, desktop, and web benchmarks show PreAct improves task completion by 1.75-2.6 tasks per benchmark, with a fallback mechanism maintaining parity with record-and-replay baselines.
computer-using agentsstate-machine programsin-context learningruntime verificationtask acceleration
KANLib -- An Modular, Extensible and Fast Kolmogorov-Arnold Network Implementation
KANLib introduces a modular, extensible framework for Kolmogorov-Arnold Networks (KANs), addressing computational inefficiencies and inconsistent feature support in existing implementations. The framework unifies concepts from PyKAN, EfficientKAN, and FastKAN, offering adaptive grid rescaling, grid extension, and architectural customization while maintaining PyTorch compatibility. Experimental validation on the California Housing benchmark confirms KANLib's competitive computational efficiency and predictive accuracy relative to reference implementations. The system enables exploration of novel KAN architectures with minimal performance trade-offs.
kolmogorov-arnold networksadaptive grid rescalingpytorch compatibilitycomputational efficiencyarchitectural customization
PearlVLA: Progressive Embodied Action-Plan Refinement in Latent Space
PearlVLA introduces a Vision-Language-Action framework that performs progressive embodied action-plan refinement in latent space, addressing the latency-deliberation trade-off in VLAs. The method separates vision-language model representations into visual grounding and iterative latent plan branches, using future observation latents from a frozen world model to guide RefineNet's scheduled residual updates over K refinement rounds. Causal Refinement-Grouped Process-Reward RL optimizes latent refinement via imagined future rewards. On LIBERO benchmark, PearlVLA achieves state-of-the-art performance among existing methods.
vision-language-actionlatent refinementworld modelprocess-reward rlembodied planning
Trustworthy Self-Composable Big-Data-as-a-Service: An LLM-Orchestrated Multi-Agent Framework for Automated Data Engineering, AutoML, MLOps Deployment, and Drift-Aware Lifecycle Optimization
The paper proposes an LLM-orchestrated multi-agent framework for trustworthy Big-Data-as-a-Service (BDaaS) lifecycle automation, addressing gaps in current AutoML and LLM-based systems. The architecture decomposes BDaaS workflows into specialized agents (data ingestion, cleaning, feature engineering, etc.) coordinated by a central LLM layer for dynamic composition, validation, and governance. Evaluated on tabular benchmarks with simulated drift, the framework matches predictive performance of baselines while improving workflow reliability, traceability, and drift recovery by 12-18%.
llm-orchestrationautomlmlopsdata-driftmulti-agent
Non-negative Elastic Net Decoding for Information Retrieval
The paper introduces Non-Negative elastic Net (NNN) decoding, a novel retrieval paradigm that addresses redundancy in dense retrieval by selecting documents as a set via sparse non-negative linear combination of embeddings. The method jointly reconstructs query embeddings while considering corpus context, theoretically outperforming inner-product scoring on corpora with correlated documents. Experiments show consistent improvements over dense retrieval on multiple benchmarks, with further gains achieved through end-to-end embedding optimization for NNN decoding.
dense retrievalnon-negative elastic netsparse linear combinationjoint decodingembedding optimization
DiagFlowBench: Evaluating How Language Models Handle Off-Procedure Inputs in Grounded Diagnostic Dialogue
DiagFlowBench introduces a dataset of 1,676 multi-turn conversations derived from 50 industrial diagnostic flowcharts to evaluate language models' handling of off-procedure inputs in grounded dialogue. The benchmark tests ten commercial and open-weight models, revealing high variability in abstention rates and a tendency to select contextually inadequate but procedurally valid steps over fabrication. Results highlight a critical vulnerability in grounding systems, where models often provide plausible yet incorrect advice when faced with out-of-scope queries.
diagnostic dialoguegrounded language modelsabstention ratesoff-procedure inputshallucination prevention
Learn to Quantify Social Interaction with Constraints for Pedestrian Walking
The paper proposes Learn to Cluster, a probabilistic latent variable generative method to quantify and interpret social interactions in pedestrian trajectory prediction. The approach clusters interactions directly from sequential trajectory observations without manual labeling, scaling to arbitrary pedestrian counts. These latent variables categorize social interactions and integrate into prediction models. Experiments on multiple benchmarks show the method effectively learns interaction patterns and improves trajectory prediction accuracy.
trajectory predictionlatent variablesocial interactionprobabilistic clusteringpedestrian behavior
Dimensionality Controls When Modularity Helps in Continual Learning
The study investigates how modular architecture, task similarity, and representational dimensionality influence continual learning in sequential A-B-A paradigms. Comparing task-partitioned recurrent networks to single-network baselines, the authors manipulate weight-scale to induce high- and low-dimensional regimes. In high-dimensional 'lazy' regimes, both architectures perform similarly, with modularity offering negligible benefits. In low-dimensional 'rich' regimes, modular networks develop task-specific subspaces that overlap for similar tasks and separate for dissimilar ones, enhancing compositional organization and interpretability. Results highlight dimensionality as a critical factor determining when modularity improves continual learning, framing safety and robustness as adaptive subspace allocation problems.
modular architecturecontinual learningrepresentational dimensionalitytask-partitioned networkssubspace allocation
MathVis-Fine: Aligning Visual Supervision with Necessity via Progressive Dependency-Guided Training for Multimodal Mathematical Reasoning
The paper introduces MathVis-Fine, a framework addressing visual dependency challenges in multimodal mathematical reasoning. It constructs the MathVis-Fine dataset with fine-grained visual annotations and dependency ratings, then proposes a two-stage progressive training paradigm balancing answer correctness and visual grounding rewards. This approach mitigates reward bias by adapting supervision to sample-specific visual necessity. Experiments show improved visual perception and reasoning accuracy, demonstrating the framework's effectiveness for precise multimodal mathematical problem-solving.
multimodal reasoningvisual dependencyprogressive trainingmathematical problem-solvingfine-grained annotations
AI Adoption Across a Multinational Workforce: Sociotechnical Conditions for GenAI Acceptance in Human Resources
This study examines generative AI (GenAI) adoption in HR systems through a mixed-methods analysis of search logs, surveys (n=25), and interviews at a multinational tech company. Key findings reveal adoption patterns depend on system-employee fit (role, language, tenure), with trust emerging through source verification, system comparison, and colleague consultation. The work contributes empirical evidence on workplace GenAI adoption dynamics during live transitions, highlighting situational fit, search literacy, and trust calibration as critical factors. It further proposes design considerations for inclusive deployment, emphasizing role-sensitive benefits and organizational knowledge infrastructure integration.
generative aihuman resourcessituational fittrust calibrationknowledge infrastructure
Structural Preservation and the Logical Expressiveness of Graph Neural Networks
The paper establishes logical expressiveness bounds for graph neural networks (GNNs) preserved under structural properties (embeddings, injective homomorphisms, homomorphisms) by linking them to graded modal logic fragments. Using semantic preservation criteria, it shows existential graded modal logic corresponds to embedding-preserving GNNs, existential-positive fragments to injective homomorphism-preserving GNNs, and existential-positive modal logic to homomorphism-preserving GNNs. The method employs a novel well-quasi-order result for bounded-height trees to derive finite representations of unravelling-invariant classes, demonstrating architecture-independent expressiveness while proving each class admits an equally expressive GNN implementation.
graph neural networksgraded modal logicstructural preservationwell-quasi-orderhomomorphism
AnchorKV: Safety-Aware KV Cache Compression via Soft Penalty with a Refusal Anchor
AnchorKV introduces a safety-aware KV cache compression method for LLMs that improves alignment against harmful prompts while maintaining utility. The approach constructs an offline 'safety anchor' in key projection space using difference-of-means representation engineering, then applies a soft penalty to token retention scores to bias eviction away from harmful directions. This modification preserves accuracy on benign inputs and reduces to standard compression when no penalty is needed, trading minimal utility loss for substantial safety gains.
kv cachesafety alignmentrepresentation engineeringtoken retentionjailbreak attacks
StepGuard: Guarding Web Navigation via Single-Step Calibration
The paper introduces StepGuard, a framework for improving web navigation agents through single-step calibration. It addresses reward misalignment via Dynamic Dual-Policy Optimization (DDPO), which switches between navigation-first and answer-first modes, and mitigates single-step errors via Confidence-Guided Adaptive Navigation Reflection (CANR), using confidence estimation and contrastive rewards for self-correction. Experiments show state-of-the-art performance on standard benchmarks, with significant improvements in navigation and answer accuracy.
web navigationreward alignmentconfidence estimationcontrastive rewardsself-correction
A Quantitative Analysis of Multimodal Biomarkers in Alzheimer's Disease
This study presents a quantitative analysis of multimodal Alzheimer's Disease biomarkers using tau-PET, structural MRI, cognitive scores, and APOE4 data from 789 ADNI subjects. The methodology includes cross-modal mutual information analysis, tau-structural atrophy associations, decomposition of tau-cognition relationships, and identification of neurodegenerative trajectories. Results demonstrate systematic characterization of biomarker interactions, revealing redundancy, predictive dependencies, and dominant trajectories aligned with cognitive decline. The approach enhances interpretability and selection of AD biomarkers while reducing patient burden.
alzheimer's diseasemultimodal biomarkerstau-petstructural mrineurodegenerative trajectory
FlowRAG: Synergizing Explicit Reasoning via Frequency-Aware Multi-Granularity Graph Flow
FlowRAG introduces a frequency-aware multi-granularity graph framework to enhance retrieval-augmented generation by addressing entity-level sparsity and brittle multi-hop reasoning. The method constructs a quad-level heterogeneous graph (passages, summaries, sentences, entities) with summary nodes as semantic hubs, employs dual-granularity activation for robust query alignment, and uses frequency-weighted flow to prune noisy paths and extract high-confidence reasoning chains. Experiments demonstrate state-of-the-art performance on complex reasoning benchmarks.
graphragmulti-hop reasoningheterogeneous graphfrequency-aware flowretrieval-augmented generation
A homotopy-type-theoretic generalization of neurosymbolic inference
The paper introduces a homotopy-type-theoretic framework for neurosymbolic (NeSy) inference, generalizing belief-weighted sums over σ-structures to account for symmetries and proof multiplicity. By replacing sets with homotopy types, the method computes belief-weighted homotopy cardinality, preserving symmetry information and enabling reasoning-shortcut awareness. The framework is proven conservative for classical functionals when symmetries are trivial and identifies symmetry-invariant concept posteriors as the unique point in the confusion-set simplex. On MNIST reasoning-shortcut benchmarks, the single-model symmetry-aware wrapper outperforms diversity-trained ensembles in calibration while maintaining label accuracy and concept identifiability.
neurosymbolic inferencehomotopy type theoryσ-structuresreasoning shortcutssymmetry-invariant
WallZero: Mastering the Game of WallGo with Strategic Analysis
The paper presents WallZero, an AlphaZero-based agent for WallGo, a strategic 7×7 board game combining stone movement and wall placement. The authors introduce customized action and feature designs to enhance performance, demonstrating WallZero's superiority by defeating two professional Go players with 1.98x more average territory per game. The agent also analyzes game fairness, revealing that the opening strategy from the Netflix series 'The Devil's Plan' produces more balanced outcomes. Code is publicly available.
alphazerowallgogame-tree complexitystrategic interactionsaction design
High-Fidelity 3D Geometric Reconstruction of Pelvic Organs from MRI: A Hybrid Deep Learning and Iterative Optimization Approach
The study presents a hybrid framework for high-fidelity 3D reconstruction of pelvic organs (bladder, uterus, rectum) from MRI, combining deep learning with iterative optimization. The method features a geometry-aware multi-level architecture for topological consistency, a two-stage amortized optimization strategy for global-local balance, and a synergy mechanism where optimization supervises training and refines inference outputs. Results show superior geometric fidelity, with lower Chamfer Distance and higher Dice Similarity Coefficient than existing methods, while maintaining computational efficiency and improved mesh quality metrics (minSICN, minSIGE).
3d reconstructiondeformable shape modelingamortized optimizationchamfer distancetopological consistency
Perceptual compensation for tonal context in self-supervised speech models
The study investigates whether wav2vec2.0 exhibits perceptual compensation for phonological context in Mandarin Chinese tones. Using embedding similarity analysis and probing classifiers, the authors compare a purely self-supervised pre-trained model with one fine-tuned for Mandarin ASR. Results show no compensation in pre-trained embeddings, limited compensation in probing classifiers, and failure to replicate human performance on isolated syllables, suggesting supervised objectives may be necessary for abstracting certain phonological regularities.
wav2vec2.0phonological contextmandarin tonesself-supervised learningprobing classifiers
Functional Equivalence in Attention: A Comprehensive Study with Applications to Linear Mode Connectivity
This study investigates functional equivalence in attention mechanisms, focusing on how positional encodings reshape symmetries in Transformer architectures. The analysis compares sinusoidal and rotary positional encodings (RoPE), revealing that sinusoidal encodings preserve equivalence structures while RoPE reduces symmetry groups, enhancing expressivity. The findings provide a theoretical basis for RoPE's practical adoption. Additionally, the research examines positional encodings' impact on linear mode connectivity, demonstrating through an alignment algorithm that connectivity patterns vary significantly with encoding choice.
functional equivalencepositional encodingstransformerslinear mode connectivityrotary positional encodings
When Multiple Scripts Matter: Evaluating ASR in Clinical Settings
The paper introduces MultiClin, a clinical ASR benchmark addressing multiscript variability in non-English settings, where orthographic variants are often misclassified as errors. The study evaluates diverse ASR models using multiscript-aware metrics, demonstrating fairer performance assessment than single-reference methods. Findings reveal that script inconsistency during training increases orthographic uncertainty, with a 50% mapping ratio yielding highest entropy, while script unification optimizes ASR performance. The dataset and code are publicly available.
asrmultiscriptorthographicentropybenchmark
Human-in-the-Loop Atlas-Based 3D Asset Segmentation for Interactive Content Workflows
The paper presents a human-in-the-loop pipeline for generating segmented 2D atlases from 3D models, addressing application-dependent segmentation needs in interactive media workflows. The method employs a greedy set cover strategy to select rendered views, followed by interactive segmentation using SAM-2 and Label Studio, with back-projection to UV space for unified atlas creation. Evaluation on eight cultural heritage objects demonstrates usable atlas generation across diverse geometries, with manual corrections primarily needed for fine structures, cavities, and weak appearance boundaries.
3d segmentationatlas generationhuman-in-the-loopuv parameterizationinteractive workflows
DecoSearch: Complexity-Aware Routing and Plan-Level Repair for Text-to-SQL
DecoSearch introduces a training-free framework for text-to-SQL that dynamically routes queries based on complexity. It employs a Schema Selector for schema pruning, an LLM Judger for decomposition decisions, and a RAG-enhanced DAG-based decomposer for complex queries, with a Topology Refiner for plan repair. The system achieves 70.53% execution accuracy on BIRD and 88.31% on Spider with DeepSeek, outperforming training-free baselines while reducing token usage. It also serves as a model-agnostic wrapper, enhancing fine-tuned backbones without pipeline modifications.
text-to-sqlschema pruningdirected acyclic graphretrieval-augmented generationexecution accuracy
A Framework for Evaluating Agentic Skills at Scale
The paper introduces a framework for evaluating agentic skills in LLMs, enabling skill authors to construct realistic tasks and assess skill utility. The method involves generating 1,000 tasks from 500 real-world skills, with instruction-following and goal-completion rubrics. Results from 19 agent-model configurations reveal significant performance variations and demonstrate that skill access alters model behavior, facilitating opinionated workflow encoding. The dataset is released to support future research.
agentic skillsllm agentsinstruction-followinggoal-completion rubricsworkflow encoding
Conservation Laws for Modern Neural Architectures
The work establishes a unified framework for characterizing conservation laws in gradient flow dynamics of modern neural architectures, extending beyond previously understood linear and ReLU networks. The method analyzes feedforward networks with GELU, SiLU, and SwiGLU activations, multihead attention with sinusoidal/rotary positional encodings, and Mixture-of-Experts architectures under various gating designs. Experimental results validate the theoretically predicted invariants, providing insights into implicit bias in over-parameterized models.
gradient flowimplicit biasover-parameterized modelsconservation lawsmodern architectures
No-Free-Fairness: Fundamental Limits and Trade-offs in Learning Systems
The paper establishes No-Free-Fairness theorems, identifying three fundamental sources of disparity in learning systems. Through theoretical analysis, it demonstrates that irreducible subgroup costs create a fairness-cost trade-off frontier, finite-sample learning induces unavoidable subgroup disparity even in noise-free settings, and model class limitations independently prevent fairness. Key results show that strict relative fairness enforcement creates statistical bottlenecks requiring exponential samples, while unrepresentable subgroup solutions preclude fairness regardless of data or training. The framework extends beyond supervised learning, positioning fairness as an intrinsic design constraint rather than an optimizable objective.
fairness-cost frontiersubgroup disparitystatistical bottleneckmodel expressivityno-free-fairness
Position: Coding Benchmarks Are Misaligned with Agentic Software Engineering
The article critiques current coding benchmarks as misaligned with agentic software engineering, arguing they conflate model performance with harness components and lack granular feedback. It identifies three key issues: benchmark scores inappropriately merge model and harness metrics, penalize valid alternative solutions by comparing to single references, and fail to provide component-level signals for iterative improvement. The analysis highlights how these limitations obscure true progress in agent-based coding systems.
coding benchmarksagentic software engineeringmodel-harness conflationreference solution biascomponent-level iteration
LiveStarPro: Proactive Streaming Video Understanding with Hierarchical Memory for Long-Horizon Streams
LiveStarPro introduces a proactive streaming video understanding system addressing three key limitations of Video-LLMs: real-time processing, autonomous response timing, and long-horizon memory retention. The framework combines Streaming Verification Decoding (SVeD) for perplexity-based response timing, Streaming Causal Attention Masks (SCAM) for incremental video-language alignment, and Tree-Structured Hierarchical Memory (TSHM) for efficient retrieval from unbounded streams. Evaluated on the OmniStarPro benchmark (15 scenarios, hour-scale streams), it achieves 28.9% higher semantic correctness, 18.2% lower timing error, and 1.58x inference speedup via KV-cache optimization compared to baselines.
video-llmsstreaming verification decodingcausal attention maskshierarchical memorykv-cache
MIVE: A Minimalist Integer Vector Engine for Softmax LayerNorm and RMSNorm Acceleration
The paper introduces MIVE, a Minimalist Integer Vector Engine for accelerating Softmax, LayerNorm, and RMSNorm operations in LLM inference. The unified datapath architecture exploits shared computational patterns across these non-linear vector normalization operations, eliminating redundant hardware blocks. ASIC implementation demonstrates superior area efficiency and hardware utilization compared to dedicated accelerators for individual operations.
vector enginehardware acceleratorlayernormrmsnormsoftmax
A Neuromorphic Trigger for Efficient Audio Event Detection
The paper proposes a neuromorphic trigger using spiking neural networks (SNNs) to gate audio inputs for downstream models, enabling efficient processing of continuous audio streams. The lightweight fully connected SNN acts as a front-end filter, selectively forwarding salient segments to computationally intensive classifiers. Evaluated on Anomalous Sound Detection (URBAN-SED) and Sound Event Detection (DCASE 2017 Task 2), the trigger achieves a 0.97 F1 score for ASD and reduces FLOPs by 42.6× while lowering error rates from 0.41 to 0.25 for SED when paired with the Dang classifier.
spiking neural networkaudio event detectionneuromorphic computinganomalous sound detectioncomputational efficiency
Talking to Your Data: Exploring Embodied Conversation as an Interface for Personal Health Reflection
The paper contributes a conversational agent system for health data reflection, combining wearable data preprocessing with a Unity-based embodied character using a dual-agent architecture (Observer for statistical extraction, Presenter for spoken communication). The method employs a simulated-self user study (N=5) comparing dashboard exploration with conversational reflection, measuring understanding, action specificity, and cognitive shifts. Results suggest the embodied interface shifts users from passive viewing to active sensemaking, though clinical advice is intentionally excluded to isolate interaction effects.
embodied conversational agentwearable data preprocessingdual-agent architecturespoken statisticssimulated-self study
Symplectic Transversality and Endpoint Green Estimates for Finite-Horizon Pontryagin Systems
The paper establishes horizon-uniform analysis for finite-horizon discrete-time Pontryagin systems via symplectic transversality and endpoint Green estimates. Key contributions include constructing a two-point endpoint inverse for linearization, verified through scaled stable-unstable boundary transversality, and deriving weighted contractions for existence, uniqueness, and Lipschitz dependence with horizon-independent constants. The framework accommodates nonlinear endpoint maps and provides symplectic/Riccati verification criteria, covering stabilizable linear-quadratic systems with noncommuting data. Numerical validation demonstrates horizon-uniform first-order expansions.
pontryagin systemssymplectic transversalitygreen estimatesfinite-horizon controlriccati criteria
ED3R: Energy-Aware Distributed Disaster Detection Enabled by Cooperative Robotic Agents
The paper introduces ED3R, an energy-aware distributed framework for wildfire detection using cooperative robotic agents. The framework enables hierarchical decision-making between a robot and remote controller, optimizing motion planning, sensing modality (onboard/remote), and wildfire detection confidence under energy constraints. ED3R incorporates obstacle avoidance, redundant exploration prevention, adaptive mission completion, and forward-looking neural regression for strategy evaluation. Evaluations show 97.18% mission success rate, with 36.4% energy reduction and 41% faster detection compared to baselines in demanding scenarios.
distributed roboticsenergy-aware planningwildfire detectioncooperative decision-makingneural regression
Shattering the Autoregressive Curse: Dynamic Epistemic Entropy Orchestrated Erasable Reinforcement Learning for LLMs
The paper proposes dynamic epistemic entropy orchestrated erasable reinforcement learning ($\text{E}^3\text{RL}$) to mitigate the autoregressive curse in long-horizon logical reasoning for LLMs. The method uses segment-level adaptive thresholds and advantage allocation to excise localized logical defects while reusing historical KV-cache streams, enabling self-healing reasoning. Evaluated on DeepMath-103k, $\text{E}^3\text{RL}$ improves exploration efficiency with linear memory overhead, achieving 5.349% and 6.514% SOTA gains on AIME for 4B and 8B parameter models respectively.
autoregressive curseepistemic entropykv-cacheself-healingadvantage allocation
LongWebBench: Evaluating Structural and Functional Webpage Generation in Long-Horizon Settings
The paper introduces LongWebBench, a benchmark for evaluating long-horizon webpage generation from structural and functional perspectives. It comprises 490 real-world long webpages for structural fidelity assessment and 507 goal-oriented interaction tasks across 129 webpages for functional evaluation. The benchmark employs a VLM-based metric for structural coherence and a DOM-augmented agent-based pipeline for functional verification. Experiments with state-of-the-art VLMs reveal structural fidelity degradation with increasing webpage length and functional shortcomings in visually plausible generations. The results underscore the importance of executable interaction as a core evaluation criterion.
long-horizon webpage generationstructural fidelityfunctional evaluationdom-augmented agentvlm-based metric
Structured Adversarial Camouflage via Voronoi Diagrams
The paper introduces adversarial Voronoi camouflage, a parameter-efficient method for generating structured adversarial patterns by optimizing seed-point locations under fixed color palettes. Using soft assignment without additional regularization, it produces splinter-like camouflage that degrades object detection performance. Evaluated on COCO-style AP@[.5:.95], garment-level application via segmentation masks (3DPeople) significantly reduces detection accuracy, with transferability across YOLOv9-12 detectors and out-of-domain backgrounds. The attack shows limited tolerance to palette changes (<=0.17), revealing structure-palette coupling. Physical validation remains future work.
adversarial camouflagevoronoi diagramsobject detectionyolovparameter-efficient
Vision-language models for chest radiography do not always need the image
The study challenges assumptions that medical vision-language models (VLMs) necessarily utilize radiographic images by demonstrating that text-only models achieve comparable diagnostic accuracy. Authors introduce a causal audit method involving image occlusion (relevant/irrelevant regions) and cross-patient scan swaps, combined with three behavioral metrics to assess image dependence. Results show a 119B-parameter multimodal VLM performs indistinguishably from a 7B text-only baseline (Δ5.7 accuracy points), with only 5/9 models showing selective image use. Text-only models match radiologist accuracy (p>0.05) but fail grounding, while image-using models exhibit radiologist-aligned grounding rates. Confidence scores reliably indicate grounding only when models process images.
vision-language modelscausal auditchest radiographygrounding metricsmultimodal fusion
Confusion-Aware Transfer Teacher Curriculum Learning Framework: Disentangling Scoring and Pacing Effects
The paper disentangles scoring and pacing effects in curriculum learning through two evaluation protocols: stage-wise test subsets for isolated scoring validation and a random-order baseline for pacing analysis. Within the Transfer Teacher Framework (TTF), it proposes a confusion-aware difficulty score incorporating both correct-class confidence and incorrect-class probability distributions. Experiments on CIFAR-10 with ResNet-18/VGG-16 show the score produces interpretable rankings but fails to improve full-data accuracy. However, it yields 8.7% relative gains in 20% data regimes, demonstrating TTF's potential for data-efficient training.
curriculum learningtransfer teacher frameworkconfusion-aware scoringdata-efficient trainingdifficulty ranking
SegTME-UNI2: A Foundation Model-Based Framework for Generalisable Multiclass Cell Segmentation and LLM-Driven Tumour Microenvironment Characterisation in Histopathology
SegTME-UNI2 introduces a unified framework for tumor microenvironment (TME) characterization via multiclass cell segmentation and LLM-driven reporting. The method combines UNI2-H (a ViT-Giant foundation model pretrained on 100M histopathology tiles) with dual UperNet decoders for semantic segmentation and nuclear instance separation, trained via a three-stage pseudo-label curriculum using PanNuke and TCGA-UT datasets. The pipeline extracts 20+ TME features, encoded as JSON for narrative generation by a fine-tuned BioNeMo GPT model. Validation on PanNuke and TCGA-UT shows feasibility, with released pseudo-labels and model checkpoints enabling large-scale TME analysis.
histopathology segmentationpseudo-label curriculumtumor microenvironmentvit-giantbiomeno gpt
EComAgentBench: Benchmarking Shopping Agents on Long-Horizon Tasks with Distributed Hidden Intent
The paper introduces EComAgentBench, a benchmark for evaluating LLM-based shopping agents on long-horizon tasks with distributed hidden intent. The benchmark comprises 662 tasks derived from real Amazon products and reviews, scattering requirements across visible queries, tool-gated profiles, and scripted clarifications. Agents must uncover hidden intent, verify candidates, and commit to a product within 100 tool calls, with typed rubrics attributing failures to specific requirements. Automated construction ensures reliability, with validation for every sample. Evaluation of seven models shows the strongest achieves only 57.1% accuracy, with performance degrading for hidden intent sources.
llm-based agentslong-horizon taskshidden intentbenchmark evaluationtool-gated profiles
FllumaOne: A Code-Native Multimodal CAD Dataset with Executable Programs and Kernel-Validated Feature Histories
FllumaOne introduces a code-native multimodal CAD dataset with 100,000 samples (FllumaOne-100K) that align executable Python programs with structured feature trees, STEP geometry, point clouds, and renderings. The dataset is generated using Flluma, a Qt/C++ OpenCASCADE-based CAD system, with kernel-validated geometry and modality completeness checks. A Qwen2.5-Coder-1.5B LoRA baseline achieves 99.98% Python syntax validity, 99.97% build success, and 0.002124 mean normalized Chamfer Distance on point cloud predictions. The dataset supports CAD reconstruction, program synthesis, and editable reverse engineering.
parametric cadexecutable programskernel-validated geometryfeature treechamfer distance
SuCo: Sufficiency-guided Continuous Adaptive Reasoning
The paper introduces Sufficiency-guided Continuous Adaptive Reasoning (SuCo), a two-stage framework for optimizing reasoning efficiency in Large Reasoning Models (LRMs). First, Minimal Sufficient CoT (MSC) defines the shortest reasoning prefix yielding correct answers, enabling MSC-Aligned Fine-Tuning (MFT) with problem-adaptive thresholds. Second, Sufficiency-Aware Policy Optimization (SAPO) uses RL with dynamic complexity tracking and sufficiency-aware rewards. Experiments on math, code, and science benchmarks show SuCo improves both accuracy and token efficiency compared to standard Chain-of-Thought approaches.
minimal sufficient cotsufficiency-aware policy optimizationreasoning efficiencyadaptive thresholdsdynamic complexity tracking
See First, Answer Later: Visual Evidence Pre-Alignment via Sufficiency-Driven RL
The paper introduces Visual Evidence Pre-Alignment (VEPA), an intermediate training stage for multimodal large language models (MLLMs) that improves visual grounding through sufficiency-driven reinforcement learning. VEPA employs Group Relative Policy Optimization (GRPO) to optimize question-conditioned visual evidence descriptions, addressing the limitations of caption-based pretraining that often neglects fine-grained visual details. Experiments demonstrate consistent performance gains on visually demanding benchmarks, with analysis showing these improvements stem from enhanced transferable visual grounding rather than task-specific training.
multimodal large language modelsvisual evidence pre-alignmentgroup relative policy optimizationsufficiency-driven objectivevisual grounding
ASTEROID: A Spatiotemporal Information Transformer for Forecasting Multi-Step Time Series of Molecular Dynamics
The authors present ASTEROID, a Transformer-based framework for multi-step forecasting of molecular dynamics (MD) trajectories without iterative integration. The model reformulates MD trajectories as spatiotemporal sequences, employing a local-global self-attention mechanism for spatial dependencies and an encoder-decoder structure for temporal dependencies. Evaluated on quantum-mechanics datasets, ASTEROID achieves higher accuracy than existing methods while reducing computational costs, supporting extended iterative forecasting. This establishes a data-driven paradigm for MD simulation acceleration.
molecular dynamicstransformerspatiotemporal sequencesmulti-step forecastingself-attention
Handling Feature Heterogeneity with Learnable Graph Patches
The paper proposes learnable graph patches to address feature heterogeneity in graph data without textual information, enabling transferable graph foundation models (GFMs). The method decomposes graphs into semantic units (patches) by unfolding node features and constructing patch structures separately, then employs a patch encoder and aggregator to extract and combine knowledge across domains. Empirical results demonstrate improved performance on diverse downstream tasks and datasets, with scaling benefits from increased pre-training data volume.
graph foundation modelfeature heterogeneitylearnable graph patchespatch encoderdomain-agnostic transfer
FacProcessTwin: An LLM-Based System for Process Twin Development
FacProcessTwin introduces an LLM-based system for automated process twin development, reducing manual effort by generating complete process models from plant documentation and operator input. The system combines natural language processing with real-time data binding, featuring human-in-the-loop governance for safety-critical decisions. Evaluation on 16 production flows at an Australian food manufacturer demonstrates 95.2% F1 accuracy in model generation and 6× faster deployment than manual methods, with zero mis-bindings in ambiguous cases through operator deferral.
process twinlarge language modelhuman-in-the-loopreal-time bindingmanufacturing automation
Temporal Preference Optimization for Unsupervised Retrieval
TPOUR (Temporal Preference Optimization for Unsupervised Retriever) introduces Temporal Retrieval Preference Optimization (TRPO), a novel training method that addresses temporal misalignment in unsupervised dense retrievers by reinterpreting preference learning along the temporal dimension. TRPO enables continuous temporal alignment via interpolation in a learned time embedding, generalizing to unseen time periods. Evaluated on temporal information retrieval (T-IR), TPOUR outperforms both unsupervised and supervised baselines, improving average nDCG@5 by +4.04 (+12.15%) on explicit and +4.98 (+15.21%) on implicit queries compared to Qwen-Embedding-8B, despite being 72.7x smaller.
temporal preference optimizationunsupervised dense retrievaltemporal retrieval preference optimizationtime embeddingtemporal information retrieval
TuneAhead: Predicting Fine-tuning Performance Before Full Training Begins
The paper introduces TUNEAHEAD, a lightweight framework for predicting fine-tuning performance of large language models (LLMs) before full training. The method encodes candidate runs as meta-feature vectors combining static dataset descriptors and dynamic probe features from short standardized probes, then maps these to performance estimates using a predictor with SHAP-based interpretability. Evaluated on 1,300+ fine-tuning runs of Qwen2.5-7B-Instruct, TUNEAHEAD achieves 1.47 percentage point RMSE and 95.1% predictions within ±3 percentage points of true scores, outperforming baselines like Early-Stop Extrapolation and ProxyLM.
fine-tuning predictionmeta-feature vectorshap-based attributionlarge language modelsperformance estimation
Using Cognitive Models to Improve Language Model Simulation of Human Persuasion Games
The paper introduces Equation-to-Behavior Prompting, a method leveraging cognitive models to enhance large language models' simulation of diverse human decision-making in persuasion games. The approach combines mathematical models (Bayesian updating, affine distortion, motivated updating, Grether's $α$-$β$ model) with prompting or reinforcement learning (Equation-to-Behavior RL). Results show large models successfully approximate cognitive models via prompting, while small models require RL, reducing belief error by 26.5% in out-of-distribution settings. Training with diverse decision-makers improves belief change by 2.5%-12% over Bayesian-only baselines when persuading GPT-5-mini.
equation-to-behavior promptingcognitive modelspersuasion gamesbayesian updatingreinforcement learning
A Risk Decomposition Framework for Pre-Hoc Fine-Tuning Prediction
The paper establishes a theoretical framework for pre-hoc fine-tuning prediction in LLMs, decomposing prediction risk into intrinsic data-model compatibility and reducible optimization variance. It proves a necessary lower bound on optimization variance decay and introduces a budget-optimal probing principle with a predictability phase diagram categorizing tasks into Static-Sufficient, Dynamic-Critical, and Noise-Dominant regimes. Experiments on synthetic and real-world benchmarks validate the theoretical regimes and demonstrate the probing strategy's efficiency.
pre-hoc predictionrisk decompositionoptimization variancephase diagramprobing strategy
From Brewing to Resolution: Tracing the Internal Lifecycle of Code Reasoning in LLMs
The study introduces a dual diagnostic framework combining layer-wise linear probing and Context-Stripped Decoding (CSD) to analyze the internal lifecycle of code reasoning in LLMs, revealing four resolution outcomes: Resolved, Overprocessed, Misresolved, or Unresolved. Applied to 16 models across Qwen, Llama, and DeepSeek architectures, results show only 41.5% Resolved accuracy, with task-specific bottlenecks like Function Call success dropping from 61.1% to 2.5% with increased call depth. The brewing scaffold remains stable (24-42% duration) across models, while resolution success varies with capability and scale.
code reasoninglinear probingcontext-stripped decodingtransformer architecturesfailure modes
SketchXplain: Intuitive Visual Explanations of Image Classifiers with Sketches
SketchXplain introduces sketch-based visual explanations for image classifiers, addressing the interpretability gap in saliency maps by generating intuitive, coherent, and selective visualizations. The method combines saliency maps, concept-bottleneck models, and sketch optimization to integrate observation artifacts, knowledge coherence, and abstraction. Evaluations on face expression recognition and skin lesion diagnosis demonstrated faster interpretation and better alignment with user knowledge compared to saliency maps or simple drawings, supporting lay diagnosis.
saliency mapsconcept-bottleneck modelssketch optimizationinterpretabilityvisual explanations
Beyond Domains: Reusing Web Skills via Transferable Interaction Patterns
SkillMigrator introduces transferable interaction patterns (TIPs) to improve web skill reuse across sites by matching layout structure rather than instruction similarity or site metadata. The method pairs each induced skill with a structural sketch of the snapshot at induction time, retrieves TIPs by layout similarity at test time, and grounds references on the live page. Evaluated on WebArena and Mind2Web, SkillMigrator reduces LLM-action count by 8-10% on successful trajectories while maintaining matched success rates.
transferable interaction patternsweb skillslayout similarityllm-action countaccessibility-snapshot
Bounding Box Label Propagation for Re-Annotation of Document Layout Analysis Datasets
The paper introduces Bounding Box Label Propagation (BBLP), a pseudo-labelling framework for semi-supervised re-annotation of document layout analysis datasets. BBLP integrates visual, textual, and positional embeddings via an object encoder to propagate labels from a small labelled subset to unlabelled data. On the D4LA dataset, BBLP achieves 54.0% mAP (81.6% of fully supervised performance) using only 10% labelled data, demonstrating effective label propagation for object detection tasks.
bounding box label propagationdocument layout analysissemi-supervised learningobject detectionpseudo-labelling
FinAcumen: Financial Multimodal Reasoning via Self-Evolving Experience Memory Harness
FinAcumen introduces a financial reasoning agent framework with self-evolving experience memory to address multimodal reasoning challenges in finance. The method employs selective experience memory that accumulates and retrieves financially grounded reasoning trajectories, activating relevant memories via semantic similarity thresholds while suppressing irrelevant ones. Evaluated on four financial multimodal benchmarks, FinAcumen outperforms finance-specialized models and approaches proprietary general-purpose models using an 8B vision-language backbone, demonstrating improved reliability under retrieval uncertainty.
multimodal reasoningexperience memorytool-augmented agentssemantic relevance thresholdfinancial benchmarks
Brick-DICL: Dynamic In-Context Learning for Automated Brick Schema Classification
Brick-DICL introduces a dynamic in-context learning framework for automated Brick schema classification in Building Management Systems (BMS). The method combines metadata-RAG for domain knowledge enhancement and class-RAG to reduce classification space, supplemented by a multi-LLM filtering mechanism for low-confidence predictions. Results show significant accuracy improvements over existing methods, reduced manual verification effort, and applicability across diverse BMS datasets, advancing standardized building management interoperability.
brick schemain-context learningbuilding management systemsretrieval-augmented generationmulti-llm filtering
Divide, Deliberate, Decide: A Multi-Agent Framework for Fine-Grained Egocentric Action Recognition
The paper introduces Divide, Deliberate, Decide (D3), a zero-shot multi-agent framework for fine-grained egocentric action recognition. The method employs (i) a Vision-Language Model (VLM) orchestrator to segment videos and propose candidate labels, (ii) an ensemble of heterogeneous VLMs that deliberate via peer consultation, and (iii) Borda count aggregation for final prediction. Operating without fine-tuning, D3 improves zero-shot performance by leveraging decorrelated model priors rather than additional compute. Experiments confirm the framework's effectiveness in enhancing action recognition accuracy through structured multi-agent deliberation.
vision-language modelsegocentric action recognitionzero-shot learningmulti-agent systemsborda count
SkillMoV: Mixture-of-View Routing with Prototype-Conditioned Gating for Unified Multi-View Proficiency Estimation
SkillMoV introduces a unified framework for multi-scenario proficiency estimation from synchronized multi-view video, featuring a Mixture-of-View Projector (MoVP) with four components: view-dependent expert routing, cross-view attention, prototype anchoring, and gated projection. The method achieves 50.17% accuracy on EgoExo4D in the Exos setting, outperforming prior work by 3.57 percentage points, while maintaining parameter efficiency via LoRA adaptation (23.32% trained parameters). Ablations confirm individual component contributions, with MoV routing providing the largest gain (+6.61 pp).
mixture-of-viewproficiency estimationmulti-view videoprototype anchoringlora adaptation
Closing the Feedback Loop: From Experience Extraction to Insight Governance in Verbal Reinforcement Learning
The paper addresses the retention-forgetting dilemma in training-free verbal reinforcement learning for LLM agents by proposing a feedback-driven curation loop architecture. The method introduces a three-layer system (rules, evidence, skills) that governs insight lifecycle through outcome-driven evaluation and non-monotonic knowledge management. On financial forecasting tasks, results show the curation loop transforms accumulated experience from performance-degrading to accuracy- and return-enhancing, with improvements contingent on proper governance mechanisms.
verbal reinforcement learningnon-stationary environmentsinsight governancefeedback-driven curationretention-forgetting dilemma
Understanding LLMs in Title-Abstract Screening: From Disagreements to Recommendations
This study investigates reliability issues in using large language models (LLMs) for title-abstract screening in systematic reviews (SRs), moving beyond accuracy metrics to analyze qualitative failure modes. Through analysis of six software engineering SRs (1,000+ papers) comparing zero-shot LLM screening with human experts (κ=0.52-0.77), recurring failure patterns were identified: boundary ambiguity, keyword overemphasis, and incorrect topic inference. The work proposes actionable recommendations including pre-deployment semantic validation, multi-LLM ensembles, and focused borderline-case validation. Future work requires empirical validation of recommendations and community guidelines for LLM use in SRs.
systematic reviewszero-shotkappa agreementsemantic validationborderline cases
Visored: A Controlled-Natural-Language Prover for LLM-Generated Mathematics
The paper introduces Visored, a dependent-type-based theorem prover optimized for LLM-generated mathematics, bridging informal mathematical writing and formal verification. Its design features a controlled natural language surface syntax and rule-driven automation for handling routine proof steps omitted in textbooks, with output convertible to Lean. Initial experiments demonstrate that LLMs can effectively utilize Visored on the miniF2F benchmark without prover-specific training data.
dependent typestheorem provercontrolled natural languagelean verificationminif2f benchmark
LLM Features Can Hurt GNNs: Concatenation Interference on Homophilous Graph Benchmarks
The study identifies a counterintuitive phenomenon where concatenating LLM-generated node features (specifically SBERT-encoded GPT-4o-mini TAPE features) with original features degrades GNN performance on homophilous benchmarks, contrary to prior reports of improvement. Through systematic experiments on Planetoid splits (PubMed, Cora, CiteSeer) and other datasets (WikiCS, ogbn-arxiv), the authors demonstrate accuracy drops up to -17.0 pp on PubMed, with attenuation under varied conditions (backbone architectures, splits). They propose Delta_sig, a measure of LLM-alone discriminability, which correlates (r²=0.38) with concatenation cost and classifies 7/9 datasets correctly. A power law (r²=0.97) links feature dimensionality (d_l) and sample size (n) to performance degradation.
graph neural networksllm featureshomophilyconcatenation interferencediscriminability
Surrogate Assisted Pedestrian Protection Design via a Foundation Model Orchestrated Workflow
The paper introduces a foundation model-orchestrated workflow for pedestrian protection design in crash safety, addressing challenges of nonlinear dynamics and discrete state transitions. The method integrates: (1) a CAE-trained surrogate model (R²=0.87) with conformal prediction, (2) NSGA-II for multiobjective search, (3) topology-preserving geometry morphing, and (4) an LLM/VLM interface for workflow orchestration and design comparison. In an automotive bumper case study, the system generates 35 safety-compliant designs in seconds versus weeks with conventional CAE, demonstrating foundation models' potential as integration layers between ML and physics-based simulation.
surrogate modelingmultiobjective optimizationconformal predictiontopology morphingfoundation models
DeepInsight: A Unified Evaluation Infrastructure Across the Physical AI Stack
DeepInsight introduces a unified evaluation infrastructure for Physical AI systems, addressing the challenge of cross-layer regressions in heterogeneous stacks spanning foundation models to embodied control. The method employs three narrow abstractions—task, resource, and result—realized as invariant protocols (episode driver, resource-handle, and trace identity) across subsystems. Results show reproduction of peer-framework benchmarks with faster single-node execution, near-linear scaling, and cross-layer diagnostic capabilities via shared traces, validated in a production embodied humanoid stack.
physical aicross-layer evaluationtrace identityresource-handle protocolembodied control
Geometric Consistency Protocol for Foundation Model Features in Multi-View Satellite Imagery
The paper introduces a geometry-consistent evaluation protocol for foundation model features in multi-view satellite imagery, addressing limitations of conventional 2D global matching approaches. The method integrates Rational Function Model (RFM) constraints via a 3D consistency metric and geometry-constrained dense matching, explicitly accounting for height-dependent epipolar geometry. Key findings reveal a decoupling between semantic similarity and geometric localization, with standard 2D backbones performing competitively against specialized 3D-aware models under the proposed RPC-consistent evaluation framework.
rational function modelepipolar geometrymulti-view reconstructionfoundation featuressatellite imagery
An AI Security Agent for Banking: Multi-Vector Fraud and AML Detection Across Retail and Corporate Accounts
The paper introduces an AI security agent for banking that detects multi-vector fraud and AML across retail and corporate accounts via a three-component fusion architecture. The system processes parallel transaction and session streams, combining LSTM sequence models, statistical velocity/threshold monitors, and graph-based network modules to capture behavioral history and relationship patterns. Evaluated on synthetic logs (237,669 transactions, 113,508 sessions), it achieves F1 scores of 0.787 (transaction) and 0.867 (session), outperforming rule-based (0.562/0.733) and LSTM-only (0.655/0.713) baselines, with sub-millisecond critical-tier response latency.
lstm sequence modelaml detectiongraph network moduletransaction streamsession hijacking
Reversal Q-Learning
The paper introduces Reversal Q-learning (RQL), a novel off-policy RL algorithm that trains flow policies using prior data within an expanded MDP framework. RQL employs virtual on-policy trajectory generation via flow reversal and a bias-and-variance reduction technique to address the curse of horizon. Compared to existing flow-based RL methods, RQL avoids backpropagation through time, better utilizes learned value functions, and directly trains expressive flow policies. Experiments on 50 simulated robotic tasks demonstrate RQL's superior performance over state-of-the-art flow-based offline RL algorithms.
reversal q-learningoff-policy reinforcement learningflow matchingmarkov decision processbias-and-variance reduction
SEAGym: An Evaluation Environment for Self-Evolving LLM Agents
The paper introduces SEAGym, an evaluation environment for self-evolving LLM agents that measures agent harness updates across multiple dimensions including training, validation, and cost records. SEAGym transforms Harbor-compatible benchmarks into dynamic task sources with train batches, frozen validation, transfer views, replay diagnostics, and metric records. Experiments on Terminal-Bench 2.0 and HLE comparing ACE, TF-GRPO, and AHE reveal that evaluation views provide complementary signals, with frequent updates sometimes failing on held-out tasks and source diversity affecting harness reliability.
self-evolving agentsagent harnessevaluation environmenttransfer viewsreplay diagnostics
Offline Preference-Based Trajectory Evaluation
The paper introduces preference-based trajectory evaluation to address statistical inefficiency in offline agent assessment, where traditional success-based metrics produce tied comparisons on ~75% of instances. The method compares trajectories via temporal preferences over progress and time-to-return profiles, reducing ties to ~35% across diverse agentic and interactive benchmarks. Results demonstrate improved discriminative power, ranking stability, and data efficiency, suggesting benchmark saturation may stem from evaluation measure choice rather than data or problem difficulty alone.
offline evaluationtrajectory preferencesagentic systemsstatistical inefficiencybenchmark saturation
Reinforcing Dual-Path Reasoning in Spatial Vision Language Models
SR-REAL introduces a dual-path reinforcement learning framework for spatial vision-language models, combining Language-Only Reasoning (LOR) for step-by-step linguistic deduction and Detect-Then-Reason (DTR) for 3D geometric inference via region tokens. The method involves cold-start supervised fine-tuning followed by RL optimization with accuracy, format, and detection rewards. Experiments show SR-REAL outperforms baselines in spatial reasoning tasks, with DTR excelling in region-aware tasks (3D localization) and LOR enhancing general reasoning, while demonstrating positive transfer between paths and cross-domain generalization without task-specific tuning.
spatial reasoningvision-language modelsreinforcement learning3d geometric inferencechain-of-thought
OmniDrive: An LLM-Choreographed Multi-Agent World Model with Unified Latent Co-Compression for Multi-View Driving Video Generation
DRIVE-CHOREO introduces an LLM-choreographed multi-agent world model for controllable multi-view driving video generation, addressing heterogeneous control injection and cross-view fusion through a shared symbolic interlingua. The system employs three Qwen2.5-VL agents (Director, Cartographer, Auditor) to parse user intent into a WorldScript, ground it in layout tokens, and provide cross-view critiques, co-compressed with video via a 3-D VAE. On nuScenes, it achieves state-of-the-art multi-view consistency and BEV mAP (21.6) with FVD 45.7, while synthetic data improves real-world detector performance by +2.4 NDS.
world modelmulti-agentlatent co-compression3-d vaebev map
Scaling Enterprise Agent Routing: Degradation, Diagnosis, and Recovery
The study investigates routing accuracy degradation in enterprise LLM assistants as tool catalogs scale, analyzing a 110-agent, 584-tool production system. Using single-step routing evaluation on three frontier models, it reveals a 16--23 percentage point F1 drop for under-specified requests, decomposed via oracle analysis into retrieval (model failure to surface tools) and confusion (10pp ceiling drop even with perfect retrieval) gaps. Embedding-based shortlisting recovers +10--11pp F1 at full scale, validated by a 1,435-utterance human annotation study showing +10--17pp improvement despite lower absolute performance.
llm routingtool catalogretrieval gapembedding shortlistingoracle analysis
FoundCause: Causal Discovery with Latent Confounders from Observational Data
FoundCause introduces an amortized causal discovery model that learns from synthetic structural causal models to infer directed graphs and latent confounders from observational data in a single forward pass. The architecture employs a permutation-invariant transformer encoder with statistics-conditioned attention, factorized edge/direction decoding, and explicit confounder modeling via latent tokens. Evaluated against 15 classical and amortized baselines on real-world datasets, it achieves +9.6% F1, +1.2% AUROC, and 18.9% lower structural Hamming distance versus top non-amortized methods.
causal discoverylatent confoundersamortized inferencestructural causal modelspermutation-invariant transformer
Unlocking LLM Code Correction with Iterative Feedback Loops
The study systematically evaluates LLMs' code correction capabilities through iterative feedback loops, introducing novel metrics for failure analysis and rectification patterns. Using four models across two programming languages, it implements an iterative refinement framework where models receive compiler errors and testcase feedback after each attempt. Results demonstrate reasoning models' superior performance in leveraging feedback, with 2-3x improvement over non-reasoning models, while syntactic errors prove more tractable than logical flaws.
iterative refinementcode generationexecution feedbackreasoning modelserror rectification
Geometry-Aware Post-Hoc Uncertainty Quantification in Operator Learning
The paper introduces REEF-GP, a post-hoc uncertainty quantification framework for neural operators that leverages their intrinsic geometry-aware representations. By fitting a Gaussian Process to residuals in the operator's embedded feature space, the method avoids separate feature map learning while incorporating spectral-normalized projections and heteroscedastic noise. Evaluated on five PDE benchmarks with geometric variability, REEF-GP maintains predictive accuracy and calibrated uncertainties comparable to deep ensembles at reduced computational cost. The approach demonstrates robustness under geometric distribution shifts, with uncertainties localizing to physically significant regions like shock fronts.
neural operatorsuncertainty quantificationgaussian processgeometry-awarespectral-normalized
MagicSim: A Unified Infrastructure for Executable Embodied Interaction
MagicSim introduces a unified infrastructure for executable embodied interaction, addressing fragmentation in current robot learning pipelines through a deterministic batched runtime and shared MDP. The system constructs diverse executable worlds from YAML specifications, integrating task families, physics, sensors, and robot embodiments in a single reset-and-step loop. It supports benchmark evaluation, automatic trajectory generation, and agent interaction via a Command->Skill->Planner->Robot->Record pipeline, producing structured multimodal trajectories that align language supervision with executed episodes.
embodied interactiondeterministic runtimemarkov decision processmultimodal trajectoriesplanner-in-the-loop
LLM-as-Judge in Education: A Curriculum-Grounded Marking Pipeline
The paper introduces a curriculum-grounded LLM-as-Judge pipeline for automated question-level marking in high-stakes exam preparation, co-developed with an industrial partner. The method employs a staged LLM workflow to identify topics, subtopics, and cognitive demand, then generates question-specific rubrics and evaluates marking criteria using syllabus artefacts like performance band descriptors and glossary definitions. Preliminary results show marking outcomes comparable to human tutors, with improved traceability to curriculum standards, and early deployment data from an online study platform provide operational insights.
llm-as-judgecurriculum-groundedquestion-level markingperformance band descriptorsstaged llm workflow
Online LLM Selection via Constrained Bandits with Time-Varying Demand
The paper introduces an online learning algorithm for LLM selection in edge-cloud systems, addressing model heterogeneity and time-varying demands under hard and soft constraints. Formulated as a constrained bandit problem, the method uses confidence-bound estimates and demand predictions to balance reward maximization with constraint satisfaction. Theoretical guarantees show sublinear regret and constraint violations versus an offline benchmark. Experiments on synthetic workloads validate the approach's effectiveness in dynamic, resource-constrained environments.
llm selectionconstrained banditsonline learningedge-cloud inferencesublinear regret
Decoding Hidden Deception in Reasoning LLMs: Activation Explainers for Deception Auditing
The paper introduces STATEWITNESS, an activation explainer for auditing deceptive behavior in reasoning LLMs. The method employs a separate decoder to interpret hidden states of target models, generating natural-language queries, structured reports, and evidence traces. Evaluated on two reasoning LLMs across seven deception datasets, STATEWITNESS achieves 0.916 mean AUROC, outperforming black-box text monitors by 11.6% and activation-probe baselines by 25.0%. It also reduces missed deceptive examples in threshold ensembles and provides interpretable outputs for human inspection.
activation explainerdeception auditinghidden statesreasoning llmsauroc
AIPatient Arena: EHR-grounded evaluation of large language models in end-to-end clinical consultation workflows
The study introduces AIPatient Arena, an EHR-grounded framework for evaluating LLMs in multi-turn clinical consultations across eight competence dimensions. The method integrates EHR data into patient-specific knowledge graphs, tested on primary (n=437) and out-of-distribution (n=119, n=67) cohorts. Results show strong performance in questioning skills (4.43-4.99/5) and ethics (4.38-4.93/5), but weaknesses in handling ambiguity (2.57-3.32/5) and diagnostic accuracy (2.63-3.55/5), revealing limitations in workflow-oriented clinical reasoning.
electronic health recordsknowledge graphsclinical consultationmulti-turn interactiondiagnostic reasoning
AUTOGATE: Automated Clock Gating via Toggling-Aware LLM-based RTL Rewriting
AUTOGATE introduces an automated framework for fine-grain clock gating (FGCG) in RTL designs, addressing limitations of current LLM-based approaches by combining ML-based waveform analysis with hierarchical LLM-driven RTL rewriting. The method employs ML clustering to distill toggle traces into structured representations, enabling workload-aware optimization without direct LLM processing of raw waveforms, and a multi-agent architecture for scalable hierarchical optimization. Evaluations show dynamic power reductions of 49.31% on small designs, 19.34% on NVDLA, 7.96% on BlackParrot, and up to 6.86% on proprietary industrial designs.
fine-grain clock gatingrtl rewritingtoggle-aware optimizationhierarchical multi-agentdynamic power reduction
Can LLMs Be CEOs? Benchmarking Strategic Resource Reallocation with Multi-Role Agent Simulation
The paper introduces CEO-Bench, a multi-agent benchmark evaluating LLMs on CEO-level strategic resource reallocation by simulating conflicting C-suite advisor inputs under organizational constraints. The framework assesses models on role integration, conditional boldness, history-sensitive judgment, and plan validity across 13 scenarios. Results from five frontier models reveal high structural validity but strategic calibration failures, including single-advisor capture and a tradeoff between integration depth and decisiveness, delineating current LLM limitations in executive decision-making.
strategic resource reallocationmulti-agent simulationrole-conditioned advisorshistory-sensitive judgmentstructural validity
Dissecting model behavior through agent trajectories
The paper introduces the concept of the 'intent-execution gap' in AI agents, highlighting the mismatch between model intentions and harness execution. It proposes Simple Strands Agent (SSA), a customizable harness designed to align model assumptions with harness behavior across diverse model families (Claude, Gemini, GPT, Grok, Qwen). The study reproduces or improves pass@1 performance on agentic benchmarks (SWE-Pro, SWE-Verified, Terminal-Bench-2) and analyzes 138k trajectories to reveal model-specific problem-solving behaviors through metrics like edit frequency and testing activity.
intent-execution gapagent harnesspass@1trajectory analysismodel alignment
MapSatisfyBench: Benchmarking Satisfaction-Aware Map Agents through Behavior-Grounded Implicit Decision Factors
The paper introduces MapSatisfyBench, a benchmark for evaluating satisfaction-aware map agents that address implicit decision factors in underspecified user queries. The authors propose a restore-identify-filter framework to reconstruct user needs from behavior-chain evidence, identify evaluable implicit factors, and retain those supported by pre-query evidence. Experiments reveal current agents excel at explicit task completion but struggle with implicit factors and proactive evidence acquisition, establishing the benchmark's utility for shifting evaluation toward satisfaction-aware spatial decision making.
satisfaction-aware agentsimplicit decision factorsbehavior-chain evidencemap servicesunderspecified queries
A Machine-Learned Comorbidity Index
The authors propose a Machine-Learned Comorbidity Index (MLCI) that addresses limitations of traditional scores (Charlson, Elixhauser) by capturing nonlinear risk-outcome relationships across multiple clinical outcomes. MLCI maps diagnosis codes to a scalar via normalized Hilbert-Schmidt Independence Criterion (nHSIC) optimization, theoretically ensuring a unified admission-level ordering. Evaluated on multiple EHR datasets, MLCI outperforms baselines in predictive performance across diverse clinical outcomes.
comorbidity indexhilbert-schmidt independence criterionelectronic health recordsnonlinear risk modelingclinical outcome prediction
MODE-RAG: Manifold Outlier Diagnosis and Energy-based Retrieval-Augmented Generation Evaluation
The paper introduces MODE-RAG, a multi-agent system for mitigating hallucinations in Multimodal Retrieval-Augmented Generation (M-RAG) systems. The method employs Variational Free Energy (VFE) and internal attention states to dynamically gate interventions, routing high-risk queries to five specialized agents that use Monte Carlo Tree Search (MCTS) for causal derivation and logit perturbations. A Correction agent and Overseer agent ensure formatting stability and factual verification. Evaluated on the ModeVent dataset, the system significantly reduces hallucination rates and logical fabrications, enhancing M-RAG robustness.
multimodal retrieval-augmented generationvariational free energymonte carlo tree searchhallucination mitigationlogit perturbations
Incumbent Advantage: Brand Bias and Cognitive Manipulation Dynamics in LLM Recommendation Systems
The study investigates brand bias in LLM-based product recommendations, revealing a Conditional Monopoly effect where established brands dominate recommendations (IAI = 10.0) for identical products, though this advantage dissipates with minor competitor rating improvements (+0.1 stars). Through experiments with GPT-4o-mini, Claude Sonnet, and Gemini 3 Flash on skincare products, the authors demonstrate that authority-style marketing language (including fabricated claims) can overcome brand bias at a Bias Surplus Value of +0.17 rating points, with model-specific responses. The work also identifies a social dilemma in multi-brand GEO competition, showing payoff degradation from +0.802 to +0.007 when all brands adopt optimization strategies.
conditional monopolybias surplus valuegenerative engine optimizationllm recommendationscognitive manipulation
Spatio-Temporal Fusion Model for Standard View Classification of Echocardiographic Videos
The authors introduce EV9V, the largest public echocardiography video dataset (5,138 videos, 910,579 frames, 9 views), addressing data scarcity in automated view classification. They propose a Spatio-Temporal Fusion Model (STFM) combining CNN-LSTM architectures with uncertainty-aware learning to robustly fuse spatial anatomical features and temporal cardiac dynamics across heterogeneous frame quality. Benchmarking against CNN, RNN, and Transformer baselines on EV9V demonstrates STFM's competitive performance, validating its dual-stream approach for echocardiographic view discrimination.
echocardiographyspatio-temporal fusionuncertainty-aware learningcnn-lstmview classification
Feynman Kac Reweighted Schrödinger Bridge Matching for Surface-Based Tau PET Harmonization
The paper proposes Feynman Kac Reweighted Schrödinger Bridge Matching (FKRSBM), a novel method for harmonizing tau PET imaging data across sites while preserving biological signals. FKRSBM learns a direct stochastic transport process between source and target distributions via entropy-regularized optimal transport, incorporating subgroup-aware endpoint proposals through Feynman Kac reweighting. Evaluated on PI-2620 and AV-1451 tau PET data, FKRSBM outperforms ComBat, CycleGAN, DF, and DSBM in distributional alignment, tau-positivity sign mismatch, APOE subgroup alignment, and disease classification accuracy.
tau pet harmonizationschrödinger bridge matchingfeynman kac reweightingentropy-regularized optimal transportspherical convolutional backbone
L-Proto: Language-Aware Episodic Prototypical Training for Multilingual Speaker Verification
L-Proto introduces a language-aware episodic prototypical training strategy for multilingual speaker verification, addressing language-dependent acoustic variability that entangles speaker identity with linguistic characteristics. The method constructs language-consistent episodes by sampling speakers from a single language per episode, reducing language-driven variation and encouraging embeddings to focus on speaker identity. Experiments on the TidyVoice Challenge benchmark show consistent improvements over conventional fine-tuning and random episodic sampling across multiple backbone architectures.
multilingual speaker verificationepisodic traininglanguage-awareprototypical networksacoustic variability
Enhancing Pathological VLMs with Cross-scale Reasoning
The paper introduces a cross-scale training paradigm for pathological vision-language models (VLMs), addressing the lack of explicit cross-scale reasoning in existing datasets. The authors propose Scale-VQA, a benchmark with 4,685 multiple-choice questions grounded in 2,537 pathology images across multiple magnifications, curated using a leakage-aware pipeline to prevent text-only shortcuts. ScaleReasoner-R1, trained via reinforcement learning, achieves state-of-the-art performance on cross-scale reasoning and generalizes to single-scale benchmarks, demonstrating the value of cross-scale supervision.
vision-language modelscross-scale reasoningpathological imagesmulti-magnification vqareinforcement learning
Discrete Autoregressive Transformer for Generative Mechanism Synthesis
The paper introduces a discrete autoregressive transformer for generative mechanism synthesis, addressing planar path synthesis by mapping target curves to diverse linkage mechanisms. The method employs a decoder-only transformer with VAE latent conditioning and mechanism-type tokens, trained with token cross-entropy and an ordinal-aware auxiliary loss. Inference uses bounded latent-noise scheduling to generate top candidates, achieving mean Chamfer distance 0.0132 and dynamic time warping 0.153 on held-out tests, outperforming a k-nearest-neighbor baseline (0.0071 and 0.117) with matched topology.
autoregressive transformermechanism synthesisvariational autoencoderchamfer distancedynamic time warping
Graph Neural Networks for Semi-Supervised Image Classification with Multi-Feature Aggregation
The study introduces a Graph Neural Network (GNN) approach for semi-supervised image classification by aggregating multi-feature and multi-graph representations from diverse extractors. The method combines features from Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs), processes them via manifold learning, and employs rank aggregation for integration. Experiments demonstrate that strategic feature-graph combinations and manifold-based graph processing significantly improve classification accuracy, particularly in low-label scenarios.
graph neural networkssemi-supervised learningfeature aggregationmanifold learningrank aggregation
Treatment Response Optimized Clinical Decision Support AI System via Digital Twin Simulation
The paper introduces an adaptive clinical decision support AI system integrating Treatment Effect estimation, Digital Twin simulation, and Reinforcement Learning for real-time treatment optimization. The framework employs continuous learning from historical data, with safety ensured via rule-based vital sign monitoring and clinician review for uncertain cases. Validation on synthetic and TCGA ovarian cancer datasets shows superior effectiveness and stability versus baselines, with low latency and minimal expert intervention required, demonstrating feasibility for clinician-supervised personalized medicine.
treatment effectdigital twinreinforcement learningclinical decision supportpersonalized medicine
Bridging Spatial And Frequency Views For Disaster Assessment: Benefits And Limitations
The study presents a controlled comparison of spatial-domain, frequency-domain, and dual-domain deep learning approaches for building damage classification using post-disaster satellite imagery from the xView2 (xBD) dataset. All models employ an EfficientNet-B0 backbone under identical training conditions, varying only in input representations and fusion strategies. Results show dual-domain models achieve the highest test accuracy (0.4688), while spatial-only models yield the best macro F1-score (0.4254); frequency-only models exhibit poor generalization. Challenges persist in detecting minor damage due to class imbalance and visual ambiguity.
building damage classificationfrequency-domain representationefficientnet-b0xview2 datasetdual-domain fusion
The Discrete-Log Clock: How a Transformer Learns Modular Multiplication
The paper demonstrates that transformers learning modular multiplication employ a sparse representation in the multiplicative character transform basis, contrary to prior findings of dense spectra in the additive DFT basis. Analyzing a model trained on $a \cdot b \bmod 113$, the authors show the embedding spectrum becomes highly sparse (Gini coefficient 0.58) with only 4 dominant frequencies, and 96.9% of MLP neurons exhibit single-frequency tuning. The transformer implements a "Discrete-Log Clock" algorithm, reducing multiplication to addition in discrete-log space, analogous to known addition algorithms. This methodology reveals interpretable structure when the analysis basis matches the task's algebraic structure.
transformersmodular multiplicationmultiplicative character transformdiscrete-log clocksparse representation
SoK: AI-Augmented Binary Reversing
This paper presents the first systematization of knowledge for AI-augmented binary reversing, analyzing 144 studies since 2015 across 22 domains. The authors develop a unified taxonomy connecting traditional analysis techniques, binary-derived artifacts, representation strategies, learning paradigms, and inference tasks, while clarifying LLMs' and agentic AI's emerging roles. Results reveal common structures across approaches, persistent challenges in evaluation, and opportunities for future research in scalable AI-assisted reversing systems.
binary reversinglarge language modelsartifact representationsagentic aiinference tasks
NarrativeWorldBench: A Frontier-Saturated Benchmark and a Latent World Model for Long-Horizon Co-Creative Audio Drama
The paper introduces NarrativeWorldBench, a benchmark for evaluating long-horizon narrative coherence in audio drama, and N-VSSM, a latent world model for maintaining narrative consistency. The benchmark assesses nine structural metrics across horizons up to 200 episodes and four Indic languages. N-VSSM, a Mamba-2-based variational state-space model with a 256D latent state and 8B decoder, achieves plot-beat F1 >= 0.84 across all horizons at 4x lower compute than frontier LLMs. Evaluations show +0.20-0.23 Likert improvement in cross-lingual fidelity and 71% preference over Claude Opus 4.5 in professional writer trials.
narrative coherencelatent world modelvariational state-spacelong-horizon evaluationcultural transfer
Visuals Lie, Consistency Speaks: Disentangling Spatial Attention from Reliability in Vision-Language Models
The study challenges the Attention-Confidence Assumption in Vision-Language Models (VLMs) by introducing the VLM Reliability Probe (VRP) to analyze reliability signals. Using structural-attention metrics (cluster counts C_k, spatial entropy H_s, and its layer-wise evolution ΔH_s), the authors identify 'Symbolic Detachment'—where early visual features become decoupled from final generation. Results show spatial attention has negligible correlation with accuracy (R ≈ 0.001), while self-consistency across reasoning paths strongly predicts truth (R = 0.429). Architectural analysis reveals divergent reliability mechanisms: LLaVA relies on fragile late-stage bottlenecks, whereas PaliGemma and Qwen2-VL exhibit global resilience.
vision-language modelsspatial entropysymbolic detachmentself-consistencyreliability probe
TerraTransfer: Learning End-to-End Driving Policies Without Expert Demonstrations
The paper introduces TerraTransfer, a method for training end-to-end driving policies without expert demonstrations by decoupling visual perception from control learning. The approach uses self-play in vectorized simulators to pretrain a policy, then aligns its latent space with a pretrained vision backbone via action KL divergence and a batch-relational low-rank structural loss. This eliminates the need for costly expert trajectories, requiring only paired (image, scene-state) frames. Evaluated on photorealistic 3D Gaussian splatting closed-loop scenarios, the method matches or surpasses prior end-to-end approaches.
end-to-end drivingself-playlatent space alignmentkl divergencegaussian splatting
Model Validation of Agentic AI Systems: A POMDP-Based Framework for Belief-State, Forecast, and Policy Validation
The paper proposes a POMDP-based validation framework for agentic AI systems, addressing model risk in autonomous decision-making by decomposing it into information, beliefs, forecasts, actions, and utility components. It formalizes LLMs as approximate Bayesian filtering operators and introduces a taxonomy covering state-space, filtering, forecast, policy, utility-specification, and parameter risks. A portfolio-management case study demonstrates the framework, showing robust performance across parameter variations and independent contribution of latent-state inference to decision quality.
pomdpagentic aibayesian filteringmodel risklatent-state inference
MeiBRD: Meta-Learning Intraoperative Biomechanical Residual Deformation
The paper introduces MeiBRD, a hybrid framework for intraoperative liver registration that combines biomechanical priors with data-driven residual learning. The method learns a graph neural diffusion function to correct linear biomechanical predictions, using sparse intraoperative measurements as context samples for meta-learning. Evaluated on a deformable liver phantom dataset, MeiBRD outperforms rigid, biomechanical, and data-driven baselines in registration accuracy and generalization, particularly for out-of-distribution cases.
biomechanical registrationresidual deformationgraph neural diffusionmeta-learningintraoperative imaging
Implicit vs. Explicit Prompting Strategies for LVLMs in Referential Communication
This study reconciles contradictory findings about large vision-language models' (LVLMs) ability to generate efficient referring expressions by comparing implicit vs. explicit prompting strategies. The authors controlled for task differences between prior studies (Jones et al., 2026; Zeng et al., 2026) while systematically varying prompt explicitness. Results confirm LVLMs can produce efficient referring expressions when explicitly instructed, but fail to infer this requirement from implicit prompts alone, revealing a fundamental divergence from human communicative efficiency.
large vision-language modelsreferring expressionsprompting strategiescommunicative efficiencyimplicit learning
Distributed General-Purpose Agent Networks: Architecture, Key Mechanisms, and Prototypes
The paper proposes a distributed general-purpose agent network architecture enabling heterogeneous agents to discover peers, establish trust, and execute open-ended tasks through semantic coordination. Key innovations include a protocol adaptation layer bridging task semantics with network operations, and three core mechanisms: bodyless gossip for collaborator discovery, BAID-based identity binding with MG-EigenTrust reputation, and Stackelberg-style mechanism-generation loops. Prototype evaluations demonstrate feasibility, with BAID verification overhead measurements and MG-EigenTrust simulations showing resilience against cross-topic collusion attacks.
distributed agent networkssemantic announcement propagationbaid-based identitymg-eigentrust reputationstackelberg mechanism
DriveJudge: Rethinking Autonomous Driving Evaluation with Vision-Language Models
The paper introduces DriveJudge, a vision-language model (VLM) agent for context-aware and interpretable evaluation of autonomous driving policies. The method combines rule-grounded evaluation with VLM reasoning, selectively invoking deterministic rule functions after environmental context interpretation. Evaluated on a curated dataset of 33,577 driving samples, DriveJudge outperforms EPDMS by 21.23 AUC in driving quality classification and DriveCritic by 6.5% in trajectory preference selection, establishing a new benchmark for driving evaluation.
autonomous drivingvision-language modelspolicy evaluationcontext-awarenessinterpretability
Translating the Untranslatable: An Operationalizable Ontology for Untranslatability
The paper introduces an operationalizable ontology and taxonomy for untranslatability in machine translation (MT), addressing cases where meaning cannot be directly preserved across languages. The authors propose a structured framework categorizing untranslatability types and compensation strategies, implemented as a multilingual dataset of untranslatable sentences with strategy-based translations. Human preference studies indicate that translation quality varies by strategy, with a consistent preference for the Annotation strategy (explanatory context). This work provides a foundation for strategy-informed MT research.
untranslatabilitymachine translationcompensation strategiesmultilingual datasetontology
Do Large Language Models Always Tell The Same Stories?
This study investigates narrative diversity in LLM-generated stories compared to human-written ones, using narrative similarity metrics across 10 models. Employing contrastive analysis and human evaluations alongside three automated annotation methods, the research finds LLM outputs exhibit higher inter-model similarity than human stories. Frontier models converge on a generic narrative mean, lacking human-authored diversity. Mitigation strategies like negative prompting and temperature scaling prove ineffective against this homogeneity.
narrative similaritycontrastive frameworkllm-generated storieshuman evaluationstemperature scaling
Counterfactual Optimization of Baseball Pitch Sequences and Estimation of Its Impact on Season-Level Statistics
The study introduces a counterfactual optimization framework for baseball pitch sequences, demonstrating their impact on season-level performance metrics. Using MLB Statcast data, a Transformer-based model predicts in-play outcomes, enabling counterfactual analysis by substituting final or setup pitches while holding context constant. Results indicate that optimizing both pitch types and locations can improve seasonal statistics by over 1.0 K/9, with additional insights on velocity-band-specific locations and pitch command importance.
counterfactual optimizationtransformer-based modelpitch sequencingin-play predictionseason-level statistics
Geometry-Consistent Endoscopic Representations for Image-Guided Navigation via Structured Foundation Model Adaptation
The paper introduces a framework for learning geometry-consistent endoscopic representations by combining synthetic data with Hierarchy-Aware Geometry-Semantic Adaptation (HAGSA), a structured alternative to LoRA that selectively inserts low-rank adapters across transformer layers. HAGSA enforces geometric correspondence in intermediate features and semantic consistency in deeper layers. Evaluations on bronchoscopy, sinus endoscopy, and colonoscopy datasets demonstrate improved pose estimation and depth prediction, with favorable synthetic-to-real transfer and scaling properties. The method outperforms baselines in geometric and semantic representation quality for image-guided navigation tasks.
monocular endoscopygeometry-semantic adaptationlow-rank adapterssynthetic-to-real transferhierarchy-aware learning
SpeechDx: A Multi-Task Benchmark for Clinical Speech AI
SpeechDx introduces a multi-task benchmark for clinical speech AI, comprising 12 datasets and 27 tasks structured by disrupted speech production stages (conceptualization, formulation, articulation). The benchmark evaluates generalization via limited-label tasks and cross-dataset condition comparisons, distinguishing clinical patterns from dataset artifacts. Systematic evaluation of 12 audio encoders reveals large-scale speech models as top performers, domain-specific models excelling only on closely matched tasks, and no representation reliably generalizing across clinical speech tasks. SpeechDx provides a standardized framework for assessing progress toward general-purpose clinical speech representations.
clinical speech aimulti-task benchmarkspeech production stageszero-shot transferaudio encoders
MemTrace: Probing What Final Accuracy Misses in Long-Term Memory
The paper introduces MemTrace, a benchmark evaluating long-term memory in LLM agents by tracking individual knowledge points across sessions rather than aggregating question-level accuracy. It probes each fact along three dimensions: memory age, question type (current/earlier state, trajectory), and evidence condition (present/missing/contradicted). Testing 13 memory-system configurations revealed that pooled accuracy masks distinct failure modes—systems often fail to track changes or correct false premises despite retrieving evidence. Key finding: evidence utilization, not retrieval, is the primary bottleneck, with retrievable evidence being underused 10x more frequently than missing.
memtraceknowledge pointevidence conditionmemory agefalse premise
Transformer-Based Warm-Starting for Feasible and Optimal Terminal Approach to Tumbling Objects with Space Manipulators
The paper proposes a transformer-based warm-starting method for sequential convex programming (SCP) to improve trajectory generation in space manipulator terminal approaches to tumbling objects. The framework decomposes the problem into translational planning and coupled attitude-manipulator torque allocation, applying a causal transformer to warm-start the computationally intensive second stage. Evaluated on 300 scenarios, the method reduces SCP iterations by 28% and runtime by 23%, while maintaining control-cost distribution and nearly halving runtime for feasibility projection compared to heuristic initialization.
sequential convex programmingtransformer warm-startspace manipulatortrajectory generationfeasibility projection
Quantifying Consistency in LLM Logical Reasoning via Structural Uncertainty
The paper introduces structural uncertainty, a consistency-aware framework for evaluating logical reasoning in large language models (LLMs) by analyzing self-preference-induced rankings over sampled reasoning paths. The method generates multiple candidate solutions, models pairwise preferences via Bradley-Terry with PageRank, and decomposes consistency into across-trial ranking instability and within-trial candidate ambiguity using entropy metrics. Experiments across five LLMs and eight benchmarks show structural signals complement output dispersion, improving unreliable instance detection in logical/mathematical tasks while collapsing in factual retrieval. Instability negatively correlates with accuracy, while ambiguity positively correlates, revealing distinct reasoning regimes.
structural uncertaintyself-preference rankingbradley-terry modelreasoning consistencyentropy decomposition
Nothing from Something: Can a Language Model Discover 0?
The study investigates whether language models can autonomously discover the mathematical concept of zero, testing their capacity for out-of-distribution generalization in arithmetic. Using GPT-2-scale models, the authors demonstrate that (1) pretrained models fail to generalize to zero without explicit training, but (2) achieve substantial improvement after fine-tuning on 10-100 zero examples. Crucially, language pretraining reduces the required training examples by 50%, suggesting linguistic scaffolding aids mathematical discovery. The work provides empirical evidence for the interplay between language and mathematical abstraction in neural models.
out-of-distribution generalizationlanguage modelsmathematical discoverygpt-2zero-shot learning
From Democracies to Autocracies: How AI Systems Enable Authoritarianism by Design
This study investigates how AI systems enable authoritarianism across political regimes through a qualitative analysis of six deployments. Using diverse sources including academic publications and government notices, the authors identify key enabling features: centralized administrative data co-optation, regulatory gaps, weak user compliance, and encoded protected traits. Findings reveal these features manifest in both democratic and autocratic contexts, with centralized systems evading oversight and fragmented systems diffusing accountability. The paper concludes with mitigation recommendations for developers and policymakers regarding AI's authoritarian risks.
authoritarianismgovernance gapscentralized systemsfragmented systemsregulatory compliance
ARVO: Atlas of Reproducible Vulnerabilities for Open-Source Software
The ARVO dataset introduces reproducible vulnerability analysis at scale by addressing key obstacles in bug reproduction, extending the OSS-Fuzz dataset with 6,100+ vulnerabilities across 311 projects. The method ensures each vulnerability is rebuildable, triggerable, and analyzable across versions, enabling automatic patch identification and post-change interaction. Evaluation shows 81% reproduction success and 89.4% patch location accuracy, enhancing both upstream practices and downstream security research.
vulnerability datasetreproducibilityoss-fuzzpatch identificationsecurity research
Skill-Constrained Model Predictive Control for Resilient Manufacturing Supply Chains
The paper introduces a skill-constrained model predictive control (MPC) framework for manufacturing supply chains, where worker certifications dynamically affect production capacity. The controller solves a mixed-integer program each shift, optimizing production, inventory, backlog, and training decisions with binary certification constraints and an interpretable terminal value. Evaluated on synthetic SkillChain-Gym scenarios, the method shows regime-dependent performance: MPC outperforms when skill bottlenecks are forecastable, while static insurance policies remain superior under surprise shocks or tight capacity constraints. Key findings highlight forecastability as the decisive factor for MPC efficacy, not general adaptivity.
model predictive controlmixed-integer programmingskill-constrained productioncertification decaysupply chain resilience
SkillChain-Gym: A Benchmark for Reskilling-Aware Production-Inventory Control under Disruptions
SkillChain-Gym introduces a benchmark for reskilling-aware production-inventory control, addressing workforce capability as a decision variable with skill-state dynamics, certification thresholds, and capacity-constrained training. The environment includes disruption scenarios, deterministic replay, and metrics for operations, resilience, and capability growth. Evaluations of production-only, adaptive, and static-insurance policies over 60-shift horizons reveal regime-dependent outcomes: training-capable policies outperform production-only baselines, with adaptive training excelling under forecasted bottlenecks and static cross-training providing resilience to shocks. Capacity slack and forgetting rate delineate regime boundaries, necessitating forecast-driven controllers.
reskilling-aware controlproduction-inventoryskill-state dynamicsdeterministic replaycapacity-constrained training
Pulling The REINS: Training-Free Safety Alignment of Video Diffusion Models via Representation Steering
The paper introduces REINS (REpresentation-space INference-time Safety steering), a training-free method for safety alignment in video diffusion models. By analyzing hidden-state activations, the authors identify a linear safety direction via Supervised PCA, enabling inference-time steering toward safe generations without weight updates. Mechanistic analysis reveals optimal steering occurs at intermediate transformer layers (~50% depth), balancing information availability and propagation. Evaluated across 9 models (1.3B-5B parameters) for text-to-video and image-to-video tasks, REINS demonstrates broad effectiveness with negligible overhead.
video diffusion modelsrepresentation steeringinference-time alignmentsupervised pcasafety alignment
MLLP-VRAIN UPV system for the IWSLT 2026 Simultaneous Speech Translation task
The MLLP-VRAIN group presents their IWSLT 2026 Simultaneous Speech Translation system, leveraging Parakeet and Qwen 3.5 models with adaptive black-box policies for improved quality-latency trade-offs. The system incorporates ASR word-boosting and RAG mechanisms for context-aware translation in En→{De, It, Zh} directions. Evaluation on the MCIF En→De test set demonstrates a +5.82 XCOMET-XL improvement over baseline, with context processing adding +1.03 gain. The work includes detailed latency analysis and participation across all language directions.
simultaneous translationadaptive policiesxcomet-xlasr word-boostingrag mechanism
Physics-Informed Attention Mechanism and Generalization Capability of Deep Learning-Based Grain Growth Evolution Prediction
The study evaluates Out-Of-Distribution (OOD) generalization in deep learning models for grain growth prediction, comparing a baseline model with a proposed physics-informed boundary-masked attention mechanism. Both models, trained on synthetic data, were tested on experimental microstructures, bimodal grain size distributions, and abnormal grain growth without retraining. The boundary-masked attention model significantly improved performance, particularly for bimodal distributions (SSIM: 0.6221 to 0.7609; mean grain size error: 8.75% to 3.57%), demonstrating emergent attention patterns aligned with curvature-driven physics. Results suggest synthetic-trained models can generalize to diverse OOD conditions, with physics-informed attention enhancing accuracy when boundary morphology matches the training domain.
grain growth predictionout-of-distribution generalizationphysics-informed attentionboundary-masked attentionstructural similarity index measure
Rift: A Conflict Signature for Deception in Language Models
The study identifies an internal 'conflict signature' distinguishing deceptive from naive errors in language models, detectable via residual rank analysis. By contrasting sleeper agents (knowingly deceptive) with naive liars (trained to err), the method isolates deception-specific signals independent of output incorrectness. Results show 2.1-2.3x higher residual ranks in deceptive forward passes across GPT-2 variants, Qwen2.5, and Phi-3 (100% detection accuracy, AUC 1.0), with robustness to strategic concealment and cross-model/language transfer (mean AUC 0.933-1.0). The signature is read-only and architecture-invariant.
residual ranksleeper agentbehavioral evaluationzero-shot transferconflict signature
When Rules Learn: A Self-Evolving Agent for Legal Case Retrieval
The paper introduces a self-evolving framework for rule-driven query rewriting to enhance BM25-based legal case retrieval without parameter training. Utilizing an LLM-based agent with automatic evaluation, the system iteratively generates rewriting rules, validates combinations, and prunes ineffective rules via historical feedback. Evaluated on the LeCaRD-v2 benchmark, the method outperforms non-evolutionary baselines, including human-designed rules and greedy selection, particularly when using a high-capacity LLM. Analyses highlight the LLM's ability to leverage experimental results and intrinsic knowledge for rule refinement through self-evolution.
legal case retrievalbm25query rewritingself-evolving frameworkllm-based agent
Beyond Parallel Sampling: Diverse Query Initialization for Agentic Search
The paper introduces DivInit, a training-free method to improve breadth scaling in agentic search by addressing query redundancy in parallel rollouts. Instead of sampling k independent first queries, DivInit generates n candidate queries, selects k diverse seeds, and executes them as parallel trajectories. This approach mitigates overlapping evidence retrieval and subsequent conditioning on shared context. Evaluated across five open-weight models and eight benchmarks, DivInit achieves consistent improvements over standard parallel sampling, with average gains of 5-7 points on multi-hop QA tasks at matched compute budgets.
agentic searchparallel samplingquery redundancymulti-hop qadiverse initialization
Trust-Aware Multi-Agent Traceability: Confidence-Calibrated Knowledge Graphs for Consistent Software Artifact Management
The paper proposes a trust-aware coordination framework for multi-agent AI systems in software engineering, using confidence-calibrated knowledge graphs to mitigate error propagation across pipeline stages. The method combines embedding-based retrieval with LLM-based multi-criteria analysis for traceability link prediction, introduces traceability seeding for confidence comparison, and enforces consistency via threshold gating and conflict resolution protocols. Evaluation on an automotive case study demonstrates improved calibration (critical for coordination) and protocol effectiveness, with ablation studies confirming calibration's necessity.
multi-agent systemsknowledge graphsconfidence calibrationtraceability linkingllm-based analysis
PowerOPD: Stabilizing On-Policy Distillation with Bounded Power Transformation
The paper introduces PowerOPD, a stabilized variant of on-policy distillation (OPD) for large language models that addresses training pathologies in standard OPD. The method employs Box-Cox power transformations (parameterized by α > 0) to create bounded, sign-consistent rewards, avoiding the high-variance gradients of log-ratio rewards. Evaluated across six mathematical reasoning benchmarks and four Qwen3 teacher-student pairs, PowerOPD achieves benchmark-averaged gains of up to +6.37/+5.71 (Avg@8/Pass@8) over vanilla OPD, reduces wall-clock time by 59.2%, and maintains gradient norms 3,000x smaller.
on-policy distillationbox-cox transformationgradient variancemathematical reasoningqwen3
Cluster-Aware Dual-Level Test Specification Generation for Large-Scale Automotive Software Requirements
The paper introduces a cluster-aware pipeline for generating test specifications from large-scale automotive software requirements. The method employs sentence transformers for embedding, UMAP for dimensionality reduction, and HDBSCAN for clustering, followed by a multi-level map-reduce summarization algorithm. This approach improves integration test coverage and summarization fidelity while scaling efficiently to thousands of requirements, as demonstrated on automotive datasets. The pipeline ensures compliance with ISO 26262 and ASPICE standards through retrieval-augmented generation.
sentence transformersumaphdbscanretrieval-augmented generationiso 26262
Statistical Foundations of LLM-based A/B Testing: A Surrogacy Framework for Human Causal Inference
The paper develops a statistical framework for using LLMs as surrogates in A/B testing, establishing conditions under which treatment effects estimated on LLM outcomes recover human population effects. It adapts surrogate endpoint theory, showing that calibration under surrogacy and comparability conditions (weaker than distributional equivalence) enables identification of average treatment effects. When conditions fail, partial identification is possible, with diagnostics for falsifying surrogacy and bounds on bias from limited overlap. Stochasticity in LLMs introduces bias and variance, mitigated by averaging multiple draws. Simulations and Upworthy headline experiments validate the approach, though human experiments remain essential for novel interventions.
surrogate endpoint theorya/b testinglarge language modelscausal inferencetreatment effects
PromptMN: Pseudo Prompting Language
PromptMN introduces a domain-specific language for structured AI prompting, using %-prefixed typed directives to explicitly annotate roles, goals, constraints, and other elements within natural language prompts. The method enables semantic resolution of directives without requiring a fixed order, operating between informal prompting and pseudocode. Evaluations on Claude Fable 5, Claude Opus 4.8, Gemini 3.1 Pro, and GPT-5.5 demonstrate correct interpretation of complex structures like conditionals and prime-checking tasks without fine-tuning. Early results suggest PromptMN reduces ambiguity in human-AI interaction across software development workflows.
pseudo-promptingdomain-specific languagesemantic resolutionreverse prompt engineeringsoftware development lifecycle
Sign-Rank, Index, and List Replicability: Connections and Separations
(No summary returned.)
Adaptive Volumetric Mechanical Property Fields Invariant to Resolution
AdaVoMP introduces a sparse transformer encoder-decoder model to predict spatially-varying mechanical properties (Young's modulus $E$, Poisson's ratio $ν$, density $ρ$) for 3D objects, improving resolution and accuracy over prior methods. The approach employs a sparse adaptive voxel (SAV) structure to represent input shapes and material fields, achieving $16^3\times$ higher resolution than VoMP. Experiments demonstrate superior accuracy and computational efficiency, enabling high-resolution simulation-ready 3D assets with realistic deformable properties.
adaptive voxelyoung's moduluspoisson's ratiosparse transformervolumetric properties
Finite-Time Queue Peak Laws in Stochastic Networks: Logarithmic Scaling After Geometric Thresholds
The paper establishes finite-time queue peak laws for generalized switches under uniform interior slack conditions, demonstrating a two-phase scaling behavior. Using drift-minimizing policies like MaxWeight, the authors prove that queue peaks follow a square-root law up to a geometry-dependent threshold, transitioning to logarithmic growth thereafter. Key results include self-normalization mechanisms that decouple logarithmic coefficients from capacity geometry while preserving geometric thresholds, with matching lower bounds confirming tightness. The analysis leverages finite-time state-space collapse for refined thresholds in input-queued switches, validated by simulations showing predicted two-phase envelopes and variance-sensitive improvements.
generalized switchesfinite-time peaksdrift-minimizing policiesself-normalizationstate-space collapse
Rethinking Dataset Distillation for Classification: Do Distilled Sets Outperform Coresets?
This work critically evaluates dataset distillation (DD) methods against coreset selection (CS) through standardized benchmarks on ImageNet-1K, ImageNet100, and ImageNette. Seven state-of-the-art DD methods are compared with three CS strategies under consistent training protocols. Results show that SOTA DD methods either match or underperform CS in accuracy and data coverage, while requiring significantly higher computational costs for synthesis. Coresets demonstrate superior representativeness and diversity in approximating the original data distribution.
dataset distillationcoreset selectiondata-centric learningimagenetsynthetic samples
Multi-Source Cybersecurity Logs: An ATT&CK-Labeled Dataset and SLM Evaluation
The authors present a multi-source cybersecurity log dataset combining system, network, and browser logs with MITRE ATT&CK technique labels, addressing a gap in existing public datasets. The dataset contains 870 sessions (70 attack, 800 benign) with 2.3 million events, labeled with 12 tactics and 53 techniques using real attack tools. They evaluate three Small Language Models (Qwen2.5-1.5B, Llama-3.2-3B, Phi-4-Mini) fine-tuned with LoRA, showing accuracy improvements from 8% to 90-97% for chunk classification and 42% exact-match accuracy for technique identification.
mitre att&ckmulti-source logssmall language modelslow-rank adaptationcybersecurity dataset
A Diffusion Approximation for Temporal-Difference Learning with Linear Features under Markovian Noise
The paper presents a stochastic differential equation (SDE) approximation for temporal-difference (TD) learning with linear function approximation under Markovian noise, addressing limitations of prior ordinary differential equation (ODE) models that neglect stochastic fluctuations. The proposed SDE model separates contraction dynamics governed by the projected Bellman operator from Markovian sampling effects, revealing how their interaction determines the error floor in constant-stepsize TD(0). Results show the error floor emerges from the interplay between Markovian long-run covariance and the contraction geometry of the projected Bellman operator.
temporal-difference learningstochastic differential equationmarkovian noiseprojected bellman operatorerror floor
A Convex Quasilinearization Method for Solving Nonlinear PDEs with Physics-Informed Neural Networks
The authors propose LiL-Q, a convex quasilinearization method for solving nonlinear PDEs using physics-informed neural networks (PINNs). The method employs Bellman-Kalaba quasilinearization to decompose the problem into linear subproblems, each solved via direct linear least-squares QR factorization on a Linear-in-Learnables (LiL) trial space (e.g., random-feature ELMs, spectral bases). This replaces nonconvex gradient-based training with convex per-step solves, ensuring local Newton-Kantorovich convergence. Evaluated on seven benchmarks (Bratu, Burgers, Navier-Stokes, etc.), LiL-Q converges in single-digit iterations, matches/exceeds PINN performance with 100× fewer parameters, and recovers exact solutions when span-constrained.
quasilinearizationphysics-informed neural networkslinear-in-learnablesnewton-kantorovich convergenceleast-squares qr
Evaluating Open-Source LLMs for Multi-Label ATT&CK Technique Classification on CTI Reports
The study establishes the first empirical baseline for multi-label ATT&CK technique classification on complex unstructured CTI reports using open-source LLMs. Authors constructed a ground-truth dataset of 2,076 human-annotated sentences from 83 reports, achieving κ=0.68 inter-annotator agreement. Evaluating seven models (8B-236B parameters) revealed a maximum micro-averaged F1 of 0.22, with parameter size correlating positively (p<0.05) but prompt strategies and temperature showing no significant impact, demonstrating current open-source LLMs' inadequacy for production use.
mitre att&ckmulti-label classificationcyber threat intelligencellm evaluationinter-annotator agreement
Deep Reinforcement Learning for Minimum Zero-Forcing Sets
The paper proposes SD-ZFS, a reinforcement learning framework adapting S2V-DQN to solve the NP-hard minimum zero-forcing set problem on undirected graphs. The method trains models on varied graph structures and evaluates generalization, scalability, and transferability across network types. Results show SD-ZFS outperforms greedy heuristics and approaches optimal solutions, while analyzing structural influences on zero-forcing set performance.
zero-forcing setreinforcement learninggraph coloringnp-hards2v-dqn
OmniPlan: An Adaptive Framework for Timely and Near-Optimal Network Planning Optimization
OmniPlan introduces an adaptive framework for network planning optimization that jointly optimizes timeliness and solution quality. The method combines an LLM-based intent interpreter, a mixture-of-experts architecture (integrating MIP solvers, heuristics, and DRL models), and a DRL-based configuration module for preference alignment. Evaluated on distributed ML inference offloading, OmniPlan reduces latency by 97.8% and device resource consumption by 11.5% compared to baselines while maintaining near-optimality.
network planning optimizationmixture-of-expertsintent interpreterpreference alignmentoffloading
From Reasoning Traces to Reusable Modules: Understanding Compositional Generalization in Language Model Reasoning
The paper formalizes compositional generalization in LLM reasoning through a hierarchical latent selection model, where reasoning traces decompose into reusable atomic modules (skills and routing mechanisms). It theoretically demonstrates that supervised fine-tuning (SFT) provides raw module materials in compositional traces, while reinforcement learning (RL) decomposes and recombines them for generalization. Controlled experiments validate that RL extracts and recombines atomic modules from SFT-supplied traces, with compound traces yielding stronger generalization than isolated modules. An effective protocol emerges where SFT ensures atomic module coverage and RL drives exploration of novel compositions.
compositional generalizationlatent selection modelsupervised fine-tuningreinforcement learningatomic modules
Edge Flow: A Tractable and Predictive Continuous-Time Model for Gradient Descent at the Edge of Stability
The authors propose Edge Flow, a continuous-time model of gradient descent dynamics at the edge of stability (EoS), where the loss Hessian's largest eigenvalue approaches $2/η$. The model comprises three coupled ODEs tracking a modified gradient flow center, Rayleigh quotient-based eigenvector direction, and exponential magnitude growth/decay. It requires only two gradients and one Hessian-vector product per iteration, exhibiting self-stabilization via feedback loops. Empirical results show Edge Flow faithfully captures EoS dynamics, including sharpness oscillations, outperforming prior continuous-time models while providing interpretability for instability mitigation.
edge of stabilitygradient descenthessian-vector productrayleigh quotientsharpness stabilization
Tensor-based second-order causal discovery
The paper introduces Tensor-based Second-order Causal Discovery (TSCD), a causal discovery algorithm leveraging second-order statistics from observational and interventional data. TSCD assumes linear structural equation models on directed acyclic graphs (DAGs) with uncorrelated noise, also extending to nonlinear cases. The method identifies causal order and parameters using a logarithmic number of interventions relative to variable count. Experiments demonstrate TSCD's noise robustness, competitive performance against existing methods, and scalability to hundreds of variables.
causal discoverystructural equation modelsecond-order statisticsdirected acyclic graphinterventional data
NoiseTilt: Noise-Tilted Reverse Kernels for Diffusion Reward Alignment
The paper introduces Noise-Tilted Reverse Kernel (NTRK), a reward-guided diffusion sampler that aligns generation with rewards without compromising sample quality. NTRK injects reward gradients through the noise term while keeping the pretrained reverse kernel unchanged, requiring only one sample per step. A novel whitening operator enables safe gradient injection. Evaluated on various reward alignment tasks, NTRK outperforms baselines in reward attainment and efficiency, achieving 20× compute reduction (25 vs. 500 NFEs) for comparable aesthetic generation quality.
diffusion modelsreward alignmentreverse kernelwhitening operatorguided sampling
ConTex: Reformulating Counterfactual Generation For Time Series Forecasting
The paper introduces ConTex, a model-agnostic framework for generating counterfactual explanations in time series forecasting. It reformulates counterfactual generation as learning a globally consistent intervention strategy, using a decomposed architecture with temporal context and conditional encoders. ConTex achieves state-of-the-art validity, sparsity, and computational efficiency (12-36x faster than instance-wise methods) across multiple benchmarks, enabling real-time inference at ~0.007 seconds per sample.
counterfactual explanationstime series forecastingintervention strategytemporal context encoderreal-time inference
Uncertainty Quantification for Flow-Based Vision-Language-Action Models
The paper introduces a method for quantifying epistemic uncertainty in flow-based vision-language-action models (VLAs) using velocity-field disagreement (VFD) across ensembles, enabling failure detection and active fine-tuning. The proposed SAVE framework leverages uncertainty-guided data acquisition to reduce demonstration requirements by 22% compared to baselines. Experiments on the LIBERO benchmark show VFD improves uncertainty calibration, failure detection, and adaptation efficiency, enhancing VLA reliability in non-stationary environments.
vision-language-action modelsepistemic uncertaintyvelocity-field disagreementflow matchingactive fine-tuning
INI-VPINN: A Variational Physics-Informed Neural Network with Implicit Neumann and Interface Handling for Multi-Material Domains with Geometric Singularities
The authors propose INI-VPINN, a variational physics-informed neural network that implicitly handles Neumann boundary and interface conditions for multi-material domains with geometric singularities. The method eliminates need for auxiliary loss terms or subdomain networks by employing compact support weighting functions and integration by parts to enforce flux continuity. Evaluated on Poisson and Laplace problems with sharp interfaces, INI-VPINN demonstrates superior accuracy, smoother convergence, and faster training compared to existing PINN formulations. The framework provides a general solution for mixed Neumann-Dirichlet problems in complex geometries.
physics-informed neural networksvariational formulationneumann boundary conditionsmulti-material domainsgeometric singularities
Recursive Scaling in Masked Diffusion Models
The paper introduces Recursive Masked Diffusion Models (R-MDMs), a novel approach to scaling masked diffusion models through recursive depth. By reusing the same denoising transformer multiple times within each diffusion step, R-MDMs achieve iterative refinement without increasing parameter count. Experiments on structured generation tasks (Sudoku, Countdown) demonstrate that R-MDMs match non-recursive baselines with L× fewer parameters and require fewer denoising steps for equivalent quality, improving both parameter efficiency and inference-time compute allocation.
masked diffusion modelsrecursive scalingparameter efficiencydenoising transformerstructured generation
Fast Nonparametric Conditional Independence Testing via Two-Stage Regression
The paper introduces BLITZ, a fast nonparametric conditional independence test for causal discovery that maintains calibration while executing in under one second. The method employs a two-stage regression approach: first removing broad dependencies via low-order polynomial regression, then applying nonlinear feature mapping and residualization with shallow tree regressions. Theoretical analysis shows this design reduces complexity for tree residualizers, balancing bias control and overfitting. Experiments demonstrate BLITZ outperforms kernel and random-feature methods in null calibration, achieving superior causal discovery performance on synthetic graphs and flow-cytometry data with reliable endpoint orientations.
conditional independence testingcausal discoverynonparametric regressionresidualizationtree regression
Half a Link can Be Enough to Predict a Whole Link: Understanding Generalization in Knowledge Graph Foundation Models
The paper investigates generalization mechanisms in knowledge graph foundation models (KGFMs) by analyzing their performance on partially seen links, termed half-links. Through a stratified analysis of four scenarios involving combinations of observed/unobserved half-links $(h,r)$ and $(r,t)$, the authors demonstrate that state-of-the-art KGFMs leverage seen half-links for predictions while struggling with unseen ones. The proposed taxonomy serves as a diagnostic protocol for assessing KGFM robustness and identifies areas for improvement in zero-shot link prediction.
knowledge graph foundation modelszero-shot generalizationhalf-linkslink predictionstratified analysis
Differential Privacy of Gaussian Process Posterior Sampling
The paper establishes differential privacy guarantees for releasing Gaussian process (GP) posterior sample paths when training data is private. By analyzing the intrinsic randomness of posterior sampling through Rényi-DP bounds, it decomposes privacy leakage into contributions from posterior means (data-dependent) and covariances. Results demonstrate that effective ridge regularization critically impacts privacy, with empirical validation via membership-inference attacks showing leakage dependence on regularization strength, posterior variance, and sample-path count. Utility experiments reveal scenarios where privacy-preserving regularization maintains decision quality, while calibrated GP noise provides an adjustable privacy knob for stricter requirements.
gaussian processdifferential privacyposterior samplingrényi divergencemembership-inference
Beyond Visual Cues: CoT-Enhanced Reasoning for Semi-supervised Medical Image Segmentation
The paper proposes CERS (CoT-Enhanced Reasoning Segmentation), a semi-supervised medical image segmentation framework integrating Chain-of-Thought (CoT) reasoning to address visual-semantic mismatches. The method constructs a knowledge pool with LLM-generated linguistic reasoning descriptions, employs semantic-aware reference selection for evidence filtering, and fuses reasoning-derived context via a multi-scale coordinate attention module (MCAM). Experiments show CERS outperforms state-of-the-art methods, particularly in resolving boundary ambiguities and semantic inconsistencies.
semi-supervised segmentationchain-of-thought reasoningmedical image analysislarge language modelsattention mechanism
Predictive Analytics in E-Commerce for CustomerBehavior Forecasting using hybrid Ret-DNN withXGBoost Model
The study proposes a hybrid Retail Deep Neural Network (Ret-DNN) with XGBoost for customer behavior forecasting in e-commerce, addressing challenges in predicting future purchases. The method processes 500,000 transaction records from a UK-based retailer through data cleaning, outlier handling, and feature extraction, using Ret-DNN for feature extraction and XGBoost for final purchase probability prediction. The hybrid model achieves a Mean Absolute Error (MAE) of 0.2193, outperforming the standalone Ret-DNN.
retail deep neural networkxgboostcustomer behavior forecastingpredictive analyticsmean absolute error
Monotonic Kolmogorov-Arnold Networks: A Theoretical and Empirical Study of Monotonicity as an Inductive Bias
The paper introduces MKAN, a Kolmogorov-Arnold Network variant with hard monotonicity guarantees for all parameters via exponential B-spline reparameterization, positive edge weights, and monotone activations, enabling standard gradient descent training. Theoretically, it proves any $C^K$ feature extractor with ball-shaped semantic partitions can be monotonically realized with at most $2N^*$ parameters, where $N^*$ is the original size. Empirically, MKAN matches SOTA monotone NNs on the SMM/ICML-2024 benchmark while uniquely combining hard monotonicity with KAN's edge interpretability, and outperforms baselines in factor recovery on synthetic data (higher Spearman alignment).
monotonic neural networkskolmogorov-arnold networksb-spline reparameterizationrepresentation-cost theoreminductive bias
Revisiting Structural Dependency in Autoregressive Multi-Task Table Recognition via Order-Independent Cell-Level Representations
The paper introduces a structural refinement module for multi-task table recognition that generates order-independent cell features via non-causal attention, addressing the global consistency issues in autoregressive approaches. The method decouples cell representation from sequential decoding order, enabling parallel content inference while maintaining global context awareness. Evaluations on two large datasets show improved cell localization and end-to-end recognition accuracy with a 3× inference speedup.
multi-task table recognitionautoregressive decodingnon-causal attentioncell localizationparallel inference
Meta-classification of one-class classification models using ranking correlation and nearest neighbor
The paper proposes a meta-classification framework for one-class classification (OCC) models by representing them as normality rankings and classifying them using nearest-neighbor and ranking-correlation metrics. The method achieves high accuracy when class labels correspond to training datasets and can classify algorithms when datasets share the same class. Experiments demonstrate classification of OCC models, datasets, and rankings, including an application to sleeping records. The approach provides a unified solution for these tasks, with source code available publicly.
meta-classificationone-class classificationnormality rankingsranking-correlationnearest-neighbor
Qwen-RobotManip Technical Report: Alignment Unlocks Scale for Robotic Manipulation Foundation Models
Qwen-RobotManip introduces a vision-language-action foundation model for robotic manipulation that achieves generalization through unified alignment across representation, motion, and behavior dimensions. The model leverages a human-to-robot synthesis pipeline to convert hand demonstrations into robot trajectories and harmonizes heterogeneous datasets, constructing a 38,100-hour pretraining corpus from open-source data. Evaluated on OOD settings including RoboCasa365 and LIBERO-Plus, Qwen-RobotManip outperforms prior state-of-the-art models like $π$0.5 by 20% in RoboChallenge and demonstrates zero-shot instruction following, cross-embodiment transfer, and real-robot validation on platforms such as AgileX ALOHA and Franka.
foundation modelrobotic manipulationcross-embodiment transferzero-shot learningvision-language-action
From Drift to Coherence: Stabilizing Beliefs in LLMs
The paper introduces prompted predictive resampling (PPR) to analyze belief dynamics in large language models (LLMs) during multiple-choice question answering, revealing early-stage belief drift that violates the martingale property. The method demonstrates that beliefs eventually stabilize into coherent predictive distributions, leading to two improvements: seed-answer prompting for faster convergence and a self-consistency loss for drift reduction via fine-tuning. Experiments on QA benchmarks show these techniques enhance predictive coherence without accuracy loss.
belief driftmartingale propertypredictive resamplingself-consistency lossmultiple-choice qa
QueryMarket: Cost-Aware Online Active Learning in Data Markets
The paper introduces QueryMarket, a framework for cost-aware online active learning in data markets, addressing the challenge of real-time data acquisition under budget constraints. It proposes OVBAL (online variance-based active learning), which combines D-optimality criteria with exponential forgetting to estimate sample utility and execute cost-aware purchases. The method adapts to nonstationary streams and heterogeneous label costs, demonstrating effectiveness in synthetic and real-world solar power forecasting tasks, particularly under seller-centric pricing.
querymarketovbald-optimalitycost-awarenonstationary
Continual Self-Improvement with Lightweight Experiential Latent Memories
The paper introduces a method for continual self-improvement in large language models by distilling inference-time reasoning traces into compact latent memories. The approach leverages lightweight per-instance training with self-generated rewards (majority voting) to create modular soft prompt memories (~0.001% of model parameters), enabling knowledge reuse without catastrophic forgetting. Evaluated on mathematical reasoning benchmarks, the method outperforms zero-shot and in-context learning baselines, demonstrating effective transfer across datasets while maintaining computational efficiency.
continual learninglatent memoriessoft promptsin-context learningmodular architecture
Blind Recovery of Latent Domains via Unsupervised Symmetry Discovery
The paper introduces an unsupervised framework for recovering latent domains and signals by discovering symmetries in data distributions. The approach models observations as linear measurements from a latent random field, optimizing a shallow group-convolutional network with stationarity and locality regularization. This method learns latent symmetry actions and filters, transforming unstructured observations into symmetry-based representations. Experiments on stochastic processes, Ising models, scrambled images, and neural recordings demonstrate successful recovery of latent signals and domains, positioning symmetry discovery as a novel direction for unsupervised structure learning and blind inverse problems.
blind inverse problemslatent domainssymmetry discoverygroup-convolutional networkunsupervised learning
A fairness-aware extension of Stochastic Multicriteria Acceptability Analysis for ranking
The paper introduces SMAA-Fair, a fairness-aware extension of Stochastic Multicriteria Acceptability Analysis (SMAA) for ranking problems, addressing fairness gaps in traditional SMAA. The method reweights simulated rankings based on group fairness metrics (Statistical Parity, rKL, nDKL), adjusting acceptability indices and central weights vectors accordingly. Experiments on synthetic and real data demonstrate improved protected group representation in favorable ranking positions while maintaining robustness to preference uncertainty.
stochastic multicriteria acceptability analysisfairness-aware rankinggroup fairness metricsacceptability indicespreference uncertainty
Delta-Based Target Reformulation for Short-Term Electricity Load Forecasting Using LSTM and Transformer Models
The paper proposes delta-based target reformulation for short-term electricity load forecasting, replacing absolute load prediction with consecutive time-step differences to address non-stationarity. Using LSTM and Transformer models on Indian hourly load data with meteorological and calendar features, the method is benchmarked against LightGBM for hour-ahead and day-ahead forecasts. Results show MAPE reductions exceeding 50% for hour-ahead predictions across all models, while day-ahead improvements are specific to sequence models, indicating formulation efficacy depends on model architecture and prediction horizon.
load forecastingnon-stationaritydelta reformulationlstmtransformer
Geometrical fairness in graph neural networks
The authors propose a fairness-aware adaptation of graph-based diffusion by modifying the Laplacian operator to mitigate bias propagation in graph neural networks. Their method incorporates subspace projections, spectral adjustments, and frequency-based filtering, leveraging graph diffusion's smoothing properties for principled fairness analysis. Evaluations on synthetic and real-world datasets show competitive performance with improved fairness metrics at limited computational overhead.
graph neural networksdiffusion processeslaplacian operatorfairness metricsspectral adjustments
EnvRL: Learn from Environment Dynamics in Agentic Reinforcement Learning
The paper proposes EnvRL, a reinforcement learning framework that incorporates environment dynamics learning through auxiliary state prediction and inverse dynamics objectives. By jointly optimizing these with the primary RL objective, the method enables agents to internalize transition mechanisms from interaction trajectories. Experiments on ALFWorld and WebShop benchmarks show EnvRL improves success rates of Qwen-2.5-1.5B-Instruct by 4.6% and 10.2% respectively when trained with GRPO, compared to RL-only baselines.
reinforcement learningenvironment dynamicsagentic learninginverse dynamicslong-horizon tasks
Physics-Constrained Neural Networks for Improved Short-Term Weather Forecasting: A Case Study over the South Pacific
The study enhances physics-constrained neural networks (PCNNs) for short-term weather forecasting via three innovations: (1) an upgraded numerical solver combining WENO-5, beta-plane approximation, and subgrid-scale viscosity, enabling a 4× larger time step (1200 s) and 26% lower daily MSE; (2) a unified autoregressive hybrid block replacing 24 specialized modules; (3) integration with neural backbones (PI-PredFormer, PI-IAM4VP). Evaluated on WeatherBench's South Pacific subset (2000-2004), the hybrids reduce 1-12h RMSE by 8-22% versus pure neural models while improving physical consistency.
physics-constrained neural networksweno-5beta-plane approximationautoregressive hybrid blockweatherbench
Expanding SPHERE-JEPA: A Family of Statistical Regularizers for the Hypersphere
The paper introduces deterministic statistical regularizers for self-supervised learning on the unit hypersphere, addressing variance issues in sliced methods like SIGReg and SUSReg. By analytically integrating random projections, the authors derive full-dimensional objectives using Maximum Mean Discrepancy (MMD), Kernel Stein Discrepancy (KSD), and Kullback-Leibler (KL) divergence with rotationally invariant kernels (Heat and Bandlimited filters). Experiments on ImageNet and Galaxy10 show improved optimization stability, faster convergence, and performance gains, with KL divergence excelling in unclustered texture retrieval due to its fine-grained separation properties.
self-supervised learninghypersphere regularizationmaximum mean discrepancykernel stein discrepancyrotationally invariant kernels
Root-Selecting Fixed-Point Inversion for Rectified Flows via Trajectory Straightness
The paper introduces SelFix, a fixed-point inversion method for rectified flows that selects solutions based on trajectory straightness to improve inversion accuracy. By formulating inversion as a fixed-point problem and leveraging straightness as a selection criterion, SelFix addresses the variability in reconstruction and editing quality caused by multiple fixed-point solutions. Experiments on FLUX.1-dev and PIE-Bench demonstrate that SelFix outperforms prior baselines in real-image reconstruction and prompt-based editing while maintaining convergence to an exact inverse root under standard assumptions.
fixed-point inversionrectified flowstrajectory straightnessreconstruction qualityprompt-based editing
When Dynamics Models Read the Wrong Time Steps: Label-Free Event Credit Re-Anchoring for Robust Global Readouts
The paper addresses temporal credit dilution in dynamics models, where global readouts incorrectly assign credit to smooth background features rather than transient events. It introduces Credit-in-Event, a probe for measuring event-step credit allocation, and CREST, a training-free method that re-anchors pooled representations via event-versus-rest contrast. Theoretical analysis shows linear readers misallocate credit to spurious channels as event fraction decreases. Experiments on gear/impact systems and bearing vibration data demonstrate CREST's effectiveness in reducing out-of-distribution error and restoring event credit, with ablations confirming the specificity of event-core re-anchoring.
temporal credit dilutionglobal readoutsevent-step creditcontrastive re-anchoringout-of-distribution error
Reducing Learner Redundancy in Boosting via Residual Orthogonalization
The paper introduces SCBoost, a boosting framework that reduces learner redundancy via residual orthogonalization instead of sequential residual fitting. The method employs Spectral Residual Projection (SRP) to project residuals onto the orthogonal complement of historical predictions, ensuring novel error capture, and Covariance-Regularized Weighting (CRW) to optimize ensemble weights with a covariance penalty. Theoretical analysis shows SRP enables exact additive residual-energy decomposition and improves Signal-to-Noise Ratio under isotropic noise. Experiments on ten benchmarks demonstrate improved accuracy and F1 scores, highlighting the benefits of explicit redundancy control in boosting.
boostingresidual orthogonalizationspectral residual projectioncovariance regularizationensemble learning
AoiZora: Topology-Aware Auto-Parallel Optimization for Inference of Diffusion Transformers
AoiZora introduces a topology-aware auto-parallel optimization system for low-latency inference of diffusion transformers on TPU sub-slices. The method combines pre-compilation IR analysis with a topology-aware communication model to optimize physical device placement while preserving the existing compilation pipeline. Evaluated on TPU v5e sub-slices, AoiZora achieves up to 1.42x latency reduction for Wan 2.1 one-step denoising compared to baseline auto-parallel systems.
auto-parallel optimizationdiffusion transformerstpu sub-slicestopology-awarecompilation flow
SpatioTemporal Causal Network Diagnostics for Geographic Tipping Point Early Warning
The paper introduces SpatioTemporal Causal Network Diagnostics (ST-CND), a framework for geographic tipping point early warning that addresses spatial dilution, Euclidean assumptions, and correlated noise. ST-CND employs transfer entropy to infer data-driven information-flow topology, dynamic mode decomposition for local recovery rates, and identifies vulnerable subnetworks via high internal fluctuation, synchronization, and low external coupling. Validated on synthetic bifurcations and sea-surface temperature benchmarks (Indo-Pacific SST, North Atlantic AMOC), ST-CND achieves AUROC 0.783 and critical-subnetwork IoU 0.378, outperforming recurrence-network and lambda-AR1 baselines.
spatiotemporalcausal networktransfer entropydynamic mode decompositiontipping point
Continuous-time Optimal Stopping through Deep Reinforcement Learning
The paper introduces CARLOS, a reinforcement learning algorithm for continuous-time optimal stopping problems that overcomes discretization limitations in dynamic programming. The method employs an aggregate deep neural network (ADNN) to learn space-time decision boundaries, progressively refining time resolution through adaptive training near stopping boundaries. Benchmarks demonstrate CARLOS outperforms Bermudan solvers, approaching American upper bounds with higher computational efficiency than non-RL methods.
optimal stoppingreinforcement learningdeep neural networkadaptive samplingcontinuous-time
Non-negative Matrix Factorisation with Topological Regularisation
The authors propose a topological regularization framework for non-negative matrix factorisation (NMF) to learn interpretable basis functions with domain-specific structure. They address discretization and threshold sensitivity issues in naive topological methods by employing persistent homology as a stable quantifier and designing compatible topological regularizers. The unified framework demonstrates applicability across image components, time-series structures, and graph signals while maintaining optimization compatibility.
non-negative matrix factorisationtopological regularizationpersistent homologyinterpretable basisstructured domains
Public transit gains and spatially uneven travel demand changes after NYC congestion pricing
The study evaluates NYC's 2025 congestion pricing program using time series foundation models to generate probabilistic counterfactual demand forecasts, addressing the lack of clean control groups. The method employs uncertainty-calibrated models on bus, subway, and aggregate trip data. Results show significant transit ridership increases (bus + subway) and modest overall travel demand reductions, with spatially heterogeneous effects: demand reductions concentrate within the Congestion Relief Zone while transit gains extend beyond Manhattan. Socio-demographic analyses reveal uneven neighborhood adaptation patterns, highlighting spatial equity implications.
congestion pricingtime series foundation modelsprobabilistic counterfactualsspatial heterogeneitytransport demand forecasting
Domain-Validity-Gated Metamorphic Testing of Scientific ML Surrogates
The paper introduces a domain-validity-gated framework for metamorphic testing of scientific ML surrogates, addressing the oracle problem through three contributions: (i) a validity rubric assessing metamorphic relations (MRs) via numerical tolerance and precondition checks, (ii) an executable MR-card format for test assets, and (iii) a case-study protocol applied to MeshGraphNets and PhysicsNeMo surrogates. Results demonstrate MR validity discrimination across tasks (incompressible flow, Burgers' equation) and architectures, with permutation invariance holding to machine precision while symmetry and conservation relations show domain-dependent validity. The method generalizes to multiple PDE families, enabling auditable separation of model violations from out-of-domain applications.
metamorphic testingscientific machine learningoracle problemdomain validitymeshgraphnets
MGUP: A Momentum-Gradient Alignment Update Policy for Stochastic Optimization
The authors propose MGUP, a momentum-gradient alignment update policy for stochastic optimization that enables fine-grained control of parameter updates while maintaining convergence guarantees. MGUP augments momentum-based optimizers (e.g., AdamW, Lion) by applying larger step-sizes to a fixed proportion of parameters per iteration and smaller, non-zero step-sizes to the rest. Theoretical convergence is proven for MGUP-AdamW, and experiments on MAE pretraining, LLM pretraining, and fine-tuning show improved or more stable performance compared to base optimizers.
stochastic optimizationmomentum-based optimizersselective updatesconvergence guaranteesstep-size control
Learning to Refine Hidden States for Reliable LLM Reasoning
The paper introduces ReLAR, a reinforcement-guided latent refinement framework that improves LLM reasoning stability by iteratively updating hidden states before decoding. The method employs learned depth and action controllers to adaptively determine refinement steps, trained via policy gradient optimization on step-wise likelihood improvement. Evaluations on medical, mathematical, multi-hop reasoning, and open-ended generation benchmarks demonstrate improved accuracy, generation quality, and reasoning stability with lower inference overhead compared to explicit reasoning baselines.
hidden-state refinementpolicy gradientlatent reasoningmulti-hop reasoninginference overhead
Beyond IGO-Flow: Toward Convergence Analysis of IGO in Continuous Spaces
The paper advances convergence analysis for discrete-time Information-Geometric Optimization (IGO) in continuous spaces, addressing a gap in existing theory focused on continuous-time idealizations. The authors study natural gradient updates in expectation-parameter coordinates of an exponential family, specifically analyzing IGO over multivariate Gaussians on strongly convex quadratic objectives. Key results prove covariance matrix convergence to zero and mean vector convergence to the global optimum under bounded condition number conditions, bridging theoretical IGO analysis with practical methods like CMA-ES.
information-geometric optimizationnatural gradientcovariance adaptationstrongly convexexponential family
An expressivity analysis of hierarchical modelling in deep transformers via bounded-depth grammars
The work provides a theoretical analysis of hierarchical representation in deep transformers using bounded-depth, non-recursive context-free grammars. By constructing transformers with positional attention, the authors show that model depth scales linearly with grammar depth, while neuron count grows with derivation-tree shapes and quadratically with production rules. Results demonstrate these architectures can encode grammatical states into low-dimensional, linearly separable subspaces, supporting the linear representation hypothesis.
transformershierarchical representationcontext-free grammarspositional attentionlinear representation hypothesis
When the Next Step Is Not One Step: Distribution-Aware Execution Modeling for Concurrent Go Programs
The paper introduces a distribution-aware execution modeling approach for concurrent Go programs, addressing nondeterministic scheduler behavior by training on empirical event distributions rather than single labels. The method fine-tunes a 7B parameter model using KL divergence on aggregated next-event distributions from multiple program runs. Evaluated on 798 real-world Go bug predictions, distribution training achieves 36.2% accuracy, outperforming zero-shot Gemini 3.5 Flash (34.8%) and the base model (28.6%), while reducing Expected Calibration Error from 0.205 to 0.169 compared to cross-entropy training. The work also formalizes goroutine-leak signatures for select-blocked cases.
concurrent programmingexecution modelingkl divergencegoroutine-leakcalibration error
Reconfigurable Computing Challenge: Transformer for Jet Tagging on Versal AI Engines
The work presents a reusable software framework for implementing quantized, integer-only transformer models on AMD Versal AI Engines (AIE) for low-latency jet tagging at CERN LHC. The method maps dense and multi-head attention (MHA) layers to AIE tiles, generating Vitis graph code automatically from high-level Python descriptions. This enables efficient deployment of transformer architectures in resource-constrained trigger systems, with the framework released as open-source software for future research.
transformerquantizationjet taggingversal ai enginemulti-head attention
A Bayesian Boolean Matrix Factorization with Application to Copy Number Analysis in Cancer
The authors propose Bayesian Boolean Matrix Factorization (BBMF), a fully conjugate generative model for binary matrix decomposition that enforces Boolean constraints via logical AND/OR operations while providing uncertainty quantification. BBMF uses sparsity-inducing priors and admits efficient Gibbs sampling with closed-form conditionals, addressing limitations of heuristic BooMF methods. Applied to arm-level copy-number alteration data in multiple myeloma, BBMF identifies interpretable bicliques linking patient subsets to recurrent chromosomal-arm amplifications, demonstrating superior biological interpretability over additive models for capturing discrete cancer evolution patterns.
boolean matrix factorizationbayesian inferencecopy-number variationgibbs samplingcancer genomics
Theoretical Grounding of Out-Of-Distribution Detection With Reinforcement Learning Optimizer
The paper establishes a theoretical framework for out-of-distribution (OOD) detection in dynamic environments by introducing a reinforcement learning (RL)-guided optimizer. The method augments standard gradient descent with an RL-based correction term to reduce semantic OOD false positive rates over time, addressing both future-domain generalization and semantic-OOD rejection. Theoretical analysis decomposes temporal errors into model-change and environment-change components, demonstrating improved generalization under the RL-guided optimizer compared to gradient descent alone.
out-of-distribution detectionreinforcement learninggradient descentdomain generalizationsemantic shift
Multi-Adapter PPO: A Cross-Attention Enhanced Wavelength Selection Framework for LIBS Quantitative Analysis
The paper introduces Multi-Adapter PPO, a reinforcement learning framework for wavelength selection in laser-induced breakdown spectroscopy (LIBS) quantitative analysis. The method employs cross-attention mechanisms and multiple specialized adapters to model complex spectral relationships, addressing the trade-off between prediction accuracy and feature efficiency. Evaluated on steel and coal datasets, it surpasses Particle Swarm Optimization by 28.4% in comprehensive score and 45.2% in prediction accuracy while maintaining interpretability and computational efficiency.
laser-induced breakdown spectroscopyreinforcement learningcross-attentionwavelength selectionquantitative analysis
ReRAM-aware Model Finetuning addressing I-V Non-linearity and Retention Errors
The paper introduces a finetuning-based hardware-aware training algorithm for deploying DNNs on ReRAM crossbar arrays, addressing I-V non-linearity and retention errors without full retraining. The method employs a range-shrunk sinh transformation for I-V non-linearity mitigation and incorporates retention errors via regularization loss during finetuning. Evaluations on ResNet18, DeiT-Tiny, and MobileNetV3 show <2% ImageNet accuracy drop, while SQuAD v2 QA tasks exhibit only 1-point F1-score degradation, demonstrating efficacy across vision and NLP tasks.
reramin-memory computinghardware-aware trainingi-v non-linearityretention errors
Perron--Frobenius Operator Matching for Generative Modeling
The paper introduces Perron--Frobenius Operator Matching (PFOM), a generative framework unifying density evolution via the integral PF operator, encompassing flow, diffusion, and jump models. It proves KL divergence uniquely preserves equality between density-level and sample-conditioned objectives, yielding a loss equivalent to Koopman path matching. The method incorporates Nesterov-accelerated training and sampling to stabilize discretization and accelerate convergence. Empirical validation on Gaussian mixtures and two-moons datasets demonstrates faster KL/$W_2$/MMD reduction and improved wall-clock efficiency. PFOM bridges operator-theoretic identification with generative modeling, enabling adaptive dictionaries and high-dimensional applications.
perron--frobenius operatorbregman divergenceskoopman path matchingnesterov-accelerated traininggenerative modeling
CheckMIABench: Firm Foundations For Membership Inference Attacks on Language Models
The paper introduces CheckMIABench, a principled benchmark for evaluating membership inference attacks (MIAs) on language models, addressing distribution shift issues in prior evaluations. The method leverages intermediate training checkpoints and public training data from open-source models (Pythia, OLMo; 70M-7B parameters) to create statistically valid testbeds. Results demonstrate improved evaluation rigor across six published attacks, accompanied by an open-source modular library (Pandora_LLM) for attack design and implementation.
membership inference attackslanguage modelsdistribution shiftprivacy evaluationintermediate checkpoints
ResAware: Cross-Environment Website Fingerprinting via Resource-Privileged Distillation
ResAware introduces a cross-environment website fingerprinting framework that enhances robustness through resource-privileged distillation. The method trains a teacher model on resource-level features and distills this knowledge into a student model via heterogeneous knowledge distillation, enabling inference using only encrypted traffic. Evaluated on a 160,000-sample dataset from six global vantage points over five months, ResAware improves Var-CNN's F1-score from 72.77% to 81.49% and open-world TPR@1%FPR from 22.40% to 27.20% under 150-day temporal drift.
website fingerprintingknowledge distillationcross-environment robustnessencrypted traffictemporal drift
Operator Boosting Produces Pareto-Efficient PDE Surrogates
The paper introduces Operator Boosting, a stagewise residual-learning framework for constructing compact neural-operator surrogates for PDEs without post-training compression. The method trains a sequence of small neural operators (FNOs, DeepONets, or CNOs) on residual fields, incorporating corrections via validation-selected shrinkage, starting from an empirical mean predictor. Evaluated across 30 dataset-architecture pairs from PDEBench, APEBench, and The Well, the approach achieves 72-95% parameter reduction while improving accuracy in 21 cases, with empirical Pareto improvements on 7 of 10 PDE benchmarks including 2D Navier-Stokes and 3D compressible flows.
neural operatorsresidual learningpde surrogatesoperator boostingpareto efficiency
Credibility-Weighted Pricing of Autonomous Vehicle Liability Under Operational Design Domain Shift
The paper proposes a hierarchical Bayesian credibility framework for autonomous vehicle liability pricing under operational design domain (ODD) shifts, addressing sparse data and non-stationary risk across deployments. The method learns an ODD-similarity kernel to pool risk estimates across cities, software versions, and territories, with Buhlmann-Straub credibility as a limiting case. Evaluation on 648 Waymo crashes (116 million miles) from NHTSA data shows city-aggregate credibility weights of 0.12-0.46, partial pooling outperforms no pooling, and the kernel's advantage becomes detectable at ~12 deployed cities.
hierarchical bayesianoperational design domaincredibility theorynon-stationary riskpartial pooling
Toward Controllable Catalyst Inverse Design via Large-Scale Autoregressive Pretraining
The authors present a conditional generative model for catalyst inverse design using a modified Generative Pretrained Transformer architecture with numerical embedding layers. The model, pretrained on 133M catalyst structures and fine-tuned on 460K optimized structures with property annotations, generates catalysts conditioned on both categorical (adsorbate type, composition) and continuous (binding energy) properties. Results show 98% structural validity, 95% optimization validity, 93% categorical condition fidelity, and a 4-fold binding energy match rate improvement over baseline, enabling 1.5-4× screening efficiency gains for reaction-targeted discovery.
inverse designautoregressive modelheterogeneous catalysisconditional generationproperty embedding
MorphStrata: Layer-Specific Perturbations for Generating Morphence Students in Time-Series Moving Target Defense
The paper introduces MorphStrata, a student generation strategy for time-series Moving Target Defense (MTD) that injects layer-specific stochastic noise into Transformer-based models. The method perturbs randomly selected architectural blocks to create heterogeneous student models, addressing adversarial vulnerability while minimizing computational overhead. Evaluated on Jena Climate, Electricity Load Diagrams, and Appliances Energy Prediction datasets against FGSM, BIM, and PGD attacks, MorphStrata reduces adversarial RMSE by up to 97.97% compared to baselines, with less than 1% training time increase over Morphence MTD. Results show a correlation between higher pairwise L2 distance among students and improved defense effectiveness.
moving target defensetime-series forecastingadversarial robustnesstransformerstochastic noise injection
Bounded Difference Concentration for Infinitely Exchangeable Sequences with Applications to AI Benchmark Uncertainty
The paper establishes concentration bounds for functions of infinitely exchangeable random variables by decomposing deviations into conditional sampling and latent mixture fluctuations. Using de Finetti's theorem, it shows that zero-sum linear contrasts yield mixture-free Hoeffding-type bounds, extending finite-exchangeable results to infinite cases. Applied to AI benchmarks like MMLU, the framework provides domain-stratified uncertainty quantification for accuracy scores and statistical guarantees for subset-based full-score estimation.
exchangeable sequencesconcentration inequalitiesde finetti theoremai benchmarkingsubgaussian mixtures
Generalization Guarantees for Multi-Input Neural Operator Learning in Sobolev Spaces
(No summary returned.)
A Closer Look at Failure Modes in Temporal Understanding of Large Audio-Language Models
The study investigates temporal reasoning failures in Large Audio-Language Models (LALMs) through a new benchmark of 1,657 questions across three tasks. Behavioral and causal mechanistic analyses reveal that models under-utilize audio when text is present, and attention redistribution outperforms simple scaling. Targeted attention adjustments at bottleneck layers improve accuracy from 55.9% to 59.1% without fine-tuning, suggesting modality imbalance is not the sole failure cause. The work provides the first causal analysis of LALMs' temporal reasoning limitations.
temporal reasoninglarge audio-language modelsattention redistributionmechanistic analysismodality imbalance
Memory-Efficient Meta-Reinforcement Learning for Adaptive Safety-Critical Control in Adversarial Spacecraft Proximity Operations
The paper introduces a memory-efficient meta-reinforcement learning framework for adaptive safety-critical control in adversarial spacecraft proximity operations. It evaluates three recurrent architectures (LSTM, GRU, Mamba) and two training algorithms (PPO, SAC) for tuning input-constrained control barrier functions (ICCBFs) via meta-RL. Results demonstrate that Mamba with PPO achieves superior task completion (100% safety), fuel efficiency, and robustness in both cooperative and adversarial scenarios compared to alternatives.
meta-reinforcement learningcontrol barrier functionsspacecraft proximity operationsrecurrent architecturesadversarial robustness
Amortized Probabilistic Retrieval of Atmospheric CO2 from OCO-2 Spectra Using Deep Learning with Laplace Approximations and Normalizing Flows
The study presents a deep learning framework for efficient probabilistic retrieval of atmospheric CO2 from OCO-2 spectra, addressing computational and uncertainty quantification limitations of operational methods. The approach employs a multi-branch neural network with Laplace approximations and normalizing flows, trained on a high-fidelity simulation dataset incorporating forward model errors. Results demonstrate five advantages: amortized inference (1000x faster), robustness to model errors, superior point estimate accuracy, improved uncertainty quantification, and non-Gaussian posterior modeling via normalizing flows.
oco-2laplace approximationsnormalizing flowsuncertainty quantificationxco2
Where Should Action Generation Begin? A Learnable Source Prior for Generative Robot Policies
The paper introduces LeaP, a learnable source prior for generative robot policies that replaces the standard Gaussian initialization with a proprioception-conditioned diagonal Gaussian over action chunks. Parameterized by a lightweight MLP, LeaP jointly predicts mean and state-adaptive variance while preserving the downstream generator architecture. Evaluated on 15 RoboTwin manipulation tasks, LeaP achieves 81.6% average success rate, outperforming baselines by 6.5-25.5 percentage points. The method demonstrates consistent improvements for both flow-matching and diffusion-bridge generators, with faster convergence and better real-world deployment performance.
generative policiessource priorproprioception-conditionedaction chunksstate-adaptive variance
Damage Adaptation in Seconds for Architected Materials
The paper introduces LEAP, a method for real-time damage adaptation in soft-actuated systems using architected materials, achieving adaptation in under one minute. The approach leverages low-dimensional latent damage representations and a robust ensemble method to handle unseen damage types (cuts, burns, actuator repairs) without simulation. Key results demonstrate successful adaptation in a 6DoF soft wrist based on Handed Shearing Auxetic (HSA) actuators during a tracing task, with exponential-to-linear sample complexity reduction for learned representations. This enables simulation-free, real-time proprioceptive adaptation critical for soft robotics deployment.
architected materialsproprioceptive adaptationsoft actuatorslatent representationssample complexity
Performance-Driven Environment Abstraction with Multi-Timescale Learning
The paper introduces a performance-driven environment abstraction method for Markov decision processes that optimizes decision quality rather than preserving geometric structure. The approach models abstraction via state aggregation with shared action distributions, establishing a performance guarantee separating value-function approximation error from action-sharing loss. A multi-timescale reinforcement learning framework jointly adapts policy and tree-structured abstraction, refining regions based on Q-value discrepancies. Experiments show significant state compression (exact metrics unspecified), improved sample efficiency, and faster replanning versus actor-critic baselines.
markov decision processesstate aggregationmulti-timescale learningq-value discrepanciesactor-critic
MM++: Unsupervised Scale-Invariant Multilayer OOD Detection via Top-K Gated Feature Fusion
MM++ proposes an unsupervised, post-hoc OOD detection framework that achieves scale invariance while preserving hierarchical feature expressivity. The method constructs a joint feature space by identifying discriminative intermediate layers via entropy density drops, fusing them with terminal representations using top-K gating, and stabilizing distance estimation through Ledoit-Wolf regularized covariance. This approach operates without OOD data, fine-tuning, or architectural changes, demonstrating robustness across architectures for both near- and far-OOD scenarios.
ood detectionscale-invariantfeature fusionledoit-wolf regularizationentropy density
Decision-Driven Geosteering Under Uncertainty: A Unified Framework for Sequential Decision Optimization
This work introduces an uncertainty-aware geosteering framework combining particle filtering for subsurface interpretation with reinforcement learning for sequential decision-making. The method employs particle filters to represent geological uncertainty and integrates three decision policies: Approximate Dynamic Programming, Deep Q-learning, and Dual Deep Reinforcement Learning with dueling architecture. Evaluated via industrial simulator under realistic noise and constraints, the framework demonstrates policy behavior through stability metrics and trajectory smoothness, providing operational insights beyond final placement performance.
geosteeringparticle filterreinforcement learningapproximate dynamic programmingdueling architecture
ProCUA-SFT Technical Report
ProCUA-SFT introduces a 3.1M-step dataset for supervised fine-tuning of computer-use agents, addressing negative transfer from existing resources like AgentNet (22.5K trajectories). The method employs an automated pipeline that synthesizes tasks across 2,484 application combinations using real-world content (e.g., 912 spreadsheets, 10K presentations) and verifies feasibility via binary precondition checks executed by a single VLM (Kimi-K2.5). Fine-tuning UI-TARS 7B on ProCUA-SFT improves OSWorld success rate to 45.0% (+18.7pp over base, +35pp over AgentNet), with partial integration into Nemotron 3 Nano Omni.
computer-use agentssupervised fine-tuningnegative transferprecondition checkingvlm
Tight $L_\infty$ Sample Complexity for Low-Degree and Sparse Boolean Polynomials
The paper establishes tight minimax sample complexity bounds for learning Boolean polynomial surrogates under $L_\infty$-error guarantees, crucial for black-box optimization. For degree-$d$ polynomials over $n$ variables, the complexity scales as $n^{d+1}$; for $s$-sparse Fourier-Walsh polynomials, it scales as $ns^2$, contrasting with noiseless cases ($n^d$ and $ns$). The analysis employs auxiliary norms as $L_\infty$ proxies, with lower bounds holding even for adaptive learners. Results demonstrate intrinsic noise-induced complexity factors absent in $L_2$ settings.
minimax sample complexityboolean polynomialsl_infinity-errorfourier-walshadaptive learners
Turning music identification into a neural forward pass
The paper demonstrates that music identification can be reformulated as a single neural forward pass using a generative transformer, contrasting with traditional rule-based search pipelines. The model, trained on audio data, directly predicts track identifiers from short excerpts (1s), outperforming acoustic fingerprinting baselines while reducing storage to 0.33% and improving p95 latency by 2.3x. It supports open-set operation by rejecting unseen tracks, aligning search with human associative recognition rather than database lookup.
neural forward passgenerative transformeracoustic fingerprintingopen-set operationassociative recognition
VISTA: Scale-Aware Visual Navigation via Action History Conditioning
VISTA introduces scale-aware visual navigation by conditioning Vision Navigation Foundation Models (VNMs) on normalized action histories alongside image observations, addressing deployment vulnerabilities from action normalization. The method integrates a DINOv3 encoder to enhance representations in visually repetitive environments, capturing spatial and geometric relationships. Evaluations show 100% goal prediction accuracy in zero-shot real-world deployments across Outdoor, Forest, and Office settings, with 95% average checkpoint completion, demonstrating robust generalization.
vision navigation foundation modelsaction history conditioningdinov3 encoderzero-shot deploymentscale-aware navigation
On the Memorization Behavior of LLMs in Generative Recommendation: Observations, Implications, and Training Strategies
The paper investigates memorization behavior in large language models (LLMs) for generative recommendation (GR), revealing that LLMs predominantly rely on one-hop memorization of training data sequences rather than leveraging pretrained knowledge. To address this, the authors propose IIRG, a training strategy that incorporates collaborative relations from multi-hop item co-occurrences and semantic relations among thematically similar items. Experiments demonstrate that IIRG significantly outperforms standard next-item prediction, particularly for users whose test items are not covered by one-hop transitions.
generative recommendationlarge language modelsone-hop memorizationcollaborative relationssemantic relations
Accelerated Convex Optimization via Hamiltonian Dynamics with Deterministic Integration Time
The authors present Hamiltonian dynamics-based algorithms for smooth convex optimization, achieving accelerated convergence rates. By analyzing contraction properties of averaged Hamiltonian flow trajectories rather than endpoint conditions, they extend prior work limited to quadratic objectives or probabilistic guarantees. The paper provides both continuous-time theoretical analysis and practical discrete-time implementations with optimal first-order complexity, establishing Hamiltonian dynamics as a principled approach for deterministic accelerated convex optimization.
hamiltonian dynamicsconvex optimizationaccelerated convergencefirst-order methodstrajectory contraction
Rethinking Groups in Critic-Free RLVR
The paper re-examines group-based critic-free RL methods for LLM post-training, identifying their primary role as preventing false penalties on negative samples rather than baseline estimation. It introduces negative token filtering, a single-rollout strategy that eliminates group synchronization overhead while maintaining stability. Evaluated on reasoning and agentic tasks, the method matches group-based approaches in reasoning performance and surpasses them in agentic scenarios, demonstrating improved data efficiency and flexibility with structured rollouts.
critic-free rlnegative token filteringadvantage computationgroup synchronizationstructured rollouts
From Compression to Deployment: Real-Time and Energy-Efficient FastGRNN on Ultra-Constrained Microcontrollers
The paper presents an optimized deployment pipeline for FastGRNN, a compact gated recurrent cell, on ultra-constrained microcontrollers (8-bit ATmega328P and 16-bit MSP430). The method combines low-rank factorization, iterative hard-thresholding sparsity, and Q15 quantization with activation calibration, achieving 566-byte weights and real-time 50 Hz inference (9.21-13 ms/sample). Results show macro F1=0.918 on HAPT, 100% prediction agreement with PyTorch, 30.5x speedup via LUTs on multiplier-less hardware, and 17.7 mW active power with 96.7% energy reduction.
fastgrnnquantizationmicrocontrollersparsityenergy-efficiency
Uncertainty Quantification of Engineering Structures by Polynomial Chaos Expansion and Multivariate Active Learning
The authors propose a multivariate adaptive sequential sampling method for polynomial chaos expansion surrogates to handle multiple correlated quantities of interest (QoIs) in engineering models. The method balances exploration of input space and exploitation of aggregated output variance by selecting samples from a candidate pool based on local variance contributions. Compared to Latin Hypercube Sampling, numerical experiments demonstrate improved surrogate accuracy, stability, and reliability in estimating second-order statistics for engineering applications.
polynomial chaos expansionquantities of interestadaptive samplingsurrogate modelingvariance reduction
Sum-of-Squares Degree Barriers for the Reweighted-Hinge Method in Robust Halfspace Learning: A Christoffel-Function Characterization
The paper establishes a fundamental connection between the Sum-of-Squares (SoS) degree of outlier-removal certificates and the Christoffel function in robust halfspace learning under malicious noise. It demonstrates that the maximal corruption mass undetectable by a degree-2t certificate is precisely the Christoffel function λ_{t+1}(c) of the clean marginal. Key findings include: a margin-degree tradeoff requiring Ω(log(1/ε)) SoS degree for ε-error certification, a degree-2 outlier barrier limiting breakdown rates, and a degree-2t algorithm achieving η^{1-1/2t} corruption tolerance. The analysis reveals inherent limitations of low-degree certificates via weighted-Chebyshev reductions and extremal estimates.
sum-of-squareschristoffel functionrobust learningmalicious noisehalfspace learning
Another Look at Log-PCA for Probability Measures: A Dynamical Formulation and Statistical Convergence
The paper introduces Wasserstein Tangential PCA (WT-PCA), a dynamical formulation of log-PCA for analyzing principal variations of random probability measures under Wasserstein geometry. The method linearizes principal geodesic analysis by capturing local principal modes via covariance operators at barycenters, leveraging optimal transport's parallel transport structure. The authors derive a statistical convergence rate for empirical WT-PCA in terms of 2-Wasserstein distance between population and empirical barycenter measures.
wasserstein geometryprincipal geodesic analysisoptimal transportcovariance operatorstatistical convergence
Constrained Diffusion Models with Primal-Dual Inference
The paper introduces constrained diffusion models with primal-dual inference (PDI), enabling sampling from entropy-regularized optimization problems with average constraints. PDI jointly infers the optimal primal distribution and its dual variable through a dual-conditioned score network, updating the multiplier via dual ascent during reverse diffusion. Theoretical analysis shows convergence of dual variables to a neighborhood of the optimum and bounds the effect of dual mismatch. Evaluations demonstrate PDI's effectiveness on Gaussian mixture sampling, wireless resource allocation, and portfolio management tasks.
constrained diffusion modelsprimal-dual inferencegibbs distributiondual ascentscore network
Finsler Geometry, Graph Neural Networks, and You
The paper introduces Finslerian graph neural networks (GNNs) as a nonlinear alternative to Laplacian-based architectures by modeling the Finsler Laplacian on manifold-sampled point clouds. The authors prove discrete estimates converge to the true Finsler operator as sample size increases and demonstrate its implementation as a GNN layer. Experiments show these networks accurately recover nonlinear diffusion geometry, overcoming the isotropic limitations of Laplace-Beltrami approximations.
finsler geometrygraph neural networkslaplace-beltrami operatornonlinear diffusionmanifold learning
Verified Detection and Prevention of Concurrency Anomalies in Multi-Agent Large Language Model Systems
The paper contributes a mechanically verified consistency hierarchy for multi-agent LLM systems, formalizing four concurrency anomalies (stale-generation, phantom-tool, causal-cascade, tool-effect reordering) in TLA+ and proving detectors sound/complete via 274 Verus obligations. Methodologically, it models state sharing as read-generate-write operations under deterministic-generation semantics, with runtime verification spanning pessimistic locking (L0-L1) to dependency-free prevention twins (L2-L4). Results include anomaly prevention in deployed Rust runtimes (100% effectiveness for A3/A6/A2), reproduction of ByteDance's deer-flow lost update, and LangGraph's ToolNode reordering fix via L3 sequencing.
concurrency anomaliesdeterministic-generation semanticsverus obligationsconsistency hierarchymulti-agent llm systems
Towards Fast GNN Surrogates for CO2 Migration in Complex Geological Formations
The authors propose an end-to-end graph neural surrogate for forecasting CO$_2$ plume migration in geological storage, addressing multiphase flows with sharp interfaces and convective mixing. The method reformulates the SPE11A benchmark as a graph with transmissibility-based edges and geometric attributes, using anisotropic message-passing with geometry-conditioned edge embeddings to capture directional transport. An autoregressive residual formulation models temporal evolution in latent space. Results show competitive forecasts of gas saturation and liquid-phase density, with moderate cumulative errors over extended horizons.
graph neural networksco2 storagemultiphase flowanisotropic message-passingautoregressive modeling
Agentic Discovery of Non-Canonical Antimicrobial Peptides with AMPGAN v3
The authors present AMPGAN v3, a multi-objective conditional GAN for generating antimicrobial peptides (AMPs) with non-canonical amino acids and chemical modifications, addressing limitations in existing generative models. The method employs dual discriminators for adversarial and activity-aware supervision, improving training stability and outperforming prior models on external classifiers. In vitro validation of five candidates showed two active against Gram-positive strains, with MIC 8 μg/mL against B. subtilis. The work also introduces PepCraft, a multi-agent framework for end-to-end AMP discovery, demonstrating alignment between computational prioritization and experimental outcomes.
antimicrobial peptidesconditional ganmulti-agent frameworknon-canonical amino acidsin vitro validation
Integrated Marketing Attribution: A Bayesian Framework for Privacy-Safe Granular Measurement Anchored in MMM
The paper introduces Integrated Marketing Attribution (IMA), a Bayesian framework combining Marketing Mix Modeling (MMM) with channel-specific attribution to address fragmentation between MMM's privacy-safe aggregates and Multi-Touch Attribution's (MTA) granularity. IMA leverages MMM-derived priors to estimate campaign-level effects from aggregated data, ensuring privacy compliance while maintaining consistency with MMM. The method demonstrates how hierarchical Bayesian modeling can bridge macro-level planning and micro-level optimization without relying on user-level tracking.
bayesian frameworkmarketing mix modelingmulti-touch attributionprivacy-safe measurementgranular attribution
📰 Industry Media (5)
Vercel Releases Eve: An Open-Source AI Agent Framework Where Each Agent is a Directory of Files Mapped to Capabilities
Vercel introduces Eve, an open-source AI agent framework implementing a filesystem-first architecture where agents are directories mapping capabilities to files. Each agent comprises modular components (model configuration, instructions, tools, skills) as discrete files, with automatic build-time registration. The framework provides six production-ready features: durable execution via Workflow SDK checkpointing, sandboxed compute, human-in-the-loop approvals, secure API connections, multi-channel integration, and OpenTelemetry tracing. Vercel reports internal deployment of 100+ agents, including a data analyst handling 30,000 monthly queries with permission-aware SQL execution. Agents scaffold via CLI and deploy unchanged to Vercel's infrastructure.
agent frameworkfilesystem-firstdurable executionsandboxed computeopentelemetry
MiniMax Sparse Attention (MSA): a Two-Branch Block-Sparse Attention Trained on a 109B-Parameter MoE With a 3T-Token Budget
MiniMax Sparse Attention (MSA) introduces a two-branch block-sparse attention mechanism to address the quadratic cost of softmax attention in long-context scenarios. The method employs an Index Branch for block selection (default size 128 tokens) and a Main Branch for exact attention over selected blocks, maintaining a fixed per-query budget of 2,048 key-value tokens. Trained on a 109B-parameter Mixture-of-Experts model with 3T tokens, MSA achieves competitive performance on benchmarks (MMLU 67.2, GSM8K 77.7) while reducing compute costs by 28.4× at 1M context. The open-source kernel (fmha_sm100) targets NVIDIA SM100 GPUs with 14.2× prefill speedups.
sparse attentionmixture-of-expertsgrouped query attentionkv-cachelong-context
OpenAI’s Deployment Simulation Extends Pre-Deployment Risk Assessment to Agentic Coding Through Simulated Tool Calls
OpenAI introduces Deployment Simulation, a pre-deployment safety method that replays past conversations through candidate models to forecast real-world behavior. The technique samples representative usage distributions, reducing evaluation awareness and selection bias while scaling with compute rather than manual effort. Results from GPT-5-series deployments show a median 1.5x multiplicative error rate in predicting undesirable behaviors, with simulated traffic achieving near-indistinguishability from real deployment (5.1% vs 5.4% evaluation detection). The method extends to agentic coding via tool-call simulation, demonstrating 49.5% discriminator win rate against real samples.
deployment simulationevaluation awarenessagentic codingmultiplicative errortool-call simulation
How to Build Memory-Efficient Transformers with xFormers Using Packed Sequences, GQA, ALiBi, SwiGLU, and Causal Attention
The tutorial demonstrates how to implement memory-efficient Transformer models using xFormers, a GPU-optimized toolkit. Key techniques include packed sequences for variable-length inputs, grouped-query attention (GQA) to reduce KV-cache overhead, ALiBi positional biases, SwiGLU feed-forward layers, and causal attention with implicit masking. Benchmarks show xFormers achieves linear memory growth with sequence length (vs quadratic for naive attention) while maintaining numerical equivalence. A 3-layer GPT model trained with mixed precision achieves 100% next-token accuracy on synthetic data, validating the approach.
xformersgrouped-query attentionalibiswiglukv-cache
Google Cloud generative AI automates council planning operations
Google Cloud's generative AI tools, including the 'Extract' application and 'Augmented Planning Decisions' (APD) prototype, automate municipal planning operations to address administrative backlogs in UK councils. The system employs Gemini foundation models hosted on Google Cloud to parse unstructured PDFs, cross-reference zoning laws, and generate draft reports while maintaining data sovereignty through enterprise-grade security controls. Initial trials across 20+ councils show 255 annual hours saved per authority in manual data entry, with APD projected to reduce decision timelines by 50% and scale to all 300+ English councils by 2027.
gemini foundation modelsunstructured data parsingenterprise-grade securitymunicipal planning automationinference scalability
Generated automatically at 2026-06-17 21:30 UTC. Summaries and keywords are produced by an LLM and may contain inaccuracies — always consult the original article.
