Daily Digest — 2026-05-21
338 items · 7 research labs, 323 arxiv papers, 8 industry media
🏛️ Research Labs (7)
An OpenAI model has disproved a central conjecture in discrete geometry
An OpenAI general-purpose reasoning model autonomously disproved Erdős's 1946 planar unit distance conjecture in discrete geometry, demonstrating polynomial improvement over the previously best-known square grid constructions. The proof leverages unexpected connections to algebraic number theory, specifically infinite class field towers and Golod–Shafarevich theory, to construct point configurations with Ω(n^(1+c)) unit-distance pairs for some c > 0. External mathematicians verified the result, noting its significance as the first AI resolution of a central open problem and its introduction of deep number-theoretic techniques to geometric combinatorics.
planar unit distance problemalgebraic number theorygolod–shafarevich theorycombinatorial geometryinfinite class field towers
How Ramp engineers accelerate code review with Codex
Ramp engineers leverage OpenAI's Codex with GPT-5.5 to accelerate code review and develop agentic tooling, reducing pull request feedback time from hours to minutes. The system employs advanced reasoning capabilities to analyze codebases with thoroughness exceeding human reviewers, while offering both CLI and GUI interfaces for developer flexibility. Results include a 4x productivity boost in code reviews and successful deployment of an On-Call Assistant agent for incident management, with engineers reporting increased confidence in shipped improvements.
codexgpt-5.5agentic toolingpull requeston-call assistant
The next phase of OpenAI’s Education for Countries
OpenAI expands its Education for Countries initiative, integrating AI tools like ChatGPT and Codex into national education systems to enhance learning outcomes and economic opportunities. The program focuses on research-driven deployment, localized AI tools, and teacher training, partnering with countries such as Estonia, Jordan, Greece, Kazakhstan, Slovakia, and Singapore. Early results include Estonia's ChatGPT Edu reaching 20,000 students and 4,600 teachers, Jordan's Siraj engaging 1 million students, and Kazakhstan's 84,000 educators completing AI-readiness training. OpenAI collaborates with governments and educators to measure impact, share findings, and scale effective practices.
chatgptcodexai literacyresearch-driven deploymentteacher training
Introducing OpenAI for Singapore
OpenAI announces a S$300M partnership with Singapore's Ministry of Digital Development and Information (MDDI) to establish an Applied AI Lab, creating 200+ technical roles and focusing on three key areas: frontier AI deployment, local AI talent development, and broad economic AI adoption. The initiative includes an Applied AI Lab (OpenAI's first outside the US), Forward-Deployed Engineer training, and collaborations with education/government agencies on AI-enabled learning tools. The program targets public service, finance, healthcare, and digital infrastructure sectors while supporting startups and SMEs through accelerator programs and workshops.
applied ai labforward-deployed engineersai-enabled learningfrontier ai deploymentcodex
We’re announcing new community investments in Missouri.
Google announced new infrastructure investments in Missouri, including a data center in Montgomery County and a $20 million Energy Impact Fund to reduce utility costs. The initiative involves a Capacity Commitment Framework with Ameren to develop over 500 megawatts of additional capacity. The project is expected to generate nine local jobs per direct position and includes workforce training programs, such as partnerships with the Construction Laborers and Contractors Joint Training Fund of Eastern Missouri, to prepare thousands for skilled roles.
data centercapacity commitment frameworkenergy impact fundworkforce trainingmegawatts
100 things we announced at I/O 2026
Google announced Gemini 3.5 Flash, a high-speed multimodal model achieving 76.2% on Terminal-Bench 2.1 and 83.6% on MCP Atlas, optimized for agentic tasks. Gemini Omni introduced video generation with physics-aware synthesis and SynthID watermarking. AI Search upgraded to Gemini 3.5 Flash, featuring generative UI via Antigravity platform for dynamic interface creation. Universal Cart leveraged Gemini for cross-platform shopping with UCP checkout. Personal agent Gemini Spark, built on Antigravity, launched in beta for autonomous task execution with user oversight.
gemini 3.5 flashsynthid watermarkinggenerative uiuniversal commerce protocolagentic tasks
A new experiment brings better group meetings to Google Beam
Google Beam introduces an experimental feature leveraging HP Dimension's immersive display to enhance group meeting inclusivity by rendering remote participants in true-to-life proportions and spatial audio. This optimization automatically adjusts participant positioning and audio anchoring, simulating an in-room experience across devices. Initial research indicates a 50% improvement in social connection and a 21% increase in conversational contribution. The integration extends compatibility with Google Workspace and Zoom, aiming to bridge the hybrid inclusion gap in video conferencing.
immersive displayspatial audiohybrid inclusion gaptrue-to-life proportionsvideo conferencing
📜 arXiv Papers (323)
Atoms of Thought: Universal EEG Representation Learning with Microstates
The paper proposes microstates as a universal EEG representation, demonstrating superior performance over traditional time- and frequency-domain features. The method clusters continuous EEG signals into discrete microstate sequences using a tokenizer trained on a large medical dataset, then applies this representation to downstream tasks including sleep staging, emotion recognition, and motor imagery classification. Experiments show improved accuracy across tasks, with additional benefits in interpretability and scalability for cognitive neuroscience and clinical applications.
eeg representationmicrostatesbrain-computer interfacesuniversal tokenizerneuroinformatics
A Methodology for Selecting and Composing Runtime Architecture Patterns for Production LLM Agents
The paper introduces the stochastic-deterministic boundary (SDB) as a foundational architectural primitive for production LLM agents, defining it as a four-part contract governing LLM output integration into system actions. It organizes runtime design into Coordination, State, and Control, presenting six runtime patterns adapted from distributed systems for conversational, autonomous, and long-horizon agents. A five-step methodology is proposed for pattern selection, alongside a diagnostic procedure for failure analysis and identification of replay divergence. The reliability decomposition separates model variance from architectural momentum, emphasizing SDB strength as model variance decreases. The methodology is applied to five workloads, with a reference implementation for a contract-renewal agent.
stochastic-deterministic boundaryruntime patternsreplay divergencearchitectural momentumproduction llm agents
Long-term Power Grid Planning via Answer Set Programming
The paper introduces the first Answer Set Programming (ASP)-based method for automated long-term power grid planning, addressing sustainability targets and demand patterns. ASP elegantly encodes topological and combinatorial invariants that are cumbersome in traditional planning languages. Experimental evaluations on synthetic and real-world grid data demonstrate the approach's expressive power and effectiveness in maintaining supply continuity and service quality over decade-long developments.
answer set programmingpower grid planningtopological invariantscombinatorial optimizationsustainability targets
HaorFloodAlert: Deseasonalized ML Ensemble for 72-Hour Flood Prediction in Bangladesh Haor Wetlands
HaorFloodAlert introduces a deseasonalized machine learning ensemble for 72-hour flood prediction in Bangladesh's Sunamganj Haor wetlands, addressing limitations of riverine flood models. The ensemble combines Random Forest (0.5625) and XGBoost (0.4375), leveraging upstream Barak River Sentinel-1 SAR data with Otsu-thresholded change detection for validation (84-91% spatial match). It achieves 89.6% LOOCV accuracy, 87.5% recall, and 0.943 AUC-ROC on 77 Sentinel-1 events, while mitigating temperature-induced seasonal bias that inflated accuracy by 6.9 pp. The system includes a three-tier alert pipeline and a BRRI-calibrated boro rice damage estimator.
deseasonalizedensembleotsu-thresholdedsentinel-1auc-roc
Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR
The paper introduces POW3R, a policy-aware rubric reward framework for reinforcement learning with verifiable rewards. POW3R dynamically adapts criterion-level reward weights during training, emphasizing criteria that distinguish rollouts while preserving human-assigned importance and category balance. This approach improves upon static aggregation methods, which conflate importance with optimization signal usefulness. Evaluated across three base policies on two datasets, POW3R outperforms vanilla GRPO in 24 of 30 comparisons, achieving higher mean rubric reward and strict completion rates while converging 2.5–4× faster.
rubric rewardsreinforcement learningpolicy-awaredynamic weightinggrpo
Rethinking Visual Attribution for Chest X-ray Reasoning in Large Vision Language Models
The study evaluates visual attribution methods in Large Vision Language Models (LVLMs) for chest X-ray reasoning, revealing their frequent failure to identify evidence used by models. A causal evaluation framework is developed, verifying expert-annotated regions via counterfactual editing. MedFocus, a novel concept-based attribution method, outperforms 11 existing methods by localizing anatomical regions through unbalanced optimal transport and measuring causal effects via targeted interventions. Results across six LVLMs and two output modes demonstrate MedFocus's superiority in producing spatial, concept-level, and token-level attributions.
lvlmsvisual attributionchest x-rayoptimal transportcausal evaluation
Less Back-and-Forth: A Comparative Study of Structured Prompting
This study evaluates structured prompting techniques for improving LLM response quality and efficiency. Comparing raw, checklist-improved, and clarifying-question prompts across ChatGPT, Claude, and Grok on summarization, planning, explanation, and coding tasks, checklist prompts achieved the highest mean rubric score (7.50/8) with superior quality-effort tradeoffs. Results demonstrate that structured prompting reduces interaction overhead while enhancing output quality.
llmsprompt engineeringchecklist promptsresponse qualityinteraction efficiency
Beyond Prediction Accuracy: Target-Space Recovery Profiles for Evaluating Model-Brain Alignment
The study introduces a framework for evaluating model-brain alignment beyond prediction accuracy by identifying which reproducible response dimensions of the target brain are recovered. Using repeated fMRI measurements from the Natural Scenes Dataset (eight subjects viewing natural images), the method first identifies reproducible dimensions in early-to-intermediate visual cortex responses, then quantifies their recovery via brain-to-brain or model-to-brain predictions. Results reveal that pretrained and randomly initialized models can achieve similar accuracy while differing in dimension recovery profiles, demonstrating that scalar metrics mask mismatches. The framework provides diagnostic evaluation of artificial vision models against human visual cortex.
model-brain alignmentfmri reproducibilityresponse dimensionsvisual cortexnatural scenes dataset
Using Aristotle API for AI-Assisted Theorem Proving in Lean 4: A Formalisation Case Study of the Grasshopper Problem
This paper presents a Lean 4 formalization case study using Aristotle API for AI-assisted theorem proving, focusing on the Grasshopper problem (IMO 2009 Problem 6). The method involves generating a generalized Lean theorem with four verified helper lemmas addressing local components of a maximality and adjacent-swap exchange strategy, while leaving the main theorem unresolved due to an unverified global counting step. Results demonstrate that local proof search successfully verified specific components but failed to address the global combinatorial bookkeeping required for the theorem. The study highlights a central limitation in AI-assisted formalization and provides a reproducible Lean artifact with precise analysis.
lean 4theorem provingaristotle apiformalizationgrasshopper problem
Toto 2.0: Time Series Forecasting Enters the Scaling Era
The study introduces Toto 2.0, a family of five open-weights time series forecasting models demonstrating reliable quality improvements across 4M to 2.5B parameters. The models employ a unified training recipe and achieve state-of-the-art performance on three benchmarks: BOOM, GIFT-Eval, and TIME. Key contributions include architectural design, training data selection, and the u-muP hyperparameter transfer pipeline. Results validate the scalability of time series foundation models, with all checkpoints released under Apache 2.0.
time series forecastingfoundation modelsscaling lawshyperparameter transferopen-weights
k-Inductive Neural Barrier Certificates for Unknown Nonlinear Dynamics
The paper introduces k-inductive neural barrier certificates (k-NBCs) for safety verification of (partially) unknown nonlinear systems, relaxing conventional constraints by allowing temporary increases in barrier function values while ensuring overall safety. The method leverages neural networks for scalable design and employs counterexample-guided inductive synthesis (CEGIS) with satisfiability modulo theories (SMT) for verification, utilizing Willems et al.'s fundamental lemma to construct data-driven system representations without requiring full dynamics knowledge. This approach removes restrictions on barrier certificate function classes, enhancing flexibility. Validation on three nonlinear case studies demonstrates efficacy in handling unknown dynamics.
k-inductivebarrier certificatescounterexample-guided inductive synthesissatisfiability modulo theoriesnonlinear dynamics
Beyond Isotropy in JEPAs: Hamiltonian Geometry and Symplectic Prediction
The paper demonstrates that isotropic Gaussian regularization in Joint-Embedding Predictive Architectures (JEPAs) is suboptimal for structured downstream geometries, showing that any fixed marginal target can be misaligned with certain geometries. It proposes HamJEPA, which encodes views as phase-space states (q,p) and predicts transitions via a learned Hamiltonian leapfrog map, incorporating non-isotropic scale and spectral floors to prevent collapse. HamJEPA outperforms SIGReg on CIFAR-100 (+4.89 kNN@20, +3.52 linear-probe at 30 epochs; +6.45 kNN@20, +10.64 linear-probe at 80 epochs) and ImageNet-100 (+4.82 kNN@20, +7.52 linear-probe at 45 epochs), with ablation confirming the symplectic coupling's role in improving neighborhood geometry.
jepahamiltonian geometrysymplectic predictionisotropic regularizationphase-space encoding
Draft Less, Retrieve More: Hybrid Tree Construction for Speculative Decoding
Graft introduces a hybrid tree construction framework for speculative decoding that breaks the Pareto tradeoff between dense and pruned drafting by coupling pruning and retrieval as mutually reinforcing operations. The method employs a sequential 'prune-then-graft' mechanism, where pruning frees computational budget for retrieval, and retrieval compensates for pruning-induced coverage loss by attaching predictive tokens with near-zero overhead. Evaluations demonstrate that Graft establishes a new Pareto frontier, achieving up to 5.41× speedup on short-context benchmarks and improving average speedup over EAGLE-3 by up to 21.8% on Qwen3-235B. The framework is training-free, lossless, and extensible to block drafting paradigms.
speculative decodingpruningretrievalpareto frontierautoregressive
Neurosymbolic Learning for Inference-Time Argumentation
The paper introduces inference-time argumentation (ITA), a neurosymbolic framework for ternary claim verification (true/false/uncertain) that combines formal argumentation semantics with LLM training. ITA guides LLMs to generate arguments and assign base scores, which are then used to compute deterministic, inspectable predictions. The method ensures faithfulness by constructing verdicts directly from argumentative structures rather than post-hoc rationales. Evaluated on two ternary claim verification datasets, ITA outperforms argumentative baselines and competes with non-argumentative direct-prediction approaches while providing transparent reasoning traces.
neurosymbolic learninginference-time argumentationternary claim verificationformal argumentation semanticsllm training
INSHAPE: Instance-Level Shapelets for Interpretable Time-Series Classification
INSHAPE introduces instance-level shapelets for interpretable time-series classification, addressing limitations of population-level approaches by identifying discriminative temporal patterns specific to each time series. The method models non-overlapping segments with temporal dependencies and aggregates instance-level shapelets into prototypical patterns for global interpretability. Evaluated on 128 UCR and 30 UEA datasets, INSHAPE outperforms state-of-the-art shapelet-based methods while enhancing interpretability.
shapeletstime-series classificationinterpretabilitytemporal patternsinstance-level
ThoughtTrace: Understanding User Thoughts in Real-World LLM Interactions
The paper introduces ThoughtTrace, the first large-scale dataset pairing real-world multi-turn human--AI conversations with users' self-reported thoughts, comprising 1,058 users, 2,155 conversations, and 10,174 thought annotations across 20 language models. Analysis reveals thoughts are semantically distinct from messages, difficult for LLMs to infer, and vary by conversation stage. Downstream applications show thoughts improve user-behavior prediction as inference-time context and provide fine-grained alignment signals for training personalized assistants. ThoughtTrace establishes user thoughts as a new data modality for studying cognitive dynamics in human--AI interaction.
thought annotationmulti-turn conversationuser-behavior predictionalignment signalscognitive dynamics
What Do Evolutionary Coding Agents Evolve?
The paper introduces EvoTrace, a dataset of evolutionary coding traces across four frameworks and 16 tasks, to analyze what evolutionary coding agents actually evolve beyond final benchmark scores. Using EvoReplay, a replay-based methodology with controlled interventions, the authors annotate code edits into nine recurring types and find that most score gains stem from a small subset of edits. Key results include a deterministic cycling pattern where 30% of added code lines are byte-identical re-introductions, revealing that benchmark improvements often arise from mechanisms other than novel algorithmic structure.
evolutionary codingllm-as-judgeedit typesalgorithmic structurebenchmark evaluation
BalanceRAG: Joint Risk Calibration for Cascaded Retrieval-Augmented Generation
BalanceRAG introduces joint risk calibration for cascaded retrieval-augmented generation (RAG) systems, addressing conservative stage-by-stage thresholding by certifying threshold pairs at target risk levels. The method frames threshold pairs as operating points on a two-dimensional lattice, using sequential graphical testing to identify safe points while controlling system-level error rates. Experiments on three open-domain QA benchmarks with multiple LLM backbones show that BalanceRAG meets risk targets, increases coverage and correct examples, and reduces unnecessary retrieval calls compared to always-on RAG.
retrieval-augmented generationrisk calibrationsequential graphical testinguncertainty thresholdingopen-domain qa
VL-DPO: Vision-Language-Guided Finetuning for Preference-Aligned Autonomous Driving
VL-DPO introduces a vision-language-guided framework for aligning ego-vehicle motion forecasting models with human preferences in autonomous driving. The method leverages a vision-language model (VLM) as a zero-shot reasoner to generate preference pairs from pretrained model rollouts, which are then used to finetune the model via Direct Preference Optimization (DPO). Evaluated on the Waymo Open End-to-End Driving Dataset (WOD-E2E), VL-DPO achieves an 11.94% increase in rater feedback score (RFS) and a 10.01% reduction in average displacement error (ADE) compared to the pretrained model, demonstrating the VLM's effectiveness as a proxy for human preference.
vision-language modeldirect preference optimizationego-vehicle motion forecastingzero-shot reasoneraverage displacement error
Probability-Conserving Flow Guidance
The paper introduces Adaptive Manifold Guidance (AdaMaG), a probability-conserving flow guidance method for diffusion and flow-based generative models. Analyzing guidance through the continuity equation, the authors decompose its effect into divergence and score-parallel terms, proving the former blows up near the data manifold. AdaMaG schedules time-dependent attenuation to bound both terms without extra inference cost. Empirical results on image generation benchmarks demonstrate improved realism, reduced hallucinations, and controlled desaturation under high guidance.
adaptive manifold guidancecontinuity equationdivergence termscore-parallel termdesaturation
CopT: Contrastive On-Policy Thinking with Continuous Spaces for General and Agentic Reasoning
CopT introduces a reversed reasoning pipeline for large language models (LLMs) that first generates draft answers before performing on-policy thinking for reflection, addressing performative reasoning in chain-of-thought (CoT) approaches. The method uses continuous embeddings as contrastive verifiers, computing sequence-level reverse KL estimators to assess answer reliability, with dynamic KL control for draft-answer visibility during corrective thinking. Evaluations on mathematics, coding, and agentic reasoning tasks show accuracy improvements up to 23% and token reductions up to 57% without additional training.
chain-of-thoughton-policy thinkingcontrastive verificationreverse kl estimatorperformative reasoning
Probing Embodied LLMs: When Higher Observation Fidelity Hurts Problem Solving
The study reveals a counterintuitive phenomenon where embodied LLM agents achieve higher task success rates with lower-fidelity RGB observations compared to ground-truth symbolic inputs on the Lockbox sequential puzzle task. Through physical robotic experiments and controlled simulations with randomized action-outcome flips, researchers demonstrate that moderate perceptual noise (40% flip probability) yields a 2.85× performance improvement by reducing repetitive action loops. The findings challenge conventional evaluation metrics, showing that success rates may reflect error-reasoning interactions rather than robust problem-solving capabilities.
embodied llmsobservation fidelitysequential puzzleperceptual noiseaction-outcome flips
Towards LLM-Assisted Architecture Recovery for Real-World ROS~2 Systems: An Agent-Based Multi-Level Approach to Hierarchical Structural Architecture Reconstruction
The paper enhances LLM-assisted architecture recovery for ROS~2 systems via (1) refined prompting for consistent synthesis and (2) a multi-level staged strategy using intermediate representations (node lists, launch dependencies) to enable hierarchical reconstruction. Evaluated on a robotic disassembly system with complex integration, the method improves structural consistency and scalability over prior work, though challenges remain in handling dynamic semantics. Results demonstrate robustness in real-world settings with heterogeneous artifacts.
ros~2architecture recoveryllm-assistedhierarchical reconstructionmulti-level representation
PromptRad: Knowledge-Enhanced Multi-Label Prompt-Tuning for Low-Resource Radiology Report Labeling
PromptRad introduces a knowledge-enhanced multi-label prompt-tuning method for low-resource radiology report labeling, addressing limitations of rule-based systems and data-intensive fine-tuning. The approach reformulates multi-label classification as masked language modeling, integrates UMLS Metathesaurus synonyms via a multi-word verbalizer, and fine-tunes PLMs without additional classification layers. Evaluated on liver CT reports, PromptRad outperforms dictionary-based and fine-tuning baselines with only 32 training examples and matches GPT-4 performance despite smaller model size, demonstrating superior negation pattern handling.
prompt-tuningmulti-label classificationumls metathesaurusmasked language modelinglow-resource learning
Does Code Cleanliness Affect Coding Agents? A Controlled Minimal-Pair Study
The study investigates whether code cleanliness affects autonomous coding agents' performance through a minimal-pair evaluation protocol. Researchers constructed repository pairs identical in architecture and behavior but differing in static-analysis violations and cognitive complexity, then tested Claude Code on 33 tasks across six pairs. Results show no significant difference in task pass rates (660 trials), but cleaner code reduced token usage by 7-8% and file revisitations by 34%, indicating cleaner code improves operational efficiency without altering success rates.
autonomous coding agentsminimal-pair evaluationstatic-analysis violationscognitive complexitytoken efficiency
When Critics Disagree: Adaptive Reward Poisoning Attacks in RIS-Aided Wireless Control System
The paper introduces Disagreement-Guided Reward Poisoning (DGRP), an adaptive attack targeting Soft Actor-Critic (SAC) agents in RIS-assisted Cognitive Radio Networks. DGRP exploits high-disagreement states between SAC's dual critics to corrupt rewards, distort value estimations, and drive suboptimal policy decisions. Experiments show DGRP degrades RIS performance gains and transmission quality more effectively than periodic-timing or exploration-triggered baselines, emphasizing the need for disagreement-aware robustness evaluation in DRL-based wireless control systems.
reward-poisoning attackssoft actor-criticreconfigurable intelligent surfacescognitive radio networksdeep reinforcement learning
AutoResearchClaw: Self-Reinforcing Autonomous Research with Human-AI Collaboration
AutoResearchClaw introduces a multi-agent autonomous research pipeline that addresses limitations of existing single-agent systems through five key mechanisms: structured multi-agent debate, self-healing execution with Pivot/Refine loops, verifiable result reporting, human-in-the-loop collaboration (seven intervention modes), and cross-run evolution. The system outperforms AI Scientist v2 by 54.7% on ARC-Bench (25-topic benchmark), with targeted human intervention proving more effective than full autonomy or exhaustive oversight. Results demonstrate how iterative failure handling and experience accumulation can enhance automated scientific discovery while preserving human judgment.
multi-agent debateself-healing executorhuman-in-the-loopcross-run evolutionverifiable reporting
When Skills Don't Help: A Negative Result on Procedural Knowledge for Tool-Grounded Agents in Offensive Cybersecurity
This work demonstrates that procedural knowledge packages (Skills) provide minimal benefit in offensive cybersecurity tasks, contrasting with their average 16.2 percentage point improvement across other domains. Through a 180-run controlled study of an MCP-grounded autonomous Capture-the-Flag agent across four documentation conditions (55 to 4,147 lines), the analysis reveals only an 8.9 percentage point spread between no-Skills and full-Skills conditions, with statistical insignificance (p = 0.71, χ²; p = 0.25, Cochran–Armitage trend test). The authors attribute this to high environment-feedback bandwidth, where schema-validated, low-latency tool observations render Skills redundant or even detrimental in timing side-channel scenarios. A falsifiable hypothesis and replication pipeline are proposed.
procedural knowledgeenvironment-feedback bandwidthschema-validatedlow-latencyside-channel
Training Neural Networks with Optimal Double-Bayesian Learning
The paper introduces a double-Bayesian probabilistic framework for determining optimal learning rates in stochastic gradient descent (SGD). The method extends classical Bayesian statistics into two antagonistic Bayesian processes, deriving a theoretically optimal learning rate. Experiments on classification, segmentation, and detection tasks validate the framework's effectiveness in improving model training and performance. The approach addresses the empirical challenges of hyperparameter selection in neural network optimization.
stochastic gradient descentbayesian statisticslearning ratehyperparameter optimizationneural network training
GeoX: Mastering Geospatial Reasoning Through Self-Play and Verifiable Rewards
GeoX introduces a self-play framework for geospatial reasoning that learns spatial logic through executable programs with verifiable rewards, eliminating the need for large-scale human annotations. The method employs a multimodal policy to propose and solve spatial problems via abduction, deduction, and induction, using spatial primitives and an image understanding tool. Reinforcement learning optimizes the policy using program-execution rewards. GeoX improves base VLMs by up to 5.5 points, matching or surpassing conventional baselines trained on curated data. The work also releases a self-play-derived benchmark for geospatial understanding.
geospatial reasoningself-playexecutable programsverifiable rewardsmultimodal policy
LLM Benchmark Datasets Should Be Contamination-Resistant
The paper advocates for contamination-resistant benchmark datasets to ensure reliable evaluation of large language models (LLMs), given widespread contamination in current benchmarks. It proposes leveraging the asymmetry between inference and training pipelines in Transformers to create unlearnable yet inference-supportive datasets, alongside mathematical advancements for cross-architecture interoperability. The authors call for community action to develop contamination-resistant methodologies, supporting tools, and their integration into evaluation pipelines.
benchmark contaminationtransformer architectureinference pipelineunlearnable datasetsllm evaluation
A Case for Agentic Tuning: From Documentation to Action in PostgreSQL
The paper introduces PerfEvolve, a method for dynamic database tuning that replaces static documentation with LLM-based agentic skills. The system addresses three limitations of traditional tuning guides (version staleness, workload heterogeneity, and parameter dependencies) by implementing version-consistency verification, workload-specific profiling, and multi-parameter joint optimization. Evaluated on PostgreSQL under TPC-C and TPC-H benchmarks, PerfEvolve achieves up to 35.2% better performance than documentation-driven baselines.
postgresqlllm-based agentsparameter tuningworkload profilingjoint optimization
Learning with Foresight: Enhancing Neural Routing Policy via Multi-Node Lookahead Prediction
The paper introduces Multi-node Lookahead Prediction (MnLP), a training strategy enhancing neural routing policies by enabling multi-step node prediction during training without inference overhead. MnLP employs causal and discardable modules for auxiliary supervision, improving long-horizon planning. Experiments demonstrate MnLP's superior generalization across problem sizes, distributions, and benchmarks compared to next-node prediction baselines, while maintaining architectural flexibility.
neural routing policymulti-node lookaheadlong-horizon planningauxiliary supervisioninference efficiency
Block-Sphere Vector Quantization
The paper presents a unified theoretical comparison of rotation-based vector quantizers (EDEN, RabitQ, TurboQuant) and introduces Block-Sphere Quantization (BlockQuant). Analysis reveals method-dependent advantages: EDEN/TurboQuant excel in MSE distortion, EDEN in inner-product distortion, and RabitQ in high-probability control. BlockQuant employs spherical geometry to quantize vector blocks, outperforming baselines in reconstruction MSE and inner-product distortion theoretically and empirically on embedding datasets and LLM inference tasks.
vector quantizationrotation-based quantizersmse distortioninner-product distortionblock-spherical geometry
Detecting Fluent Optimization-Based Adversarial Prompts via Sequential Entropy Changes
The paper introduces CPD Online (CPD), a model-agnostic method for detecting optimization-based adversarial prompts in LLMs by analyzing sequential entropy changes. The approach formulates detection as an online change-point problem, standardizing token-level entropies against a system-prompt baseline and applying a one-sided CUSUM statistic. Evaluated on 1,012 attacks (GCG, AutoDAN, etc.) and benign prompts across six models (LLaMA-2, Vicuna, Qwen2.5), CPD achieves AUROC 0.88 and F1 0.82 on LLaMA-2-7B, outperforms windowed-perplexity baselines, and localizes 79.6% of triggers within adversarial suffixes. As a gate for LLaMA Guard, it reduces guard calls by 17-22% while maintaining detection quality.
adversarial promptschange-point detectiontoken-level entropycusum statisticllm jailbreaking
World-Ego Modeling for Long-Horizon Evolution in Hybrid Embodied Tasks
The paper introduces World-Ego Modeling (WEM), a novel paradigm for long-horizon embodied tasks that disentangles world and ego dynamics through motion-, semantic-, and intention-based views. The proposed WEM framework combines an implicit world-ego planner with a cascade-parallel mixture-of-experts (CP-MoE) diffusion generator. Evaluated on the new HTEWorld benchmark (125K video clips, 4.5M frames) and existing manipulation benchmarks, WEM achieves state-of-the-art performance in hybrid navigation-manipulation tasks.
world-ego modelinglong-horizon evolutionhybrid embodied taskscascade-parallel mixture-of-expertsdiffusion generator
GEM: GPU-Variability-Aware Expert to GPU Mapping for MoE Systems
GEM introduces GPU-variability-aware expert mapping for Mixture-of-Expert (MoE) models, addressing straggler GPUs caused by imbalanced token loads and hardware variability. The method profiles GPU performance variability and token load distributions, strategically placing consistent (frequently used) and temporal (co-occurring) experts across GPUs to minimize synchronization delays. Evaluations demonstrate 7.9% average and up to 16.5% maximum latency reduction compared to baseline approaches.
mixture-of-expertgpu variabilitytoken load balancingexpert placementsynchronization bottleneck
A Measure-Theoretic Analysis of Reasoning: Structural Generalization and Approximation Limits
The paper theoretically analyzes out-of-distribution (OOD) generalization in LLM reasoning through a measure-theoretic framework. Using optimal transport, it projects discrete reasoning trajectories into a continuous metric space, quantifying domain shifts via Wasserstein-1 distance and bounding generalization via Kantorovich duality. Key findings show that position-dependent attention mechanisms (e.g., Absolute Positional Encoding) incur higher Lipschitz constants and expected risk compared to shift-invariant alternatives (e.g., Rotary Embeddings). Additionally, mapping backtracking to Dyck-$k$ languages reveals depth scaling is necessary for $ ext{TC}^0$ Transformers to avoid representation collapse, which width scaling cannot circumvent. Empirical validation across 54 Transformer configurations confirms generalization risk scales monotonically with Wasserstein domain shift.
optimal transportwasserstein distancelipschitz continuitydyck-$k$ languagebarron spaces
Probabilistic Tiny Recursive Model
The paper introduces Probabilistic Tiny Recursive Model (PTRM), a task-agnostic framework enhancing Tiny Recursive Models (TRMs) through stochastic exploration. PTRM injects Gaussian noise at each recursion step, enabling parallel trajectories to explore diverse solution basins, and selects among them using the model's existing Q head for early stopping. Without retraining or task-specific augmentations, PTRM achieves significant accuracy improvements: 87.4% to 98.75% on Sudoku-Extreme and 62.6% to 91.2% on Pencil Puzzle Bench, outperforming frontier LLMs (55.1%) at 0.0001x cost with 7M parameters.
probabilistic recursionstochastic explorationgaussian noisesolution basinsearly stopping
Robotics-Inspired Guardrails for Foundation Models in Socially Sensitive Domains
The paper proposes robotics-inspired guardrails for foundation models in socially sensitive domains, introducing formal constructs for runtime behavioral control over interaction trajectories. The Grounded Observer framework enforces constraints in uncertain, closed-loop systems, addressing cumulative failures through trajectory-level safety rather than individual outputs. Applied to small talk, autism therapy, and school de-escalation deployments, the framework demonstrates effective runtime interventions that mitigate undesirable interaction drift while adapting to social contexts. The work suggests extensions for stronger behavioral guarantees.
foundation modelsruntime behavioral controlinteraction trajectoriesclosed-loop systemsconstraint enforcement
PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents
PEEK introduces a context map system for long-context LLM agents, caching reusable orientation knowledge about recurring external contexts (e.g., document corpora) as a small, constant-sized prompt artifact. The method employs three modules: a Distiller for knowledge extraction, a Cartographer for structured edits, and an Evictor for token-budget enforcement. Evaluations show 6.3-34.0% performance gains over ACE, with 93-145 fewer iterations, 1.4-5.8x lower cost, and improved solving rates (6.0-14.0%) and rubric accuracy (7.8-12.1%) across models like OpenAI Codex.
context maplong-context reasoningllm agentsknowledge distillationtoken budget
StruMPL: Multi-task Dense Regression under Disjoint Partial Supervision and MNAR Labels
StruMPL introduces a multi-task dense regression framework for disjoint partial supervision with MNAR labels and inter-task physical constraints, addressing forest aboveground biomass (AGB) estimation from Earth observation data. The method employs a shared encoder with per-variable regression, imputation, and propensity heads for MNAR correction, alongside a learnable physics module enforcing biome-specific allometric constraints. An Augmented IPW (AIPW) pseudo-outcome with stop-gradients ensures stable joint optimization. Evaluated on two biomes, StruMPL reduces high-AGB bias by ~54% and outperforms baselines in RMSE and bias metrics.
multi-task learningmissing not at randomdense regressionallometric constraintsaugmented ipw
Breaking Modality Heterogeneity in Low-Bit Quantization for Large Vision-Language Models
The paper introduces SplitQ, a channel-splitting-driven post-training quantization framework for vision-language models (VLMs) that addresses modality heterogeneity in low-bit quantization. The method features Modality-specific Outlier Channel Decoupling (MOCD) to isolate salient modality-specific outlier channels and Adaptive Cross-Modal Calibration (ACC) with dual learnable branches to mitigate quantization errors. Experiments on 6 multi-modal datasets show SplitQ outperforms existing approaches, preserving 93.5% of FP16 performance under W3A3 quantization (69.5 vs. 74.3 accuracy).
post-training quantizationvision-language modelsmodality heterogeneityoutlier channel decouplingcross-modal calibration
Real-Time Parallel Counterfactual Regret Minimization
The paper introduces Parallel CFR, the first parallelization framework for real-time depth-limited Counterfactual Regret Minimization (CFR) in imperfect-information games. The method decomposes CFR iterations into a seven-stage pipeline with two orthogonal parallelism dimensions (by information set and by tree node), offloading leaf node evaluation to GPUs via batched neural network inference. Experiments on Heads-Up No-Limit Texas Hold'em show 3.3–3.4× speedup over single-threaded baselines, achieving per-iteration times of ~47–54 ms on a game tree with over 1 billion histories using a single desktop-class device (NVIDIA DGX Spark).
counterfactual regret minimizationparallel computationimperfect-information gamesgpu accelerationreal-time decision making
Fast and Featureless Node Representation Learning with Partial Pairwise Supervision
Contrastive FUSE introduces a fast, unified framework for node representation learning in graphs with partial pairwise labels and no node features. The method optimizes a spectral contrastive objective integrating community-aware structural signals and signed pairwise constraints, replacing costly modularity gradients with a lightweight approximation for scalability. It employs an efficient optimization scheme with natural gradient decomposition and adaptive learning-rate scaling, enabling rapid updates on million-edge graphs. Experiments on citation networks, co-purchase graphs, and OGB datasets demonstrate competitive or superior contrastive classification performance without node features, alongside significant runtime improvements over baselines.
spectral contrastive objectivemodularity gradientnode representation learningadaptive learning-rate scalingpairwise constraints
Streamlined Constraint Reasoning via CNN Pattern Recognition on Enumerated Solutions
We introduce a CNN-LLM pipeline for automated streamliner synthesis in constraint programming, leveraging enumerated solutions to detect structural patterns. The method trains a Convolutional Neural Network contrastively on feasible solutions versus perturbed non-solutions, then translates the CNN's discriminative signal into MiniZinc streamliner constraints via LLM-driven synthesis. Evaluated on hardened benchmarks, the pipeline achieves 98.8%, 98.6%, and 89.4% portfolio time reductions on Vessel Loading, Social Golfers, and Black Hole respectively, with geometric-mean speedups of 932x, 356x, and 1103x. Discovered streamliners include class-based packing constraints, canonicalisations, and layout-coordinate bounds.
streamliner synthesisconvolutional neural networkminizincconstraint programmingcontrastive learning
Deep Tech to Space: Space Data Centers and AI Revolution at the Edge
The article proposes Space Data Centers (SDCs) as AI-driven orbital platforms to address bandwidth and latency constraints in space-to-Earth data transmission. The authors present a constellation architecture for Low Earth Orbit SDCs, detailing orbital design, inter-satellite networking, computational resource allocation, and service orchestration. Technical and economic feasibility is analyzed through forecasting models based on technology roadmaps, with validation via Earth observation and lunar exploration case studies. The approach aims to mitigate ground station limitations caused by visibility windows and scheduling complexity.
space data centerslow earth orbitinter-satellite linksservice orchestrationtechnology roadmaps
Passive Construction Site Safety Monitoring via Persona-Scaffolded Adversarial Chain-of-Thought VLM Verification
The paper introduces a passive construction safety monitoring pipeline using a three-stage architecture: (1) YOLO11 for PPE/hazard detection, (2) SAM 3 for segmentation refinement, and (3) Qwen3-VL-8B-Instruct with persona-scaffolded adversarial chain-of-thought verification. The method achieves a 12% precision improvement over single-pass prompting, particularly in hallucination-prone violation categories, evaluated on the 12-video Ironsite corpus. The system maps violations to OSHA standards, performs ergonomic risk scoring, and generates timestamped safety reports.
yolo11sam 3qwen3-vl-8b-instructadversarial chain-of-thoughtosha standards
StableGrad: Backward Scale Control without Batch Normalization
The paper introduces StableGrad, an optimizer-level mechanism for controlling gradient scales in deep neural networks without modifying forward passes or using batch normalization. The method rescales layer-wise weight gradients during backpropagation, preserving network outputs and derivatives—crucial for Physics-Informed Neural Networks (PINNs) where batch normalization disrupts physical consistency. Evaluations demonstrate StableGrad's effectiveness: it improves PINN benchmark accuracy by enabling deeper models and stabilizes BatchNorm-free ResNet/EfficientNet training without architectural changes. Results indicate gradient-scale control at the optimizer level can replace forward normalization when inappropriate.
gradient scalingphysics-informed neural networksbatch normalizationoptimizer-level controldeep network training
A Framework for Evaluating Zero-Shot Image Generation in Concept-based Explainability
The paper proposes a framework for evaluating zero-shot text-to-image (T2I) generative models as synthetic concept datasets for concept-based explainable AI (XAI). It assesses synthetic concept faithfulness through four analyses: (1) concept representation similarity between synthetic and real images, (2) intra-similarity of progressively larger concept subsets, (3) downstream explanation task performance, and (4) concept removal effects on explanations. Results reveal challenges in using synthetic data for XAI, raising open questions about zero-shot pipelines. The dataset is publicly available.
concept-based xaizero-shot learningtext-to-image generationsynthetic datasetsmodel explainability
FineBench: Benchmarking and Enhancing Vision-Language Models for Fine-grained Human Activity Understanding
The authors introduce FineBench, a benchmark for evaluating fine-grained human activity understanding in Vision-Language Models (VLMs), featuring 199,420 QA pairs across 64 long-form videos with dense spatial/temporal annotations. They propose FineAgent, a modular framework combining a Localizer and Descriptor to enhance VLMs, demonstrating consistent performance improvements on FineBench. Results show proprietary models like GPT-5 perform adequately, while open-source VLMs struggle with spatial reasoning and subtle action distinctions.
vision-language modelsfine-grained understandingvideo question answeringspatial reasoningmodular framework
CADENet: Condition-Adaptive Asynchronous Dual-Stream Enhancement Network for Adverse Weather Perception in Autonomous Driving
CADENet introduces a training-free, three-thread system for adverse weather perception in autonomous driving, addressing the latency and annotation completeness bias of existing approaches. The system comprises Thread S (YOLOv11n) for zero-latency detections, Thread Q for condition-adaptive enhancement (CAPE) fused via entropy-guided NMS (EG-NMS), and Thread E for CLIP-based zero-shot weather classification requiring only text prompts. Evaluated on 1327 DAWN images, CADENet achieves Recall = 0.0103 (micro), F1 = 0.0230 on snow, and F1 = 0.0038 on rain, with Thread S sustaining ≈44 FPS. The method formalizes annotation completeness bias, ensuring recall as the primary metric.
adverse weather perceptioncondition-adaptive enhancemententropy-guided nmszero-shot classificationannotation completeness bias
A Closed-loop, State-centric, Multi-agent Framework for Passenger Load Estimation from Heterogeneous Data Streams
The paper proposes a closed-loop, state-centric, multi-agent framework for robust passenger load estimation from heterogeneous data streams, addressing challenges like incremental count errors and sensor reliability. The method enforces physical feasibility, dynamically allocates trust among evidence sources, and uses physics-derived violation residuals for training. The architecture includes a unified stop-event backbone, a coupled Perception--Physical--Fusion loop for stop-by-stop inference, and optional trip-level macro-correction modules. This approach improves accuracy in automatic passenger counting (APC) systems under varying operational conditions.
passenger load estimationheterogeneous data streamsclosed-loop frameworkmulti-agent systemautomatic passenger counting
Mega-ASR: Towards In-the-wild^2 Speech Recognition via Scaling up Real-world Acoustic Simulation
Mega-ASR introduces a unified framework for robust automatic speech recognition (ASR) in real-world environments, addressing the 'acoustic robustness bottleneck' through scalable compound-data construction and progressive acoustic-to-semantic optimization. The method leverages Voices-in-the-Wild-2M, a dataset covering 7 acoustic phenomena and 54 compound scenarios, and employs Acoustic-to-Semantic Progressive Supervised Fine-Tuning and Dual-Granularity WER-Gated Policy Optimization. Experiments show Mega-ASR outperforms state-of-the-art systems on adverse-condition benchmarks, achieving 45.69% WER on VOiCES R4-B-F (vs. 54.01%) and 21.49% on NOIZEUS Sta-0 (vs. 29.34%), with over 30% relative WER reduction in complex compositional scenarios.
automatic speech recognitionacoustic robustnessprogressive fine-tuningwer-gated optimizationcompound-data construction
Explainable Wastewater Digital Twins: Adaptive Context-Conditioned Structured Simulators with Self-Falsifying Decision Support
The paper presents CCSS-IX, an explainable digital twin for wastewater treatment plants, combining interpretable locally linear state-space models with a context-aware gating network. The system features a runtime decision layer using conformal risk control to certify or falsify operator-proposed actions, providing finite-sample coverage guarantees. Evaluated on the Avedøre and Agtrup/BlueKolding plants and BSM2 benchmark, the method achieves 0.78-1.08% RMSE versus black-box baselines, reduces aggregate regret by 43.6%, and prevents 93/187 false-safe N2O approvals (4.65× baseline improvement, p<1e-21).
digital twinconformal risk controlstate-space modelswastewater treatmentinterpretable ai
From Prompts to Pavement Through Time: Temporal Grounding in Agentic Scene-to-Plan Reasoning
This work introduces temporal conditioning in inter-agent communication to enhance coherence in scene-to-plan reasoning for autonomous vehicles, addressing inconsistencies in continuous actions. Three planner architectures with increasing temporal integration were evaluated on curated subsets of the BDD-X dataset using semantic, syntactic, and logical metrics. Results indicate no statistically significant improvements in standard NLP-based correctness metrics, but qualitative analysis reveals predictive hazard reasoning, stable corrective behavior, and strategic divergence in the Sentinel architecture. The study establishes the first empirical benchmark for temporal scene-to-plan reasoning.
temporal conditioningscene-to-plan reasoningautonomous vehiclesinter-agent communicationbdd-x dataset
Smooth Piecewise Cutting for Neural Operator to Handle Discontinuities and Sharp Transitions
We propose Cut-DeepONet, a two-stage neural operator framework that explicitly models discontinuities in PDE solutions while reducing learning complexity. The method employs a lifting strategy to partition the domain into smooth subregions, representing discontinuities as boundaries in higher-dimensional space, and uses an auxiliary network to predict discontinuity locations for unseen inputs. Experiments on benchmark PDEs demonstrate that Cut-DeepONet outperforms state-of-the-art methods on problems with discontinuities and sharp transitions, achieving superior performance even with low-resolution training data and fewer trainable parameters.
neural operatordiscontinuitieslifting strategypartial differential equationscut-deeponet
ST-TGExplainer: Disentangling Stability and Transition Patterns for Temporal GNN Interpretability
ST-TGExplainer introduces a self-explainable temporal graph neural network (TGNN) that disentangles stability and transition patterns for improved interpretability. The method employs a disentangled information bottleneck objective to learn compact explanatory subgraphs, explicitly suppressing redundancy between historical (stability) and emerging (transition) interaction patterns. Experiments demonstrate that ST-TGExplainer achieves strong predictive performance while providing more faithful explanations compared to existing interpretable TGNNs.
temporal graph neural networksinterpretabilitydisentangled information bottleneckstability patternstransition patterns
LP-Eval: Rubric and Dataset for Measuring the Quality of Legal Proposition Generation
The paper introduces LP-Eval, a rubric and dataset for evaluating legal proposition generation quality, co-designed with legal experts. The method decomposes quality into formal validity and substantive dimensions, applying it to 100 LLM-generated propositions from Court of Justice of the European Union decisions. Results indicate LLMs produce predominantly well-formed propositions, with higher quality for established cases, while rubric-guided LLM evaluations align moderately with expert assessments but lack sensitivity to fine-grained distinctions.
legal proposition generationevaluation rubriclarge language modelsexpert annotationsformal validity
FLUXtrapolation: A benchmark on extrapolating ecosystem fluxes
The paper introduces FLUXtrapolation, a benchmark for evaluating machine learning models on ecosystem flux extrapolation under distribution shifts. It addresses the challenge of upscaling flux measurements from sparse tower sites to global estimates, considering covariate and conditional shifts across climates and ecosystems. The benchmark includes temporal, spatial, and temperature-based extrapolation scenarios, with evaluation metrics focusing on held-out domains, temporal aggregations, and tail errors. Initial results show baseline models perform similarly in median hourly RMSE but diverge under tail-focused and multi-scale assessments, highlighting the benchmark's utility for advancing flux upscaling methods.
flux upscalingdistribution shiftcovariate shiftconditional shiftrmse
Chunking German Legal Code
The study evaluates chunking strategies for retrieval-augmented generation in German statutory law, using the German Civil Code as a benchmark. It compares structural units (sections, subsections), fixed-size windows, contextual chunking, semantic clustering, Lumber, and RAPTOR-based hierarchical retrieval, measuring recall, latency, build time, and storage. Results indicate that structure-aligned methods (sections, subsections) achieve highest recall and computational efficiency, outperforming complex LLM-intensive approaches. The findings emphasize the importance of preserving domain-specific structure for legal retrieval.
retrieval-augmented generationchunking strategieslegal information retrievalcontextual chunkinghierarchical retrieval
Latent Laplace Diffusion for Irregular Multivariate Time Series
Latent Laplace Diffusion (LLapDiff) proposes a generative framework for irregular multivariate time series forecasting, avoiding temporal distortion from re-gridding and drift from sequential solvers. The method models targets as low-dimensional latent trajectories, using a stable modal parameterization inspired by port-Hamiltonian dynamics and Laplace-domain complex-conjugate poles for direct irregular timestamp evaluation. A gap-aware history summarizer links continuous dynamics to observations via renewal-averaging analysis. Experiments demonstrate LLapDiff's superiority in long-horizon forecasting and missing-value imputation, with code publicly available.
latent laplace diffusionirregular time seriesport-hamiltonian dynamicsrenewal-averaging analysisgap-aware summarizer
Stitched Value Model for Diffusion Alignment
We propose StitchVM, a model stitching framework for aligning diffusion-based generative models with task-specific rewards by transferring pretrained pixel-space reward models to noisy latent regimes. The method attaches a frozen diffusion backbone to a truncated pixel-space reward model, combining robust reward capabilities with native noisy latent handling. Stitching and fine-tuning CLIP ViT-L and SD 3.5 Medium requires only 10 GPU-hours. This approach enables amortized value function construction rather than per-sample approximation, improving downstream steering and post-training methods: DPS becomes 3.2× faster with 50% reduced peak GPU memory, and DiffusionNFT becomes 2.3× faster.
diffusion alignmentmodel stitchingnoisy latentsreward transferamortized estimation
Synergistic Foundation Models for Semi-Supervised Fetal Cardiac Ultrasound Analysis: SAM-Med2D Boundary Refinement and DINOv3 Semantic Enhancement
The authors propose a semi-supervised framework for fetal cardiac ultrasound analysis, combining joint segmentation and classification via synergistic foundation models. The method integrates SAM-Med2D for boundary refinement and DINOv3 for pseudo-label enhancement, employing view-specific hard masking and a two-stage optimization strategy (EMA phase followed by Classification Fine-Tuning). Evaluated on FETUS 2026, it achieves 79.99% Dice Similarity Coefficient, 61.62% Normalized Surface Distance, and 41.20% F1-score, demonstrating effectiveness for prenatal congenital heart disease screening.
semi-supervised learningboundary refinementpseudo-label enhancementmulti-task backbonetwo-stage optimization
AffectAI-Capture: A Reproducible Multimodal Protocol for Small-Group Meeting Research
AffectAI-Capture introduces a reproducible protocol for synchronized multimodal data collection in four-person meeting interactions, integrating eye tracking, wearable physiology, audio, multi-view video, event logging, and structured self-report. The protocol employs fixed task blocks based on established group-interaction paradigms, with acquisition and post-processing organized around a unified event timeline and standardized outputs. Pilot validation confirmed audio quality and video synchronization through controlled bench tests, while full protocol sessions with participants are ongoing. This architecture links task design, instrumentation, timing provenance, and data packaging for affective, behavioral, and meeting-analytics research.
multimodal dataeye trackingwearable physiologyevent timelinetask blocks
Prior Knowledge or Search? A Study of LLM Agents in Hardware-Aware Code Optimization
The study investigates the role of pretrained priors versus search in LLM agents for hardware-aware code optimization through three experiments. Findings reveal: (1) LLMs act as greedy optimizers in black-box settings; (2) zero-shot kernel generation ignores explicit input-size instructions, with performance degrading for uncommon sizes; (3) iterative feedback improves CUDA but degrades TVM IR, indicating language density affects optimization. Results suggest LLMs rely heavily on pretrained knowledge over feedback or agentic structure.
llm agentshardware-aware optimizationkernel generationpretrained priorsfeedback-loop optimization
From SGD to Muon: Adaptive Optimization via Schatten-p Norms
The paper introduces a data-driven criterion for dynamically selecting optimal Linear Minimization Oracle (LMO) geometries in deep neural network optimization, interpolating between SGD and Muon updates. The method derives closed-form update rules from gradient and activation statistics using a single-step random feature regression surrogate, while incorporating parameter-wise preconditioning to recover SGD, Muon, Adam, and MuAdam as special cases. With only ~3% runtime overhead, the adaptive optimizer matches or outperforms Muon and AdamW across three training scenarios, demonstrating that LMO geometry can be efficiently adapted from runtime data.
linear minimization oracleadaptive optimizationschatten-p normsrandom feature regressionpreconditioning
Distribution-Free Uncertainty Quantification for Continuous AI Agent Evaluation
The paper introduces distribution-free uncertainty quantification methods for continuous AI agent evaluation, adapting split conformal prediction and adaptive conformal inference (ACI) to provide coverage guarantees for forecasted quality scores. The approach achieves calibration error below 0.02 at 24h horizons, with ACI dynamically adjusting intervals by 35% post-agent releases. It extends to multi-agent pipelines with compositional bounds (validated for inter-stage correlations ρ ∈ [-0.5, 0.9]), conformal abstention for pairwise rankings, and FDR-corrected leaderboard testing. Evaluation on 50 agents using 18 hourly signals shows mean conditional coverage of 80.4%, with 90% of agents within [72%, 90%], and cross-source sentiment divergence predicting ranking instability (r=0.64, p<0.01).
conformal predictionadaptive conformal inferenceuncertainty quantificationmulti-agent pipelinesconditional coverage
OpenComputer: Verifiable Software Worlds for Computer-Use Agents
OpenComputer introduces a verifier-grounded framework for constructing verifiable software worlds for computer-use agents, integrating four components: app-specific state verifiers, a self-evolving verification layer, a task-generation pipeline, and an evaluation harness. The framework supports 33 desktop applications and 1,000 finalized tasks across diverse domains. Experiments demonstrate that OpenComputer's hard-coded verifiers outperform LLM-as-judge evaluations in aligning with human adjudication, particularly for fine-grained application states. Frontier agents exhibit challenges in end-to-end task completion, and open-source models show significant performance drops from their OSWorld-Verified scores, highlighting gaps in robust computer automation.
verifier-grounded frameworkapp-specific state verifiersself-evolving verification layertask-generation pipelineevaluation harness
Minimax Optimal Variance-Aware Regret Bounds for Multinomial Logistic MDPs
(No summary returned.)
AR1-ZO: Topology-Aware Rank-1 Zeroth-Order Queries for High-Rank LoRA Fine-Tuning
AR1-ZO introduces a topology-aware rank-1 zeroth-order optimization method for high-rank LoRA fine-tuning, addressing the rank paradox where increasing LoRA rank degrades finite-difference signal quality. The method decomposes LoRA into rank-1 atoms, querying one atom per step with topology-aware scaling γ=αr to maintain signal strength without auxiliary mechanisms. Theoretical analysis confirms atom minimality and rank-independent active query dimension, while experiments on OPT and Qwen3 models demonstrate improved effectiveness for high-rank LoRA under fixed query budgets.
zeroth-order optimizationlora fine-tuningrank paradoxfinite-difference signaltopology-aware scaling
Synthesis and Evaluation of Long-term History-aware Medical Dialogue
The authors introduce MediLongChat, a framework for synthesizing longitudinal medical dialogues to address the lack of datasets for evaluating long-term patient history reasoning. Their method involves three stages: creating synthetic patient profiles, generating multi-turn dialogues per encounter, and integrating them into coherent histories. They propose three benchmark tasks (In-dialogue, Cross-dialogue, and Synthesis Reasoning) and a multi-dimensional evaluation framework combining vector-based metrics with LLM-as-a-judge assessments. Experiments reveal state-of-the-art LLMs struggle with these tasks, demonstrating the benchmark's utility and the need for specialized healthcare agent methods.
medical dialogue synthesislongitudinal reasoningllm-as-a-judgemulti-turn dialogue generationhealthcare agent evaluation
GroupAffect-4: A Multimodal Dataset of Four-Person Collaborative Interaction
The GroupAffect-4 dataset addresses gaps in multimodal affective computing by capturing co-located group interactions across four ecologically varied tasks (information pooling, negotiation, idea generation, public-goods game). It includes synchronized data from 40 participants (10 groups) with wrist-worn physiology sensors, eye-tracking glasses, close-talk microphones, self-reports, questionnaires, and personality scores. The dataset achieves 91% physiology and 98% eye-tracking coverage, validated by affective manipulation checks. Fifteen benchmark targets span within-person states, between-person traits, and group dynamics. Released with BIDS structure, Croissant metadata, and open scripts, it supports reproducible research in collaborative affect analysis.
multimodalaffective computinggroup dynamicsphysiologyeye-tracking
What Really Improves Mathematical Reasoning: Structured Reasoning Signals Beyond Pure Code
The study challenges the assumption that code universally enhances reasoning in language models through controlled pretraining on a 10T-token corpus. By isolating executable code and controlling for Code-NL data, it finds code improves programming but not general reasoning, often competing with mathematical tasks. Structured reasoning traces (e.g., code-text mixtures) better explain reasoning gains than pure code. Increasing math-domain sample density boosts mathematical reasoning without compromising programming, suggesting targeted cognitive scaffolds mitigate cross-domain trade-offs. Routing analyses reveal domain interactions in expert-activation patterns.
language modelsmathematical reasoningstructured reasoning tracesdata-composition effectsexpert-activation patterns
CogScale: Scalable Benchmark for Sequence Processing
The paper introduces CogScale, a benchmark of 14 scalable synthetic tasks designed to evaluate sequence processing capabilities across different architectural scales. The framework enables efficient testing of novel architectures under controlled parameter budgets (1k, 10k, 100k) before large-scale deployment. Evaluations of seven architectures (GRU, LSTM, xLSTM, ESN, Mamba, Transformer variants) demonstrate that attention mechanisms and state-space models maintain performance at higher complexity, while classical RNNs excel only in basic retention tasks under strict parameter constraints.
sequence processingparameter budgetstate-space modelssynthetic tasksarchitectural evaluation
Memory-Augmented Reinforcement Learning Agent for CAD Generation
The paper proposes a memory-augmented reinforcement learning framework for CAD generation agents to address limitations of LLM-based methods in handling complex models with long operation sequences and geometric constraints. The method integrates a geometric kernel toolchain, dual-track memory (case library and skill library), and dynamic utility retrieval with reinforcement learning for policy optimization. Experiments demonstrate significant improvements in success rate and geometric consistency for complex CAD generation tasks.
cad generationreinforcement learningmemory-augmentedgeometric constraintsdynamic retrieval
EngiAI: A Multi-Agent Framework and Benchmark Suite for LLM-Driven Engineering Design
The paper introduces EngiAI, a multi-agent framework and benchmark suite for evaluating LLM-driven engineering design systems. The benchmark assesses three dimensions: workflow (7 prompt styles), retrieval-augmented generation (RAG with gated scoring), and HPC orchestration (SLURM cluster). The LangGraph-based MAS implementation coordinates 7 specialized agents for tasks like topology optimization and 3D printer control. Results show proprietary LLMs achieve 96-97% task completion on Beams2D versus 55-78% for open-source 4B-parameter models, with conditional branching proving most challenging (20-53% completion). RAG gating confirms retrieval's critical role (≈1.0 vs. near-zero scores), while HPC orchestration reveals performance variability (50-100% completion).
multi-agent systemretrieval-augmented generationtopology optimizationhpc orchestrationlanggraph
TERGAD: Structure-Aware Text-Enhanced Representations for Graph Anomaly Detection
TERGAD introduces a structure-aware text-enhanced representation framework for graph anomaly detection, addressing limitations in existing text-rich approaches by incorporating node-level topological properties. The method translates topological features into natural language narratives, processes them via Large Language Models (LLMs) to derive semantic embeddings, and fuses these with original node attributes using a gated dual-branch autoencoder. Anomaly scores are computed based on integrated reconstruction errors, capturing deviations in both attributes and semantic expectations. Experiments on six real-world datasets show TERGAD outperforms state-of-the-art baselines, with ablation studies confirming the importance of structural semantic guidance and gated fusion.
graph anomaly detectionlarge language modelssemantic embeddingsdual-branch autoencoderreconstruction error
ContextRAG: Extraction-Free Hierarchical Graph Construction for Retrieval-Augmented Generation
ContextRAG introduces a retrieval-augmented generation system that constructs hierarchical graphs without LLM-based entity extraction, reducing computational costs. The method employs residual-quantization k-means and Formal Concept Analysis with Lukasiewicz residuated logic to derive fuzzy concept graphs through soft join/meet operations. On UltraDomain's 130-task subset, it achieves 33.6% F1 (36.8% on multi-hop tasks) using only 30 LLM calls (22k tokens), versus HiRAG's extrapolated 23M tokens, with lattice-derived nodes improving F1 by +3.9pp.
retrieval-augmented generationformal concept analysisresidual-quantizationlukasiewicz logicmulti-hop reasoning
LIFT and PLACE: A Simple, Stable, and Effective Knowledge Distillation Framework for Lightweight Diffusion Models
We propose LIFT and PLACE, a coarse-to-fine knowledge distillation framework for lightweight diffusion models, addressing challenges in mimicking complex teacher denoising processes. LIFT decomposes training into coarse alignment and fine refinement phases, while PLACE extends this by partitioning outputs into error-based groups for locally adaptive guidance. The framework demonstrates effectiveness across diffusion spaces (image/latent), architectures (U-Net/DiT), tasks (unconditional/conditional), and datasets, including flow-based models like MMDiT (SD3). Under extreme compression with a 1.3M-parameter student (1.6% of teacher size), conventional KD fails (FID 50-200+), but our method achieves stable convergence with FID 15.73.
knowledge distillationdiffusion modelscoarse-to-finedenoising processparameter compression
Mathematical Reasoning in Large Language Models: Benchmarks, Architectures, Evaluation, and Open Challenges
This survey provides a comprehensive analysis of mathematical reasoning in Large Language Models (LLMs), synthesizing approximately 120 studies to evaluate datasets, architectures, training strategies, and evaluation protocols. The study introduces a unified taxonomy for mathematical datasets, categorizing them by pretraining corpora, supervised fine-tuning resources, and evaluation benchmarks. It systematically examines reasoning architectures, including tool integration and verifier-guided reasoning, revealing gaps in process-level verification and identifying failure modes like reasoning faithfulness and benchmark biases. Key research directions focus on improving symbolic grounding and evaluation reliability for robust LLM-based reasoning systems.
mathematical reasoninglarge language modelssupervised fine-tuningverifier-guided reasoningsymbolic grounding
Measuring Safety Alignment Effects in Autonomous Security Agents
The study introduces a trace-based benchmark to evaluate safety alignment effects in autonomous security agents, addressing limitations of single-turn refusal tests. The method analyzes 30 vulnerability-analysis tasks with deterministic metrics, comparing four language model families (Gemma, Qwen2.5-Coder, Llama) and their uncensored derivatives across 2,300 traces. Results show Gemma models exhibit significant performance gains in less-restricted variants (14.0% vs 0.7% success for Gemma 4 31B) with improved grounding, while other models demonstrate inconsistent or negative effects. The benchmark reveals task-specific alignment tradeoffs, demonstrating the need for system-level evaluation separating refusal rates, tool reliability, and evidence grounding.
safety alignmentautonomous agentsvulnerability analysistrace-based benchmarklanguage models
Projecting Latent RL Actions: Towards Generalizable and Scalable Graph Combinatorial Optimization
The authors propose projection agents, a novel RL-GCO approach that addresses generalization and scalability challenges in graph combinatorial optimization. The method operates in a continuous GNN-based action embedding space, predicting latent actions via a single forward pass and decoding them into discrete actions using nearest-neighbor techniques. Evaluations show 16.2x faster inference and 40% better generalization across benchmarks, with potential for super-linear decision spaces. The work includes LaGCO-RL, a Python library for latent action-space construction and RL-GCO reproducibility.
graph combinatorial optimizationreinforcement learninggraph neural networkslatent action spacenearest-neighbor decoding
Beyond Rational Illusion: Behaviorally Realistic Strategic Classification
The paper introduces behaviorally realistic strategic classification (BRSC), addressing the limitation of assuming strict rationality in strategic classification by incorporating cognitive biases from prospect theory. The proposed Prospect-Guided Strategic Framework (Pro-SF) models agent behavior through three prospect theory mechanisms: asymmetric cost-benefit evaluation, subjective reference points, and non-rational probability distortion. Experiments on synthetic and real-world datasets demonstrate Pro-SF's effectiveness in capturing behaviorally grounded strategic manipulations, bridging machine learning and behavioral economics for more reliable decision systems.
strategic classificationprospect theorycognitive biasesstackelberg gamebehavioral economics
Transforming Constraint Programs to Input for Local Search
The paper presents an automated method for transforming constraint optimization problems into local search neighborhoods by leveraging symmetry properties. Using the IDP system, the approach compiles constraint specifications into neighborhood structures suitable for metaheuristic algorithms. Evaluation on six classical optimization problems demonstrates the technique's viability, with results suggesting effective neighborhood generation without manual intervention.
constraint optimizationlocal searchsymmetry propertiesmetaheuristic algorithmsneighborhood generation
CriterAlign: Criterion-Centric Rationale Alignment for Code Preference Judging
CriterAlign introduces a criterion-centric framework for pairwise code preference prediction, addressing limitations of pointwise rubric-based LLM judges. The method employs direct criterion-level pairwise judgments, tie-driven criterion refinement, swap-consistency filtering, and pairwise synthesis, enhanced by Human-Preference-Aligned Guidance (HPAG) derived from training examples. Evaluated on BigCodeReward, CriterAlign improves accuracy from 60.4% to 66.3% over a Qwen2.5-VL-32B monolithic judge, with ablations validating the pairwise criterion design and HPAG.
pairwise preference predictionrubric-based judginghuman-preference-aligned guidanceswap-consistency filteringcriterion-centric framework
Pseudocode-Guided Structured Reasoning for Automating Reliable Inference in Vision-Language Models
The paper introduces Pseudocode-guided Structured Reasoning (PStar), a framework to mitigate hallucinations in Vision-Language Models (VLMs) by adaptively selecting structured pseudocode reasoning paths. PStar employs abstract reasoning functions, a pseudocode library, and a Difficulty Feature Vector (DFV) to assess question complexity and choose appropriate strategies. Experiments show PStar reduces hallucination rates, achieving 87.1% on POPE and 68.0% on MMStar, outperforming GPT-4V, thus enhancing reliability for real-world deployments.
pstarvision-language modelspseudocode reasoningdifficulty feature vectorhallucination mitigation
When Tabular Foundation Models Meet Strategic Tabular Data: A Prior Alignment Approach
The paper introduces Strategic Prior-data Fitted Network (SPN), an inference-time framework that adapts tabular foundation models to strategic data manipulation by aligning predictions with post-manipulation distributions. SPN addresses the prior mismatch in pretrained PFNs by constructing strategic in-context examples, avoiding retraining. Experiments on real-world and synthetic datasets demonstrate SPN's superior robustness and predictive accuracy over both tabular foundation models and classical methods in strategic settings.
tabular foundation modelsstrategic manipulationprior-data fitted networksdistribution shiftin-context learning
The Accessibility Capability Boundary: Operational Limits and Expansion Potential of AI-Generated Browser-Native Accessibility Systems
The paper introduces the Accessibility Capability Boundary (ACB), a formal framework for analyzing the operational limits and expansion potential of AI-driven accessibility systems. It models accessibility as a multidimensional capability space constrained by variables such as deployment latency, cognitive load, and adaptability. The authors argue that AI-generated, browser-native systems leveraging standard APIs can reduce deployment friction and enable rapid interface adaptation. The framework is grounded in two prototypes: an AI-generated browser-native interface for a blind user in Nepal and an open-source webcam alignment assistant for visually impaired users. The study identifies computational, infrastructural, and verification constraints as hard boundaries for this paradigm.
accessibility capability boundaryai-generated systemsbrowser-nativedeployment latencycognitive load
P2DNav: Panorama-to-Downview Reasoning for Zero-shot Vision-and-Language Navigation
P2DNav introduces a hierarchical framework for zero-shot vision-and-language navigation (VLN), addressing the limitations of existing methods by disentangling directional reasoning from local grounding. The framework comprises three components: Panorama-to-Downview (P2D), which separates navigation into panoramic direction selection and downview local grounding; Sliding-Window Dialogue Memory (SDM), organizing navigation history as multi-turn dialogue context; and Reflective Reorientation Mechanism (RRM), enabling reliability assessment and reorientation. Evaluated on the R2R-CE benchmark, P2DNav achieves significant success rate (SR) gains of 146.6% and 58.9% over state-of-the-art zero-shot waypoint-based and waypoint-free methods, respectively.
zero-shot vlpanorama-to-downviewsliding-window dialogue memoryreflective reorientation mechanismr2r-ce benchmark
optimize_anything: A Universal API for Optimizing any Text Parameter
The paper introduces optimize_anything, a universal LLM-based optimization system that achieves state-of-the-art results across six diverse tasks by formulating optimization as improving text artifacts evaluated by scoring functions. The system supports single-task search, multi-task search with cross-problem transfer, and generalization to unseen inputs, demonstrating capabilities such as tripling Gemini Flash's ARC-AGI accuracy (32.5% to 89.5%), reducing cloud costs by 40%, and generating CUDA kernels that match or beat PyTorch in 87% of cases. Ablations show that actionable side information improves convergence and final scores, while multi-task search benefits from cross-task transfer. The work is open-sourced as part of the GEPA project.
llm-based optimizationmulti-task searchcross-problem transfertext artifactscoring function
EMO-BOOST: Emotion-Augmented Audio-Visual Features for Improved Generalization in Deepfake Detection
The paper proposes Emo-Boost, a multimodal deepfake detection framework that augments low-level audio-visual features with high-level emotion cues to improve generalization. The method integrates an off-the-shelf RGB/acoustic detector with EmoForensics, which models temporal consistency in vision/audio emotion representations. Results show complementary signals between emotion and low-level features, yielding a 2.1% AUC improvement in cross-manipulation generalization on FakeAVCeleb.
deepfake detectionmultimodal fusionemotion recognitiongeneralizationtemporal consistency
Component-Aware Structure-Preserving Style Transfer for Satellite Sim2Real 6D Pose Estimation
The paper introduces a component-aware structure-preserving style transfer framework for satellite synthetic-to-real data construction to improve 6D pose estimation. The method uses weakly paired real-synthetic samples, extracts part-wise real-domain style codes, and injects them into synthetic regions via mask-aligned modulation, preserving geometric annotations. Adversarial training with local contrastive consistency and edge-preserving constraints maintains downstream usability. Evaluated on 5,000 synthetic and 100 real images, the approach achieves FID 54.32 and KID 0.048, improving GDRNet's ADD pass rate to 0.260 and AUC to 0.611.
sim2realstyle transfer6d pose estimationcomponent-awareadversarial training
MiMuon: Mixed Muon Optimizer with Improved Generalization for Large Models
The paper introduces MiMuon, a mixed optimizer combining Muon and momentum-based SGD, to improve generalization in large models. Through algorithmic stability analysis, the authors prove MiMuon achieves a lower generalization error bound of O(1/N) compared to Muon's O(1/(Nκ^T)), where κ is the minimum singular value difference in gradient estimates. The method employs orthogonalized gradients while maintaining Muon's O(1/T^(1/4)) convergence rate. Experiments on Qwen3-0.6B and YOLO26m validate MiMuon's efficacy.
muon optimizergeneralization errororthogonalizationalgorithmic stabilitylarge models
Spectral Integrated Gradients for Coarse-to-Fine Feature Attribution
The authors propose Spectral Integrated Gradients (SIG), a novel feature attribution method that improves upon Integrated Gradients by constructing integration paths via singular value decomposition. SIG activates singular components from largest to smallest, enabling a coarse-to-fine attribution progression that reduces gradient noise accumulation. Evaluations across multiple image classification datasets show SIG produces cleaner attribution maps and outperforms existing path-based methods in quantitative metrics.
integrated gradientsfeature attributionsingular value decompositioncoarse-to-finegradient noise
Formal Skill: Programmable Runtime Skills for Efficient and Accurate LLM Agents
The paper introduces Formal Skill, a runtime-native abstraction for LLM agents that encodes reusable capabilities through JSON metadata, Python executors, and hook-governed control logic. This approach moves procedural knowledge from prompt text to executable state machines, improving token efficiency and policy enforcement. Implemented in FairyClaw, an event-driven runtime, Formal Skill achieves competitive scores on Harness-Bench with reduced token usage, particularly excelling in tasks requiring structured skill execution.
formal skillllm agentsruntime abstractiontoken efficiencyhook policies
A novel YOLO26-MoE optimized by an LLM agent for insulator fault detection considering UAV images
The paper proposes YOLO26-MoE, a modified YOLO26 detector integrating a sparse Mixture-of-Experts (MoE) module in its high-resolution branch for improved insulator fault detection in UAV imagery. The architecture enables adaptive feature refinement for small defects and diverse fault patterns while maintaining one-stage detection efficiency. A tool-augmented LLM agent orchestrates hyperparameter optimization and training. Evaluations show state-of-the-art performance with 0.9900 mAP@0.5 and 0.9515 mAP@0.5:0.95, surpassing contemporary YOLO variants.
yolo26-moemixture-of-expertsuav inspectioninsulator fault detectionllm agent optimization
Towards Multi-Model LLM Schedulers: Empirical Insights into Offloading and Preemption
The paper presents an empirical study of multi-model LLM scheduling under GPU memory constraints, analyzing performance impacts of layer offloading and preemption. Through systematic measurements across diverse models and hardware, it reveals non-linear throughput degradation from offloading (with smaller models showing greater sensitivity) and identifies preemption overhead dominated by model state reload rather than KV-cache transfer. Key findings include significant variation in data movement costs across architectures and the influence of sequence length on execution inefficiencies. The work provides design principles for future schedulers handling heterogeneous multi-model workloads.
large language modelsgpu offloadingpreemption overheadkv-cacheheterogeneous scheduling
Implicit Action Chunking for Smooth Continuous Control
The paper introduces Dual-Window Smoothing (DWS), an implicit action chunking framework for smooth continuous control in reinforcement learning. DWS enforces temporal coherence without expanding action space via a dual-window design: an execution window for deterministic modulation and a value window for bias correction. A lightweight actor-side temporal regularizer promotes global continuity. Evaluated on DeepMind Control Suite and industrial energy management tasks, DWS outperforms SOTA baselines, achieving smoother control, reduced jitter, and 100% success rate in vision-based autonomous driving.
reinforcement learningaction chunkingtemporal coherencecontinuous controldual-window smoothing
SceneCode: Executable World Programs for Editable Indoor Scenes with Articulated Objects
SceneCode introduces executable world programs for editable indoor scenes with articulated objects, addressing limitations of static mesh-based pipelines. The framework compiles natural language prompts into code-driven indoor worlds via a room-level agentic backbone and five code-generation strategies, validated through execution-guided refinement. Evaluations demonstrate improved prompt faithfulness, cleaner mesh structure, and simulator-loadable articulation metadata compared to existing methods.
scene synthesisarticulated objectsprogrammatic generationexecutable programsphysics simulation
Lens Privacy Sealing: A New Benchmark and Method for Physical Privacy-Preserving Action Recognition
The paper introduces Lens Privacy Sealing (LPS), a hardware solution for physical privacy-preserving action recognition using adjustable laminating film to obscure camera lenses via stochastic multi-layer scattering. It presents the P$^3$AR dataset (114K videos) with privacy annotations and proposes MSPNet, a single-stage framework with Inter-Frame Noise Suppressor (IFNS) and Cross-Frame Semantic Aggregator (CFSA) for degraded video processing. Experiments show MSPNet nearly doubles action recognition accuracy while suppressing identity recognition, achieving superior privacy-utility trade-offs and resistance to reconstruction attacks.
privacy-preservingaction recognitionhardware solutionstochastic scatteringcontrastive learning
Library Drift: Diagnosing and Fixing a Silent Failure Mode in Self-Evolving LLM Skill Libraries
The paper identifies and addresses library drift, a silent failure mode in self-evolving LLM skill libraries where unbounded skill accumulation degrades performance. Through reproducible triggers (skill injection ablation, premature retirement) and trace-level diagnostics (evidence logs, contribution scores), the authors isolate the drift mechanism. They propose a governance solution (outcome-driven retirement, bounded active-cap, meta-skill authoring) that improves pass@1 from 0.258 to 0.584 on MBPP+ hard-100 over 100 rounds. Eight ablations validate the load-bearing components of the fix.
library driftskill accumulationoutcome-driven retirementmeta-skill authoringtrace-level diagnostics
TORQ: Two-Level Orthogonal Rotation for MXFP4 Quantization
TORQ introduces a training-free Post-Training Quantization framework for MXFP4 activation quantization in LLMs, addressing structural imbalances in activation distributions. The method employs two-level orthogonal rotation: macroscopic inter-block rotation redistributes activation energy using the Schur-Horn theorem, while microscopic intra-block rotation maximizes codebook utilization via maximum-entropy guidance. Evaluated on LLaMA3 and Qwen3, TORQ reduces perplexity on WikiText to 8.43 (vs. BF16's 7.61) and increases average accuracy from 38.40% (RTN) to 73.63% (vs. BF16's 74.82%), significantly closing the gap between 4-bit and full-precision inference.
mxfp4post-training quantizationschur-horn theoremcodebook utilizationactivation quantization
EgoCoT-Bench: Benchmarking Grounded and Verifiable Operation-Centric Chain of Thought Reasoning for MLLMs
The authors introduce EgoCoT-Bench, a benchmark for evaluating grounded and verifiable operation-centric reasoning in Multimodal Large Language Models (MLLMs) using egocentric videos. The benchmark comprises 3,172 QA pairs across 351 videos, organized into four task groups (12 sub-tasks) for perception, retrospection, anticipation, and high-level reasoning. Constructed via spatio-temporal scene graphs and human refinement, it reveals MLLMs' difficulties with fine-grained reasoning and inconsistent evidence in explanations despite correct answers.
multimodal large language modelsegocentric video understandingspatio-temporal scene graphsoperation-centric reasoningverifiable qa pairs
CaptchaMind: Training CAPTCHA Solvers via Reinforcement Learning with Explicit Reasoning Supervision
The authors introduce CaptchaBench, the first large-scale CAPTCHA benchmark with 16,000 programmatically generated samples across eight task categories, featuring detailed region and process-level annotations. They propose CaptchaMind, a reinforcement learning-based solver trained with explicit reasoning process supervision, addressing limitations of existing methods in fine-grained visual detail capture and region-level comparison. The system achieves 82.9% average success rate on CaptchaBench and 71.0% on real-world instances, significantly outperforming prior approaches without closed-source API dependencies.
captchareinforcement learningvisual reasoningprocess supervisionbenchmark
Generative-Evaluative Agreement: A Necessary Validity Criterion for LLM-Enabled Adaptive Assessment
The paper introduces Generative-Evaluative Agreement (GEA) as a validity criterion for LLM-enabled adaptive assessments, addressing self-referential validation when LLMs generate items, simulate responses, and score them. GEA measures whether an LLM's scoring function recovers the skill levels its generative function was instructed to produce. Experiments on a two-stage adaptive assessment show GEA recovers roughly half the intended variance (r = 0.698) with systematic positive bias, varying strongly by skill type (r > 0.7 for syntactic skills, near zero for design-level skills) and revealing low-skill overestimation near routing thresholds. The authors propose granular, skill-decomposed rubrics as the primary mitigation strategy.
generative-evaluative agreementadaptive assessmentskill decompositionrouting thresholdvalidity criterion
Investigating Cross-Modal Skill Injection: Scenarios, Methods, and Hyperparameters
This study systematically investigates cross-modal skill injection for Vision-Language Models (VLMs) through model merging, focusing on scenarios, methods, and hyperparameters. The research demonstrates that domain-expert LLMs can enhance VLMs in instruction-following and cross-lingual tasks but struggle with mathematical reasoning. Classic merging methods like TA and DARE outperform alternatives, with hyperparameter tuning being critical for performance. The analysis provides quantitative insights into these methodologies, offering a framework for efficient skill transfer without extensive data or computational overhead.
vision-language modelsmodel mergingcross-modal skill injectionhyperparameter tuningdomain-expert llms
Efficient Elicitation of Collective Disagreements
The authors introduce a stratified framework to efficiently elicit collective disagreements among voters over alternatives, addressing limitations of pairwise comparisons and full rankings. They propose the plurality matrix, which generalizes pairwise comparisons by recording the probability of each alternative ranking first in any subset. The framework defines the level of a disagreement measure as the smallest subset size needed to express it, proving that many existing notions, including rank-variance and divisiveness, require level 3. Theoretical and experimental results demonstrate the utility of higher levels. Two elicitation protocols are designed to estimate the plurality matrix, balancing participant count and cognitive load.
plurality matrixdisagreement measureselicitation protocolspairwise comparisonsrank-variance
BLINKG: A Benchmark for LLM-Integrated Knowledge Graph Generation
The paper introduces BLINKG, a benchmark for evaluating Large Language Models (LLMs) in Knowledge Graph (KG) generation from heterogeneous data sources. The benchmark comprises scenarios of increasing complexity based on real-world use cases, assessing LLMs' ability to map data schemas to ontology concepts. Experimental evaluation shows LLMs offer promising but limited performance in complex scenarios, highlighting requirements for semi-automated KG construction and opening new research directions.
knowledge graph generationlarge language modelsontology alignmentbenchmark evaluationheterogeneous data
Base Models Look Human To AI Detectors
We demonstrate that base models are frequently classified as human-written by commercial AI-text detectors (GPTZero, Pangram), while instruction-tuned models are not. To exploit this finding, we introduce Humanization by Iterative Paraphrasing (HIP), a detector-agnostic pipeline that fine-tunes a base model for paraphrasing and applies it iteratively. HIP achieves superior trade-offs between semantic preservation and detector evasion compared to baselines, consistently improving human-likeness scores across Llama-3 and Qwen-3 model families (0.6B to 70B). Results suggest current detectors track instruction-tuning artifacts rather than fundamental machine-generated text characteristics, necessitating detector designs that explicitly model these factors.
instruction tuningparaphrasingdetector evasionsemantic preservationhuman-likeness
Position: The Turing-Completeness of Real-World Autoregressive Transformers Relies Heavily on Context Management
The paper clarifies distinctions between fixed-system and scaling-family settings for Transformer Turing-completeness claims, arguing that real-world LLMs operate in the former. It formalizes the fixed-system setting, demonstrating that context-management methods critically influence computational power. Results show existing proofs in scaling-family settings provide resource bounds but not Turing-completeness, addressing common misinterpretations.
transformersturing-completenesscontext-managementautoregressivellms
ARC-RL: A Reinforcement Learning Playground Inspired by ARC Raiders
The paper introduces ARC-RL, a MuJoCo-based reinforcement learning benchmark suite featuring four stylized robotic morphologies (18-DoF Queen, 12-DoF Bastion, 18-DoF Tick, 12-DoF Leaper) inspired by ARC Raiders. The environments employ a unified reward function combining velocity tracking, gait compliance, and safety penalties without motion-capture data, alongside hand-crafted Central Pattern Generator demonstrators. Experiments compare online algorithms (SAC, SPEQ, SOPE-EO) and prior-data-augmented methods (SACfD, SPEQ-O2O, SOPE), analyzing their performance across morphological diversity and stylistic constraints.
reinforcement learningmujococentral pattern generatormorphological diversitycontinuous control
CANINE: Coaching Visually Impaired Users for Interactive Navigation with a Robot Guide Dog
CANINE introduces an automated coaching system for training visually impaired users in interactive navigation with robot guide dogs, addressing the challenge of subtle human-robot coordination. The system decomposes navigation into sub-skills, employing knowledge tracing to prioritize training on weak areas and using foundation models to infer error causes and generate adaptive verbal feedback. A controlled study with blindfolded participants demonstrates CANINE's superiority over generic instructions in learning efficiency and navigation performance. Retention and case studies confirm lasting skill improvement and real-world applicability, aligning with controlled study findings.
robot guide dogknowledge tracingfoundation modelsadaptive feedbackinteractive navigation
Attention-Guided Reward for Reinforcement Learning-based Jailbreak against Large Reasoning Models
The paper proposes a reinforcement learning-based jailbreak method for Large Reasoning Models (LRMs) that incorporates attention patterns into the reward function, alongside diverse persuasion strategies to expand the action space. Analysis reveals that successful jailbreaks correlate with lower attention to harmful input tokens and higher attention to them in reasoning content. Experiments on five LRMs across three benchmarks show the method achieves significantly higher attack success rates than existing approaches, demonstrating improved effectiveness, efficiency, and transferability.
large reasoning modelsjailbreak attacksreinforcement learningattention patternsattack success rate
CutVerse: A Compositional GUI Agents Benchmark for Media Post-Production Editing
The authors introduce CutVerse, a benchmark for evaluating GUI agents in professional media post-production workflows, addressing a gap in current GUI agent capabilities. The benchmark comprises 186 complex tasks across 7 applications (e.g., Premiere Pro, Photoshop), with expert demonstrations and a lightweight parser for structured action trajectory extraction. Evaluations show existing agents achieve only 36.0% task success, highlighting challenges in long-horizon reliability and domain-specific planning despite strengths in spatial grounding and multimodal alignment.
gui agentsmedia post-productionlong-horizon tasksmultimodal interfacesaction trajectories
Sampling-Based Safe Reinforcement Learning
The paper introduces Sampling-Based Safe Reinforcement Learning (SBSRL), a model-based RL algorithm ensuring safety during learning by enforcing constraints across sampled dynamics. The method approximates worst-case optimization over uncertain dynamics and incorporates an exploration strategy based on epistemic uncertainty constraints, eliminating explicit exploration bonuses. Theoretical analysis provides high-probability safety guarantees and finite-time sample complexity bounds. Empirical results demonstrate safe and efficient exploration in simulation and real robotics, with extensions to deep-ensemble implementations for high-dimensional continuous control.
safe reinforcement learningmodel-based rlepistemic uncertaintycontinuous controlsample complexity
Quantifying the Pre-training Dividend: Generative versus Latent Self-Supervised Learning for Time Series Foundation Models
The study quantifies the pre-training dividend of self-supervised learning (SSL) for time series foundation models, comparing Generative paradigms against Latent Alignment architectures. Adaptations of LeJEPA and DINO for time series are introduced, utilizing Discrete Wavelet Transform (DWT) augmentations to enforce invariance to local fluctuations. Results reveal asymmetric gains: SSL yields up to 375% improvement for anomaly detection and classification, but marginal benefits for forecasting. Representational utility is governed by a precision-invariance trade-off, aligning task-specific signal resolution with the objective. Representation quality saturates at moderate architectural depths and is independent of data origin, suggesting scaling via synthetic generation.
self-supervised learninglatent alignmentdiscrete wavelet transformanomaly detectionsynthetic generation
Beyond Mode Collapse: Distribution Matching for Diverse Reasoning
The paper introduces DMPO (Distribution-Matching Policy Optimization), a method to prevent mode collapse in on-policy reinforcement learning by approximating forward KL minimization. DMPO constructs a target distribution over trajectories proportional to rewards and aligns the policy distribution to it, enabling sustained exploration. Evaluated on NP-hard combinatorial optimization tasks, DMPO achieves 43.9% Quality Ratio on text-based NP-Bench (9% improvement over GRPO) and 43.1% on vision-based NP-Bench (12% improvement), with gains extending to mathematical reasoning (+2.0%) and out-of-domain tasks (+2.3%).
mode collapsedistribution matchingforward klpolicy optimizationnp-hard
Generative Auto-Bidding with Unified Modeling and Exploration
We propose GUIDE, a generative auto-bidding framework that unifies exploration and safety in digital advertising. GUIDE integrates a Decision Transformer (DT) for joint modeling of historical bidding actions and environmental states, a Q-value module for exploration regularization, and an Inverse Dynamics Module (IDM) for safe policy fallback. The framework adaptively selects actions between exploration and fallback via an 'explore-safeguard-select' pipeline. Evaluations on public datasets, simulated auctions, and large-scale deployment on Taobao demonstrate GUIDE's superiority, achieving +4.10% GMV, +1.40% clicks, +1.66% cost, and +3.52% ROI compared to state-of-the-art baselines.
decision transformerinverse dynamics moduleauto-biddingexploration regularizationq-value module
Resilient Byzantine Agreement with Predictions
The paper characterizes algorithmic resilience in Byzantine Agreement problems with predictors flagging faulty nodes, presenting tight consistency--robustness trade-offs for both non-authenticated and authenticated settings. For $n$ nodes and parameter $α$, algorithms tolerate $α\cdot n$ faults when predictors are correct and $\frac{1-α}{2} \cdot n - 1$ faults when predictors are wrong, improving to $(1-α) \cdot n - 1$ in authenticated settings. Resilience linearly decreases with predictor inaccuracy, losing one unit per wrong prediction in non-authenticated settings and halving this decline in authenticated settings. Tight impossibility results show these bounds are exact.
byzantine agreementalgorithmic resilienceconsistency-robustness trade-offsauthenticated settingspredictor accuracy
What and When to Distill: Selective Hindsight Distillation for Multi-Turn Agents
The paper introduces SERL, a selective environment-reweighted learning framework for multi-turn LLM agents that optimizes credit assignment by leveraging diverse environmental feedback. SERL combines task rewards for update direction with environment feedback (error messages, observations, etc.) to adjust update placement and magnitude, focusing on critical actions. Evaluated on ALFWorld and WebShop, SERL achieves 90.0% and 80.1% success rates, outperforming RL and distillation baselines. Analysis demonstrates that selective, action-relevant feedback at meaningful points yields better performance than indiscriminate use of longer context.
reinforcement learningcredit assignmentmulti-turn agentsenvironment feedbackselective distillation
Targeted Downstream-Agnostic Attack
The paper introduces Targeted Downstream-Agnostic Attack (TDAA), a method for generating adversarial examples that force pre-trained encoders to produce identical features for both adversarial inputs and a pre-selected 'threat image'. Unlike prior downstream-agnostic attacks (DAAs) that use shared perturbations, TDAA employs example-specific perturbations via a generator, ensuring high attack success and stealth. The approach is evaluated on 10 self-supervised methods across 3 benchmarks, revealing significant vulnerabilities in pre-trained encoders.
targeted attackdownstream-agnosticadversarial examplespre-trained encodersfeature-level anchor
When the Majority Votes Wrong, the Intervention Timing for Test-Time Reinforcement Learning Hides in the Extinction Window
We introduce TTRL-Guard, a framework addressing misinterpreted accuracy gains in test-time reinforcement learning (TTRL) for mathematical reasoning. TTRL-Guard targets the Correct-Answer Extinction Window, a phenomenon where correct-answer signals in low-ability problems are briefly active before suppression. The framework employs Flip-Rate-Aware Reward Scaling (FRS) to down-weight at-risk updates, Minority-Preserving Sampling (MPS) to retain gradient signals from minority correct answers, and Risk-Conditioned Sparse Updatings (RCSU) to suspend updates on polarized problems. Experiments across three models and four benchmarks demonstrate that TTRL-Guard achieves the best average pass@1 on Qwen2.5-7B-Instruct and Qwen3-4B, with a +54% relative improvement over TTRL on AIME 2025.
test-time reinforcement learningextinction windowflip-rate-aware reward scalingminority-preserving samplingrisk-conditioned sparse updatings
KappaPlace: Learning Hyperspherical Uncertainty for Visual Place Recognition via Prototype-Anchored Supervision
KappaPlace introduces a novel framework for uncertainty-aware Visual Place Recognition (VPR) via Prototype-Anchored supervision, addressing the lack of well-calibrated uncertainty estimation in existing methods. The approach models image descriptors as von Mises-Fisher variables and predicts concentration parameters to quantify aleatoric uncertainty, extending beyond query-centric views to match-level reliability assessment. Evaluated on five benchmarks, KappaPlace reduces Expected Calibration Error (ECE@K) by up to 50% while maintaining or improving retrieval recall. The framework supports both joint training and post-training extensions for frozen backbones, offering robust uncertainty signals for safety-critical robotics applications.
visual place recognitionprototype-anchored supervisionvon mises-fisheraleatoric uncertaintyexpected calibration error
Backtracking When It Strays: Mitigating Dual Exposure Biases in LLM Reasoning Distillation
The paper proposes MOTAB, a novel LLM reasoning distillation pipeline addressing dual exposure biases in knowledge transfer from teacher to student models. The method dynamically monitors student-generated trajectories during on-policy distillation, backtracking to safe states when deviations exceed an adaptive threshold and invoking teacher correction. Experiments on LIMO-v2 and AceReason datasets show MOTAB achieves ~3% average improvement in reasoning performance by mitigating both standard exposure bias (from training-inference distribution mismatch) and reversed exposure bias (from teacher guidance on sub-optimal student contexts).
llm reasoning distillationexposure biaschain-of-thoughton-policy learningknowledge transfer
When to Stop Reusing: Dynamic Gradient Gating for Sample-Efficient RLVR
The paper introduces Dynamic Gradient Gating (DGG), a method to improve sample efficiency in Reinforcement Learning with Verifiable Rewards (RLVR) by dynamically controlling gradient reuse. The authors identify Disproportionate Weight Divergence (DWD), where performance degradation correlates with sharp gradient surges in the \texttt{lm\_head} layer, and prove this layer's gradient norm bounds policy divergence. DGG monitors \texttt{lm\_head} gradients in real-time, intercepting harmful updates. Evaluations across math, ALFWorld, WebShop, and QA tasks show DGG achieves up to 2.93× sample efficiency and 2.14× speedup while matching single-use baseline performance.
reinforcement learningsample efficiencygradient gatingpolicy divergencelm_head
Conflict-Resilient Multi-Agent Reasoning via Signed Graph Modeling
SIGMA introduces a signed graph-informed multi-agent reasoning framework that explicitly models trust, conflict, and neutral relations among agents to address error propagation and unreliable interaction patterns in LLM-based multi-agent systems. The method constructs a structured signed interaction graph with confidence-weighted edges, employs conflict-aware signed message passing to reinforce trustworthy signals while suppressing conflicts, and performs structure- and conflict-aware weighted aggregation for globally consistent predictions. Experiments on six benchmarks across multiple LLM backbones show SIGMA outperforms state-of-the-art baselines, achieving significant improvements in accuracy and conflict-resilient performance.
multi-agent systemssigned graphconflict-awaremessage passingweighted aggregation
Unlocking the Potential of Continual Model Merging: An ODE Perspective
The paper proposes ODE-driven Merging (ODE-M), a novel method for Continual Model Merging (CMM) that addresses catastrophic forgetting by tracing low-loss paths in parameter space. Drawing on mode connectivity theory, ODE-M integrates a time-dependent velocity field and enforces barrier constraints to prevent loss-increasing steps during sequential model merging. Experiments show ODE-M achieves state-of-the-art performance on mainstream CMM benchmarks, outperforming existing merging rules that lack explicit control over learning capacity allocation.
continual model mergingmode connectivityode-driven mergingcatastrophic forgettingparameter space
A Bitter Lesson for Data Filtering
The study challenges conventional wisdom on data filtering for pretraining large models, demonstrating that unfiltered data yields superior results in high-compute regimes. Through scaling experiments targeting data-scarce scenarios, the authors show that sufficiently large models not only tolerate low-quality data but benefit from it. Results indicate that optimal performance is achieved without any data filtering when training compute and model size are sufficiently scaled.
data filteringpretrainingscaling lawslarge language modelscompute-optimal training
Rebalancing Reference Frame Dominance to Improve Motion in Image-to-Video Models
The paper introduces DyMoS (Dynamic Motion Slider), a training-free method to address reference-frame dominance in image-to-video (I2V) models, which often produce overly static outputs. By rebalancing self-attention pathways to reduce excessive focus on reference-frame key tokens during initial denoising steps, DyMoS enhances inter-frame dynamics without modifying model weights or input images. Experiments on multiple state-of-the-art I2V backbones show improved motion dynamics while preserving visual quality and reference fidelity, controlled via a single scalar parameter.
image-to-videoreference-frame dominanceself-attentiondenoisingmotion dynamics
EmbGen: Teaching with Reassembled Corpora
EmbGen introduces a synthetic data generation pipeline for domain adaptation of instruction-tuned models, addressing limitations of homogenized outputs and cross-document dependencies in existing methods. The approach decomposes corpora into entity-description pairs, reassembles them via embedding-based semantic structure, and generates QA pairs through proximity, intra-cluster, and inter-cluster sampling with cluster-specialized prompts. Evaluated against EntiGraph, InstructLab, and Knowledge-Instruct on three datasets under 5M and 20M token budgets, EmbGen improves Binary Accuracy by 12.5% and 88.9% respectively on the most heterogeneous dataset while remaining competitive on others.
synthetic data generationembedding similarityinstruction tuningdomain adaptationbinary accuracy
PRISM: A Benchmark for Programmatic Spatial-Temporal Reasoning
The paper introduces PRISM, a benchmark for programmatic spatial-temporal reasoning with 10,372 human-calibrated instruction-code pairs, significantly larger than prior benchmarks (20x scale). It spans 437 subject categories across English and Chinese, focusing on knowledge visualization. The authors propose a funnel-style evaluation framework with four metrics: Code-Level Reliability, Spatial Reasoning, Prompt-Aware Dynamic Visual Complexity (PADVC), and Temporal Density (TD). Evaluation of seven LLMs reveals an Execution-Spatial Gap, with a 41% average drop from execution success to spatial correctness, demonstrating that runnable code often fails to produce spatially coherent animations.
programmatic video generationspatial-temporal reasoningexecution-spatial gapknowledge visualizationdynamic visual complexity
The Evaluation Game: Beyond Static LLM Benchmarking
The paper introduces a game-theoretic framework to model the interaction between evaluators and trainers in defending against jailbreaks in Large Language Models (LLMs). Using group actions to formalize data augmentation, the study analyzes generalization regimes, showing evaluators maintain constant miss ratios below critical thresholds. Empirical evidence from Llama, Qwen, and Mistral models indicates fine-tuning on adversarial prompts yields local generalization, with refusal rates correlated to prompt distance. The framework redefines benchmarks as dynamic orbits under group actions, challenging static evaluation protocols.
jailbreaksgroup actionsgeneralization regimesadversarial promptsrefusal rates
Generative Recursive Reasoning
The paper introduces Generative Recursive reAsoning Models (GRAM), a framework that extends recursive reasoning to probabilistic multi-trajectory computation. GRAM models reasoning as stochastic latent trajectories, enabling multiple hypotheses and solution strategies through recursive depth and parallel sampling. It functions as a latent-variable generative model supporting both conditional ($p_θ(y \mid x)$) and unconditional ($p_θ(x)$) generation. Trained with amortized variational inference, GRAM outperforms deterministic recurrent and recursive baselines on structured reasoning and multi-solution constraint satisfaction tasks while demonstrating unconditional generation capabilities.
recursive reasoninglatent-variable modelsvariational inferencemulti-trajectory computationconstraint satisfaction
Concept-Guided Noisy Negative Suppression for Zero-Shot Classification and Grounding of Chest X-Ray Findings
The paper introduces CoNNS, a concept-guided noisy-negative suppression framework for zero-shot classification and grounding of chest X-ray findings. The method constructs a hierarchical concept ontology using large language models, structuring 41 clinical concepts by presence, attributes, and texts. It implements cross-patient pair relabeling via fine-grained breakdown, noisy negative filtering, and hard negative mining, followed by a Concept-Aware NCE loss for visual-text alignment. Experiments on multi-granularity zero-shot grounding tasks and five classification datasets show CoNNS outperforms state-of-the-art models.
zero-shot classificationnoisy negativesconcept ontologycontrastive learningchest x-ray
Multi-Scale Generative Modeling with Heat Dissipation Flow Matching
The paper introduces Heat Dissipation Flow Matching (HDFM), a novel generative model that integrates continuous blur-based corruption into Flow Matching (FM) to inject multi-scale priors. HDFM addresses ill-posedness in the inverse heat-dissipation process by aligning an interpolated heat-dissipation path and mitigates high-dimensional regression difficulty via $x$-prediction. Experiments demonstrate HDFM's superiority over baseline methods across datasets, with ablations confirming the benefits of blur and $x$-prediction.
heat dissipation flow matchingmulti-scale priorsflow matchinginverse heat-dissipation$x$-prediction
Toward User Comprehension Supports for LLM Agent Skill Specifications
The study evaluates LLM agent skill specifications as user comprehension aids, proposing four comprehension anchors: operational basis, output contract, boundary disclosure, and example capability demonstration. Analyzing 878 cybersecurity skills via rule-based coding, it finds only 19.0% include example demonstrations and 2.3% cover all anchors. A DNS/C2 telemetry subset (n=6) reveals missing examples complicate local checks, requiring helper code inspection. The work advocates treating specifications as capability disclosures rather than mere executable containers.
llm agentskill specificationscomprehension anchorsrule-based codingcybersecurity skills
Skinned Motion Retargeting with Spatially Adaptive Interaction Guidance
The paper presents a geometry-aware motion retargeting framework that preserves interaction semantics across characters with varying body proportions. The method dynamically repositions anchors via a Transformer-based refinement strategy, using differentiable soft projection to constrain anchors to target geometry, and employs a graph-based autoencoder for skeletal motion prediction. An alternating training scheme optimizes anchor adaptation and motion retargeting jointly. Evaluations show superior interaction fidelity preservation compared to state-of-the-art approaches.
motion retargetinginteraction semanticstransformer-based refinementdifferentiable soft projectiongraph-based autoencoder
Brain alignment of reasoning and action representations from vision-language and action models during naturalistic gameplay
The study investigates brain alignment between foundation models (vision-language models and large-action models) and fMRI recordings during naturalistic gameplay, revealing three key findings. First, both model families outperform RL baselines in voxel-wise encoding performance regardless of feature dimensionality. Second, prompt-driven improvements scale with cortical hierarchy, showing maximal gains in frontal-parietal regions (2× early visual cortex). Third, representational organization differs qualitatively: VLMs show prompt symmetry (12.5-13.6% unique variance) while LAMs exhibit action-prompt asymmetry (27% vs -5%), particularly in frontal-motor cortex. The results demonstrate action-specialized fine-tuning reorganizes multimodal representations toward action-relevant neural computations.
brain alignmentvision-language modelslarge-action modelsfmri encodingcortical hierarchy
PAVE: A Cognitive Architecture for Legitimate Violation in Generative Agent Societies
The paper introduces PAVE, a cognitive architecture enabling generative agents to reason about legitimate rule violations in cooperative settings. PAVE's four modules (Perception, Assessment, Verdict, Emulation) process contextual cues, score legitimacy, decide on violations, and scope actions. Evaluated in the Voville environment with four LLM backbones, PAVE agents demonstrate legitimate violation, authority deference, bounded scope, and recovery properties, outperforming vanilla agents in structured decision-making and plausibility ratings. Ablation studies confirm the legitimacy gate's necessity for these properties.
cognitive architecturelegitimate violationgenerative agentsllm backbonesvoville environment
IMLJD: A Computational Dataset for Indian Matrimonial Litigation Analysis
The paper introduces IMLJD, a novel computational dataset of 3,613 Indian matrimonial litigation judgments from the Supreme Court (2000-2024) and Karnataka High Court (2018-2024), annotated with structured outcome labels, metadata indicators, and a knowledge graph. The dataset covers cases under IPC Section 498A, the Protection of Women from Domestic Violence Act, and CrPC Section 482. Analysis reveals a 57.6% quashing petition success rate at the Supreme Court versus 39.7% at the Karnataka High Court, with a 19.6 percentage point differential persisting in matched temporal analysis (2018-2024).
computational jurisprudencelegal knowledge graphstructured outcome labelingmatrimonial litigationcourt judgment analysis
HalluWorld: A Controlled Benchmark for Hallucination via Reference World Models
The paper introduces HalluWorld, a controlled benchmark for studying hallucinations in language models through reference-world formulations. It constructs synthetic and semi-synthetic environments (gridworlds, chess, terminal tasks) to automatically label hallucinations while varying complexity, observability, and temporal dynamics. Evaluation reveals frontier models excel at perceptual hallucination but struggle with multi-step state tracking, causal simulation, and abstention decisions. Results indicate hallucinations stem from multiple distinct failure modes rather than a unified capability gap.
hallucinationbenchmarkreference-worldstate trackingabstention
STAR-PólyaMath: Multi-Agent Reasoning under Persistent Meta-Strategic Supervision
STAR-PólyaMath introduces a multi-agent framework for mathematical reasoning that addresses reliability issues like hallucination accumulation and memory fragmentation through meta-level supervision and structured Reasoner-Verifier interaction. The system employs a Python orchestrator to separate control from inference, with a persistent Meta-Strategist providing high-level guidance. It achieves state-of-the-art results on eight benchmarks, including perfect scores on AIME 2025-2026, Putnam 2025, and HMMT February 2026, outperforming GPT-5.5 by 13.54% on MathArena Apex 2025. Ablation studies confirm the framework's orchestration drives performance gains.
multi-agentmeta-strategistreasoner-verifierorchestrationablation
Agentic Trading: When LLM Agents Meet Financial Markets
The paper investigates LLM-based trading agents through a systematic review of 77 studies, focusing on their decision pipelines and market adaptability. It identifies protocol incomparability as a key issue, with only 2 out of 19 primary studies reporting time-consistent split protocols and none achieving R3 reproducibility. The study proposes an Architecture-Capability-Adaptation framework and emphasizes the need for standardized evaluation protocols and reproducible artifacts. Findings highlight rapid architectural experimentation but persistent bottlenecks in execution semantics and reproducibility.
llm-based trading agentsprotocol incomparabilityarchitecture-capability-adaptationreproducibility auditexecution semantics
MOCHA: Multi-Objective Chebyshev Annealing for Agent Skill Optimization
The paper introduces MOCHA, a Multi-Objective Chebyshev Annealing method for optimizing LLM agent skills under platform constraints. Unlike existing approaches that use weighted sums, MOCHA employs Chebyshev scalarization to cover the full Pareto front, including non-convex regions, combined with exponential annealing for exploration-exploitation balance. Evaluated across six agent skills, MOCHA achieves a 7.5% mean correctness improvement over baselines (up to 14.9% on FEVER and 10.4% on TheoremQA), discovering twice as many Pareto-optimal variants while baseline methods fail on 4/6 tasks.
multi-objective optimizationchebyshev scalarizationpareto frontllm agentsskill optimization
RE-VLM: Event-Augmented Vision-Language Model for Scene Understanding
RE-VLM introduces a dual-stream vision-language model combining RGB and event camera data for robust scene understanding under adverse conditions. The model employs parallel encoders and progressive training to align heterogeneous visual features with language, addressing modality gaps. A graph-driven pipeline synthesizes captions and QA pairs from synchronized RGB-event streams, overcoming data scarcity. Evaluated on PEOD-Chat and RGBE-Chat datasets, RE-VLM outperforms RGB-only and event-only baselines in captioning and VQA tasks, particularly in challenging illumination scenarios. Results demonstrate significant improvements in cross-modal alignment and real-world applicability.
vision-language modelevent camerasdual-stream architecturescene graphsprogressive training
Exploring and Developing a Pre-Model Safeguard with Draft Models
The paper introduces a pre-model safeguard that leverages jailbreak attack transferability between large and small language models (LLMs/SLMs) to improve prompt safety auditing. By systematically studying transferability factors, the authors observe that SLM draft responses predict LLM safety implications. Their design uses speculative inference with SLMs to generate draft responses, then applies existing guards to both prompt and drafts, reducing false-negative rates while maintaining computational efficiency compared to post-model guards. Experiments demonstrate improved safety prediction with lower token usage and processing time.
jailbreak transferabilityspeculative inferencepre-model guardfalse-negative ratedraft models
Inference-Time Scaling in Diffusion Models through Iterative Partial Refinement
We propose Iterative Partial Refinement (IPR), an inference-time scaling method for sequential diffusion models that operates without external verifiers. IPR re-noises and regenerates subsets of regions in an already-generated sample, conditioned on the remaining regions, enabling richer context for revising earlier decisions. This approach enhances global consistency in samples through iterative partial refinement. On MNIST Sudoku, IPR increases the valid solution rate from 55.8% to 75.0%, demonstrating its efficacy in tasks requiring global constraint satisfaction. The method is tailored for sequential, mixed-noise conditioning settings.
inference-time scalingdiffusion modelsiterative partial refinementmixed-noise conditioningglobal constraint satisfaction
ContextFlow: Hierarchical Task-State Alignment for Long-Horizon Embodied Agents
ContextFlow introduces a hierarchical task-state alignment framework for long-horizon embodied agents, addressing task-state misalignment failures in planner-executor coordination. The method represents task stages as explicit contracts, converts runtime observations into evidence packets, and applies scoped updates (continue, refine, transfer, promote, repair) to maintain alignment. By keeping specialist executors responsible for local control while making task-frontier alignment auditable, the framework mitigates unsupported handoffs, stage lock, and executor-context mismatches. Experiments on long-horizon embodied tasks demonstrate improved diagnosis and mitigation of task-state failures through evidence-grounded updates.
task-state alignmentembodied agentsevidence packetsscoped updateslong-horizon planning
DEFLECT: Delay-Robust Execution via Flow-matching Likelihood-Estimated Counterfactual Tuning for VLA Policies
DEFLECT introduces an offline post-training refinement for Vision-Language-Action (VLA) policies to address prediction-execution misalignment in asynchronous inference. The method constructs counterfactual action pairs from a frozen reference policy and scores them using an implicit flow-matching likelihood-ratio surrogate, requiring no human labels or online rollouts. Results show +6.4 success-rate gain in high-latency regimes (5-7 control steps), +4.6 when transferred to real-scale VLA, and consistent improvements on two real-robot tasks.
vision-language-actionasynchronous inferenceflow-matchingcounterfactual tuningdelay-robust
Are Rationales Necessary and Sufficient? Tuning LLMs for Explainable Misinformation Detection
The paper introduces LONSREX, a data synthesis pipeline for fine-tuning LLMs to generate necessary and sufficient rationales in explainable misinformation detection (MD). The method addresses limitations of naive filtering by proposing a metric to quantify each verification step's contribution, evaluating rationale quality. Experiments show that traditional approaches using binary labels yield insufficient rationales, while stronger LLMs produce overly verbose ones. LONSREX effectively balances these issues by optimizing rationale necessity and sufficiency.
misinformation detectionlarge language modelsrationale generationfine-tuningveracity prediction
EviTrack: Selection over Sampling for Delayed Disambiguation
The paper introduces EviTrack, a test-time inference framework for sequential prediction in delayed disambiguation regimes, where early observations remain ambiguous. EviTrack maintains competing trajectory hypotheses and applies evidence- and likelihood-ratio-based selection to delay commitment until sufficient evidence accumulates, inspired by multiple hypothesis tracking. Evaluated on a synthetic benchmark with known latent ground truth, EviTrack outperforms sampling-based baselines at matched inference budget, achieving faster post-disambiguation recovery. Results demonstrate trajectory-level selection's superiority over increased sampling coverage in such regimes.
sequential predictiondelayed disambiguationmultiple hypothesis trackingtrajectory hypothesesevidence-ratio
FormalASR: End-to-End Spoken Chinese to Formal Text
The paper introduces FormalASR, two compact end-to-end models (0.6B and 1.7B parameters) for direct spoken Chinese to formal text transcription, eliminating the need for post-processing LLMs. The method involves constructing WenetSpeech-Formal and Speechio-Formal datasets via LLM-based rewriting and quality filtering, followed by fine-tuning Qwen3-ASR models. Results show a 37.4% relative CER reduction over verbatim baselines, alongside improved ROUGE-L and BERTScore, demonstrating efficient on-device deployment.
formalasrend-to-endspoken-to-formalqwen3-asrverbatim transcription
Swimming with Whales: Analysis of Power Imbalances in Stake-Weighted Governance
The paper analyzes power imbalances in stake-weighted governance systems, particularly in Proof-of-Stake blockchains, using the Penrose-Banzhaf power index. Methodologically, it combines analytical proofs showing that perfect power-stake alignment is unattainable but approximable under specific conditions, with empirical analysis of real-world data from Project Catalyst. Results reveal significant power distortions favoring large stakeholders, providing quantitative insights into governance centralization risks in current implementations.
proof-of-stakegovernancepenrose-banzhaf indexpower imbalanceblockchain
When Web Apps Heal Themselves: A MAPE-K Based Approach to Fault Tolerance and Adaptive Recovery
This study introduces a MAPE-K-based self-healing framework for web applications, integrating an AutoFix-inspired mechanism for adaptive fault recovery. The system was evaluated through fault injection experiments across 20 scenarios, achieving 90.7% F1-score in fault detection and 93.2% recovery success. AutoFix reduced time-to-recovery by 56.2% (avg. 3.92s), maintaining 88-95% throughput with only 3.1% response time increase. Feedback mechanisms improved recovery efficiency by 18.6%, demonstrating practical fault tolerance via feedback-driven adaptation.
mape-kautofixfault tolerancetime-to-recoverythroughput
AQuaUI: Visual Token Reduction for GUI Agents with Adaptive Quadtrees
AQuaUI introduces a training-free token reduction method for GUI-agent models by leveraging non-uniform spatial information density in screenshots. It employs adaptive quadtrees to merge redundant tokens while preserving critical visual elements, maintaining spatial consistency via position encoding. The method enhances temporal consistency across interactions through conditional quadtree refinement using prior states. Evaluated on GUI-Owl-1.5-32B-Instruct, AQuaUI achieves 13.22% speedup and 29.52% token reduction with 99.06% performance retention, demonstrating efficient exploitation of GUI spatial redundancy.
quadtreetoken reductiongui agentsspatial redundancymultimodal models
ExECG: An Explainable AI Framework for ECG models
ExECG introduces a standardized Python framework for explainable AI in ECG analysis, addressing variability in current pipelines. The three-stage architecture includes: (1) Wrapper for ECG format standardization, (2) Explainer unifying XAI methods (e.g., saliency maps, attention), and (3) Visualizer for cross-method comparison. Case studies demonstrate interoperability and reproducibility across heterogeneous ECG models, facilitating clinical deployment through improved interpretability of arrhythmia classification and abnormality detection outputs.
explainable aiecg analysisarrhythmia classificationsaliency mapsinterpretability
Causal Evidence for Attention Head Imbalance in Modality Conflict Hallucination
The study investigates modality-conflict hallucination in multimodal LLMs, where textual premises override contradictory visual evidence. Through causal analysis of attention heads in five MLLMs, it identifies opposing groups: broadly distributed hallucination-driving heads and concentrated hallucination-resisting heads. The imbalance between these groups biases generation toward erroneous premises. The proposed MACI intervention selectively suppresses driving heads during conflicts, achieving significant hallucination reduction on the MMMC benchmark (best among baselines) with favorable accuracy trade-offs and zero-shot transfer to SCI-SemanticConflict.
modality-conflict hallucinationattention head imbalancepath patchingcausal interventionmmmc benchmark
Euclidean Embedding of Data Using Local Distances
The paper presents a method for Euclidean embedding of data using only local distance graphs, without requiring prior vector representations. The approach formulates a variational problem to match local on-graph distances to Euclidean metrics, deriving Euler-Lagrange equations in coordinate-free form. These non-linear equations are solved via iterative sparse linear updates. Key contributions include: (a) continuum-level functional equations for optimal embedding, (b) representation-free formulation relying solely on neighborhood distance graphs, and (c) local graph-based estimation. Experiments on synthetic and real datasets demonstrate preservation of local metric structure and global isometric approximation.
euclidean embeddinglocal distance graphvariational problemeuler-lagrange equationsisometric approximation
PhyWorld: Physics-Faithful World Model for Video Generation
PhyWorld introduces a physics-faithful world model for video generation, employing two-stage post-training to enhance physical plausibility. First, flow matching fine-tuning improves video-to-video continuation stability and motion coherence. Second, Direct Preference Optimization (DPO) aligns generated dynamics with physical principles. Evaluated on VBench and a dedicated physical-faithfulness benchmark, PhyWorld achieves 0.769 (vs. 0.756 baselines) in video consistency and 3.09 (vs. 2.99 baselines) in physical plausibility, demonstrating its efficacy for Physical AI simulations.
phyworldvideo generationphysical plausibilitydirect preference optimizationflow matching
AI Technologies in Language Access: Attitudes Towards AI and the Human Value of Language Access Managers
This study examines language access managers' attitudes toward AI technologies in translation services, focusing on sectors with legal and ethical constraints like healthcare and government. Through qualitative thematic analysis of 10 semi-structured interviews with US-based professionals, the research reveals conditional optimism about AI adoption, coupled with strong risk awareness and insistence on human oversight in AI-mediated language access. Findings highlight tensions between efficiency mandates and the perceived irreplaceability of human judgment in high-stakes multilingual contexts.
language accessqualitative analysishuman oversighttranslation technologyethical ai
Can Large Language Models Revolutionize Survey Research? Experiments with Disaster Preparedness Responses
The study evaluates LLMs' potential to address survey research challenges through a five-stage framework tested on hurricane preparedness data (n=946). It introduces an Anchored Marginal Theory-Informed LLM (A-TLM) that integrates Protection Motivation Theory (PMT) via knowledge graphs, outperforming classical imputation methods (RMSE 1.439 vs. 1.496) with minimal bias (-0.121). Structured retrieval around PMT causal relationships reduces MAE by 9.5% compared to standard RAG, while subgroup analysis reveals masked bias patterns. The framework demonstrates controlled hallucination through knowledge-grounded refusal in chatbot implementations.
large language modelsmissing data imputationprotection motivation theoryknowledge graphretrieval-augmented generation
Diagnosing Multi-step Reasoning Failures in Black-box LLMs via Stepwise Confidence Attribution
The paper introduces Stepwise Confidence Attribution (SCA), a framework for diagnosing reasoning failures in black-box LLMs by assigning step-level confidence scores to generated traces. SCA employs Information Bottleneck principles through two methods: NIBS (non-parametric consistency measurement) and GIBS (graph-based subgraph learning). Experiments on mathematical reasoning and multi-hop QA tasks demonstrate SCA's effectiveness in identifying erroneous steps, with step-level confidence feedback improving self-correction success rates by up to 13.5% over answer-level baselines.
stepwise confidence attributioninformation bottleneckmulti-step reasoningblack-box llmsself-correction
Token by Token, Compromised: Backdoor Vulnerabilities in Unified Autoregressive Models
The paper introduces Token by Token Backdoor Attack (ToBAC), the first backdoor attack targeting unified autoregressive models (UAMs) that generate both text and image tokens. It explores data-based and model-based poisoning strategies, demonstrating how innocuous triggers (e.g., common words) can propagate malicious effects across modalities, manipulating visual outputs and accompanying text. Experiments show ToBAC achieves a 55% success rate in modality-aligned brand promotion via model access and 63.1% success against JanusPro through data poisoning.
unified autoregressive modelsbackdoor attackmultimodal generationdata poisoningmodel poisoning
Position: Uncertainty Quantification in LLMs is Just Unsupervised Clustering
The paper contends that current uncertainty quantification (UQ) methods for LLMs constitute a category error, functioning as unsupervised clustering algorithms that measure internal consistency rather than external correctness. Through critical analysis, the authors identify three pathologies: hyperparameter sensitivity, conflation of stability with truth, and reliance on unstable proxy metrics due to absent ground truth. They propose a paradigm shift toward UQ methods anchored in objective verification, advocating for improved evaluation metrics, native uncertainty mechanisms, and reality-grounded confidence measures to address confident hallucinations.
uncertainty quantificationlarge language modelsconfident hallucinationsunsupervised clusteringproxy metrics
SimGym: A Framework for A/B Test Simulation in E-Commerce with Traffic-Grounded VLM Agents
SimGym introduces a framework for simulating e-commerce A/B tests using vision-language model (VLM) agents, addressing the limitations of traditional testing (traffic diversion, slow cycles). The method combines traffic-grounded persona generation from clickstream data, live-browser agents with multimodal perception and episodic memory, and an evaluation protocol comparing simulated vs. real outcomes. Validation on UI theme changes shows 77% directional alignment with real add-to-cart shifts, reducing experimental cycles from weeks to under an hour.
a/b testingvision-language modelclickstream datamultimodal perceptionepisodic memory
Rotation-Aligned Key Channel Pruning for Efficient Vision-Language Model Inference
The paper introduces RotateK, a rotation-based structured Key channel pruning framework for efficient vision-language model inference. The method employs online PCA-based rotation to align token-dependent channel importance into a shared low-dimensional subspace, enabling accurate pruning under lightweight head-wise masks, and uses a fused Triton attention kernel for efficient decoding. Experiments on two VLM backbones demonstrate that RotateK outperforms prior Key channel pruning methods in accuracy and decoding latency, with joint token-channel pruning improving over token-only baselines at matched KV cache budgets.
key channel pruningvision-language modelskv cacheonline pcatriton attention kernel
Not all uncertainty is alike: volatility, stochasticity, and exploration
The paper demonstrates that different sources of environmental uncertainty (volatility and stochasticity) have opposing effects on optimal exploration strategies in decision-making. By extending the Gittins index framework to Gaussian state-space bandits with latent dynamics, the authors derive Cause-Aware Uncertainty-Sensitive Exploration (CAUSE), a closed-form exploration bonus that accounts for these asymmetries. CAUSE outperforms standard exploration methods in environments with heterogeneous noise and improves upon Gittins-per-arm policies in restless bandit settings, while revealing that pathological noise inference can lead to reversed exploration patterns relevant to psychiatric modeling.
gittins indexstate-space banditsexploration-exploitationvolatilitystochasticity
Quantized Machine Learning Models for Medical Imaging in Low-Resource Healthcare Settings
The paper presents a multi-strategy compression framework for deploying deep learning models in low-resource medical imaging settings, focusing on brain tumor classification from MRI. The approach combines quantization-aware training, knowledge distillation (DenseNet-101 to DenseNet-32), and Float16 post-training quantization on MobileNetV2. The quantized MobileNetV2 achieves 82.37% validation accuracy (vs. 82.20% full-precision) with a 6.14x size reduction (35.34 MB to 5.76 MB), maintaining uniform diagnostic performance across glioma, meningioma, pituitary tumors, and healthy controls. Results demonstrate clinical viability for resource-constrained environments.
quantization-aware trainingknowledge distillationfloat16 quantizationmobilenetv2medical imaging
Aerial Inspection Behaviors via RL-based Quadrotor Control for Under-canopy Forest Environments
The paper presents a deep Reinforcement Learning (RL)-based low-level controller for quadrotor navigation in under-canopy forest environments, enabling inspection view-pose tracking (position and yaw reference tracking) for target inspection behaviors. The method combines an end-to-end RL policy (mapping states to RPMs) with a higher navigation layer comprising a Traveling Salesman Problem (TSP) planner for optimal visitation sequencing and a Rapidly-exploring Random Tree Star (RRT*) planner for collision-free path generation. Results demonstrate effective deployment in five target inspection scenarios, showing RL-based motor-level control can serve as a reliable low-level execution module when supported by navigation guidance.
reinforcement learningquadrotor controlview-pose trackingtsp plannerrrt* planner
On-Device Continual Learning with Dual-Stage Buffer and Dynamic Loss for Point-of-Care Pneumonia Diagnosis
PneumoNet introduces a domain-incremental learning method for point-of-care pneumonia diagnosis, addressing performance decline under domain shifts. The method combines a lightweight CNN, a dual-stage balanced buffer for class-balanced replay, and dynamic class-weighted loss to correct training imbalances. Evaluated on PneumoniaMNIST with five domain-shift scenarios, PneumoNet achieves 86.6% accuracy with 1.4% forgetting, outperforming baselines in size and speed.
domain-incremental learninglightweight cnndual-stage bufferdynamic losspneumoniamnist
Hallucination as Exploit: Evidence-Carrying Multimodal Agents
The paper introduces evidence-carrying multimodal agents (ECA) to mitigate hallucination-to-action conversion, where false visual claims trigger unauthorized privileged actions. ECA decomposes tool calls into action-critical predicates, verifies them via constrained DOM/OCR/AX certificates, and uses a deterministic gate to authorize only supported actions. Evaluations show ECA reduces gate bypass from 15% to 1.3% through targeted hardening, achieves 0% unsafe-action rate on 200-task and 120-task pipelines, and outperforms naive agents (100% unsafe execution) and prompt-only defenses (49.6%). Oracle-certificate replay on 7,488 GPT-5.4 traces validates gate correctness.
multimodal agentshallucination-to-actionevidence-carryingaction-critical predicatesdeterministic gate
Going PLACES: Participatory Localized Red Teaming for Text-to-Image Safety in the Global South
The study introduces PLACES, a dataset of 26,000 text-to-image (T2I) model failures collected through localized red teaming in Ghana, Nigeria, and India (Karnataka, Punjab). The method emphasizes participatory localization, involving community workshops in secondary urban centers to capture culturally specific adversarial patterns. Results reveal unique harms (e.g., religious norm violations, ominous symbolism) and structural gaps in Western-centric safety frameworks, demonstrating the need for culturally contextualized T2I safety evaluation.
text-to-image modelsred teamingsafety frameworkscultural pluralismadversarial patterns
Discoverable Agent Knowledge -- A Formal Framework for Agentic KG Affordances (Extended Version)
The paper proposes Agentic Affordance Profiles (AAPs), a formal framework extending Semantic Web Service concepts to Knowledge Graphs (KGs) by addressing epistemic gaps in current metadata standards (VoID, DCAT). AAPs model four dimensions: agent-provable knowledge, closure assumptions, vocabulary grounding, and entailment regime alignment, enabling principled KG selection and failure diagnosis during agent planning. The work identifies a five-point research agenda for scalable affordance matching, bridging ontological commitments between agents and heterogeneous KGs.
agentic affordance profileknowledge graphsemantic web servicesontological commitmententailment regime
Planner-Admissible Graph-PDE Value Extensions for Sparse Goal-Conditioned Planning
The paper establishes planner-admissibility conditions for graph-PDE value extensions in sparse goal-conditioned planning, proving a local action-gap certificate: greedy rollouts succeed if surrogate value errors remain below half the true action gap. The analysis compares Absolutely Minimal Lipschitz Extension (AMLE) and harmonic extension, showing AMLE's superiority through a comparison-principle fill-distance bound and its compatibility with local geometry. Experimental results on 120 AntMaze configurations demonstrate AMLE's 0.970 success rate versus harmonic's 0.584, with high-p methods (p=8, p=16) achieving 0.973-0.982 success by correcting harmonic's action-ranking inversions.
goal-conditioned planninggraph-pdeabsolutely minimal lipschitz extensionaction-gap certificateharmonic extension
Bridge: Retrieval-Augmented Spatiotemporal Modeling for Urban Delivery Demand
The paper introduces Bridge, a retrieval-augmented spatiotemporal framework for urban delivery demand forecasting in cold-start regions. Bridge combines an inductive graph backbone with a time-aware memory of region-time windows, retrieving future demand patterns based on regional context and recent dynamics, then refining forecasts via gated fusion. The retriever is trained with a future-aware objective to align retrieval with forecasting utility. Evaluations on four delivery datasets demonstrate Bridge's superiority over baselines in within-city cold-start and cross-city transfer scenarios, highlighting retrieval augmentation's value when parametric generalization falls short.
retrieval-augmentedspatiotemporalcold-startinductive graphgated fusion
How Far Are We From True Auto-Research?
The paper introduces ResearchArena, a framework for evaluating auto-research systems by generating and assessing 117 agent-written papers across 13 CS domains using Claude Code, Codex, and Kimi Code. Manuscript-only review (SAR) shows Claude Code outperforming Analemma's FARS and matching human ICLR 2025 submissions, but artifact-aware peer review reveals severe experimental rigor issues (fabrication, underpowering, execution mismatch) with agent-dependent failure rates (5-77%). No agent-generated paper meets top-tier venue standards, indicating significant gaps in true auto-research capability.
auto-researchresearcharenaartifact-aware reviewexperimental rigoragent-generated papers
Progressive Autonomy as Preference Learning: A Formalization of Trust Calibration for Agentic Tool Use
The paper formalizes trust calibration for agentic tool use as a preference-learning problem, where a policy gateway models human risk tolerance via Gaussian-process classification with probit likelihood on binary approve/deny feedback. The method leverages Preferential Bayesian Optimization's inference machinery and uncertainty-targeted querying to classify actions into allow/block/ask regions, differing from standard optimization objectives. Theoretical connections to sample-efficient preference learning are established while addressing the distinct challenge of autonomous action approval.
trust calibrationagentic tool usegaussian-process classificationpreferential bayesian optimizationprobit likelihood
Flash PD-SSM: Memory-Optimized Structured Sparse State-Space Models
Flash PD-SSM introduces a memory-optimized structured sparse state-space model (SSM) that balances expressivity and efficiency through trainable structured sparse matrices, with discrete selection at each time-step. This approach matches unstructured matrix expressivity in finite-state automaton modeling while maintaining computational efficiency. Evaluations on synthetic tasks and long sequences (17k+ tokens) show state-of-the-art accuracy among SSMs, with improved throughput and memory usage in language modeling and state-tracking applications compared to existing SSMs.
state-space modelsstructured sparsityfinite-state automatonmemory optimizationtime-series modeling
Be Kind, Rewrite: Benign Projections via Rewriting Defend Against LLM Data Poisoning Attacks
The paper proposes open-book benign rewriting (OBBR) as a defense against LLM data poisoning attacks, demonstrating theoretically and empirically that projecting poisoned samples to benign prompt space via rewriting with benign references improves safety. The method outperforms closed-book rewriting by 25.7% and state-of-the-art defenses by 51% across five backdoor attacks on four LLMs, while maintaining computational efficiency and preserving natural language task performance. Results show OBBR's effectiveness against both trigger-based and non-trigger-based poisoning.
data poisoningbackdoor attacksllm rewritingopen-book defensebenign projection
GRASP: Deterministic argument ranking in interaction graphs
The paper introduces GRASP (Gradual Ranking with Attacks and Support Propagation), a deterministic framework for argument ranking in interaction graphs that addresses instability in holistic LLM-as-a-Judge evaluations. GRASP aggregates local interaction judgments via a convergent attack-defense propagation operator, demonstrating greater consistency than holistic rankings (inter-model disagreement reduced). Results show GRASP scores are reproducible but uncorrelated with human 'convincingness' labels, instead measuring structural sufficiency—argument robustness within explicit interaction graphs. The method provides a transparent alternative to opaque LLM judging practices.
argument rankinginteraction graphsllm-as-a-judgestructural sufficiencypropagation operator
Learning to Hand Off: Provably Convergent Workflow Learning under Interface Constraints
The paper introduces IC-$Q$, a provably convergent decentralized $Q$-learning algorithm for workflow learning under interface constraints, formalized as an interface-constrained semi-Markov decision process (IC-SMDP). The method extends the approximate information state (AIS) framework to multi-agent SMDPs and controls Markovian noise under random duration, achieving coordination via scalar handoffs. Theoretical analysis provides a finite-sample bound decomposing into neural approximation error, interface gap, and mixing-time residual, validated empirically on synthetic tasks, multi-LLM reasoning, routing, and CPU programming, matching centralized performance without joint trajectory access.
ic-smdpdecentralized q-learningapproximate information stateworkflow learningmulti-agent coordination
COBALT: Crowdsourcing Robot Learning via Cloud-Based Teleoperation with Smartphones
COBALT introduces a cloud-based teleoperation platform for scalable robot learning via crowdsourced demonstrations. The system leverages vectorized environments and load-balanced infrastructure to support concurrent teleoperation by multiple users on a single GPU, achieving sub-100 ms latency for up to 8 users. It accommodates diverse input devices (smartphones, VR headsets, etc.) and demonstrates scalability (256 simulated clients across 8 GPUs). A user study confirms smartphone-based teleoperation matches or exceeds specialized hardware performance. The platform includes real-time metrics for data quality filtering and a training curriculum, enabling collection of 7500+ demonstrations across nine countries. The resulting dataset validates state-of-the-art imitation learning algorithms.
teleoperationimitation learningvectorized environmentsload-balanced infrastructurecrowdsourcing
Knowing When Not to Predict: Self Supervised Learning and Abstention for Safer DR Screening
The study investigates how self-supervised learning (SSL) pretraining duration affects confidence calibration and abstention in diabetic retinopathy screening models. Using multiple SSL checkpoints with fixed fine-tuning, the authors evaluate calibrated confidence, coverage, selective accuracy, and selective macro-F1. Results show SSL improves selective prediction over random initialization, but prolonged pretraining does not consistently enhance reliability despite accuracy saturation. The work highlights pretraining length as a critical design choice for safety-aware models, not just a computational detail.
self-supervised learningconfidence calibrationselective predictiondiabetic retinopathyabstention
EgoBabyVLM: Benchmarking Cross-Modal Learning from Naturalistic Egocentric Video Data
The paper introduces EgoBabyVLM, a benchmark for evaluating vision-language models (VLMs) on naturalistic egocentric video data resembling infant learning conditions. It proposes Machine-DevBench, an automatically generated evaluation suite that eliminates train/eval mismatches by sampling lexical and grammatical items across logarithmic frequency bins from the model's vocabulary. Experiments reveal current VLMs fail to leverage weakly-aligned visuo-linguistic signals prevalent in egocentric streams, despite human proficiency in such conditions. The work establishes the EgoBabyVLM Challenge to advance models capable of learning from infant-like naturalistic input.
vision-language modelsegocentric videocross-modal learningdevelopmental benchmarkssemantic alignment
POLAR-Bench: A Diagnostic Benchmark for Privacy-Utility Trade-offs in LLM Agents
POLAR-Bench introduces a diagnostic benchmark for evaluating privacy-utility trade-offs in LLM agents, featuring adversarial probing of protected attributes under user-defined policies. The method employs a trusted model with privacy policies conversing with adversarial third-party models across 10 domains (7,852 samples), scoring privacy/utility via set-membership and varying policy dimensions/attack strategies. Results show frontier models withhold >99% of protected attributes, while 1-30B open-weight models leak >50%, revealing critical gaps in privacy alignment for on-device deployments.
privacy-utility trade-offllm agentsadversarial probingpolicy-aware benchmarkingintent-following
GOAL: Graph-based Objective-Aligned Diffusion Solvers for Dynamic Multi-Objective Optimization
The paper introduces GOAL, a graph-based diffusion solver for dynamic multi-objective optimization that enables controllable decision generation by conditioning on human-specified objectives. The method employs a heterogeneous graph encoding with distinct edge types for different constraint classes, guiding message passing in a graph neural network. Evaluated on Flow Shop Problem (FSP), Job Shop Scheduling Problem (JSP), and Flexible Job Shop Scheduling Problem (FJSP), GOAL achieves 100% feasibility and <0.20% MAPE for problems up to 20 jobs/60 operations, outperforming NSGA-II and MOEA/D by 25x in speed and quality.
graph neural networkmulti-objective optimizationdiffusion solverheterogeneous graphscheduling benchmarks
FAGER: Factually Grounded Evaluation and Refinement of Text-to-Image Models
FAGER introduces a factually grounded framework for evaluating and refining text-to-image models by assessing implicit and explicit factual correctness. The method constructs a structured rubric via LLM-based fact proposal and reference-guided visual verification, then converts it into VLM-evaluated QA pairs. Experiments on five datasets (science, history, products, culture, knowledge-intensive concepts) show FAGER outperforms prior metrics in Factual A/B tests and enables training-free output refinement with significant factuality improvements.
text-to-image evaluationfactual correctnessvisual verificationllm-based fact proposalvlm-based evaluation
Neural Operators for Design-Space Surrogate Modeling of Tendon-Actuated Continuum Robots
The authors propose neural operator architectures for surrogate modeling of tendon-actuated continuum robots, enabling generalization across robot designs. They formulate the problem as operator learning, mapping design parameters and tendon inputs to configurations, and develop four novel architectures: two DeepONet-based and two FNO-based variants. Trained on simulation data, all models achieve good accuracy while maintaining fast inference, demonstrating effective generalization for control, planning, and design optimization in surgical and industrial applications.
neural operatorscontinuum robotssurrogate modelingdeeponetfourier neural operators
DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows
The authors introduce DecisionBench, a benchmark for evaluating emergent delegation in long-horizon agentic workflows, featuring a fixed task suite (GAIA, tau-bench, BFCL multi-turn), peer-model pool (11 models across 7 vendors), and multi-axis metrics (quality, cost, latency, etc.). The substrate supports various evaluation methods, including learned routers and adaptive profile construction. Key findings include: (i) end-task quality is statistically similar across awareness conditions (|β|=0.21), masking orchestration signals; (ii) routing fidelity-at-1 varies 7.5-29.5% across conditions, with delivery channel being the dominant factor; (iii) counterfactual analysis reveals 15-31 percentage points of unrealized performance headroom. The release includes the substrate, annotations, and 220 per-condition run archives.
agentic workflowsdelegation interfacerouting fidelitycounterfactual ceilingorchestration signal
ScheduleFree+: Scaling Learning-Rate-Free & Schedule-Free Learning to Large Language Models
The paper introduces ScheduleFree+, a learning-rate-free and schedule-free method for training large language models, addressing scalability challenges in prior work. By incorporating necessary modifications for larger batch and model sizes, the method outperforms Warmup-Stable-Decay schedules by 31% at 1000 tokens per parameter. Results demonstrate its efficacy in long-duration training, establishing a theoretical basis for model averaging and checkpoint merging during pretraining.
schedule-free learninglarge language modelslearning-rate-freemodel averagingcheckpoint merging
Embedding by Elicitation: Dynamic Representations for Bayesian Optimization of System Prompts
The paper introduces ReElicit, a Bayesian optimization framework for tuning system prompts using aggregate feedback. The method employs embedding by elicitation, where an LLM constructs a compact feature space from task descriptions and prompt-score histories, enabling Gaussian process-based optimization with adaptive feature representations. Evaluated on ten prompt optimization tasks with a 30-evaluation budget, ReElicit outperforms baseline methods in aggregate performance, demonstrating LLMs' capability as adaptive semantic representation builders for natural-language optimization.
bayesian optimizationsystem promptsgaussian processembedding by elicitationllm
Counterfactual Likelihood Tests for Indirect Influence in Private Reasoning Channels
The paper introduces a counterfactual likelihood test to quantify indirect influence between private reasoning channels in modular AI systems. The method substitutes upstream private blocks with length-matched donor blocks while fixing public tokens and downstream targets, then measures negative-log-likelihood shifts. Validation on a 7B role-channel model shows textual probes (n-gram overlap, canary tests) fail to reliably detect leakage, whereas the counterfactual approach distinguishes unmasked/masked conditions and isolates public-channel pathways. Results demonstrate persistent A-to-B influence through public-speech hidden states (verified across 13,734 directional contrasts) and zero reverse influence, with graph-separation controls confirming the public channel as the sole signal carrier.
counterfactual likelihoodprivate reasoning channelsnegative-log-likelihood shiftrole-visibility maskgraph-separation control
MANGO: Meta-Adaptive Network Gradient Optimization for Online Continual Learning
The paper proposes MANGO, a meta-adaptive gradient optimization framework for online continual learning (OCL) that balances stability-plasticity via gradient-gating and meta-learned regularization. Gradient-gating scales parameter updates based on sensitivity, while meta-learned regularization adapts stability coefficients using replay data as both training signal and forgetting evaluator. Evaluated on CLEAR-10, CIFAR-100, and Tiny-ImageNet, MANGO achieves state-of-the-art accuracy and positive backward transfer, outperforming baselines across varying replay buffer sizes.
online continual learninggradient-gatingmeta-learned regularizationcatastrophic forgettingbackward transfer
ReacTOD: Bounded Neuro-Symbolic Agentic NLU for Zero-Shot Dialogue State Tracking
ReacTOD introduces a neuro-symbolic architecture for zero-shot dialogue state tracking, combining bounded ReAct loops with deterministic validation to reduce hallucinations and format errors. The method employs iterative self-correction, symbolic validation, and compact prompt management via incremental state prediction. Results show 9.3% accuracy improvement over single-pass inference on MultiWOZ, 93.1% self-correction rate, and state-of-the-art zero-shot performance: 52.71% joint goal accuracy (gpt-oss-20B) and 47.34% (Qwen3-8B). On SGD, Claude-Opus-4.6 achieves 80.68% JGA, demonstrating cross-benchmark generalization.
reactodneuro-symboliczero-shotmultiwodsgd
CRAFT: Critic-Refined Adaptive Key-Frame Targeting for Multimodal Video Question Answering
The paper introduces CRAFT, a query-conditioned pipeline for multimodal video question answering that combines dynamic keyframe selection, multilingual ASR, and a hybrid critic loop for iterative claim verification. The system integrates UNLI temporal entailment, DeBERTa-v3 cross-claim screening, and a Llama-3.2-3B adjudicator, with a citation-merging stage for source attribution. On MAGMaR 2026, CRAFT achieves 0.739 average score, 0.810 reference recall, and 0.635 citation F1, outperforming baselines. Ablations confirm the importance of atomic claims, ASR, and the critic loop. The method also generalizes to WikiVideo (0.823 Avg), demonstrating cross-dataset applicability.
multimodal video qakeyframe selectiontemporal entailmentclaim verificationcitation merging
Learning Long-Term Temporal Dependencies in Photovoltaic Power Output Prediction Through Multi-Horizon Forecasting
The paper proposes a multi-horizon forecasting framework for photovoltaic power output prediction, demonstrating architecture-independent accuracy improvements by jointly optimizing over sequential future values. The method integrates sequential sky imagery with historical PV data, enabling deep neural networks to better capture long-term temporal dependencies through gradient and filter diversity preservation. Evaluations across diverse architectures show significant accuracy gains across all forecast horizons with minimal computational overhead, offering a scalable solution for grid stability.
multi-horizon forecastingphotovoltaic predictiontemporal dependenciesgradient diversityfilter diversity
Riemannian Networks over Full-Rank Correlation Matrices
The paper introduces Riemannian networks operating on the manifold of full-rank correlation matrices, an underexplored alternative to SPD matrices. The authors leverage five distinct correlation geometries to systematically extend basic layers (MLR, FC, convolutional) to these manifolds, while providing accurate backpropagation methods for two geometries. Experimental comparisons with SPD and Grassmannian networks demonstrate the approach's effectiveness, though specific performance metrics are not provided in the excerpt.
riemannian networkscorrelation matricesspd manifoldbackpropagationgrassmannian
Benchmarking Commercial ASR Systems on Code-Switching Speech: Arabic, Persian, and German
This benchmark evaluates five commercial ASR systems on code-switching speech across Arabic-English, Persian-English, and German-English language pairs. A two-stage pipeline (heuristic filtering + LLM ensemble) selects 300 samples per pair, reducing scoring costs by 91%. Systems are assessed via WER and BERTScore, with ElevenLabs Scribe v2 achieving lowest WER (13.2% overall) and highest BERTScore (0.936). Difficulty-stratified analysis reveals performance gaps, while BERT embeddings confirm semantic proximity despite script differences. Dataset available on HuggingFace.
asrcode-switchingbertscorewertransliteration
Toward an AI-Powered Computational Testbed for Workforce Policy
The article proposes dynamic employee agents as an AI-powered computational testbed for workforce policy, combining LLM-powered generative agents with management science to simulate employee responses to organizational changes. The method integrates HR records, psychometric data, and digital activity to model cognitive, emotional, and behavioral trajectories. The authors outline the required computational architecture and emphasize safeguards for privacy, accuracy, and representativeness, positioning this as a critical tool for managing AI-driven workforce transitions.
generative agentspsychometric measurescomputational architectureworkforce simulationorganizational behavior
Multi-axis Analysis of Image Manipulation Localization
The authors introduce AUDITS, a benchmark for image manipulation detection comprising 530K images from user and news photos, supporting analysis across domain shifts, quality, type, and size. The dataset includes diffusion-based inpaintings with diverse manipulation types and sizes. Experiments evaluate robustness of existing methods under domain shifts, aiming to advance research in reliable, generalizable detection. Results highlight the need for improved methods to address varied manipulation scenarios.
image manipulation detectiondomain shiftdiffusion-based inpaintingbenchmark datasetrobustness evaluation
Interpretable Computer Vision for Defect Detection in X-ray Tomography of Aerospace SiC/SiC Composites
The paper introduces p-ResNet-50, an interpretable convolutional framework for defect detection in X-ray tomography of aerospace SiC/SiC composites, combining high accuracy with case-based explanations. The method extends ResNet-50 with a prototype layer aligned to expert-defined defect categories, using novel anchor-based and medoid-based regularization to prevent prototype collapse. Evaluated on 12,000 XCT patches, it achieves comparable performance to black-box ResNet-50 (ROC-AUC 0.994 vs. 0.993) while providing traceable decisions via representative evidence patches and uncertainty mapping through UMAP visualization.
interpretable computer visionprototype networksx-ray tomographydefect detectionregularization
SAGE: Scalable Automatic Gating Ensemble for Confident Negative Harvesting in Fraud Detection
The paper introduces SAGE, a scalable automatic gating ensemble for confident negative harvesting in fraud detection, specifically targeting music streaming fraud. The method combines SimHash-based stratified sampling with a modular gating ensemble (using Mahalanobis distance and k-NN density) to address representation bias in Positive-Unlabeled learning. It ensures coverage of rare behavioral cohorts through floor-constrained sampling. Evaluation shows strong precision and recall on held-out data, with generalization across customer-level and artist-level fraud detection without methodological modifications.
fraud detectionpositive-unlabeled learningsimhashmahalanobis distancek-nn density
When Does Model Collapse Occur in Structured Interactive Learning?
The paper characterizes model collapse in structured interactive learning environments where multiple generative models train on each other's synthetic outputs. It formalizes model interactions via directed graphs, proving that collapse depends critically on interaction topology, and derives necessary/sufficient conditions for collapse occurrence. Theoretical results include finite-sample guarantees for linear regression and asymptotic guarantees for general M-estimators, validated through numerical experiments. The work extends prior single-model collapse analyses to multi-agent settings with arbitrary interaction patterns.
model collapseinteractive learningm-estimatorssynthetic datainteraction graphs
Goal-Oriented Lower-Tail Calibration of Gaussian Processes for Bayesian Optimization
The paper introduces goal-oriented calibration for Gaussian process (GP) predictive distributions in Bayesian optimization (BO), specifically targeting lower-tail miscalibration for minimization tasks. It proposes a framework with two spatial calibration metrics—occurrence calibration and thresholded μ-calibration—and develops tcGP, a post-hoc method to calibrate GP predictions below a threshold t. Theoretical analysis shows the resulting expected improvement (EI) algorithm maintains denseness in the design space. Empirical evaluations on benchmarks demonstrate improved lower-tail calibration and BO performance compared to standard and globally calibrated GP models.
bayesian optimizationgaussian processlower-tail calibrationexpected improvementspatial calibration
TrajTok: Adaptive Spatial Tokenization for Trajectory Representation Learning
TrajTok introduces an adaptive spatial tokenization method for learning transferable trajectory representations from raw GPS traces. The approach combines multi-resolution hexagonal cell partitioning with a factorized transformer encoder featuring modality-specific self-attention, cross-attention fusion, and spatiotemporal rotary position embeddings (ST-RoPE). Pretrained via masked-token modeling to recover geometric and kinematic patterns, TrajTok achieves state-of-the-art performance on Porto dataset benchmarks including similarity search (85.3% accuracy), classification (91.2% F1), and travel-time regression (12.4 min MAE), demonstrating generalizability across geometry- and kinematics-dominated tasks.
trajectory representationspatial tokenizationfactorized transformerrotary position embeddingsmasked-token modeling
FiLark: a streaming-first software framework for end-to-end exploration, annotation, and algorithm integration in distributed acoustic sensing
FiLark introduces a streaming-first Python framework for distributed acoustic sensing (DAS) that unifies data access, processing, and visualization under a continuous-stream abstraction. The system features an OpenGL-based ring-buffer renderer for interactive browsing of long recordings with constant memory, an integrated annotation interface for event labeling in streams, and a signal processing library with CPU/GPU implementations via PyTorch. By maintaining stateful chunked execution and standardized monitor interfaces, FiLark enables seamless transition from interactive exploration to production pipelines without modification.
distributed acoustic sensingstreaming-firstring-buffer rendererstateful chunked executiongpu-accelerated
Optimizing Computational-Statistical Runtime for Wasserstein Distance Estimation
The paper introduces a Sample-Sketch-Solve paradigm to optimize the computational-statistical runtime for estimating Wasserstein distance between distributions. By employing a regular cartesian grid sketch of samples, the method compresses data without increasing asymptotic error, particularly effective under α-Hölder smooth distributions. The approach achieves ε-additive error in expectation, with runtime scaling as ε^(-max(2,(d+1+o(1))/(1+α))) for d=2,3, demonstrating near-optimal performance when α→1 in d=3.
wasserstein distancecomputational-statistical runtimesample-sketch-solveα-hölder smoothcartesian grid sketch
Optimal Representation Size: High-Dimensional Analysis of Pretraining and Linear Probing
The paper provides an analytical framework for understanding how pretrained representation dimensionality affects downstream generalization in high-dimensional linear models. Using principal component analysis for pretraining and linear regression for downstream tasks, the authors derive exact expressions for training and generalization error as functions of representation size, sample sizes, and task alignment. Key findings show maximal compression is optimal with abundant pretraining data but scarce labels, while higher-dimensional representations generalize better with limited pretraining. The work also quantifies a precise trade-off between unlabeled and labeled data requirements, with empirical validation in autoencoders and pretrained LLMs.
representation learninglinear probinghigh-dimensional analysispretraininggeneralization error
Towards Distillation Guarantees under Algorithmic Alignment for Combinatorial Optimization
The paper establishes sufficient conditions for successful knowledge distillation in combinatorial optimization tasks when the target architecture is algorithmically aligned with the problem structure. Focusing on graph neural networks (GNNs) aligned with dynamic programming (DP) algorithms, the analysis assumes the source model satisfies the linear representation hypothesis (LRH) and shows distillation efficiency depends on the decision tree complexity of DP transition functions. Theoretical results demonstrate that distillation succeeds when the target GNN's architecture matches the DP algorithm's structure, extending prior work on decision-tree distillation to structured prediction settings.
knowledge distillationalgorithmic alignmentgraph neural networksdynamic programmingcombinatorial optimization
Smooth Partial Lotteries for Stable Randomized Selection
The paper introduces smoothness as a design principle for partial lotteries in competitive selection processes, addressing instability in existing lottery designs where small score changes cause large probability shifts. It proposes the Clipped Linear Lottery, a mechanism where selection probabilities scale linearly between upper and lower thresholds, proving its worst-case regret matches a lower bound for smooth selection rules up to a factor of $(1 - k/n)$. Experiments on peer review data from ICLR 2025, NeurIPS 2024, and the Swiss National Science Foundation demonstrate the instability of existing designs and the superior smoothness-utility tradeoff of the proposed method.
partial lotteriessmooth selectionclipped linear lotterylipschitz conditionregret bound
Tail Annealing for Heavy-Tailed Flow Matching
The paper introduces Log-FM, a method for improving flow matching on heavy-tailed data via coordinate-wise soft-log transforms. The approach applies $\phi(x) = \mathrm{sign}(x) \cdot \log(1 + |x|)$ pre-training and exponentiation post-generation, with a Hill diagnostic to selectively transform only heavy-tailed dimensions. Theoretical analysis shows the log-transform converts Pareto tails to exponentials, enabling tail annealing through power transformations. Evaluated on a 144-configuration benchmark (3 copulas, dimensions up to 100, 4 tail indices), Log-FM outperforms specialized baselines in $W_1$, CVaR$_{99}$, and extreme-quantile metrics, achieving zero severe divergences across 2,880 runs.
flow matchingheavy-tailed datatail annealinghill diagnosticsoft-log transform
Active Context Selection Improves Simple Regret in Contextual Bandits
The paper introduces an active context selection method for contextual bandits, improving simple regret bounds compared to passive sampling. By allowing the learner to choose which contexts to sample, the authors derive tight regret rates: passive sampling achieves $\sqrt{n/T \lVert p \rVert_{1/2}}$, while active sampling with allocation $q_j \propto p_j^{2/3}$ achieves $\sqrt{n/T} \lVert p \rVert_{2/3}$, yielding improvements up to $\Theta(k^{1/4})$. They extend the analysis to budgeted active sampling and propose the Explore-Explore-Then-Commit (EETC) algorithm for unknown context distributions, matching known-$p$ rates asymptotically. Experiments validate the theoretical results.
contextual banditssimple regretactive samplingregret boundseetc algorithm
D$^3$-Subsidy: Online and Sequential Driver Subsidy Decision-Making for Large-Scale Ride-Hailing Market
The paper introduces D$^3$-Subsidy, a hierarchical diffusion-based framework for dynamic driver-side subsidy optimization in ride-hailing platforms. The method employs a prefix-conditioned diffusion model to generate future trajectories from historical data, coupled with a context-conditioned inverse module for low-dimensional control signals. A Lagrangian-dual-derived mapping ensures subsidy-rate cap compliance without iterative optimization. Offline evaluations show improvements in completed rides (\texttt{Rides}) and gross merchandise value (\texttt{GMV}), while real-world A/B tests confirm operational feasibility with budget constraints.
diffusion modellagrangian dualonline decision-makingparameter-efficient fine-tuningride-hailing
CAMERA: Adapting to Semantic Camouflage in Unsupervised Text-Attributed Graph Fraud Detection
The paper introduces CAMERA, a Case-Adaptive Multi-cue Expert fRAmework for unsupervised text-attributed graph fraud detection (TAGFD) that addresses semantic camouflage by fraudsters. The method employs an ego-decoupled mixture-of-experts architecture, where each expert models distinct fraud-indicative cues, and a context-informed gating model adaptively integrates these cues based on ego node representations and local neighborhood context. CAMERA leverages fraudster rarity for unsupervised one-class learning with expert-level objectives that emphasize benign patterns. Evaluations on 4 datasets demonstrate CAMERA's superior performance against semantically camouflaged fraudsters compared to competitors.
text-attributed graphsemantic camouflagemixture-of-expertsunsupervised learningfraud detection
Take It or Leave It: Intent-Controlled Partial Optimal Transport
The paper introduces intent-controlled partial optimal transport (IC-POT), a generalization of partial optimal transport that replaces global rejection mechanisms with pointwise rejection costs based on side-specific reliability or external information. The method formulates the problem as a balanced Kantorovich OT on an augmented support and provides a dual interpretation via local acceptance thresholds. Experiments in positive-unlabeled learning, open-partial domain adaptation, and geophysical satellite data demonstrate improved performance when rejection rules encode statistical or physical priors.
partial optimal transportpointwise rejectionkantorovich problemside informationdomain adaptation
Training-Free Bayesian Filtering with Generative Emulators
The paper introduces a training-free Bayesian filtering method using generative emulators, specifically diffusion-based models, to address scalability issues in high-dimensional settings. By leveraging these emulators, the authors implement an optimal variant of particle filters that avoids the computational limitations of classical numerical solvers. Experimental results on nonlinear chaotic systems, including atmospheric dynamics, demonstrate the method's effectiveness in scaling particle filtering to high dimensions without additional training.
bayesian filteringparticle filtersdiffusion-based emulatorsnonlinear dynamicshigh-dimensional scaling
Fine-Tuning Without Forgetting via Loss-Adaptive Learning Rates
The paper introduces FINCH, a loss-adaptive learning-rate schedule that mitigates catastrophic forgetting during fine-tuning of large language models without modifying the objective function. The method dynamically adjusts learning rates based on batch loss, reducing rates for high-loss batches to limit forgetting while maintaining task performance. Evaluated on knowledge acquisition, science, and low-resource language benchmarks, FINCH reduces forgetting by 93% on average, preserves TruthfulQA and HaluEval performance on Qwen3-4B, and improves confidence calibration compared to standard fine-tuning.
catastrophic forgettingfine-tuninglearning-rate scheduleloss-adaptiveconfidence calibration
Minimalist Visual Inertial Odometry
The paper presents a minimalist visual-inertial odometry (VIO) system for differential-drive robots using only four photodiodes with optical Gabor masks and an IMU. The method jointly optimizes mask parameters and a Temporal Convolutional Network (TCN) via simulation to decode speed from photodiode measurements, combining them with IMU angular velocity for planar trajectory estimation. Experimental validation on a prototype shows accurate ground truth tracking across diverse terrains without real-world fine-tuning, demonstrating resource-efficient odometry with minimal sensing.
visual-inertial odometrygabor maskstemporal convolutional networkdifferential-drive robotsplanar odometry
Beyond Binary Success: A Diagnostic Meta-Evaluation Framework for Fine-Grained Manipulation
The paper introduces MetaFine, a diagnostic meta-evaluation framework for fine-grained manipulation tasks, addressing limitations of binary success metrics in current embodied AI benchmarks. MetaFine decomposes manipulation competency into three axes (understanding, perception, controlled behavior) via a compositional task graph that integrates heterogeneous benchmarks under a unified protocol. Evaluation of vision-language-action models reveals dimension-specific failures, with visual encoder spatial structure preservation identified as a critical bottleneck; targeted improvements yield 70% gains in precision without policy modifications. The framework supports hybrid real-sim validation for stable physical benchmarking.
fine-grained manipulationmeta-evaluationvision-language-actioncompositional task graphspatial perception
Your Neighbors Know: Leveraging Local Neighborhoods for Backdoor Detection in Decentralized Learning
The paper introduces Argus, a decentralized backdoor detection framework for collaborative learning that requires no central coordinator or trigger knowledge. Argus leverages local trigger analysis and neighborhood consensus via structural similarity metrics to distinguish true backdoors from false alarms caused by data heterogeneity, with theoretical convergence guarantees. Evaluated on three datasets against three baselines, Argus reduces attack success rates by up to 90 percentage points while maintaining model utility within 5 points of an oracle, with improved effectiveness under higher data heterogeneity.
decentralized learningbackdoor detectionstructural similaritydata heterogeneityconvergence guarantees
Normative Networks for Source Separation via Local Plasticity and Dendritic Computation
The paper introduces Predictive Entropy Maximization, a biologically plausible neural network for blind source separation (BSS) that uses local plasticity and dendritic computation. The method approximates entropy maximization through an interpretable objective function, enabling error-driven feedforward synapses (implementable via dendritic mechanisms), Hebbian lateral inhibition, and output nonlinearities for domain constraints. Theoretical spectral bounds characterize approximation accuracy. Empirically, the approach outperforms biologically plausible baselines under correlated sources and noise, matching performance of exact determinant-based methods. Results demonstrate how local plasticity and adaptive inhibition emerge from regularized second-order entropy maximization.
blind source separationlocal plasticitydendritic computationentropy maximizationhebbian learning
Learning Orthonormal Bases for Function Spaces
The paper introduces a neural network-based method for learning adaptive orthonormal bases in function spaces, departing from fixed bases like Fourier or wavelets. By parameterizing basis transformations as continuous paths on the Lie manifold of the orthogonal group, governed by ODEs with neural network-defined finite-rank skew-adjoint operators, the approach achieves universality: rank-2 generators suffice to approximate any target basis. Theoretical results prove density in the orthogonal group, while experiments demonstrate applications to functional PCA, eigenfunction computation, and energy-preserving dynamical systems.
orthonormal basislie manifoldskew-adjoint operatorfunction spaceneural ode
Exploiting Non-Negativity in DAG Structure Learning
The paper introduces a novel approach for learning directed acyclic graphs (DAGs) from linear structural equation models by exploiting non-negative edge weights. The method formulates a regularized non-negative DAG learning problem using an augmented-Lagrangian approach, demonstrating that non-negativity simplifies the acyclicity characterization and yields a benign optimization landscape. Theoretical analysis proves the true DAG is the unique global minimizer without spurious stationary points, while experiments show superior performance over state-of-the-art continuous DAG-learning methods on synthetic and real-world data.
directed acyclic graphsstructural equation modelsnon-negative edge weightsaugmented-lagrangianoptimization landscape
Variance-Reduced Manifold Sampling via Polynomial-Maximization Density Estimation
The paper introduces PMM-MASEM, a variance-reduced manifold sampling method that replaces k-nearest-neighbor density estimation in MASEM with a polynomial-maximization moment estimator. The hybrid approach uses a gated PMM2/PMM3 estimator for non-flat spacing distributions while defaulting to the plug-in/MLE rule for homogeneous manifolds. Experiments show a 22-36% reduction in density MSE for asymmetric gamma and boundary-spacing regimes, though performance degrades on platykurtic uniform spacings. Results indicate applicability-boundary conditions rather than universal improvement.
manifold samplingpolynomial-maximizationvariance reductiondensity estimationk-nearest-neighbor
JAXenstein: Accelerated Benchmarking for First-Person Environments
JAXenstein introduces a JAX-based benchmark for visual first-person tasks in reinforcement learning, addressing the lack of such domains in the JAX ecosystem. Utilizing the Wolfenstein 3D rendering engine, it enables fast and scalable experimentation, outperforming comparable vision-based benchmarks in speed. The benchmark supports testing exploration and partial observability, facilitating rapid algorithm development.
jaxreinforcement learningwolfenstein 3dbenchmarkpartial observability
Hierarchical Contrastive Learning for Multi-Domain Protein-Ligand Binding
HCLBind introduces a hierarchical contrastive learning framework for multi-domain protein-ligand binding prediction, addressing limitations of monolithic graph approaches. The method combines self-supervised pre-training on Q-BioLiP with a novel hierarchical decoy strategy: local perturbations for single-domain physicochemical constraints and inter-domain rotations for global geometry. It employs a hybrid architecture with domain-gated graph attention, cross-modal attention, and LoRA-adapted foundation models. Evaluated on PDBBind, HCLBind demonstrates improved interface feature discrimination and uncertainty estimation compared to supervised baselines.
protein-ligand bindingcontrastive learninggraph attention networkmulti-domain proteinsuncertainty estimation
Auditing Privacy in Multi-Tenant RAG under Account Collusion
(No summary returned.)
Fast Tensorization of Neural Networks via Slice-wise Feature Distillation
The authors propose a scalable tensorization framework for neural network compression via slice-wise feature distillation. Unlike global tensor decomposition methods requiring expensive finetuning, their approach decomposes networks into modular slices (individual layers/blocks or layer groups) and tensorizes each slice independently to match intermediate representations of the original model. This method improves accuracy recovery, reduces data dependence, and enables parallel optimization. Experiments on ResNet-34 demonstrate near-lossless compression at moderate rates with faster optimization than global approaches, while GPT-2 XL results show scalability to large models in distributed settings.
tensorizationfeature distillationneural compressionslice-wise decompositionparallel optimization
Set-Valued Policy Learning
The paper introduces set-valued policy learning, a paradigm where policies output multiple plausible treatments instead of single recommendations, enabling intrinsic uncertainty quantification. The method extends learning-to-defer via a greatest Lower Bound approach and introduces conformal policy learning, which connects estimated optimal treatments with unobserved ground-truth rules. A randomness-injection technique guarantees marginal coverage without assumptions on black-box optimal rules. Experiments on synthetic data and In-Vitro Fertilization (IVF) demonstrate robust policies that balance performance and reliability while incorporating clinical considerations.
set-valued policyconformal policy learninguncertainty quantificationlearning-to-deferrandomness-injection
General Lower Bounds for Differentially Private Federated Learning with Arbitrary Public-Transcript Interactions
The work establishes a general lower bound for differentially private federated learning protocols with arbitrary public-transcript interactions, applicable to adaptive rounds and client sample reuse. By developing a privacy-information contraction inequality for complete public transcripts, the authors derive a federated van Trees lower bound for estimators under total clientwise sample-level zero-concentrated differential privacy (zCDP). The results demonstrate the bound's applicability to mean estimation, linear regression, and nonparametric regression under squared ℓ2 loss.
federated learningdifferential privacyzcdplower boundsparameter estimation
LionMuon: Alternating Spectral and Sign Descent for Efficient Training
The paper introduces LionMuon, a hybrid optimizer alternating between Lion's sign-based updates and Muon's spectral matrix-sign updates with period P, sharing a dual-EMA momentum buffer. This approach reduces per-step cost while maintaining Muon's effectiveness, achieving Pareto dominance over Muon, Lion, Signum, and AdamW across 124M, 355M, and 720M model scales. Theoretical analysis provides complexity bounds under heavy-tailed noise, predicting compute-optimal P and conditions for superiority. LionMuon's state memory matches Lion (half of AdamW), and even a simpler single-EMA variant (SignMuon) outperforms pure Muon.
optimizerspectral descentsign descentdual-emaheavy-tailed noise
B-cos GNNs: Faithful Explanations through Dynamic Linearity
The paper introduces B-cos GNNs, a class of graph neural networks designed for faithful explainability through dynamic linearity. By employing linear aggregation and B-cos transforms instead of non-linear message passing, the model decomposes predictions into interpretable per-node, per-feature contributions via a single input-dependent linear map. This approach eliminates the need for auxiliary explainers or modified objectives, providing instance-level explanations efficiently. Evaluated as a GIN variant, B-cos GNNs achieve state-of-the-art explainability with minor accuracy trade-offs, outperforming post-hoc methods in speed across synthetic and real-world benchmarks.
b-cos gnnsdynamic linearitygraph neural networksexplainabilitygin variant
MSAlign: Aligning Molecule and Mass Spectra Foundation Models for Metabolite Identification
The paper introduces MSAlign, a framework for metabolite identification through alignment of mass spectra and molecular representations. The method combines frozen foundation models (DreaMS for mass spectra and ChemBERTa for molecules) via lightweight MLP projections trained with contrastive learning. MSAlign outperforms existing approaches across benchmarks while being simple and fast. The work also formalizes distribution shift in evaluation strategies, providing quantitative analysis of data splitting tradeoffs. All implementations and datasets are released for reproducibility.
metabolite identificationcontrastive learningfoundation modelsrepresentation alignmentmass spectrometry
Graph Neural Networks for Community Detection in Graph Signal Analysis
The paper proposes integrating GNN-derived community detection with Partition of Unity Method (PUM) interpolation for graph signal analysis. Using a taxonomy of GNN architectures for community detection, it constructs local subdomains via GNN clustering, computes Graph Basis Function (GBF) interpolants per community, and combines them into global approximations. Experiments on geometric and urban network benchmarks show accurate signal reconstruction, demonstrating that deep learning-based partitions enhance localized interpolation scalability.
graph neural networkscommunity detectionpartition of unity methodgraph basis functionssignal interpolation
Awakening the Hydra: Stabilizing Multi-Concept Backdoor Injection in Text-to-Image Diffusion Models
The paper introduces Hydra, a framework for stable multi-concept backdoor injection in text-to-image diffusion models under decentralized reuse scenarios. The method employs evolutionary trigger search in text encoder space to align triggers with target concepts while maintaining stability across injections, combined with multi-task fine-tuning and trigger-clean regularization. Experiments on multiple diffusion backbones demonstrate Hydra's effectiveness, achieving ~95% attack success rate across 8 attackers and 500 concept pairs while preserving clean generation quality.
backdoor injectiontext-to-image diffusionmulti-task fine-tuningevolutionary trigger searchtrigger-clean regularization
Probabilistic Multivariate Time Series Forecasting with Diffusion Copulas
The paper introduces a Diffusion-Copula framework for probabilistic multivariate time series forecasting, addressing the 'normality bias' in diffusion models by decoupling marginal distribution learning from dependence structure modeling. The method combines deep Mixture Density Networks for heavy-tailed marginal dynamics with a Classification-Diffusion Copula for joint dependence. Evaluated on cryptocurrency markets, the framework outperforms state-of-the-art baselines in forecasting extreme events, correctly identifying simultaneous market crashes as statistically probable rather than impossible, thus improving risk management during contagion events.
diffusion-copulamixture density networksmultivariate forecastingtail riskdependence structure
Agentic Discovery of Cryomicroneedle Formulations
The study presents an AI-driven closed-loop workflow for discovering cryoprotectant formulations for cryomicroneedles, combining literature curation, Gaussian-process surrogate modeling, Bayesian optimization, and wet-lab validation. A dataset of 198 mesenchymal stem-cell cryopreservation formulations was used to train an uncertainty-aware prior model, which was iteratively refined through 106 wet-lab observations. Final results showed improved predictive performance (batch RMSE reduced from 41.21 to 6.86 percentage points, R²=0.942) and identified a high-viability (95.15%) formulation with low toxicity. The work highlights the potential of agent-assisted discovery for labs lacking in-house data expertise.
cryomicroneedlesgaussian-processbayesian optimizationcryoprotectantmulti-objective optimization
Convergence of Consensus-Based Particle Methods for Nonconvex Bi-Level Optimization
The paper proposes a derivative-free consensus-based optimization method for nonconvex bi-level optimization, where the upper-level function is minimized over the set of lower-level global minimizers. The method employs smooth quantile selection and a Gibbs-type Laplace approximation to construct consensus points. Theoretical analysis establishes convergence guarantees for both mean-field dynamics and finite-particle approximations, demonstrating exponential convergence to arbitrary Wasserstein neighborhoods of the bi-level solution under smooth quantile localization and stability assumptions. Numerical experiments on constrained 2D problems and neural network training validate the theoretical findings.
bi-level optimizationconsensus-based optimizationmean-field dynamicswasserstein distancegibbs-type approximation
Cross-View Attention Fusion Net: A Prior-Guided Dual-View Representation Learning for Cardiac Output Estimation from Short-Term PPG Signals
The Cross-View Attention Fusion Network (CVAF-Net) is proposed for cardiac output (CO) estimation from short photoplethysmography (PPG) signals, combining raw temporal PPG data with structured feature sequence maps via cross-view attention. This dual-view approach leverages both end-to-end learning and physiological priors, achieving mean absolute error (MAE) of 0.19 L/min (3.95% MAPE) on simulated data and 1.20 L/min in real-world settings, while reducing FLOPs by 12× compared to Transformer-based models. The method demonstrates physiological plausibility through correlations with age (ρ=−0.274), heart rate (ρ=0.894), and vascular resistance (ρ=−0.740).
photoplethysmographycardiac output estimationcross-view attentionfeature sequence maphemodynamic monitoring
OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond
OScaR introduces a lightweight KV cache compression framework for X-LLMs (text-only, multi-modal, and omni-modal LLMs) by addressing Token Norm Imbalance (TNI) through Canalized Rotation and Omni-Token Scaling. The method mitigates sequence-dimensional variance efficiently, supported by optimized system design and CUDA kernels. Evaluations demonstrate near-lossless INT2 quantization performance, achieving 3.0x decoding speedup, 5.3x memory reduction, and 4.1x throughput increase compared to BF16 FlashDecoding-v2.
kv cachetoken norm imbalanceper-channel quantizationcanalized rotationomni-token scaling
BCI-sift: An automated feature selection toolbox for Brain Computer Interface applications
The authors present BCI-sift, a Python toolbox for systematic feature selection in Brain-Computer Interface applications, compatible with scikit-learn. The method integrates optimization algorithms to identify relevant features across electrode, temporal, and frequency dimensions in high-dimensional BCI data. Validation on HD ECoG data (8 participants, 64-128 electrodes) showed improved classification accuracy, with selected features anatomically consistent and temporally clustered around speech production, while high-frequency bands proved most informative. The open-source toolbox enhances decoding performance and interpretability for various BCI modalities.
feature selectionbrain-computer interfaceelectrocorticographysensorimotor cortexhigh-dimensional data
Inferring Sensitive Attributes from Knowledge Graph Embeddings: Attack and Defense Strategies
The paper investigates privacy risks in knowledge graph embeddings (KGEs), demonstrating that adversaries can infer sensitive user attributes from non-sensitive KGE outputs. It proposes a defense framework using post-processing sanitization techniques to mitigate these attribute inference attacks. Preliminary results reveal the attack effectiveness and explore the privacy-utility trade-off in randomization-based defenses, suggesting future work on advanced techniques is needed.
knowledge graph embeddingsattribute inference attacksprivacy riskssanitization techniquesprivacy-utility trade-off
Increasing Missingness to Reduce Bias: Richardson-SGD with Missing Data
The paper introduces Richardson-SGD, a debiasing method for stochastic gradient descent (SGD) with missing data that deliberately increases missingness to reduce gradient bias. By generating a further-thinned version of incomplete observations and combining gradients via Richardson extrapolation, the method reduces bias from $O(\|p\|)$ to $O(\|p\|^2)$, where $p$ is the missingness ratio vector. Theoretical analysis shows the approach is model-agnostic, computationally efficient, and generalizes to multi-step cancellation of higher-order bias terms. Empirical results demonstrate improved optimization and estimation across generalized linear models, particularly when combined with imputation methods like MICE.
richardson extrapolationgradient biasmissing datastochastic gradient descentimputation
Gaussian Approximation and Multiplier Bootstrap for Federated Linear Stochastic Approximation
The paper establishes Berry-Esseen-type bounds for federated linear stochastic approximation (LSA), providing the first federated Gaussian approximations that quantify communication-computation trade-offs and heterogeneity-aware error terms. It analyzes both constant and decreasing step size regimes, recovering prior results as special cases. A key contribution is an online multiplier bootstrap procedure for inference on the last iterate, which bypasses asymptotic covariance matrix estimation and offers non-asymptotic validity guarantees.
federated learningstochastic approximationberry-esseen boundmultiplier bootstrapheterogeneity-aware
Optimal Reconstruction from Linear Queries
The paper characterizes optimal reconstruction error for recovering an unknown point in ℝᵈ from noisy linear queries, establishing fundamental limits analogous to Bayes optimal error in supervised learning. Using geometric methods including a robust generalization of Jung's theorem via Lie group analysis, the authors prove: (1) asymptotic error √(2d/(d+1))δ as T→∞, (2) doubly exponential excess error decay for fixed d, and (3) Θ(exp(d)) query complexity for vanishing error in growing dimensions. The improper variant analysis further extends these theoretical foundations.
linear queriesreconstruction errorjung's theoremlie groupquery complexity
Diffusion Graph Posterior Sampling for Nonlinear Inverse Problems with Application to Electrical Impedance Tomography
The paper introduces a graph-based diffusion framework for solving nonlinear inverse problems in PDEs, specifically electrical impedance tomography (EIT). The method extends diffusion posterior sampling (DPS) to unstructured meshes via an unconditional score-based diffusion model on 2D triangular meshes, supplemented by a regularized variant (RDPS) incorporating total variation and Tikhonov terms. Experiments on synthetic and real EIT data show RDPS achieves stable, physically plausible reconstructions, outperforming GPnP-BM3D and DP-SGS in accuracy and noise robustness while generalizing to out-of-distribution geometries.
diffusion posterior samplingelectrical impedance tomographyunstructured meshesscore-based diffusioninverse problems
A Family of Divergence Measures for Evaluating the Reconstruction Quality of Explainable Ensemble Trees
The authors propose a family of divergence measures for evaluating reconstruction quality in Explainable Ensemble Trees (E2Tree), addressing limitations of correlation-based validation. Their framework introduces the normalized Loss of Interpretability (nLoI), a Cressie-Read power divergence (λ=2) measure with closed-form decomposition into within-node and between-node components, enabling precise diagnostic analysis. Four complementary metrics capture distinct structural facets, supported by a unified permutation testing procedure. Theoretical analysis establishes boundedness and symmetry, while empirical evaluations on three benchmarks demonstrate superior detection of reconstruction fidelity gradients compared to correlation-based methods.
explainable ensemble treescressie-read divergencereconstruction fidelitypermutation testinginterpretability loss
Posterior Contraction of Lévy Adaptive B-spline Regression in Besov Spaces
The study establishes nearly minimax-optimal posterior contraction rates, up to a logarithmic factor, for the Lévy Adaptive B-spline (LABS) regression model in Besov spaces. LABS extends the Lévy Adaptive Regression Kernel (LARK) framework by incorporating B-spline kernels with independently defined knots, enabling adaptation to irregular and locally structured features. Theoretical results are complemented by simulations on standard Besov test functions (Blocks, Bumps, HeaviSine, Doppler), demonstrating practical utility while automatically adapting to unknown smoothness.
b-spline regressionbesov spacesposterior contractionnonparametric bayesianminimax-optimal rates
Physics-Informed Graph Neural Network Surrogates for Turbulent Nanoparticle Dispersion in Dental Clinical Environments
The paper introduces ELGIN, a physics-informed graph neural network surrogate for simulating turbulent nanoparticle dispersion in dental clinics. The model combines a multi-head Graph Transformer with Lagrangian particle tracking and symplectic integration, using a four-stage curriculum for stable autoregressive rollouts. Compared to a Lagrangian-only baseline, ELGIN reduces mean parcel displacement error from 19.56% to 16.20% and cloud radius-of-gyration error from 9.85% to 6.58%, while achieving 37x speedup over traditional CFD methods. The approach enables real-time infection-risk screening in clinical environments.
graph neural networkturbulent dispersionphysics-informed learningautoregressive rolloutsymplectic integrator
Online Market Making and the Value of Observing the Order Book
The paper introduces an action-dependent feedback model for online market making, where observing the order book provides partial information about supply and demand when trades don't occur. For stochastic i.i.d. prices, the authors propose an elimination-based algorithm achieving O(√T) high-probability regret without smoothness assumptions on trader valuations. They extend this to mean-reverting processes (both local autoregressive dynamics and global drift conditions) while maintaining O(√T) regret, and present an explore-then-perturb algorithm for adversarial settings with O(T^{2/3}) expected regret. The results demonstrate improved learnability compared to standard bandit feedback models.
online market makingaction-dependent feedbackregret boundsmean-reverting processeslimit order book
HiLiftAeroML: High-Fidelity Computational Fluid Dynamics Dataset for High-Lift Aircraft Aerodynamics
The authors introduce HiLiftAeroML, the first open-source high-fidelity CFD dataset for high-lift aircraft aerodynamics, targeting AI surrogate model development. The dataset comprises 1800 samples from 180 geometry variants and 10 angles of attack for the NASA Common Research Model, generated using GPU-accelerated explicit wall-modeled LES with solution-adapted grids (300M-500M cells). Results include time-averaged volume/surface variables and integral forces, released under CC-BY-4.0 to accelerate aerospace AI research.
computational fluid dynamicshigh-lift aerodynamicsles simulationsurrogate modelingnasa crm
Learning-Accelerated Optimization-based Trajectory Planning for Cooperative Aerial-Ground Handover Missions
The paper introduces a learning-augmented trajectory planning framework for UAV-UGV handover missions, combining neural surrogate planning with centralized optimization. A decoupled encoder-decoder LSTM network predicts initial trajectories from task specifications, accelerating downstream optimization. Evaluations show a threefold speedup and 100% optimization success rate compared to cold starts, demonstrating efficient, feasible trajectory generation for multi-robot systems.
trajectory planninguav-ugv cooperationlstm networksoptimization accelerationmulti-robot systems
Density-Ratio Losses for Post-Hoc Learning to Defer
The paper introduces density-ratio losses for post-hoc Learning to Defer (L2D), framing deferral decisions as density-ratio estimation between model and expert ideal distributions. The method derives DR CPE losses for L2D scorers via reduction from density-ratio to class-probability estimation, enabling adjustable deferral rates without retraining. Theoretical analysis shows connections to Chow's rule and expert-tilted Bayes posteriors. Experiments demonstrate competitive performance against baselines and robustness across datasets, positioning post-hoc L2D as density-ratio learning between ideal distributions.
learning to deferdensity-ratio estimationpost-hoc learningchow's rulebayes posterior
Provable Fairness Repair for Deep Neural Networks
The paper introduces ProF, a provable fairness repair framework for deep neural networks (DNNs) addressing ethical concerns like individual discrimination. ProF leverages interval bound propagation to soundly capture model outputs over input neighborhoods, integrating fairness constraints into a Mixed-Integer Linear Programming (MILP) formulation for guaranteed repair. Evaluated on four benchmarks, ProF achieves up to 95.93% fairness generalization on datasets and 93.16% on the entire input space, with ~90% fairness improvement while supporting multiple sensitive attributes.
fairness repairinterval bound propagationmixed-integer linear programmingprovable guaranteessensitive attributes
The Silent Hyperparameter: Quantifying the Impact of Inference Backends on LLM Reproducibility
This work identifies inference backends as a critical but underreported hyperparameter affecting LLM benchmark reproducibility. The authors survey 200 inference engines and analyze 35,000 ML publications, finding minimal reporting of inference stacks despite their diversity. Through controlled experiments with five backends (vLLM, SGLang, llama.cpp) across multiple models and benchmarks, they demonstrate backend choice alone can alter scores by up to 16.6 percentage points and cause output divergence, traced to optimizations like prefix caching, CUDA graphs, and logit processing defaults.
inference backendsbenchmark reproducibilitycuda graphsprefix cachinglogit processing
Boosting Text-to-Image Diffusion Models via Core Token Attention-Based Seed Selection
The paper introduces Attention-Based Seed Selection (ABSS), a training-free method to improve text-to-image diffusion models by ranking seeds based on cross-attention to core tokens during early denoising steps. ABSS operates at inference time, selecting top-k seeds without fixed thresholds or model modifications. Experiments on Stable Diffusion variants demonstrate consistent improvements in text-image alignment and visual quality across three benchmarks, validated by human preference metrics.
seed selectioncross-attentiondenoisingtext-to-imagestable diffusion
Adynamical systems view of training generativemodels and the memorization phenomenon
The paper provides a dynamical systems interpretation of memorization in generative models during SGD training, building on prior work about collapse phenomena and two-time-scale dynamics. By modeling the loss function with strongly/weakly coupled variables and leveraging Austin (2016)'s framework, the authors formalize how constant-step SGD exhibits distinct time scales. This analysis, combined with Borkar (2025a)'s collapse model and Azizian et al. (2024)'s results, explains memorization as prolonged output similarity during fine-tuning. The work unifies memorization, double descent, and collapse through a system-theoretic lens.
memorization phenomenontwo-time-scale dynamicsstochastic gradient descentgenerative modelscollapse phenomenon
Drifting Objectives for Refining Discrete Diffusion Language Models
The paper introduces TokenDrift, a method to refine discrete diffusion language models (DDLMs) by transferring drifting objectives from continuous to discrete domains. The approach lifts categorical predictions to soft-token features, applies anti-symmetric drifting in a frozen semantic space, and backpropagates stop-gradient feature targets to DDLM logits. Experiments with masked (MDLM) and uniform-state (DUO) diffusion backbones show significant improvements: 89% and 86% reductions in generation perplexity at 4 NFEs, respectively, compared to baselines.
discrete diffusiondrifting objectivesoft-token featuresanti-symmetricgeneration perplexity
Implicit Bias of Mirror Flow in Homogeneous Neural Networks: Sparse and Dense Feature Learning
The paper characterizes max-margin solutions induced by mirror flow in homogeneous neural networks through convex duality, deriving a balance equation for the horizon function governing margin formation. The analysis extends classical gradient flow results, providing convergence rates, norm growth estimates, and demonstrating how mirror maps influence solution geometry. Experiments on synthetic and vision datasets reveal: (1) non-homogeneous mirror maps can converge to identical max-margin solutions, (2) convergence exhibits extremely slow (including exponential) regimes, and (3) mirror maps induce diverse feature learning behaviors, from sparse to dense neuron activations.
mirror flowmax-marginhomogeneous networksfeature learningconvex duality
CEPO: RLVR Self-Distillation using Contrastive Evidence Policy Optimization
The paper introduces Contrastive Evidence Policy Optimization (CEPO), a reinforcement learning method that sharpens credit assignment in reasoning tasks by contrasting token-level preferences under correct versus incorrect answers. CEPO constructs a wrong-answer teacher from rejected rollouts without additional sampling, theoretically preserving safety guarantees while better identifying decisive reasoning steps versus filler tokens. Experiments on five multimodal mathematical reasoning benchmarks show CEPO improves average accuracy to 43.43% (2B) and 60.56% (4B) versus GRPO's 41.17% and 57.43%, while distribution-matching baselines (OPSD, SDPO) underperform due to information leakage.
reinforcement learningcredit assignmentpolicy optimizationmultimodal reasoningself-distillation
TIDE: Asymmetric Neural Circuits for Stabilized Temporal Inhibitory-Excitatory Dynamics
The paper introduces TIDE, a neuro-inspired architecture for stabilized neural dynamics, addressing stability limitations in Continuous Thought Machine (CTM) architectures. TIDE employs asymmetric Excitatory-Inhibitory (E-I) networks with Wilson-Cowan dynamics and lateral inhibition, ensuring stability via energy-based optimization and game-theoretic loss. It enforces Dale's principle and an 80:20 E-I ratio while incorporating Hierarchical Receptive Fields for biological realism. Theoretical proofs confirm convergence and stability, with empirical results showing TIDE achieves +1.65% top-1 accuracy on ImageNet under perturbations while using <50% of CTM's training time.
neural dynamicswilson-cowan dynamicsdale's principlelateral inhibitionenergy-based systems
Neuron Incidence Redistribution for Fairness in Medical Image Classification
The paper introduces Neuron Incidence Redistribution (NIR), a regularization method to mitigate demographic disparities in medical image classification by redistributing latent disease evidence across penultimate-layer neurons. NIR penalizes variance in predicted-probability-weighted mean activations without requiring demographic labels. Evaluated on HAM10000 and Harvard OCT-RNFL datasets, NIR reduces TPR disparity from 10.81% to 0.93% (age) and 12.04% to 0.74% (gender), and FPR disparity from 15.68% to 10.66% (race) and 12.69% to 1.80% (age), while marginally improving AUC by 0.51 points.
fairnessmedical image classificationneuron incidence redistributionpenultimate-layer activationsdemographic disparity
Understanding Dynamics of Adam in Zero-Sum Games: An ODE Approach
The paper provides a theoretical analysis of Adam-DA in zero-sum games by deriving continuous-time ODE approximations of its discrete-time dynamics. Using this framework, the authors examine local convergence and implicit gradient regularization, revealing that momentum parameters exhibit opposite effects compared to minimization problems. Experimental validation on GANs across multiple architectures and datasets confirms these reversed momentum dynamics.
adam-dazero-sum gamesode approximationmomentum parametersimplicit regularization
Tweedie's Formulae and Diffusion Generative Models Beyond Gaussian
The authors extend Tweedie's formula to non-Gaussian diffusion processes, enabling denoising score-matching objectives for geometric Brownian motion (GBM), squared Bessel (BESQ), and Cox-Ingersoll-Ross (CIR) processes. This theoretical advancement facilitates score-based generative modeling beyond Gaussian noise assumptions. The derived formulae are empirically validated on image generation, financial time series modeling, and empirical Bayes estimation, demonstrating competitive performance with non-Gaussian diffusion models. Results indicate particular promise for GBM- and CIR-based approaches in their respective domains.
tweedie's formulanon-gaussian diffusiondenoising score matchinggeometric brownian motioncox-ingersoll-ross process
Accurate, Efficient, and Explainable Deep Learning Approaches for Environmental Science Problems
This dissertation develops three deep learning approaches for environmental science challenges. First, WaLeF and FIDLAr improve flood prediction and management in coastal systems, outperforming baselines in accuracy and efficiency while providing interpretability. Second, CoDiCast, a conditional diffusion model, enables probabilistic weather forecasting with explicit uncertainty quantification. Third, Hypercube-RAG enhances scientific QA by combining retrieval-augmented generation with a structured text cube framework, simultaneously improving accuracy, efficiency, and explainability. Evaluations demonstrate effectiveness in flood-prone regions and global weather prediction tasks.
water level forecastingconditional diffusion modelretrieval-augmented generationprobabilistic forecastinginterpretable deep learning
Scalable, Energy-Efficient Optical-Neural Architecture for Multiplexed Deepfake Video Detection
The authors propose a hybrid digital-analog architecture for scalable deepfake video detection, combining a lightweight digital front-end with a spatially multiplexed optical back-end using programmable spatial light modulators. The system processes 15+ video streams in parallel via optical propagation, achieving 97.79% accuracy on Celeb-DF with 99.86% sensitivity and 95.72% specificity while reducing computational costs versus digital methods. Experimental validation demonstrates robustness to video degradation, noise, compression, and adversarial attacks, highlighting simultaneous improvements in throughput, energy efficiency, and adversarial resilience.
optical computationspatial multiplexingdeepfake detectionanalog inferenceadversarial robustness
MAM-CLIP: Vision-Language Pretraining on Mammography Atlases for BI-RADS Classification
The authors propose MAM-CLIP, a vision-language model for BI-RADS classification in mammography, leveraging contrastive pretraining on 2313 image-text pairs from mammography atlases. The method combines a PubMedBERT language encoder with a vision encoder, pretrained via image-text alignment to capture rich textual descriptors, then fine-tuned for BI-RADS prediction. Results show consistent improvements over image-only baselines, with F1-score gains of +1% (40K samples) to +14% (1K samples), and demonstrate that 2K image-text pairs outperform 2K labeled samples by +1.1% when >10K training samples are available. The work releases preprocessed TEKNOFEST data and model artifacts.
bi-radscontrastive learningvision-language pretrainingmammography atlasespubmedbert
CompoSE: Compositional Synthesis and Editing of 3D Shapes via Part-Aware Control
CompoSE introduces a novel method for compositional synthesis and editing of 3D shapes via part-aware control, enabling localized granular editing of individual parts. The approach uses a diffusion transformer architecture that alternates between local part processing and global context aggregation, with a novel conditioning technique for strong user input adherence. It infers part semantics and symmetries from coarse geometric primitives without requiring part-level text prompts. Experiments show superior performance in guided synthesis, with capabilities including part substitution, addition, deletion, and style-preserving resizing, validated by objective metrics and LLM-based evaluations.
compositional synthesisdiffusion transformerpart-aware control3d shape editinggeometric primitives
What Makes a Representation Good for Single-Cell Perturbation Prediction?
The paper introduces PerturbedVAE, a framework addressing signal imbalance in single-cell perturbation modeling by separating perturbation-specific information from invariant structure. It employs causal representation learning to recover sparse perturbation effects, supported by identifiability analysis for reliable recovery conditions. Empirical results demonstrate state-of-the-art performance on benchmark tasks, with notable improvements in out-of-distribution combinatorial predictions and interpretable perturbation-response programs.
perturbedvaesingle-cellrepresentation learningidentifiabilitysparsity
An Exterior Method for Nonnegative Matrix Factorization
The authors propose an exterior method for nonnegative matrix factorization (eNMF) that decouples low-rank approximation from nonnegativity enforcement, contrasting with traditional interior approaches. The method initializes from optimal unconstrained factorization and employs a rotation procedure to map factors to exterior points near the nonnegative orthant, yielding KKT-satisfying stationary points on the boundary. Experiments across 400 NMF trials show 99% convergence to equivalent factor matrices, with eNMF outperforming 81 competitor configurations by achieving 30% lower reconstruction error and 150% speedup. Downstream applications in audio processing and recommendation systems demonstrate practical benefits.
nonnegative matrix factorizationexterior optimizationlow-rank approximationkkt conditionsorthogonal transformations
BrainDyn: A Sheaf Neural ODE for Generative Brain Dynamics
BrainDyn introduces a sheaf neural ODE model for generating brain-like dynamics on structured graphs, addressing limitations of LLMs and RNNs in anatomical alignment and graph networks in expressiveness. The method combines LSTM-based activity history encoding with learnable restriction maps and a sheaf Laplacian for message passing, integrated with a neural ODE for continuous-time evolution. Evaluated on resting-state fMRI (PNC), EEG with epilepsy (TUSZ), and NEST spiking simulations, BrainDyn demonstrates strong forecasting and supports in silico perturbation prediction.
sheaf neural odebrain dynamicsrestriction mapssheaf laplacianin silico perturbation
A Unified Framework for Structure-Aware Clustering and Heterogeneous Causal Graph Learning
The paper proposes DAG-DC-ADMM, a unified framework for jointly learning cluster assignments and cluster-specific dependency structures in multivariate systems with heterogeneous causal relationships. The method combines Structural Equation Modeling (SEM) with a groupwise truncated Lasso fusion penalty (gTLP) to enforce structural similarity within clusters, while incorporating sparsity and acyclicity constraints via a smooth formulation. An adapted ADMM algorithm solves the resulting nonconvex optimization problem, with convergence guarantees to KKT points for certain graph structures. Experiments show the method achieves high true positive rates and low false discovery rates in recovering cluster-specific causal dependencies.
structural equation modelingdirected acyclic graphsalternating direction method of multipliersheterogeneous causal learningnonconvex optimization
An Objective Performance Evaluation of the LSTM Networks in Time Series Classification
This paper presents a framework for objectively evaluating LSTM networks against model-based approaches in time-series classification. The study compares an LSTM classifier with an expectation maximization (EM) classifier on binary classification tasks using scalar linear Gaussian state space models, with the Kalman filter likelihood ratio test as a reference. Monte Carlo simulations reveal that the EM classifier outperforms LSTM when data conform to the model structure, while LSTM requires larger noise separation for reliable classification and underperforms the reference in measurement noise scenarios regardless of sequence length or training size.
lstm networkstime-series classificationexpectation maximizationkalman filtermonte carlo simulations
A Two-Phase Adaptive Balanced Penalty Method for Controllable Pareto Front Learning under Split Feasibility Conditions
The paper proposes a two-phase Adaptive Balanced Penalty (ABP) method for Controllable Pareto Front Learning under split feasibility conditions, reformulating the problem as a Bi-Level Scalarized Split Problem. ABP combines three gradient components (optimality, set feasibility, image feasibility) via an adaptive indicator and proves convergence using convex surrogates. The method is implemented as ABP-HyperNet for Hyper-MLP and HyperTrans architectures, evaluated through a new Expected Feasible Hypervolume metric. Experiments on five multi-objective benchmarks and three multi-task datasets show ABP-HyperNet achieves 2.3× higher EFHV than baselines, improving feasibility from 36-49% to 87-100%.
pareto front learningsplit feasibilityhypernetworkbi-level optimizationfeasible hypervolume
Matérn Noise for Triangulation-Agnostic Flow Matching on Meshes
The paper introduces a triangulation-agnostic flow matching (FM) method for generating signals over triangle meshes, employing a Matérn process as a noise distribution to ensure triangulation invariance. The approach adapts FM to meshes by using PoissonNet, a state-of-the-art gradient-domain learning model, as the denoiser. Experiments demonstrate the method's efficacy in generating realistic elastic rest states and humanoid poses on meshes exceeding one million triangles, outperforming existing techniques in quality and diversity.
flow matchingmatérn processtriangulation-agnosticpoissonnetgradient-domain
Cross-Paradigm Knowledge Distillation: A Comprehensive Study of Bidirectional Transfer Between Random Forests and Deep Neural Networks for Big Data Applications
This paper introduces the first comprehensive study of bidirectional knowledge distillation between Random Forests (RF) and Deep Neural Networks (DNN), addressing a gap in cross-paradigm transfer. The authors propose novel methods including progressive multi-stage distillation, multi-teacher ensemble distillation, and uncertainty-aware transfer mechanisms. Evaluated across 144 experiments on 6 datasets, their approach achieves 98.13% classification accuracy (NN-COMPACT) and 92.6% R^2 score (NN-WIDE), demonstrating complementary benefits of interpretability (RF) and expressiveness (DNN) while enabling flexible deployment in big data environments.
knowledge distillationrandom forestsdeep neural networksmulti-teacher ensembleinterpretable ai
Domain-Adaptive Communication-Rate Optimization for Sim-to-Real Humanoid-Robot Wireless XR Teleoperation
The paper proposes a domain-adaptive communication-rate optimization framework for sim-to-real wireless XR teleoperation of humanoid robots, minimizing communication energy while maintaining motion trajectory reconstruction accuracy. The method integrates sampling, transmission, interpolation, and reconstruction, employing dimension-wise sampling-rate control and a PPO algorithm with density-ratio weighting and trust-region regularization for sim-to-real adaptation. Experiments on a public humanoid teleoperation dataset demonstrate improved tradeoffs between reconstruction error and energy consumption under distribution shift, with analysis across varying wireless channels and dynamic trajectories.
sim-to-realwireless xr teleoperationproximal policy optimizationdensity-ratio estimationcommunication-rate optimization
Factor Augmented High-Dimensional SGD
The paper introduces Factor-Augmented SGD (FSGD), a novel optimization method for high-dimensional learning that leverages latent factor representations on streaming data, eliminating the need for offline preprocessing. Unlike traditional two-stage approaches, FSGD operates purely online, enhancing scalability. The authors provide the first theoretical framework incorporating latent factor estimation error into SGD analysis, proving moment convergence in ℓ^s norm under decaying step sizes and mini-batch updates. This work establishes a foundation for reliable, scalable SGD in high-dimensional systems.
stochastic gradient descentlatent factorhigh-dimensional learningmoment convergencestreaming data
Language models struggle with compartmentalization
The study demonstrates that large language models (LLMs) exhibit compartmentalization, failing to share statistical strength between distinct presentations of unified concepts (e.g., multilingual or multi-representational data). Through empirical analysis, the authors show that LLMs often learn parallel internal representations for each presentation, saturating capacity and reducing sample efficiency. Synthetic parallel data fails to mitigate this issue, and early multilingual learning in small models appears highly compartmentalized. Interventions exhibit phase transitions in effectiveness based on presentation count, suggesting inconsistent representation unification under the language modeling objective.
compartmentalizationstatistical strengthparallel representationssample efficiencyphase transition
Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR
The paper introduces Pion, a modified version of the Muon optimizer that addresses spectral whitening limitations in vision-language-action (VLA) training and reinforcement learning with verifiable rewards (RLVR). Pion replaces Muon's uniform spectral whitening with a high-pass Newton-Schulz iteration, promoting dominant singular values while suppressing noisy components, and supports per-head updates for preserving attention-head heterogeneity. Experiments on LIBERO and LIBERO-Plus show Pion achieving 100% success rate in VLA tasks versus 97.0% for Muon and 32.2% for AdamW, with similar gains in RLVR on Qwen3-1.7B/4B models.
spectral whiteningnewton-schulz iterationvision-language-actionreinforcement learningattention heads
Do Better Volatility Forecasts Lead to Better Portfolios? Evidence from Graph Neural Networks
The paper demonstrates that volatility forecasting accuracy, cross-sectional ranking quality, and portfolio performance are distinct objectives in financial machine learning. Using weekly realized volatility data from 465 S&P 500 equities (2015-2025), the authors compare Heterogeneous Autoregressive, LSTM, and GraphSAGE models across correlation, sector, and Granger-causal graphs with macro regime features. Results show the best MSE, ranking accuracy, and Sharpe ratio metrics come from different models, indicating graph-based approaches only benefit portfolio rules that exploit their encoded cross-sectional structure.
realized volatilitygraph neural networkssharpe ratiocross-sectional rankinggranger-causality
OpenCompass: A Universal Evaluation Platform for Large Language Models
The paper introduces OpenCompass, a modular and scalable evaluation platform for large language models (LLMs) addressing challenges in cross-domain benchmarking. The system features five core components: Configuration System, Task Partitioning Module, Execution and Scheduling Module, Task Execution Unit, and Result Visualization Module, supporting rule-based, LLM-as-a-Judge, and cascaded evaluation methods. OpenCompass provides unified evaluation across multiple domains (knowledge, reasoning, computation, science, language, code) with high compatibility, flexibility, and concurrency, enabling efficient identification of LLM capabilities and optimization pathways.
large language modelsbenchmark evaluationmodular architecturetask partitioninghigh-concurrency
CODA: Rewriting Transformer Blocks as GEMM-Epilogue Programs
CODA introduces a GPU kernel abstraction that reformulates Transformer block computations as GEMM-plus-epilogue programs, addressing memory-bound bottlenecks in training systems. The method algebraically reparameterizes normalization, activations, and residual updates to execute during GEMM tile retention on-chip, using composable epilogue primitives for scaling, reductions, and accumulation. Evaluations on Transformer workloads show that both human- and LLM-authored CODA kernels achieve high performance, demonstrating the approach's efficacy in combining framework productivity with hardware efficiency.
transformergemmepiloguekernelmemory-bound
From Simple to Complex: Curriculum-Guided Physics-Informed Neural Networks via Gaussian Mixture Models
The authors propose Curriculum-Guided Gaussian Mixture Physics-Informed Neural Networks (CGMPINN), a method combining Gaussian mixture modeling with dynamic curriculum learning to address training challenges in PINNs for PDEs. The approach periodically fits a GMM to residual distributions, implements a smooth curriculum schedule for progressive difficulty adaptation, and employs precision-based variance modulation. Theoretical analysis includes convergence guarantees and generalization bounds. Experiments on six PDE benchmarks demonstrate CGMPINN reduces relative $L_2$ error by up to 97.8% compared to standard PINNs.
physics-informed neural networksgaussian mixture modelcurriculum learningpartial differential equationsadaptive optimization
Backdooring Masked Diffusion Language Models
The paper introduces SHADOWMASK, the first training-time backdoor attack for masked diffusion language models (MDLMs), addressing their unique discrete corruption and iterative denoising mechanics. The method modifies the forward process by replacing the standard all-mask terminal distribution with a trigger-mask mixture prior, creating a dedicated denoising pathway for trigger inputs while preserving clean performance. Evaluations on DiT-based MDLM and LLaDA-8B-Instruct across WikiText-103, OpenWebText, and Alpaca demonstrate near-100% attack success, minimal clean utility degradation, and robustness against fine-tuning and defenses.
masked diffusion language modelsbackdoor attackdenoising pathwaytrigger-mask mixtureparameter-efficient fine-tuning
Beyond Extrapolation: Knowledge Utilization Paradigm with Bidirectional Inspiration for Time Series Forecasting
The paper proposes KUP-BI, a novel time-series forecasting paradigm that leverages bidirectional structural knowledge by approximating post-target continuation proxies from training data. The method distills continuation-style knowledge from historical trajectories and integrates it into standard forecasting models via a lightweight feature-level gating module, avoiding reliance on parametric extrapolation. Experiments on six public datasets demonstrate consistent performance improvements across state-of-the-art models with minimal computational overhead.
time-series forecastingbidirectional inspirationcontinuation proxystructural knowledgefeature-level gating
GAE Falls Short in Imperfect-Information Self-Play Reinforcement Learning
The paper introduces $Q$-boosting, a variance-reduced advantage estimator for imperfect-information self-play reinforcement learning, addressing the high variance in generalized advantage estimation (GAE) caused by stochastic action sampling. The proposed Variance-Reduced Policy Optimization (VRPO) combines this estimator with a multi-step Expected SARSA$(λ)$ trace to compute policy expectations, reducing action-sampling noise while retaining PPO's clipped objective and on-policy updates. Empirical results demonstrate VRPO's strong performance in large-scale games like Dou Dizhu and Heads-Up No-Limit Texas Hold'em.
generalized advantage estimationself-play reinforcement learningvariance reductionimperfect-information gamesproximal policy optimization
Quantum Machine Learning for Cyber-Physical Anomaly Detection in Unmanned Aerial Vehicles: A Leakage-Free Evaluation with Proxy-Audited Feature Sets
The study presents a leakage-free evaluation of quantum machine learning for anomaly detection in unmanned aerial vehicles (UAVs) using the TLM:UAV benchmark. Key contributions include (i) a group-aware temporal protocol (B2) for dataset partitioning, (ii) a three-mode feature audit to quantify accuracy sources, and (iii) a hybrid XGBoost + Data Reuploading (DRU) classifier benchmarked against classical controls. Results show the trained-DRU hybrid exhibits a directional F1 macro improvement (+0.05) under strict feature auditing and the lowest mean false-alarm rate, though inter-seed variance limits statistical significance. The implementation is provided in Qiskit 2.x for NISQ-era aerospace cybersecurity.
quantum machine learninganomaly detectionunmanned aerial vehiclesdata reuploadingnisq-era
DeRegiME: Deep Regime Mixtures for Probabilistic Forecasting under Distribution Shift
DeRegiME introduces a deep regime mixture of experts for probabilistic forecasting under distribution shift, separating latent uncertainty regimes from the underlying signal via a sparse variational Gaussian process with a nonstationary regime-mixing kernel and Student-t likelihood. The method employs a shared gate to combine per-regime sub-kernels and noise processes, yielding an interpretable mean-residual-noise decomposition and regime transitions as implicit changepoints. Evaluated across ten benchmarks, DeRegiME improves negative log predictive density by 20.3%, CRPS by 3.0%, and MSE by 4.7% over encoder-matched baselines, demonstrating consistent gains across abrupt, gradual, and seasonal shifts.
probabilistic forecastinggaussian processdistribution shiftregime mixturesparse variational inference
Robust Mitigation of Age-Dependent Confounding Effects via Sample-Difficulty Decorrelation
The paper proposes a framework to mitigate age-dependent confounding in medical image classification by decorrelating sample difficulty from age trends, preserving diagnostically meaningful age information. The method employs a warm-up phase to model label-conditioned age-difficulty relationships, then applies Huber-weighted affinity weights for robust decorrelation, supplemented by an Age Coverage Score for stable optimization under limited age diversity. Evaluated on two radiology datasets, the approach reduces age-dependent disparities in true/false positive rates by 15-30% with <1% AUC impact, demonstrating robustness to train-test age distribution shifts.
confounding mitigationsample-difficulty decorrelationhuber weightingage coverage scorelabel-conditioned modeling
Worst-Group Equalized Odds Regularization for Multi-Attribute Fair Medical Image Classification
The paper introduces a worst-group equalized-odds margin regularizer for multi-attribute fair medical image classification, addressing disparities in true and false positive rates across demographic subgroups at fixed operating points. The method identifies extreme margin deviations in subgroups defined by attributes like age, sex, and race, applying a unified penalty without intersectional constraints. Evaluated on two medical imaging datasets in multi-label settings, it reduces Equalized Odds and Equalized Opportunity disparities while maintaining AUC performance.
equalized oddsmulti-attribute fairnessmargin regularizermedical image classificationsubgroup disparities
Precision Physical Activity Prescription via Reinforcement Learning for Functional Actions
The paper introduces a reinforcement learning (RL) algorithm to optimize personalized physical activity (PA) distributions for health biomarkers, using step count data from the All of Us Research Program. The method addresses the lack of PA recommendation systems by modeling daily step distributions as continuous actions in an offline RL framework. Results show superior performance over existing continuous-action RL methods, with optimal policies recommending higher and more consistent step counts, tailored to subgroups based on glucose levels, BMI, blood pressure, age, and sex.
reinforcement learningphysical activitybiomarkersoffline learningpersonalized recommendation
Sequential Consensus for Multi-Agent LLM Debates: A Wald-SPRT compute governor with calibration-based failure detection
The paper introduces a Sequential Probability Ratio Test (SPRT)-based compute governor for multi-agent LLM debates, dynamically terminating rounds when consensus is reached or maximal rounds (R_max) are exhausted. The method employs a Beta-distributed LLM judge score to estimate convergence likelihood, with calibration ensuring domain-specific validity. Evaluations on GSM8K and MMLU show 3.7x fewer LLM calls (1.01 average rounds) at 97.0% accuracy versus fixed-5 debates (99.0%), while MMLU reveals calibration failure (99.5% capping at 2.1x cost). The SPRT layer optimizes compute without accuracy guarantees.
sequential probability ratio testmulti-agent llm debatebeta likelihoodcompute governorcalibration-based failure detection
A Cloud-Based Tool for Meteorite Recovery Using Drones and Machine Learning
The paper introduces a cloud-based tool integrating drones and machine learning for meteorite recovery, specifically targeting instrumentally observed falls. The system features iterative improvements over prior versions and has been tested in South and Western Australia. Results demonstrate both successes and limitations in field applications. The tool is accessible to the meteoritics community via https://find.gfo.rocks.
meteorite recoverydronesmachine learningcloud-based toolinstrumentally observed falls
Activation Functions, Statistics and Learning of Higher-Order Interactions in Restricted Boltzmann Machines
The work analyzes how activation functions in Restricted Boltzmann Machines (RBMs) affect the representation and learning of higher-order interactions in data. By exploiting the duality between RBMs and interacting binary variable models, the authors characterize the space of representable models analytically for four activation functions: Linear, Step, ReLU, and Exponential. Results show that rapidly increasing nonlinearities (e.g., Exponential) facilitate learning of data with large higher-order interactions, while certain structures remain difficult to represent across all activation functions, with analytical predictions closely matching simulation outcomes.
restricted boltzmann machinesactivation functionshigher-order interactionsnonlinearitiesbinary variables
Reducing Diffusion Model Memorization with Higher Order Langevin Dynamics
The paper theoretically characterizes how Higher-Order Langevin Dynamics (HOLD) reduces memorization in diffusion models by analyzing its regularization effect. HOLD introduces auxiliary variables (interpreted as velocity/acceleration) that impose dynamical constraints, causing the data variable's dynamics to follow a low-pass-filtered version of the learned score function. Theoretical analysis shows increased smoothness with higher-order HOLD, mitigating memorization risks while preventing distribution collapse. Empirical validation on real-world data confirms HOLD's advantage over standard diffusion models in reducing sample replication.
higher-order langevin dynamicsdiffusion modelsmemorizationscore functiondistribution collapse
A Heuristic Approach for Performance Tuning in RL-based Quadrotor Control via Reward Design and Termination Conditions
The article presents a heuristic method for performance tuning in RL-based quadrotor control through reward design and termination conditions. A novel dual-bandwidth exponential reward structure enables critically damped setpoint tracking with low steady-state errors (∼2%), trained via Proximal Policy Optimization (PPO) in 6M time steps. Heuristic adjustments to reward weights and coefficients yield tunable settling times for acrobatic (fast) and inspection (slow) behaviors while maintaining baseline response characteristics. Evaluation across 100 trials demonstrates precise position/yaw tracking from random initial conditions.
reinforcement learningquadrotor controlreward designproximal policy optimizationsetpoint tracking
Information Processing Capacity of Stationary Physical Systems: Theory, Data-efficient Estimation Methods, and Photonic Demonstration
The authors extend the Information Processing Capacity (IPC) framework to stationary physical computing systems, proving fundamental bounds: individual capacities ∈ [0,1], sum bounded by readout count, and noise reducing this bound. They develop data-efficient IPC estimation methods using Richardson extrapolation and Sobol quasi-random sampling, addressing finite-sample bias. Experimental validation with a nonlinear optical fibre photonic system shows IPC shifts toward higher-order nonlinear capacities under Kerr effect modulation. Total IPC correlates strongly (r unspecified) with benchmark ML task performance, establishing it as a dimensionality measure linking physical dynamics to computational capability.
information processing capacityphysical computingrichardson extrapolationkerr effectquasi-random sampling
PMF-CL: Pareto-Minimal-Forgetting Continual Learner for Conflicting Tasks
(No summary returned.)
Identifiable Multimodal Causal Representation Learning under Partial Latent Sharing
The paper establishes component-wise identifiability guarantees for causal latent representations in multimodal data with partially shared latent structures, addressing a key challenge in causal representation learning (CRL). The authors propose a non-parametric approach under flexible assumptions, using nonlinear mixing functions to model modality-specific latent subsets without requiring parametric latent distributions. A differentiable Wasserstein-based module is introduced to recover the shared structure, compatible with diverse architectures. Experiments on synthetic and real-world datasets demonstrate superior performance over state-of-the-art methods.
causal representation learningmultimodal learningidentifiabilitywasserstein distancelatent variable models
CLIC: Contextual Language-Informed Cardiac Pathology Classification
The paper introduces CLIC (Contextual Language-Informed Cardiac pathology classification), a multimodal framework that enhances ECG-based diagnosis by integrating patient metadata and demographic variables through natural language encoding. The method translates contextual data into descriptive text, providing an informative anchor for disambiguating physiological patterns, and compares template-based clinical text with LLM-generated descriptions. Results show that controlled template-based text yields consistent classification improvements, though LLM-synthesized texts remain competitive in downstream performance.
electrocardiogrammultimodal frameworkclinical textlarge language modelspathology classification
Atomistic Modeling of Chemical Disorder in Materials: Bridging Classical Methods and AI-Assisted Approaches
The review addresses the representation gap between experimental and computational descriptions of chemical disorder in materials, proposing a framework that integrates classical and AI-driven methods. It evaluates techniques including mean-field theories, cluster expansion, Monte Carlo, and emerging AI approaches like universal interatomic potentials and generative models. The analysis demonstrates how AI can enhance disorder-native capabilities, such as configurational exploration, generative modeling of disordered structures, and kinetics-aware prediction, enabling more realistic AI-accelerated materials discovery.
chemical disorderatomistic modelinggenerative modelscluster expansionmonte carlo
Dual-Channel Tensor Neural Networks: Finite-Sample Theory and Conformal Structure Selection
The paper introduces a Dual-Channel Tensor Neural Network (DC-TNN) that processes tensor data through coupled channels for a low-rank core and sparse refinement, accommodating CP, Tucker, and tensor-train decompositions. It establishes non-asymptotic risk bounds for the estimator, showing effective dimension depends on core rank and refinement sparsity. A conformal ROC procedure provides finite-sample, distribution-free coverage for uncertainty quantification, while a conformal structure selector chooses among tensor decompositions. Experiments on synthetic data and a protein dataset demonstrate improved predictive accuracy and structure recovery.
tensor neural networksnon-asymptotic risk boundsconformal inferencestructure selectionlow-rank decomposition
Learning Interpretable Point-Based Clinical Risk Scores via Direct Optimization
The paper introduces novel machine learning algorithms for constructing interpretable, point-based clinical risk scores via direct optimization of explicit objectives. The method employs a flexible greedy optimization strategy to learn additive scoring rules with nonnegative integer weights, addressing computational challenges of integer programming for nonconcave or discontinuous value functions. Applied to an Epic Cosmos EHR cohort, the approach constructs an integer-weighted comorbidity score for post-discharge mortality risk prediction, with performance validated through simulation studies.
clinical risk scoresinteger programminggreedy optimizationelectronic health recordscomorbidity score
Performance Monitoring of Proton Exchange Membrane Water Electrolyzer by Transformers-Based Machine Learning Model
The authors propose a transformer-based machine learning framework for real-time state-of-health monitoring in proton exchange membrane (PEM) water electrolyzers without interrupting operation. Their encoder-decoder architecture employs patch-based sequence tokenization of operational data to reconstruct polarization curves, achieving a 10× MSE reduction versus baseline transformers across four longitudinal tests (≤478 hours). The method enables continuous performance monitoring while capturing latent representations of degradation, suggesting potential for interpretable health indicators in green hydrogen production systems.
proton exchange membranetransformer modelstate-of-healthpolarization curvepatch tokenization
Heterogeneity-Aware Dataset Scheduling for Efficient Audio Large Language Model Training
The paper proposes Grouped Sequential Training (GST), a heterogeneity-aware dataset scheduling method for efficient Audio Large Language Model (ALLM) training. GST organizes datasets into affinity-aware groups using gradient-based metrics and introduces them via progressive scheduling, balancing parallel training stability with sequential optimization efficiency. Evaluations on 14 AudioQA datasets show GST achieves 30–40% faster convergence than parallel training while matching or exceeding mix-all training performance, providing a scalable framework for multi-dataset ALLM optimization.
audio large language modelsdataset heterogeneitygradient-based affinitygrouped sequential trainingaudioqa
Chessformer: A Unified Architecture for Chess Modeling
Chessformer introduces a unified transformer architecture for chess modeling that simultaneously advances playing strength, human move prediction, and interpretability. The encoder-only model represents board squares as tokens, employs Geometric Attention Bias (GAB) for dynamic positional encoding, and uses an attention-based source-destination policy head. Evaluations show state-of-the-art human move prediction (57.1% accuracy), +100 Elo improvement in Leela Chess Zero, and granular interpretability via square-token attention patterns. Results demonstrate domain-aligned design enables concurrent gains across performance metrics.
geometric attention biassource-destination policysquare-token representationhuman move predictionencoder-only transformer
The impact of observation density on Bayesian inversion of latent dynamics in shock-dominated flows
The authors present a non-intrusive reduced-order modeling framework for Bayesian initial-state inversion in shock-dominated compressible flows, addressing the ill-posed inverse problem through uncertainty quantification. The method combines a convolutional autoencoder (32D latent space) with a learned latent-space forward operator, trained on 500 high-fidelity Sod shock tube simulations solved via fifth-order WENO scheme. Results demonstrate accurate reconstruction of shock-tube structures (rarefaction wave, contact discontinuity) and show that increased observation density reduces posterior uncertainty by 78% (density) and 76% (pressure), with 250 training simulations yielding sufficient accuracy.
bayesian inversionreduced-order modelingconvolutional autoencodershock-dominated flowsuncertainty quantification
Mapping Uncharted Symmetries: Machine Discovery in Combinatorics
The paper demonstrates machine learning's capacity for verifiable mathematical discovery by addressing combinatorial function construction under exact constraints (SLURP problem). Two novel methods are introduced: MapSeek-Functional (alternating pseudo-labeling and supervised training) and MapSeek-Symbolic (direct symbolic formula generation). Applied to algebraic combinatorics, these methods yield a new combinatorial interpretation of $q,t$-Narayana polynomials via noncrossing partitions, resolving a previously open case with a symmetry proof. All discoveries are formally verified in Lean 4, with full code released for reproducibility.
slurpmapseeknarayananoncrossinglean4
Provably Data-driven Lagrangian Relaxation for Mixed Integer Linear Programming
The paper establishes theoretical foundations for learning Lagrangian Relaxation (LR) multipliers in Mixed Integer Linear Programming (MILP) via data-driven methods. It derives a generalization bound of O(s^1.5/√N) for learned multipliers, proves a minimax lower-bound of Ω(s/√N), and shows Stochastic Gradient Ascent (SGA) achieves the optimal Θ(s/√N) rate. The framework extends to learning-to-warm-start, attaining Θ(s/N) rates. Contributions include tight bounds on sample complexity and constructive proofs for SGA optimality in LR contexts.
lagrangian relaxationmixed integer linear programminggeneralization boundstochastic gradient ascentminimax optimality
Generative Pseudo-Force Fields for Molecular Generation
The paper introduces generative pseudo-force fields (GPFFs), a method combining energy-based relaxation and data-driven generation for molecular conformations. GPFFs train a machine learning force field (MLFF) on a quadratic pseudo-potential energy surface derived from reference equilibrium structures, eliminating the need for costly ab-initio data. The approach is shown to be a time-step-agnostic variant of variance exploding diffusion models, enabling efficient sampling with arbitrary structural priors. On QM9, GPFFs achieve 100% validity at 256 neural function evaluations (NFE) and over 50% at 6 NFE, outperforming diffusion baselines across all samplers.
generative pseudo-force fieldsmolecular conformationsmachine learning force fielddiffusion modelsneural function evaluations
KVBuffer: IO-aware Serving for Linear Attention
KVBuffer introduces an IO-aware serving mechanism for linear attention to address inefficiencies in recurrent state updates during long-context inference. The method buffers recent keys and values, enabling chunkwise computation for decoding, parallel verification for speculative decoding, and direct attention output computation for short contexts. Implemented in SGLang for Qwen3-Next, KVBuffer reduces decoding latency by up to 45.17% and increases maximum serving requests by 5x for speculative decoding with four draft tokens.
linear attentionspeculative decodingmemory accesschunkwise computationio-aware serving
Guiding Neuro-Symbolic Scenario Generation with Spatio-Temporal Logic
The paper introduces STRELGen, a neuro-symbolic framework for generating safety-critical autonomous driving scenarios. The method combines a multi-agent trajectory diffusion model with differentiable Spatio-Temporal Logic (STREL) specifications, enabling gradient-based optimization in latent space to produce plausible edge cases. This approach addresses the inefficiency of brute-force real-world testing by generating targeted scenarios that satisfy complex safety constraints while remaining within the learned data distribution. Results demonstrate efficient synthesis of interpretable, safety-critical multi-agent interactions for stress-testing autonomous systems.
autonomous drivingdiffusion modelsspatio-temporal logicscenario generationsafety validation
RLFTSim: Realistic and Controllable Multi-Agent Traffic Simulation via Reinforcement Learning Fine-Tuning
RLFTSim introduces a reinforcement learning fine-tuning framework for multi-agent traffic simulation, enhancing realism and controllability by aligning simulator rollouts with real-world data distributions. The method builds on a pre-trained model, employs a dense reward signal balancing fidelity and controllability, and demonstrates state-of-the-art performance on the Waymo Open Motion Dataset. Results show improved realism with fewer samples compared to heuristic search-based methods, while effectively enabling goal-conditioned scenario generation.
multi-agent simulationreinforcement learning fine-tuninggoal-conditioned controllabilitywaymo open motion datasetdense reward signal
Learning When to Adapt
The paper introduces DISeL (Dynamic Input-Sensitive LoRA), a parameter-efficient fine-tuning method that addresses catastrophic forgetting in static low-rank adaptation (LoRA) by incorporating input-dependent gating over rank-one components. DISeL preserves pre-trained model behavior by default while activating task-specific components during fine-tuning, adding minimal parameters. Evaluated on RoBERTa (GLUE), Llama, and Mistral for mathematical reasoning and code generation, DISeL reduces forgetting compared to LoRA variants while maintaining competitive accuracy. The gating mechanism also provides interpretable insights into layer-wise and rank-wise adaptation patterns.
low-rank adaptationcatastrophic forgettingparameter-efficient fine-tuninginput-dependent gatinginterpretable adaptation
Conformal Prediction via Transported Beta Laws
The paper introduces a framework for analyzing calibration-conditional coverage in conformal prediction by modeling it as a transported beta law. Using Wasserstein distances on [0,1], the method quantifies departures from the ideal beta reference distribution under non-i.i.d. settings, distinguishing between test-side shifts (via transport maps) and calibration dependence (via order-statistic law changes). Theoretical bounds are derived for marginal coverage gaps and bad-calibration probabilities, with applications to scale-shift, clustered, and stationary mixing scenarios. Simulations demonstrate that first-order approximations accurately track empirical Wasserstein distances even at moderate sample sizes.
conformal predictionwasserstein distancebeta lawcalibration-conditional coverageorder-statistic law
Deep Neural Sheaf Diffusion
The paper introduces Deep Neural Sheaf Diffusion (DNSD), a novel approach to address representation collapse in deep Graph Neural Networks (GNNs) by replacing the sheaf Laplacian with a sheaf adjacency operator. DNSD incorporates normalization, odd nonlinearities, and gating to maintain signal integrity across layers. The method is theoretically contrasted with graph attention mechanisms, emphasizing matrix-valued edge functions and node representation normalization. Empirical results show DNSD outperforms GNN and Neural Sheaf Diffusion (NSD) baselines by up to 30 percentage points in accuracy on synthetic datasets and consistently on real-world benchmarks, positioning sheaf-based architectures as viable for graph foundation models.
neural sheaf diffusiongraph neural networkssheaf laplacianadjacency operatorrepresentation collapse
LoRA vs. Full Fine-Tuning: A Theoretical Perspective
The paper provides a theoretical analysis comparing Low-Rank Adaptation (LoRA) and full fine-tuning in linear regression, identifying conditions where LoRA achieves lower excess risk. By modeling the pretraining-downstream task relationship as a low-rank difference, the analysis shows LoRA outperforms full fine-tuning when this difference is effectively low-rank. Theoretical results demonstrate that optimal rank selection can improve generalization despite reduced expressivity, with experimental validation suggesting broader applicability beyond linear settings.
low-rank adaptationexcess riskfine-tuninggeneralization performancelinear regression
SAGA: A Sequence-Adaptive Generative Architecture for Multi-Horizon Probabilistic Forecasting with Adaptive Temporal Conformal Prediction
The paper introduces SAGA, a decoder-only transformer architecture for multi-horizon probabilistic forecasting of irregular tabular panel sequences, enhanced with adaptive temporal conformal prediction for finite-sample coverage guarantees. The method processes longitudinal data from 2,143,817 individuals in the Swedish LISA register (1990-2022), forecasting annual labor earnings at 1-30 year horizons and aggregating them into lifetime earnings distributions via Monte Carlo. SAGA reduces continuous ranked probability score by 31.9% at 10 years and mean absolute error by 37.7% at 20 years versus parametric baselines, while conformal intervals maintain nominal coverage within 0.4 percentage points marginally.
decoder-only transformerconformal predictionprobabilistic forecastingpanel datamonte carlo aggregation
Learn-by-Wire Training Control Governance: Bounded Autonomous Training Under Stress for Stability and Efficiency
The paper introduces Learn-by-Wire Guard (LBW-Guard), a bounded autonomous training-control governance layer that operates above AdamW to mitigate instability in large language model training. LBW-Guard monitors training telemetry and applies bounded control to optimizer execution without altering fixed objectives, evaluated on Qwen2.5 models (3B-14B) and TinyLlama-1B under stress conditions. Results show an 18.7% perplexity reduction (13.21→10.74) and 1.10x speedup for Qwen2.5-7B, with maintained trainability under aggressive learning rates (LR=3e-3: 1885.24→11.57 perplexity) where AdamW fails.
training stabilityoptimizer governancelearning-rate stressperplexity reductionbounded control
EgoTraj: Real-World Egocentric Human Trajectory Dataset for Multimodal Prediction
The authors introduce EgoTraj, a multimodal egocentric dataset for human trajectory prediction, addressing the scarcity of real-world egocentric trajectory data. Collected using Meta Quest Pro, EgoTraj comprises 75 sequences of human navigation in urban environments, featuring synchronized RGB video, 6-DOF head poses, 3D eye gaze vectors, and scene annotations. Benchmarking state-of-the-art methods reveals the utility of gaze, scene, and motion cues for trajectory prediction, demonstrating EgoTraj's potential for AR-based perception and assistive systems.
egocentric trajectorymultimodal dataset6-dof head poses3d eye gazear-based perception
Distance-Aware Muon: Adaptive Step Scaling for Normalized Optimization
The paper introduces three adaptive step-scaling algorithms for the Muon optimizer, addressing sensitivity to step scale in normalized optimization. Distance-Adaptive Muon uses trajectory radius for trust-region scaling, proving stationarity for smooth non-convex objectives. Scale-Calibrated Muon employs local descent certificates for star-convex objectives, achieving O(1/T) objective-gap bounds. Distance-Free Muon eliminates distance-to-minimizer knowledge via scalar certificates. Experiments on GPT-124M/WikiText-103 and ViT-Tiny/CIFAR-100 demonstrate reduced tuning sensitivity and performance matching/exceeding fixed-scale baselines.
normalized optimizationadaptive scalingtrust-region methodsstar-convex objectivestrajectory radius
📰 Industry Media (8)
Roundtables: Inside the Musk v. Altman Trial
The California Superior Court ruled against Elon Musk's lawsuit alleging OpenAI executives Sam Altman and Greg Brockman misrepresented the company's nonprofit status, as analyzed by MIT Technology Review's legal correspondent Michelle Kim. The trial proceedings revealed Musk's claims of deception regarding OpenAI's governance structure and his concurrent development of competing AI systems through xAI. Key testimony included internal communications about model distillation practices and recruitment attempts between the parties. The verdict maintains OpenAI's current organizational framework amid ongoing debates about AI safety and commercial competition in foundation model development.
nonprofit statusmodel distillationfoundation modelsgovernance structurelegal precedent
How to Build Knowledge Graph Generation Pipelines From Text With kg-gen, NetworkX Analytics, and Interactive Visualizations
The tutorial presents a comprehensive pipeline for generating knowledge graphs from unstructured text using kg-gen, NetworkX, and PyVis. The method employs LLMs (GPT-4o-mini via LiteLLM) for entity-relation extraction, implements chunking and clustering for long documents, and demonstrates multi-source aggregation with entity resolution. Results include NetworkX-based centrality analysis (degree: 0.317 for 'Deep learning'), Louvain community detection (4 communities in AI text), and interactive PyVis visualizations with PageRank-weighted nodes (size=12+80*PR). The system exports to JSON/GraphML and supports neighborhood queries (2-hop around 'machine learning').
knowledge graphentity resolutionpageranklouvain communitiesgraphml
NVIDIA AI Releases Nemotron-Labs-Diffusion: A Tri-Mode Language Model with 6× Tokens Per Forward Over Qwen3-8B
NVIDIA introduces Nemotron-Labs-Diffusion, a tri-mode language model family (3B/8B/14B) unifying autoregressive (AR), diffusion-based parallel, and self-speculation decoding within a single architecture. The model employs a joint AR-diffusion training objective (α=0.3) with two-stage pretraining (1T AR tokens + 300B joint tokens), achieving 63.61% average accuracy on 10-task evaluation in AR mode (8B). Key innovations include block-wise bidirectional diffusion (2.57× tokens/forward), LoRA-enhanced linear self-speculation (5.99× tokens/forward), and quadratic self-speculation (6.38× tokens/forward), outperforming Qwen3-8B by 2.4× speed at batch size 1.
tri-mode language modeldiffusion-based parallel decodingself-speculation decodingjoint ar-diffusion objectivelora-enhanced linear self-speculation
Alibaba Qwen Team Introduces Qwen3.5-LiveTranslate-Flash: Real-Time Multimodal Interpretation Across 60 Languages at 2.8-Second Latency
Alibaba's Qwen team introduces Qwen3.5-LiveTranslate-Flash, a multimodal real-time translation system achieving 2.8-second latency across 60 input languages. The model employs semantic unit prediction for streaming output, integrates visual cues (lip movements, gestures) to disambiguate noisy audio, and performs real-time voice cloning from a single utterance. Benchmarked on FLEURS and CoVoST2, it outperforms commercial alternatives while offering dynamic keyword injection for domain-specific terminology. The system supports 29 speech output languages via a WebSocket API with vision-audio fusion.
multimodal translationsemantic unit predictionvoice cloningdynamic keyword configurationwebsocket protocol
Google Introduces Gemini 3.5 Flash at I/O 2026: A Faster and Cheaper Model for AI Agents and Coding
Google introduced Gemini 3.5 Flash, a cost-efficient variant of its Gemini series optimized for AI agents and coding tasks. The model features a 1,048,576-token context window, multimodal input support, and dynamic compute allocation for complex problems. Benchmark results show superior performance over Gemini 3.1 Pro, with Terminal-Bench 2.1 (76.2%), GDPval-AA (1656 Elo), MCP Atlas (83.6%), and CharXiv Reasoning (84.2%). Priced at $1.50/M input tokens and $9.00/M output tokens, it integrates with Google's Managed Agents API and Antigravity 2.0 platform for enterprise-scale agentic workflows.
gemini 3.5 flashmanaged agents apiantigravity 2.0multimodal understandingdynamic compute allocation
Upstash for Redis vs Supabase vs Neon: Which One Fits Vibe Coding Workflows in 2026?
The article provides a technical comparison of Upstash for Redis, Supabase, and Neon, clarifying their distinct roles in serverless architectures. Upstash specializes in HTTP-based Redis for caching and rate-limiting in edge environments, Supabase offers a full-stack BaaS with PostgreSQL, auth, and storage, while Neon provides serverless PostgreSQL with scale-to-zero compute and copy-on-write branching. Key findings highlight Upstash's compatibility with Vercel/Cloudflare Workers, Supabase's integrated AI tooling, and Neon's cost efficiency for idle workloads. The analysis emphasizes their complementary nature rather than direct competition.
serverless postgresqlhttp-based rediscopy-on-write branchingbackend-as-a-servicescale-to-zero
Alibaba is designing AI chips around agents, and that changes what the race is actually about
(No summary returned.)
Enterprise AI roadblocks and roadmaps, security and physical AI: Day two at TechEx
TechEx North America's second day analyzed enterprise AI deployment challenges, identifying pilot-to-production scaling as a critical bottleneck. Sessions emphasized agentic AI specialization, data infrastructure readiness, and token-based cost management, while infrastructure discussions contrasted build-vs-buy decisions for physical compute. Cybersecurity tracks highlighted velocity gaps between AI adoption and governance, proposing zero-trust architectures for agent permissions. Physical AI emerged as a focus area beyond LLMs, with hands-on workshops demonstrating agent self-improvement techniques via Google Colab instances.
agentic aizero-trusttoken-based chargingvelocity gapphysical ai
Generated automatically at 2026-05-20 21:36 UTC. Summaries and keywords are produced by an LLM and may contain inaccuracies — always consult the original article.
