Customer-obsessed science
Research areas
-
June 8, 20267 min readFour approaches can dramatically improve the performance and trustworthiness of AI agents in operational environments.
-
-
-
-
May 27, 20264 min readMachine learning
Featured news
-
ICML 2026 Workshop on Statistical Frameworks for Uncertainty in Agentic Systems2026The LLM Jury, a Panel of LLM Evaluators (POLL) (Verga et al., 2024) reporting consensus scores, has become a practical alternative to single judge LLM evaluation, yet its statistical behavior remains poorly understood. Formalizing the setup under the Huber contamination model, we show that POLL incurs unbounded bias under any positive contamination, regardless of jury size, whenever a single judge fails
-
ECML-PKDD 20262026When fusing heterogeneous modalities for classification, a central challenge is cardinality heterogeneity: modalities often produce token sequences of vastly different lengths, yet standard symmetric fusion wastes attention capacity under this asymmetry. We present CRAFT, a modality-agnostic fusion framework that selects a high-density attention backbone using token cardinality and standalone task relevance
-
arXiv2026Large language model (LLM) agents deployed in healthcare and life sciences (HCLS) routinely receive queries that are semantically ambiguous—the same terms carry different meanings across clinical, regulatory, pharmacovigilance, data-standards, and research domains. Existing approaches address ambiguity post-hoc through output filtering or retrieval augmentation, but do not quantify it before the model responds
-
2026Continual learning methods for vision-language models are developed on benchmarks where each new task introduces entirely new domain knowledge. Real-world task sequences are more natural: they routinely share visual concepts, language patterns, and even training samples across stages. However, existing mixture-of-expert methods that assign one expert per task with fixed routing can split similar inputs
-
2026Time reasoning is a make-or-break capability for Large Language Models (LLMs) aspiring to act as reliable personal and enterprise assistants. This paper introduces the Temporal Reasoning Dataset (TRD), a programmatically generated multilingual benchmark designed to evaluate temporal reasoning operational capabilities in LLMs across ten languages, with particular focus on basic operations relevant to conversational
Collaborations
View allWhether you're a faculty member or student, there are number of ways you can engage with Amazon.
View all