Mitigating hallucinations in LLMs for international trade: Introducing the TradeGov evaluation dataset and TradeGuard hallucination mitigation framework for trade Q&A
2026
Given the constant flux in the world of geopolitics, staying up to date and compliant with international trade issues is challenging. But exploring if LLMs can aid this task is a frontier hitherto unexplored in the LLM evaluation literature - primarily due to the lack of a dataset for benchmarking the capabilities of LLMs on questions regarding international trade subjects. To address this gap, we introduce TradeGov - a novel, human audited dataset containing 5k international trade related question-answer pairs across 138 countries, created using ChatGPT based on the Country Commercial Guides on the International Trade Administration website. The dataset achieves 98% relevance and faithfulness and doesn't show any systematic biases along macroeconomic and geographical dimensions, lending itself to equal applicability for LLM assessment across countries. Testing the performance of ChatGPT-4o and Claude Sonnet 3.5 on this dataset - marking the first systematic evaluation of LLMs for answering questions about international trade - we find that ChatGPT-4o achieves 85% accuracy while Claude Sonnet 3.5 achieves 88% accuracy. Building on these insights, we develop TradeGuard - an ensemble trade regulation hallucination mitigation framework that leverages majority vote summarization and multi-agent debate to achieve 91% accuracy on the TradeGov dataset, outperforming vanilla versions of Claude and ChatGPT. TradeGuard's ensemble hallucination detection algorithm — combining entailment verification, cross-questioning, and Bayesian regression—achieves an F1 score of 91%, significantly enhancing reliability in legal contexts. Notably, we demonstrate that TradeGuard reduces 'I don't know' responses while maintaining accuracy, particularly for low-income countries and demonstrates no systematic biases along key macroeconomic dimensions.
Research areas