CausalFusion: Integrating LLMs and graph falsification for causal discovery
2026
Causal discovery is central to enable causal models for tasks such as effect estimation, counterfactual reasoning, and root cause attribution. Yet existing approaches face trade-offs: purely statistical methods (e.g., PC, LiNGAM) often return structures that overlook domain knowledge, while expert-designed DAGs are difficult to scale and time-consuming to construct. We propose CausalFusion, a hybrid framework that combines graph falsification tests with large language models (LLMs) acting as domain-specialized data scientists. LLMs incorporate domain expertise into candidate structures, while graph falsification tests iteratively refine DAGs to balance statistical validity with expert plausibility. We evaluate CausalFusion through two experiments: (i) a synthetic e-commerce dataset with a precisely defined ground truth DAG, and (ii) real-world supply chain data from Amazon, where the ground truth was constructed with domain experts. To benchmark performance, we compare against classical causal discovery algorithms (PC, LiNGAM) as well as LLM-only baselines that generate DAGs without iterative falsification. Structural Hamming Distance (SHD) is used as the primary evaluation metric to quantify similarity between generated and “true” DAGs. We also analyze different foundational models chain-of-thought traces to examine whether deeper reasoning correlates with improved structural accuracy or reproducibility. Results show that CausalFusion produces DAGs more closely aligned with ground truth than both classical algorithms and LLM-only baselines, while offering interpretable reasoning at each iteration, though challenges in reproducibility and generalizability remain.
Research areas