Label with confidence: Effective confidence calibration and ensembles in LLM-powered classification
2024
Large Language Models (LLMs) have been employed as crowd-sourced annotators to alleviate the burden of human labeling. However, the broader adoption of LLM-based automated labeling systems encounters two main challenges: 1) LLMs are prone to producing unexpected and unreliable predictions, and 2) no single LLM excels at all labeling tasks. To address these challenges, we first develop fast and effective logit-based confidence score calibration pipelines, aiming to leverage calibrated LLM confidence score to accurately estimate the LLM’s level of confidence. We propose novel calibration error based sampling strategy to efficiently select labeled data for calibration, leading to a reduction of calibration error by 46%, compared with uncalibrated scores. Leveraging calibrated confidence scores, we then design a cost-aware cascading LLM ensemble policy which achieves improved accuracy, while reducing inference cost by more than 2 times compared with the conventional weighted majority voting ensemble policy.
Research areas