DiscoverGPT: Multi-task fine-tuning of large language model for related table discovery
2024
Natural language understanding over tabular data is crucial for data discovery tasks such as joinable and unionable table search. State-of-the-art approaches adopt large language models (LLMs) trained over massive text corpora to assess the table semantic relatedness, typically following a pretrain-and-finetune paradigm with labeled tabular data. Recent studies in-corporate auxiliary tasks such as entity resolution and column type classification in the fine-tuning phase to improve the performance. How-ever, there is a lack of studies on how different supervisions complement or even contrast each other, leading to a suboptimal performance on the final data discovery tasks. In this paper, we propose a simple yet effective multi-task fine-tuning framework named DiscoverGPT that holistically discovers and leverages the intricate relationships among the supervisions to optimize the model performance on the data discovery task. Moreover, DiscoverGPT is plug-and-play that allows a broad range of open-domain auxiliary tasks to be incorporated, by utilizing the generative power of LLMs. We demonstrate the usability and effectiveness of DiscoverGPT with baseline comparisons and ablation studies. DiscoverGPT outperforms the top baseline by up to 7% in F1 score.
Research areas