Effective product schema matching and duplicate detection with large language models
2025
Building and maintaining a rich and high-quality product schema helps customers of an e-commerce service find products based on the characteristics they desire. As the quantity of products sold on the service increases, so does the complexity of maintaining the schema. Expanding it requires finding gaps, designing new product attributes, and ensuring that they do not already exist in the schema. In this paper, we present an automated system for product schema matching, which uses a combination of semantic search and Large Language Models (LLM) in order to align the product concepts from two schemas. The approach was tested on the duplicate attribute detection task using a dataset of 1, 399 product attributes, where it achieved 90.2% 𝐹2, outperforming humans by 8.4% on the same task. On the product schema matching task, it achieved 78.12% 𝐹1, which is close to human-level performance. Moreover, we estimate that the system can reduce the time spent by humans reviewing new attributes by more than 90%.
Research areas