Gazetteer enhanced named entity recognition for code-mixed web queries

Besnik Fetahu; Anjie Fang; Oleg Rokhlenko; Shervin Malmasi

Publication

Gazetteer enhanced named entity recognition for code-mixed web queries

By Besnik Fetahu, Anjie Fang, Oleg Rokhlenko, Shervin Malmasi

2021

Download Copy BibTeX

Share

Download

Copy BibTeX

Share

Named entity recognition (NER) for Web queries is very challenging. Queries often do not consist of well-formed sentences, and contain very little context, with highly ambiguous queried entities. Code-mixed queries, with entities in a different language than the rest of the query, pose a particular challenge in domains like e-commerce (e.g. queries containing movie or product names). This work tackles NER for code-mixed queries, where entities and non-entity query terms co-exist simultaneously in different languages. Our contributions are twofold. First, to address the lack of code-mixed NER data we create EMBER, a large-scale dataset in six languages with four different scripts. Based on Bing query data, we include numerous language combinations that showcase real-world search scenarios. Secondly, we propose a novel gated architecture that enhances existing multi-lingual Transformers with a Mixture-of-Experts model to dynamically infuse multilingual gazetteers, allowing it to simultaneously differentiate and handle entities and non-entity query terms in multiple languages. Experimental evaluation on code-mixed queries in several languages shows that our approach efficiently utilizes gazetteers to recognize entities in code-mixed queries with an F1=68%, an absolute improvement of +31% over a non-gazetteer baseline.

Gazetteer enhanced named entity recognition for code-mixed web queries

Latest news

Work with us