Named entity recognition (NER) is a vital task in spoken language understanding, which aims to identify mentions of named entities in text e.g., from transcribed speech. Existing neural models for NER rely mostly on dedicated word-level representations, which suffer from two main shortcomings. First, the vocabulary size is large, yielding large memory requirements and training time. Second, these models are not able to learn morphological or phonological representations. To remedy the above shortcomings, we adopt a neural solution based on bi-directional LSTMs and conditional random fields, where we rely on sub word units, namely characters, phonemes, and bytes. For each word in an utterance, our model learns a representation from each of the sub word units. We conducted experiments in a real-world large-scale setting for the use case of a voice-controlled device covering four languages with up to5.5M utterances per language. Our experiments show that (1) with in-creasing training data, performance of models trained solely on sub-word units becomes closer to that of models with dedicated word-level embeddings (91.35vs93.92F1 for English), while using a much smaller vocabulary size (332vs74K), (2) sub-word units enhance models with dedicated word-level embed-dings, and (3) combining different sub-word units improves performance.
Research areas