Controlled data generation via insertion operations for NLU

Manoj Kumar; Haidar Khan; Yuval Merhav; Wael Hamza; Anna Rumshisky; Rahul Gupta

Publication

Controlled data generation via insertion operations for NLU

By Manoj Kumar, Haidar Khan, Yuval Merhav, Wael Hamza, Anna Rumshisky, Rahul Gupta

2022

Download Copy BibTeX

Share

Download

Copy BibTeX

Share

Use of synthetic data is rapidly emerging as a realistic alternative to manually annotating real data for industry-scale model building. Manual data annotation is slow, expensive and not preferred for meeting customer privacy expectations. Further, commercial natural language applications are required to support continuously evolving features as well as newly added experiences. To address these requirements, we propose a targeted synthetic data generation technique by inserting tokens into a given semantic signature. The generated data are used as additional training samples in the tasks of intent classification and named entity recognition. We evaluate on a real-world voice assistant dataset, and using only 33% of the available training set, we achieve the same accuracy as training with all available data. Further, we analyze the effects of data generation across varied real-world applications and propose heuristics that improve the task performance further.

Controlled data generation via insertion operations for NLU

Latest news

Work with us