Amazon scientists help SK telecom create Korean-based natural language processor
AWS services used to process massive amounts of data needed to develop the sophisticated, open-source artificial language model.
Korean is a major world language, spoken by some 80 million people. Although it has a long history, dating back to what is believed to be its start in Manchuria, Korean is what linguists call an “isolate,” with no apparent link to other languages, such as English has with French and Latin.
But now Korean is part of the revolution in natural language processing, a branch of artificial intelligence that helps computers recognize and interpret human language. In late April, Amazon announced that Korean mobile telecommunications company SK telecom, working with Amazon Web Services researchers, have released the first open-source, advanced Korean language Generative Pre-trained Transformer-2 (GPT-2) model, called KoGPT-2.
GPT-2 is a language model that has been trained to predict – to “generate” – the completion of a sentence or a paragraph based on as little as a one-word prompt. It was developed in 2019 by OpenAI, an AI research firm. The GPT-2 model is similar to the next-word prediction on your smartphone keyboard, but much larger and more sophisticated.
KoGPT-2 is an open-source GPT-2 model pre-trained with Korean texts to improve machine learning (ML) performance in the Korean language. It can be used for chatbots, search engines, and other purposes.
In creating KoGPT-2, a team of deep-learning engineers from the Amazon Machine Learning (ML) Solutions Lab at AWS was paired with the Conversational AI Team from the SK telecom AI Center. Using AWS services such as Amazon Elastic Compute Cloud, Amazon Elastic Fabric Adaptor, and Amazon FSx for Lustre, the researchers built KoGPT-2 using a large Korean-language data set provided by SK telecom.
We wanted to help scale out SK telecom’s burgeoning natural language efforts by training the state of the art KoGPT-2 model.
Natural language processing models utilize a large collection of language samples to train a computer on the structure of the language, the meaning of words, and more. GPT-2 requires a particularly large dataset for its algorithm to infer the intent of someone speaking to it or writing a question. In the original GPT-2, OpenAI used some 1.5 billion parameters on a text corpus with more than 40 gigabytes of internet data. GPT-2 is trained with the objective of predicting the next word, given all of the previous words within some text.
OpenAI researchers have described the GPT-2 model as “chameleon-like”, saying it adapts to the style and context of the conditioning text. This allows researchers and engineers to generate coherent sentence about topics of their choosing. GPT-2 has already proved itself to be astonishingly powerful, with an ability to generate perfectly plausible text with a prompt of just a few words or a generalized scenario. GPT-2 has mimicked a writer creating a new Lord of the Rings battle scene, posed as a presidential speechwriter, and performed other linguistic feats.
To train KoGPT-2, SK telecom created a corpus of 125 million sentences and more than 1.6 billion words, drawing on data from the Korean Wiki Project, Korean news sources, and other sources.
That posed a formidable technical challenge, says Muhyun Kim, a senior data scientist in the Amazon ML Solutions Lab. “We needed a lot of computing power to train the model,” he says. “We used 64 GPUs (graphics processing units) for one week. Before that, though, we did a lot of experimentation to find the right configuration for analyzing the data and to troubleshoot possible errors.”
“Without human expertise, however, nothing can happen. Our experience helped us work with SK telecom to refine their models and speed up training. AWS is perfect for training a large model like KoGPT-2. It’s easy to use and offers a tremendous amount of bandwidth. But even if the network is fast, if the storage is slow, training will be slow. With Amazon FSx for Lustre we were able to accelerate the entire process,” Muhyun added.
SK telecom also used GluonNLP, an open-source deep-learning toolkit for natural language processing, to speed up the model-training process.
“GluonNLP offers various tokenizers and data pipeline utilities, which make it easy to train state-of-the-art models on custom data sets. We adopted techniques such as mixed precision training, efficient GPU kernels for activation functions, and integration with Amazon Elastic Fabric Adaptor, which significantly accelerated large scale distributed training with GluonNLP,” said Haibin Lin, an applied scientist from the AWS MXNet team.
With the Amazon ML Solutions Lab implementing and providing the large-scale infrastructure to make training feasible, SKT AI Center’s Conversational AI Team provided the key ingredients and linguistic expertise. As mentioned above, the team painstakingly created the dataset to train the model. They also implemented the code to make model training possible in the first place, as well as trained the KoGPT-2 model.
“We wanted to help scale out SK telecom’s burgeoning natural language efforts by training the state of the art KoGPT-2 model,” added Kim Tae Yoon, the leader of the Conversational AI Team from SK telecom. “Open source and contribution back to the growing Korean NLP community are core values of our team, so it was only natural to open source this model,” added Tae Yoon.
From a practical standpoint KoGPT-2 will give SK telecom’s customers a surprisingly human-like experience when speaking with a chatbot or finding answers to questions.
KoGPT-2 is available from the GitHub repository of SKT AI Center (https://github.com/SKT-AI/KoGPT2) under a Modified MIT License. AWS also has released a Git repository with guidance on how to deploy the KoGPT-2 model into Amazon SageMaker.